Friday Aug 01, 2014

Oracle Solaris Cluster 4.2 - Agent development just got better and easier

Agent or data service development allows you to deploy your application within the Oracle Solaris Cluster infrastructure. The Generic Data Service (GDS) has been designed and developed by Oracle Solaris Cluster Engineering to reduce the complexity associated with data service development so that all you need to do is to supply the GDS with a script to start your application. Of course, other scripts can also be used by the GDS to validate, stop and probe your application. Essentially, the GDS is our preferred data service development tool and is extensively deployed within our fully supported data service portfolio of popular applications. If we do not have a data service for your application then the GDS will be your friend to develop your own data service.

With Oracle Solaris Cluster 4.2 we are delivering a new version of the GDS. If you are familiar with data service development within Oracle Solaris Cluster then you will be pleased to know that we have kept the consistent behaviour that makes the GDS a trusted and robust data service development tool. By delivering a new and enhanced version of the GDS, data service development within Oracle Solaris Cluster is now much more flexible.

We have included some enhancements which previously required somewhat awkward workarounds when using the original GDS resource type SUNW.gds. For example, previously your application was always started under the control of the Process Monitor Facility (PMF) and now with the new version of the GDS, using the PMF is optional by setting the extension property PMF_managed=TRUE or FALSE. As you may be aware, when starting an application under the control of the PMF your application had to leave behind at least one process for the PMF to monitor. While most applications will do this, you may want your GDS resource to execute a script or command that does not leave behind any processes. Setting the extension property PMF_managed=FALSE for your GDS resource will inform the GDS to not start your script or command under the control of the PMF, instead your script or command is just executed. This is very useful if your GDS resource performs some task that does not leave behind a process.

We have also enhanced the GDS probe's behaviour so that you can ensure that your probe command will always be executed even on a system that is heavily loaded, potentially eliminating a probe timeout. We have bundled a new resource type so that you can proxy a service state as well as providing you with the ability to subclass this new version of the GDS. Subclassing allows you to use the new version of the GDS but have your own resource type name and new or modified extension properties. In fact there are over 20 improvements with this new version of the GDS.

While it is not practical to showcase all the capabilities of the new version of the GDS within this blog, the capability of the new version of the GDS is significant if you consider that new extension properties and other improvements, such as subclassing, can all be combined together. Furthermore, the new version of the GDS will validate your configuration and inform you if your configuration has a conflict. Consequently, the new version of the GDS has enough functionality and flexibility to be our preferred data service development tool.

The original GDS resource type SUNW.gds still exists and is fully supported with Oracle Solaris Cluster 4.2. The new version of the GDS includes two new resource types, ORCL.gds and ORCL.gds_proxy. It is possible to use your existing SUNW.gds start/stop and probe scripts with ORCL.gds. However, if you included workarounds within your scripts you should consider removing or disabling those workarounds and instead use the new extension property that delivers the same behaviour.

Oracle Solaris Cluster 4.2 delivers enterprise Data Center high availability and business continuity for your physical and virtualised infrastructures. Deploying your application within that infrastructure is now better and easier with ORCL.gds and ORCL.gds_proxy. Comprehensive documentation, including demo applications, for the new version of the GDS can be found within the Oracle Solaris Cluster Generic Data Service (GDS) Guide.

Additionally, further links are available that describe data services from the Oracle Solaris Cluster Concepts Guide, all our data service documentation for Oracle Solaris Cluster 4.2 and finally our Oracle Solaris Cluster 4 Compatibility Guide that lists our fully supported list of data services.

Neil Garthwaite
Oracle Solaris Cluster Engineering

Friday Mar 08, 2013

Stay in sync with a changing SAP world

Over the years, the SAP kernel changed a lot. Finally, the architecture of our SAP agents had reached its limits. It was clearly time for a rewrite of the SAP agent. To accommodate the changes in the SAP kernel, we created the new HA for NetWeaver agent. We also took this chance to support central instance and traditional HA deployments within this one agent. This integration reduces the resource complexity.

Most recently, the SAP kernel 7.20_EXT SAP introduced the ability to integrate HA frameworks. Of course, we did not miss the opportunity to integrate our new SAP NetWeaver agent with this functionality. So, with the new HA for NetWeaver agent, it is now possible to use the SAP command and the SAP Management console to manage SAP instances under cluster control, without having root access. This will improve the ease of use for SAP in an Oracle Solaris Cluster environment. We made these functionalities available on Oracle Solaris 11 software starting with Oracle Solaris Cluster 4.0 SRU 4.

One of the larger cities in Europe made the change with their SAP system from Oracle Solaris 10 to Oracle Solaris 11 as soon as the HA for NetWeaver agent was available. They reduced their system administration costs because of the benefits of Oracle Solaris 11 and Oracle Solaris Cluster 4.0. Information about this customer use case is available at:

We look forward to your feedback and inputs !


Friday Jun 13, 2008

Oracle Data Guard Replication Support in Sun Cluster Express Geographic Edition

If you read the Solaris Cluster Express 6/08 available for download blog posting, you will have noticed that it mentioned that SCX Geographic Edition had been enhanced to support Oracle Data Guard replication. Now if you run Oracle 10g or 11g RAC in your data centers and need to integrate them into a disaster recovery framework that supports: Sun StorageTek Availability Suite (AVS), Hitachi TrueCopy and EMC SRDF then read on.

SCX Geographic Edition (SCXGE) is binary distribution of the open source version of Sun Cluster Geographic Edition (SCGE) that runs on Solaris Cluster Express (SCX) on top of Solaris Express Community Edition (SXCE) build 86. [We love our acronyms and abbreviations]. Trivia: did you know that SUN itself is an acronym that originally stood for Stanford University Networks?

Back to the plot... Services made highly available by SCX resource groups can grouped together in "protection groups". SCXGE then uses the robust start and stop methods of SCX to ensure that the protection group is migrated in a controlled fashion from a primary location to a standby site. This migration includes performing all the tasks needed to change the direction of data flow provided by the underlying replication technology. The advantage being that SCXGE hides the complexity of the individual replication methods behind a common command line interface. Thus switching a service from site A to site B is reduced to:

# geopg switchover -m newhost_cluster my-protection-pg

regardless of whether the replication was AVS, TC or SRDF based.

Support for Oracle Data Guard (ODG) extends SCXGE's capability beyond just software and storage based replication technologies. Internally, the ODG module uses the Oracle Data Guard broker interface which restricts its use to Oracle 10g and 11g RAC configurations only, i.e. no support for HA-Oracle.

So once you set up your Oracle RAC databases, configure ODG and create your Data Guard broker configuration, you can add that to your SCXGE configuration and have it monitored and controlled, just as you do for your other non-ODG protection groups. The module allows you to have multiple ODG protection groups, each with one or more ODG broker configuration in. Remember that each protection group can span any SCXGE partnership pair, so you can only have one primary and one standby database in each of these broker configurations.

If this has got you interested, feel free to download and install the SCX software bundle with the new ODG module and try it out. Just remember that this is the first build. I'm working on fixing all the bugs that my colleagues in QA discover, so keep an eye on this blog and the Open HA Cluster for news of any updates.

Although we don't offer support for this product, you can post questions in any of the usual forums (or is it fora :-) ).

Tim Read

Friday Apr 18, 2008

Improving Sun Java System Application Server availability with Solaris Cluster


Sun Java System Application Server is one of the leading middle-ware products in the market with its robust architecture, stability, and ease of use.  The design of the Application Server by itself has some  high availability (HA) features in the form of node agents (NA) which are spread on multiple nodes to avoid a single point of failure (SPoF).  A simple illustration of the design :

However, as we can notice from the above block diagram, the Domain Administration Server (DAS) is not highly available. If the DAS goes down, then the administrative tasks cannot be done.  Despite the client connections being redirected to other instances of the cluster in case of an instance or NA failure or unavailability, an automated recovery would be desirable to reduce the load on the remaining instances of the cluster.  There are also the hardware, OS and network failure scenarios that needs to be accounted for in critical deployments, in which uptime is one of the main requirements.  

Why is a High Availability Solution Required?

A high availability solution is required to handle those failures which Application Server or for that matter any user-land application, cannot recover from, like network, hardware, operating  system failures, and human errors. Apart from these, there are other scenarios like providing continuous service even when OS or hardware upgrades and/or maintenance is done.

Apart from failures, a high availability solution helps the deployment take advantage of other operating system features, like network level load distribution, link failure detection, and virtualization etc.,  to the fullest.

How to decide on the best solution?

Once customers decide that their deployment is better served by a high availability solution, they need to decide on which solution to choose from the market.  The answer to the following questions will help in the decision making:

Is the solution very mature and robust?

Does the vendor provide an Agent that is specifically designed for Sun Java System Application Server?

Is the solution very easy to use and deploy?

Is the solution cost effective?

Is the solution complete? Can it provide high availability for associated components like
Message Queue?

And importantly, can they get very good support in the form of documentation, customer service and a single point of support?

Why Solaris Cluster?

Solaris Cluster is the best high availability solution for the Solaris platform available. It offers excellent integration with the Solaris Operating System and helps customers make use of new features introduced in Solaris without making modifications on their deployments.  Solaris Cluster supports applications running in containers, offers a very good choice of file systems that can be used, choices of processor architecture, etc.  Some of the  highlights include:

Kernel level integration to make use of Solaris features like containers, ZFS, FMA, etc.

A wide portfolio of agents to support the most widely used applications in the market.

Very robust and quick failure detection mechanism and stability even during very high loads.

IPMP - based network failure detection and load balancing.

The same agent can be used for both Sun Java Application Server and Glassfish.

Data Services Configuration Wizards for most common Solaris Cluster tasks.

Sophisticated fencing mechanism to avoid data corruption.

Detect loss of access to storage by monitoring the disk paths.

How does Solaris Cluster Provide High Availability?

Solaris Cluster provides high availability by using redundant components.  The storage, server and network card are redundant.  The following figure illustrates a simple two-node cluster which has the recommended redundant interconnects, storage accessible to both nodes, and public network interfaces each. It is important to note that this is the recommended configuration and the minimal configuration can have just one shared storage, interconnect, and public network interface.  Solaris Cluster even provides the flexibility of having a single-node cluster as well based on individual needs.

LH =  Logical hostname, type of virtual IP used for moving IP addresses across NICs.

RAID =  any suitable software or hardware based RAID mechanism that provides both redundancy and performance.

One can opt to provide high availability just for the DAS alone or for the node agent as well. The choice is based on the environment. Scalability of the node agents is not a problem with high availability deployments, since multiple node agents can be deployed on a single Solaris Cluster installation. These node agents are configured in multiple resource groups, with each resource group having a single logical host, HAStoragePlus and agent node resource. Since node agents are spread over multiple nodes in a normal deployment, there is no need for additional hardware just because a  highly available architecture is being used.  Storage can be made redundant either with software or hardware based RAID.

Solaris Cluster Failover Steps in Case of a Failure

Solaris Cluster provides a set of sophisticated algorithms that are applied to determine whether to restart an application or to failover to the redundant node. Typically the IP address, the file system on which the application binaries and data reside, and the application resource itself are grouped into a logical entity called resource group (RG).  As the name implies, the IP address, file system, and application itself are viewed as resources and each one of them is identified by a resource type (RT) typically referred to as an agent. The recovery mechanism, i.e restart or fail over to another node is, determined based on a combination of time outs, number of restarts, and history of failovers. An agent typically has start, stop, and validate methods that are used to start, stop, and verify prerequisites every time the application changes state.  It also includes a probe which is executed at a predetermined period of time to determine application availability.

Solaris Cluster has two RTs or agents for the Sun Java System Application Server.  The Resource Type SUNW.jsas is used for DAS, and SUNW.jsas_na for node agent. The probe mechanism involves executing the “asadmin list-domain” and “asadmin list-node-agents” command and interpreting the output to determine if the DAS and the node agents are in the desired state or not.  The Application Server software, file system, and  IP address are moved to the redundant node in case of a failover. Please refer to the Sun Cluster Data Service guide ( for more details.

The following is a simple illustration of a failover in case of a server crash.

In the previously mentioned setup, Application Server is not failed over to the second node if
one of the NICs alone fails. The redundant NIC, which is part of the same IPMP group hosts the logical host to which the DAS and NA make use. A temporary network delay will be noticed for until the logical host is moved from nic1 to nic2.

The Global File System (GFS) is recommended for Application Server deployments since there is very little write activity other than logs on the file system in which the configuration files and in specific deployments, binaries are installed. Because GFS is always mounted on all nodes, it results in better fail over times and quicker startup of Application server in case of a node crash or similar problems.

Maintenance and Upgrades

The same properties that help Solaris Cluster provide recovery during failures can be used to provide service continuity in case of maintenance and upgrade work. 

During any planned OS maintenance or upgrade, the RGs are switched over to the redundant node and the node that needs maintenance is rebooted into non-cluster mode. The planned actions are performed and the node is then rebooted into the cluster.  The same procedure can be repeated for all the remaining nodes of the cluster.

Application Server maintenance or upgrade depends on the way in which the binaries and the data and  configuration files are stored. 

1.)Storing the binaries on the node's internal hard disk and storing the domain and node agent related files on the shared storage.  This method is preferable for environments in which frequent updates are necessary. The downside is the possibility of inconsistency in the application binaries, due to differences in patches or upgrades

2.)Storing both the binaries and the data in the shared storage.   This method provides consistent data during all times but makes upgrades and maintenance without outages difficult.

The choice has to be made by taking into account the procedures and processes followed in the organization.

Other Features

Solaris Cluster also provides features that can be used for co-locating services based on the concept of affinities. For example, you can use negative affinity to evacuate the test environment when a production environment is switched to a node or use positive affinity to move the Application Server resources to the same node on which database server is hosted for better performance etc.

Solaris Cluster has an easy-to-use and intuitive GUI  management tool called Sun Cluster Manager, which can be used to perform most management taks.

Solaris Cluster has an inbuilt telemetry feature that can be used to monitor the usage of resources like CPU, memory, etc.

Sun Java Application server doesn't require any modification for Solaris Cluster as the agent is designed with this scenario in mind.

The same agent can be used for Glassfish as well.

The Message Queue Broker can be made highly available as well with the HA  for Sun Java Message Queue agent.

Consistent with Sun's philosophy, the product is being open sourced in phases and the agents are already available under the CDDL license.

An open source product based on the same code base is available for OpenSolaris releases called Open High Availability Cluster.  For more details on the product and community, please visit .

The open-source product also has a comprehensive test suite that serves helps users test their deployment satisfactorily.  For more details, please read


For mission-critical environments, availability against all types of failures is a very important criterion.  Solaris Cluster is best designed to provide the highest availability for  Application Server by virtue of its integration with Solaris OS, stability, and having an agent specifically designed for Sun Java System Application Server.

Madhan Kumar
Solaris Cluster Engineering


Oracle Solaris Cluster Engineering Blog


« June 2016