Tuesday May 27, 2008

OSP blog for May/08

Solaris Cluster 3.2u1 or 2/08 has launched! You'll find that most of the vendors listed on the matrix are supported or in the process of supporting this latest release. You can always find the latest updates at the OSP URL, where we welcome your comments and feedback. 

We've got some significant updates this month in relation to increasing the Solaris Cluster node count on two particular configurations. The first being EMC, where we now have (16) node connectivity on Symmetrix Arrays supported on various platforms.

\* Solaris and Solaris Cluster connected to EMC Symmetrix storage is now supported in combinations between (8) and (16) nodes. Support includes the following Sun Servers platforms:  T1000/T2000, M4000/M5000/M8000 along with T5120 and T5220. Combinations of servers are supported on Symmetrix 800 and 1000 arrays. Solaris 10 11/06, 8/07 and 4/08 along with Solaris Cluster 3.2 are included.

For the details of these specific configurations, always reference the Solaris Cluster Open Storage matrix or the EMC, External Support matrix.

In other EMC related configurations, ZFS and MPxIO are now supported with EMC on Solaris 10x86 (Opteron) and Solaris 10 11/16 (Sparc).

Solaris Cluster on NetApp rounds out the server node increases with the NAS storage product being increased from (32) to (48) node attach. We've also included the following servers as updates to the matrix; M4000/5000/8000/9000 along with Sparc Enterprise T5120 and T5220, Sun Fire x4150and x4450. Sun Blade T6300, T6320, T6220 and T6250 are included.

In other related news, Solaris Cluster on HP now includes the new EVA 4400 Storage Array which is a significant addition to the Solaris Cluster storage portfolio.

Until next time.

Roger Autrand
Sr. Manager, Sun Cluster Availability

Friday Apr 18, 2008

Improving Sun Java System Application Server availability with Solaris Cluster


Sun Java System Application Server is one of the leading middle-ware products in the market with its robust architecture, stability, and ease of use.  The design of the Application Server by itself has some  high availability (HA) features in the form of node agents (NA) which are spread on multiple nodes to avoid a single point of failure (SPoF).  A simple illustration of the design :

However, as we can notice from the above block diagram, the Domain Administration Server (DAS) is not highly available. If the DAS goes down, then the administrative tasks cannot be done.  Despite the client connections being redirected to other instances of the cluster in case of an instance or NA failure or unavailability, an automated recovery would be desirable to reduce the load on the remaining instances of the cluster.  There are also the hardware, OS and network failure scenarios that needs to be accounted for in critical deployments, in which uptime is one of the main requirements.  

Why is a High Availability Solution Required?

A high availability solution is required to handle those failures which Application Server or for that matter any user-land application, cannot recover from, like network, hardware, operating  system failures, and human errors. Apart from these, there are other scenarios like providing continuous service even when OS or hardware upgrades and/or maintenance is done.

Apart from failures, a high availability solution helps the deployment take advantage of other operating system features, like network level load distribution, link failure detection, and virtualization etc.,  to the fullest.

How to decide on the best solution?

Once customers decide that their deployment is better served by a high availability solution, they need to decide on which solution to choose from the market.  The answer to the following questions will help in the decision making:

Is the solution very mature and robust?

Does the vendor provide an Agent that is specifically designed for Sun Java System Application Server?

Is the solution very easy to use and deploy?

Is the solution cost effective?

Is the solution complete? Can it provide high availability for associated components like
Message Queue?

And importantly, can they get very good support in the form of documentation, customer service and a single point of support?

Why Solaris Cluster?

Solaris Cluster is the best high availability solution for the Solaris platform available. It offers excellent integration with the Solaris Operating System and helps customers make use of new features introduced in Solaris without making modifications on their deployments.  Solaris Cluster supports applications running in containers, offers a very good choice of file systems that can be used, choices of processor architecture, etc.  Some of the  highlights include:

Kernel level integration to make use of Solaris features like containers, ZFS, FMA, etc.

A wide portfolio of agents to support the most widely used applications in the market.

Very robust and quick failure detection mechanism and stability even during very high loads.

IPMP - based network failure detection and load balancing.

The same agent can be used for both Sun Java Application Server and Glassfish.

Data Services Configuration Wizards for most common Solaris Cluster tasks.

Sophisticated fencing mechanism to avoid data corruption.

Detect loss of access to storage by monitoring the disk paths.

How does Solaris Cluster Provide High Availability?

Solaris Cluster provides high availability by using redundant components.  The storage, server and network card are redundant.  The following figure illustrates a simple two-node cluster which has the recommended redundant interconnects, storage accessible to both nodes, and public network interfaces each. It is important to note that this is the recommended configuration and the minimal configuration can have just one shared storage, interconnect, and public network interface.  Solaris Cluster even provides the flexibility of having a single-node cluster as well based on individual needs.

LH =  Logical hostname, type of virtual IP used for moving IP addresses across NICs.

RAID =  any suitable software or hardware based RAID mechanism that provides both redundancy and performance.

One can opt to provide high availability just for the DAS alone or for the node agent as well. The choice is based on the environment. Scalability of the node agents is not a problem with high availability deployments, since multiple node agents can be deployed on a single Solaris Cluster installation. These node agents are configured in multiple resource groups, with each resource group having a single logical host, HAStoragePlus and agent node resource. Since node agents are spread over multiple nodes in a normal deployment, there is no need for additional hardware just because a  highly available architecture is being used.  Storage can be made redundant either with software or hardware based RAID.

Solaris Cluster Failover Steps in Case of a Failure

Solaris Cluster provides a set of sophisticated algorithms that are applied to determine whether to restart an application or to failover to the redundant node. Typically the IP address, the file system on which the application binaries and data reside, and the application resource itself are grouped into a logical entity called resource group (RG).  As the name implies, the IP address, file system, and application itself are viewed as resources and each one of them is identified by a resource type (RT) typically referred to as an agent. The recovery mechanism, i.e restart or fail over to another node is, determined based on a combination of time outs, number of restarts, and history of failovers. An agent typically has start, stop, and validate methods that are used to start, stop, and verify prerequisites every time the application changes state.  It also includes a probe which is executed at a predetermined period of time to determine application availability.

Solaris Cluster has two RTs or agents for the Sun Java System Application Server.  The Resource Type SUNW.jsas is used for DAS, and SUNW.jsas_na for node agent. The probe mechanism involves executing the “asadmin list-domain” and “asadmin list-node-agents” command and interpreting the output to determine if the DAS and the node agents are in the desired state or not.  The Application Server software, file system, and  IP address are moved to the redundant node in case of a failover. Please refer to the Sun Cluster Data Service guide (http://docs.sun.com/app/docs/doc/819-2988) for more details.

The following is a simple illustration of a failover in case of a server crash.

In the previously mentioned setup, Application Server is not failed over to the second node if
one of the NICs alone fails. The redundant NIC, which is part of the same IPMP group hosts the logical host to which the DAS and NA make use. A temporary network delay will be noticed for until the logical host is moved from nic1 to nic2.

The Global File System (GFS) is recommended for Application Server deployments since there is very little write activity other than logs on the file system in which the configuration files and in specific deployments, binaries are installed. Because GFS is always mounted on all nodes, it results in better fail over times and quicker startup of Application server in case of a node crash or similar problems.

Maintenance and Upgrades

The same properties that help Solaris Cluster provide recovery during failures can be used to provide service continuity in case of maintenance and upgrade work. 

During any planned OS maintenance or upgrade, the RGs are switched over to the redundant node and the node that needs maintenance is rebooted into non-cluster mode. The planned actions are performed and the node is then rebooted into the cluster.  The same procedure can be repeated for all the remaining nodes of the cluster.

Application Server maintenance or upgrade depends on the way in which the binaries and the data and  configuration files are stored. 

1.)Storing the binaries on the node's internal hard disk and storing the domain and node agent related files on the shared storage.  This method is preferable for environments in which frequent updates are necessary. The downside is the possibility of inconsistency in the application binaries, due to differences in patches or upgrades

2.)Storing both the binaries and the data in the shared storage.   This method provides consistent data during all times but makes upgrades and maintenance without outages difficult.

The choice has to be made by taking into account the procedures and processes followed in the organization.

Other Features

Solaris Cluster also provides features that can be used for co-locating services based on the concept of affinities. For example, you can use negative affinity to evacuate the test environment when a production environment is switched to a node or use positive affinity to move the Application Server resources to the same node on which database server is hosted for better performance etc.

Solaris Cluster has an easy-to-use and intuitive GUI  management tool called Sun Cluster Manager, which can be used to perform most management taks.

Solaris Cluster has an inbuilt telemetry feature that can be used to monitor the usage of resources like CPU, memory, etc.

Sun Java Application server doesn't require any modification for Solaris Cluster as the agent is designed with this scenario in mind.

The same agent can be used for Glassfish as well.

The Message Queue Broker can be made highly available as well with the HA  for Sun Java Message Queue agent.

Consistent with Sun's philosophy, the product is being open sourced in phases and the agents are already available under the CDDL license.

An open source product based on the same code base is available for OpenSolaris releases called Open High Availability Cluster.  For more details on the product and community, please visit http://www.opensolaris.org/os/communities/ohac .

The open-source product also has a comprehensive test suite that serves helps users test their deployment satisfactorily.  For more details, please read http://opensolaris.org/os/community/ha-clusters/ohac/Documentation/Tests/.


For mission-critical environments, availability against all types of failures is a very important criterion.  Solaris Cluster is best designed to provide the highest availability for  Application Server by virtue of its integration with Solaris OS, stability, and having an agent specifically designed for Sun Java System Application Server.

Madhan Kumar
Solaris Cluster Engineering

Tuesday Apr 08, 2008

Solaris Cluster and MySQL

As you might imagine, the recent news about Sun and MySQL has led to increased interest in MySQL solutions on Solaris.  One key element of any database deployment is high availability, so here in Solaris Cluster we're fielding a wave of queries about our HA offerings for MySQL.

The first question is often "Does Solaris Cluster have an agent for MySQL?" and the answer is Yes, we do.  You can read the documentation for this agent here and find a detailed blog on how to deploy MySQL in Solaris Cluster environments

 The next question is often about open source, and again we respond in the affirmative, inviting people to browse the source code for the agent, and participate in its evolution

 And these past few weeks, many people have been wondering if we'll be at the MySQL Conference and Expo in Santa Clara CA next week, and the answer is yes, si, oui, jawohl, etc.!

You can learn more about the conference here and you can learn more about Solaris Cluster at the conference by visiting one of the booths we'll be at to talk about Solaris Cluster with MySQL (don't miss the demo!), and also Ritu Kamboj's talk on Wednesday the 16th at 2:00 p.m. -- "Best Practices for Deploying MySQL on the Solaris Platform."

We hope to see you there! 

Burt Clouse
Senior Engineering Manager, Solaris Cluster


Wednesday Mar 26, 2008

HA-xVM Server x86-64 agent

Open HA Cluster has now made available an agent for xVM Server x86-64 guest domains. The nice part of the HA-xVM Server x86-64 agent is that we're only being holistic by gluing projects together in OpenSolaris, Open HA Cluster (based on Solaris Cluster), + xVM (based on Xen). The HA-xVM Server x86-64 agent simply aims to mitigate against the failure of a physical server that may be hosting several heterogeneous xVM Server x86-64 guest domains

Actually, it's a very simple idea to visualise. The HA-xVM Server x86-64 agent monitors guest domains as well as the physical server those domains are running on. If either fails, then the HA-xVM Server x86-64 agent fails over the domain to another x86-64 xVM node.

If you're familiar with the term “live migration,” then you may already know that live migration is not high availability. Suppose you have just lost the server hosting your domain; you can't now live migrate from that server. If, on the other hand, you live migrated 5 minutes before the server failed [read hindsight], then you just got lucky. If the server fails before or during live migration, then you can only restart that domain on another node. Essentially, the HA-xVM Server x86-64 agent detects and reacts to these scenarios.

Anyway, if you're interested, check out Open Solaris HA-xVM Project, where you'll find an extensive HA-xVM Server x86-64 cheat sheet and an installable package of the agent to try out.

Neil Garthwaite
Solaris Cluster Engineering

Sunday Nov 11, 2007

SWIFTAlliance Access and SWIFTAlliance Gateway 6.0 support on SC 3.2 / S10 in global and non-global zones

The Solaris 10 packages for the Sun Cluster 3.2 Data Service for SWIFTAlliance Access and the Sun Cluster 3.2 Data Service for SWIFTAlliance Gateway are available from the Sun Download page. They introduce support for SWIFTAlliance Access and SWIFTAlliance Gateway 6.0 on Sun Cluster 3.2 with Solaris 10 11/06 or newer. It is now possible to configure the data services for resource groups, which can fail over between the global zone of the nodes or between non-global zones. For more information consult the updated documentation, which is part of the PDF file of the downloadable tar archive.
The data services were tested and verified in a joint effort between SWIFT and Sun Microsystems at the SWIFT labs in Belgium. Many thanks to the SWIFT engineering team and our Sun colleagues in Belgium for the ongoing help and support!
For completeness, here is the support matrix for SWIFTAlliance Access and SWIFTAlliance Gateway with Sun Cluster 3.1 and 3.2 software:

Failover Services for Sun Cluster 3.1 SPARC
Application Application Version Solaris SC version Comments
SWIFTAlliance Access 5.0
8, 9
10 11/06 or newer
3.1 SPARC Requires Patch 118050-05 or newer
SWIFTAlliance Gateway 5.0
10 11/06 or newer
3.1 SPARC Requires Patch 118984-04 or newer

Failover Services for Sun Cluster 3.2 SPARC
Application Application Version Solaris SC version Comments
SWIFTAlliance Access5.9
10 11/06 or newer
3.2 SPARC Requires Patch 126085-01 or newer
Package available for download
SWIFTAlliance Gateway5.0
10 11/06 or newer
Package available for download

If you want to study the data services source code, you can find it online for SWIFTAlliance Access and SWIFTAlliance Gateway on the community page for Open High Availability Cluster.

Thorsten Früauf
Solaris Cluster Engineering




« April 2014