Monday Jul 14, 2008

LDoms guest domains supported as Solaris Cluster nodes

Folks, when late last year we announced support for Solaris Cluster in LDoms I/O domains on this blog entry , we also hinted about support for LDoms guest domains. It has taken a bit longer then we envisaged, but i am pleased to report that SC Marketing has just announced support for LDoms guest domains with Solaris Cluster!!

So, what exactly does "support" mean here? It means that you can create a LDoms guest domain running Solaris, and then treat that guest domain as a cluster node by installing SC software (specific version and patch information noted later in the blog) inside the guest domain and have the SC software work with the virtual devices in the guest domain. The technically inclined reader would, at this point, have several questions pop into his head... How exactly does SC work with virtual devices? What do i have to do to make SC recognize these devices? Are there any differences between how SC is configured in LDoms guest domains, vs non-virtualized environments? Read-on below for a high level summary of specifics:

  • For shared storage devices (i.e. those accessible from multiple cluster nodes), the virtual device must be backed by a full SCSI LUN. That means, no file backed virtual devices, no slices, no volumes. This limitation is required because SC needs advanced features in the storage devices to guarantee data integrity and those features are available only for virtual storage devices backed by full SCSI LUNs.

  • One may need to use storage which is unshared (ie is accessed from only one cluster node), for things such as OS image installation for the guest domain. For such usage, any type of virtual devices can be used, including those backed by files in the I/O domain. However, for such virtual devices, make sure to configure them to be synchronous. Check LDoms documentation and release notes on how to do that. Currently (as of July 2008) one needs to add "set vds:vd_file_write_flags = 0" to the /etc/system file in the I/O domain exporting the file. This is required because the Cluster stores some key configuration information on the root filesystem (in /etc/cluster) and it expects that the information written to this location is written synchronously to the disks. If the root filesystem of the guest domain is on a file in the I/O domain, it needs this setting to be synchronous.

  • Network based storage (NAS etc.) is fine when used from within the guest domain. Check cluster support matrix for specifics. LDoms guest domains don't change this support.

  • For cluster private interconnect, the LDoms virtual device "vnet" can be used just fine, however the virtual switch which it maps must have the option "mode=sc" specified for it. So essentially, for the command ldm subcommand add-vsw, you would add another argument "mode=sc" on the command line while creating the virtual switch which would be used for cluster private interconnect inside the guest domains. This option enables a fastpath in the I/O domain for the Cluster heartbeat packets so that those packets do not compete with application network packets in the I/O domain for resources. This greatly improves the reliability of the Cluster heartbeats, even under heavy load, leading to a very stable cluster membership for applications to work with. Note however, that good engineering practices should still be followed while sizing your server resources (both in the I/O domain as well as in the guest domains) for the application load expected on the system.

  • With this announcement all features of Solaris Cluster supported in non-virtualized environments are supported in LDoms guest domains, unless explicitly noted in the SC release notes. Some limitations come from LDoms themselves, such as lack of jumbo frame support over virtual networks or lack of link based failure detection with IPMP in guest domains. Check LDoms documentation and release notes for such limitations as support for such missing features are improving all the time.

  • For support of specific applications with LDoms guest domains and SC, check with your ISV. Support for applications in LDoms guest domains is improving all the time, so check often.

  • Software version requirements. LDoms_1.0.3 or higher, S10U5 and patches 137111-01, 137042-01, 138042-02, and 138056-01 or higher are required in BOTH the LDoms guest domains as well as in the I/O domains exporting virtual devices to the guest domains. Solaris Cluster SC32U1 (3.2 2/08) with patch 126106-15 or higher is required in the LDoms guest domains.

  • Licensing for SC in LDoms guest domains follows the same model as those for the I/O domains. You basically pay for the physical server, irrespective of how many guest domains and I/O domains are deployed in that physical server.
  • This covers the high level overview of how SC is to be deployed inside the LDoms guest domains. Check out the SC Release notes for additional details, and some sample configurations. The whole virtualization space is evolving very rapidly and new developments are happening ever so quickly. Keep this blog page bookmarked and visit it frequently to find out how Solaris Cluster is evolving along with this space.


    Ashutosh Tripathi
    Solaris Cluster Engineering

    Wednesday Jun 04, 2008

    Solaris Cluster Express 6/08 available for download

    Solaris Cluster Express 6/08 is now available for download! You can download the DVD image here

    What is new in this release?

    \* This release runs on OpenSolaris Nevada build 86. The version of the Sun Management Centre is now 3.1.

    \* The HA agent for Solaris Containers is now enhanced to include support for the Solaris 9 Branded Zones on SPARC platform. This is very useful for those customers who still need to run some applications on Solaris 9 while taking advantage of the new features of Solaris 10 and above.

    \* The HA agent for PostgreSQL Database is now ehanced to support WAL shipping. This feature greatly enhances the deployment of PostgreSQL database in Enterprise deployments.

    \* Support for Solaris Containers configured with exclusive IP is included in this release.

    \* The SCX Geographic Edition is enhanced to support Oracle Data Guard based replication.

    \* This release also contains the mandatory bug fixes and other minor enhancements not mentioned above.

    Stay tuned for more milestones along the open source journey!

    Munish Ahuja
    Madhan Kumar B.
    Jonathan Mellors
    Arun Kurse
    Venugopal N.S.

    Thursday May 29, 2008

    HA For Grid Engine at osgc2008

    Last week i presented Open HA Cluster at Open Source Grid Cluster Conference in Oakland California. The conference had three different tracks, dedicated to Globus (GlobusWorld), Grid Engine (Grid Engine Workshop), and Rocks (Rocks Cluster Workshop).
    My presentation about making Sun Grid Engine highly available using Open HA Cluster (OHAC) was part of the Grid Engine Workshop.

    I noticed that the term Cluster was a bit overused at this conference with different products and technologies using it in slightly different ways. So i started with clarifying the term "HA Cluster" to refer to the technology which OHAC brings to the arena, which is about high availability, in spite of failures. A quick show of hands revealed that about 25% of the participants were aware of the concept of "HA Clusters" in general, with about 15% actually being aware of OHAC itself. Given that, i spent a larger portion of my talk on the concept of single points of failure, redundancy, failover and how OHAC recovers from system failures. Towards the end of my talk, i also talked about using OHAC to make Sun Grid Engine highly available and what are the key advantages of the HA solution based on OHAC. These points and the slides are curtsy of Thorsten Frueauf . The key points about how OHAC helps in improving the availability of Sun Grid Engine are noted in this blog entry .

    The presentation did generate a couple questions from the audience, i remember one question was about how does OHAC takes care of MAC address change when it fails over a HA ipaddress from one node to another. I explained that OHAC uses gratuitous ARPs to update the ARP cache of any routers on the network and that works with all but a very few exceptions. Another question was about data recovery during disk/mirror failures and whether the end application needs to worry about it, i explained that typically that recovery is performed by a volume manager and the end application is blissfully unaware of it. The OHAC framework makes sure that the end application has the data available when and where (on the node where the app is) the application is started. Another question was about the speed of failover (how fast is the recovery upon various failures). I turned that question into an advantage where i explained how OHAC is tightly integrated with Solaris and thus can detect failures quickly and recover from the failures quickly. I then invited folks to view the failover demo on my laptop on the next day, in the "Grill the Gurus" portion of the conference.

    I was somewhat curious about the audience mix as well about whether the larger percentage was from academic community or the commercial community. A quick show of hands revealed that the commercial users were well represented, roughly in the same numbers as the academic/research users. After the talk, i did speak to a couple of people during the coffee/lunch breaks and met a variety of people. Here are some folks i remember: A sysadmin at an European Oil company interested in using Grid Engine for optimizing/minimizing application license for a commercial software he uses for geological data analysis, a IT manager for a Medical Software startup based in San Francisco who was interested in Open Source software as a way to minimize costs, a deployment architect for a IT consultancy company who was interested in geographical data replication and content based routing of incoming jobs, a lab manager from an ivy league university who wanted to figure out an easy way for his students to be effective at managing his compute lab environment, a IT admin for a storage manufacturer who was interested in learning about techniques for efficient monitoring of workloads.

    For the demo next day, i had Sun Grid Engine configured as a HA server across two zones on my laptop. I was able to demo the very quick restart of the Grid Engine qmaster and scheduler daemons. People seemed to be somewhat interested to learn as to how that is happening, which led me to explain how Solaris Contracts are used by the process monitoring implementation in OHAC, which leads to quick detection and recovery from application failures. Most people were simply interested in chatting about the general concept of clusters itself and discussing their own "Grid and Cluster" scenarios.

    I you are interested in the actual slides i used for the talk, you can check them out here . If you missed this conference, you would have an opportunity to learn more about Open HA Cluster and OpenSolaris at the upcoming LinuxTag conference in Berlin, Germany from May 28th till 31th of May 2008.

    The picture at the top is taken during a coffee break in the conference. Check out this link for other photos i took at the conference. Also, Deirdré Straughan made a video of my talk, complete with neat fading in and out of the presentation slides. Click in the embedded window below to watch the presentation in flash.

    If you'd like, you can watch the video in iPod format and watch it on your video iPod . Beware that the file is rather big though.

    This conference was a nice experience for me to talk to lot of people and make them aware of Open HA Cluster , and also learn about what is going on in other Open Source communities such as Grid. Hope you found some of the things in this blog useful and interesting.

    Ashutosh Tripathi
    Solaris Cluster Engineering

    Friday May 02, 2008

    Solaris Cluster, Sun Cluster Geographic Edition and Virtualization: The Art of the Possible

    Virtualization is a hot topic right now. You only need look at what Sun is doing with our CoolThreads servers, our storage, our software, our Open High Availability Cluster (OHAC) HA-xVM and HA Container agent development work ... deep breath ... and our OEM agreements and I think you'll agree - it's everywhere. Consequently, it's generating lots of questions on our external cluster forum and resulting in several posts to the Solaris Cluster blog, including this one.

    So does virtualization mean that the need for clustering goes away? Far from it! Now that you've consolidated your multiple, independent servers into a single virtualized platform, you've effectively put more of your 'eggs' in a single basket. If the services are mission or business critical, you are still going to need to protect them. And how are you going to do that? Solaris Cluster, of course. Oh, and what about disaster recovery? Critical services need protection against fire, flood, natural disasters. That's where Sun Cluster Geographic Edition comes in. Is your head hurting yet?

    As with any rapidly evolving technology, it's important to understand the ramifications of what you are doing, as well as be able to navigate the what-works-with-what matrix. Failure in either department could result in an unsupported configuration or an implementation that doesn't achieve the levels of service you would expect.

    When I first suggested to my colleagues that I would author something on this topic, I had no idea what I was letting myself in for. I can usually talk to one expert, or consult one document and get all the information I need. But as you'll have realised from the preamble, the combination of virtualization and clustering spans many boundaries. So, I am indebted to the many colleagues who provided information and reviews for the Blueprint that resulted from my rash suggestion.

    I hope you find "Using Solaris Cluster and Sun Cluster Geographic Edition with Virtualization Technologies" saves you a lot of head scratching when your colleague, manager, or professor asks you, "And how do Solaris Cluster and virtualization technologies work together?".



    Friday Apr 18, 2008

    Improving Sun Java System Application Server availability with Solaris Cluster


    Sun Java System Application Server is one of the leading middle-ware products in the market with its robust architecture, stability, and ease of use.  The design of the Application Server by itself has some  high availability (HA) features in the form of node agents (NA) which are spread on multiple nodes to avoid a single point of failure (SPoF).  A simple illustration of the design :

    However, as we can notice from the above block diagram, the Domain Administration Server (DAS) is not highly available. If the DAS goes down, then the administrative tasks cannot be done.  Despite the client connections being redirected to other instances of the cluster in case of an instance or NA failure or unavailability, an automated recovery would be desirable to reduce the load on the remaining instances of the cluster.  There are also the hardware, OS and network failure scenarios that needs to be accounted for in critical deployments, in which uptime is one of the main requirements.  

    Why is a High Availability Solution Required?

    A high availability solution is required to handle those failures which Application Server or for that matter any user-land application, cannot recover from, like network, hardware, operating  system failures, and human errors. Apart from these, there are other scenarios like providing continuous service even when OS or hardware upgrades and/or maintenance is done.

    Apart from failures, a high availability solution helps the deployment take advantage of other operating system features, like network level load distribution, link failure detection, and virtualization etc.,  to the fullest.

    How to decide on the best solution?

    Once customers decide that their deployment is better served by a high availability solution, they need to decide on which solution to choose from the market.  The answer to the following questions will help in the decision making:

    Is the solution very mature and robust?

    Does the vendor provide an Agent that is specifically designed for Sun Java System Application Server?

    Is the solution very easy to use and deploy?

    Is the solution cost effective?

    Is the solution complete? Can it provide high availability for associated components like
    Message Queue?

    And importantly, can they get very good support in the form of documentation, customer service and a single point of support?

    Why Solaris Cluster?

    Solaris Cluster is the best high availability solution for the Solaris platform available. It offers excellent integration with the Solaris Operating System and helps customers make use of new features introduced in Solaris without making modifications on their deployments.  Solaris Cluster supports applications running in containers, offers a very good choice of file systems that can be used, choices of processor architecture, etc.  Some of the  highlights include:

    Kernel level integration to make use of Solaris features like containers, ZFS, FMA, etc.

    A wide portfolio of agents to support the most widely used applications in the market.

    Very robust and quick failure detection mechanism and stability even during very high loads.

    IPMP - based network failure detection and load balancing.

    The same agent can be used for both Sun Java Application Server and Glassfish.

    Data Services Configuration Wizards for most common Solaris Cluster tasks.

    Sophisticated fencing mechanism to avoid data corruption.

    Detect loss of access to storage by monitoring the disk paths.

    How does Solaris Cluster Provide High Availability?

    Solaris Cluster provides high availability by using redundant components.  The storage, server and network card are redundant.  The following figure illustrates a simple two-node cluster which has the recommended redundant interconnects, storage accessible to both nodes, and public network interfaces each. It is important to note that this is the recommended configuration and the minimal configuration can have just one shared storage, interconnect, and public network interface.  Solaris Cluster even provides the flexibility of having a single-node cluster as well based on individual needs.

    LH =  Logical hostname, type of virtual IP used for moving IP addresses across NICs.

    RAID =  any suitable software or hardware based RAID mechanism that provides both redundancy and performance.

    One can opt to provide high availability just for the DAS alone or for the node agent as well. The choice is based on the environment. Scalability of the node agents is not a problem with high availability deployments, since multiple node agents can be deployed on a single Solaris Cluster installation. These node agents are configured in multiple resource groups, with each resource group having a single logical host, HAStoragePlus and agent node resource. Since node agents are spread over multiple nodes in a normal deployment, there is no need for additional hardware just because a  highly available architecture is being used.  Storage can be made redundant either with software or hardware based RAID.

    Solaris Cluster Failover Steps in Case of a Failure

    Solaris Cluster provides a set of sophisticated algorithms that are applied to determine whether to restart an application or to failover to the redundant node. Typically the IP address, the file system on which the application binaries and data reside, and the application resource itself are grouped into a logical entity called resource group (RG).  As the name implies, the IP address, file system, and application itself are viewed as resources and each one of them is identified by a resource type (RT) typically referred to as an agent. The recovery mechanism, i.e restart or fail over to another node is, determined based on a combination of time outs, number of restarts, and history of failovers. An agent typically has start, stop, and validate methods that are used to start, stop, and verify prerequisites every time the application changes state.  It also includes a probe which is executed at a predetermined period of time to determine application availability.

    Solaris Cluster has two RTs or agents for the Sun Java System Application Server.  The Resource Type SUNW.jsas is used for DAS, and SUNW.jsas_na for node agent. The probe mechanism involves executing the “asadmin list-domain” and “asadmin list-node-agents” command and interpreting the output to determine if the DAS and the node agents are in the desired state or not.  The Application Server software, file system, and  IP address are moved to the redundant node in case of a failover. Please refer to the Sun Cluster Data Service guide ( for more details.

    The following is a simple illustration of a failover in case of a server crash.

    In the previously mentioned setup, Application Server is not failed over to the second node if
    one of the NICs alone fails. The redundant NIC, which is part of the same IPMP group hosts the logical host to which the DAS and NA make use. A temporary network delay will be noticed for until the logical host is moved from nic1 to nic2.

    The Global File System (GFS) is recommended for Application Server deployments since there is very little write activity other than logs on the file system in which the configuration files and in specific deployments, binaries are installed. Because GFS is always mounted on all nodes, it results in better fail over times and quicker startup of Application server in case of a node crash or similar problems.

    Maintenance and Upgrades

    The same properties that help Solaris Cluster provide recovery during failures can be used to provide service continuity in case of maintenance and upgrade work. 

    During any planned OS maintenance or upgrade, the RGs are switched over to the redundant node and the node that needs maintenance is rebooted into non-cluster mode. The planned actions are performed and the node is then rebooted into the cluster.  The same procedure can be repeated for all the remaining nodes of the cluster.

    Application Server maintenance or upgrade depends on the way in which the binaries and the data and  configuration files are stored. 

    1.)Storing the binaries on the node's internal hard disk and storing the domain and node agent related files on the shared storage.  This method is preferable for environments in which frequent updates are necessary. The downside is the possibility of inconsistency in the application binaries, due to differences in patches or upgrades

    2.)Storing both the binaries and the data in the shared storage.   This method provides consistent data during all times but makes upgrades and maintenance without outages difficult.

    The choice has to be made by taking into account the procedures and processes followed in the organization.

    Other Features

    Solaris Cluster also provides features that can be used for co-locating services based on the concept of affinities. For example, you can use negative affinity to evacuate the test environment when a production environment is switched to a node or use positive affinity to move the Application Server resources to the same node on which database server is hosted for better performance etc.

    Solaris Cluster has an easy-to-use and intuitive GUI  management tool called Sun Cluster Manager, which can be used to perform most management taks.

    Solaris Cluster has an inbuilt telemetry feature that can be used to monitor the usage of resources like CPU, memory, etc.

    Sun Java Application server doesn't require any modification for Solaris Cluster as the agent is designed with this scenario in mind.

    The same agent can be used for Glassfish as well.

    The Message Queue Broker can be made highly available as well with the HA  for Sun Java Message Queue agent.

    Consistent with Sun's philosophy, the product is being open sourced in phases and the agents are already available under the CDDL license.

    An open source product based on the same code base is available for OpenSolaris releases called Open High Availability Cluster.  For more details on the product and community, please visit .

    The open-source product also has a comprehensive test suite that serves helps users test their deployment satisfactorily.  For more details, please read


    For mission-critical environments, availability against all types of failures is a very important criterion.  Solaris Cluster is best designed to provide the highest availability for  Application Server by virtue of its integration with Solaris OS, stability, and having an agent specifically designed for Sun Java System Application Server.

    Madhan Kumar
    Solaris Cluster Engineering


    Oracle Solaris Cluster Engineering Blog


    « July 2016