Tuesday Feb 24, 2009

Eating our own dog food

You have heard about "Practice what you preach", and here at Solaris Cluster Oasis we often talk about how important high availability is for your critical applications. Beyond just the good sense of using our own products, there is no substitute for actually using your own product day in and day out. It gives us engineers a very important dose of reality, in that any problems with the product have a direct impact on the our own daily functioning. That begs the question: How is the Solaris Cluster group dealing with its own high availability needs?

In this blog entry we teamed up with our Solaris Community Labs team to provide our regular visitors to Oasis with a peek into how SC plays a role in running key pieces of our own internal infrastructure. While a lot of SUN internal infrastructure uses Solaris Cluster, for the purpose of this blog entry, we landed up choosing one of the internal clusters which is used directly for Solaris Cluster Engineering team for their own home directories (yes, that is right, home directories, where all of our stuff lives, is on Solaris Cluster), and developer source code trees.

 See below for a block diagram of the cluster, continue after the diagram for more details about the configuration.


Here are some more specifications of the Cluster:

- Two T2000 servers

- Storage consists of four 6140's presenting RAID5 LUNs. We choose the 6140s to provide RAID, partly because they were there and also partly to leverage the disk cache on these boxes to improve performance

- Two Zpools configured as RAID 1+0, one for home directories and another for workspaces (workspace is engineer-speak for source code tree)

- Running S10U5 (5/08) and SC3.2U1 (2/08)

High Availability was a key requirement for this deployment as downtime for a home directory server with large number of users was simply not an option. For the developer source code too, downtime would mean that long running source code builds would have to be restarted, leading to costly loss of time, not to mention having lots of very annoyed developers roaming the corridors of your workplace is never a good thing :-)

Note that it is not sufficient to merely move the NFS services from one node to other during the failover, one has to make sure that any client state (including file locks) are failed over. This ensures that the clients truly don't see any impact (apart from perhaps a momentary pause). Additionally, deploying different Zpools on different cluster nodes means that the compute power of both nodes is utilized when both are up, while we continue to provide services when one of them is down.

Not only did the users benefit from the high availability, but the cluster administrators gained maintenance flexibility. Recently, the SAN fabric connected to this cluster was migrated from 2 GBps to 4 GPps and a firmware update (performed in single-user mode) was needed on the fibre channel host bus adapters (FC-HBA's). The work was completed without impacting services and the users never noticed. This was simply achieved by moving one of the Zpools (along with the associated NFS shares and HA IP addresses) from one node to another (with a simple click on the GUI) and upgrading the FC-HBA firmware. Once the update was complete, repeat the same with the next node and the work was done!

While the above sounds useful for sure, we think there is a subtler point here, that of "confidence in the product". Allow us to explain: While doing a HW upgrade on a live production system as described above is interesting and useful, what is really important is the ability of the system administrator to be able to do this without taking a planned outage. That is only possible if the administrator has full confidence that no matter what, my applications would keep running and my end users will not be impacted. That is the TRUE value of having a rock solid product.

Hope the readers found this example useful. I am happy to report that the cluster has been performing very well and we haven't (yet) have had episodes of angry engineers roaming our corridors. Touch wood!

During the course of writing this blog entry, i got curious about the origins of the phrase "Eating one's own dog food". Some googling led me to  this page, apparently this phrase has its origins in TV advertising and came over into IT jargon via Microsoft, interesting....

Ashutosh Tripathi - Solaris Cluster Engineering

Rob Lagunas: Solaris Community Labs

Monday Feb 09, 2009

Software Quorum

This blog post describes a new feature called software quorum, which is introduced in Solaris Cluster 3.2 1/09.

To explain things well, let us start with a simple explanation of the concept of quorum.

A cluster can potentially break up into multiple partitions due to the complete failure of all paths across the private interconnects for any pair of cluster nodes. In such scenarios, it is imperative for the cluster software to ensure that only one of those partitions survives as the running cluster, and all other partitions kill themselves. This is essential to ensure data integrity on storage shared between cluster nodes, and to ensure that only one set of cluster nodes functions as the cluster hosting services and applications.

How do we decide which partition survives? There is a concept of 'quorum' - quite similar to the usual meaning of the word. Each cluster node is given a vote. If the sum of the votes given to cluster nodes is V, a cluster partition can survive as a running cluster when the sum of the votes of the nodes present in the partition is at least one more than half of the total votes V, in other words at least (1 + V/2) votes.

What happens if two partitions have the same number of votes? Consider a simple example. Each node of a 2-node cluster wants to survive as an independent cluster, when network communications fail. We could say let neither partition form a cluster, but we want high availability as well. So we have the concept of a quorum device (think of it as a 'token' for simplification). We say in such an equal-partitions case, the partition that holds the token should form the surviving cluster. So the idea is: all partitions 'race' to get the quorum device ('token'), but only one partition can win the race and survive. The other partition simply commits suicide. That is the basic simplistic idea of a quorum device. There are many complications in practice, but let's leave it at that simple explanation.

Now what do we use as a quorum device? Note that the term 'device' does not mean a disk necessarily. There are various entities that could constitute a quorum device. Apart from shared disks, Solaris Cluster can use NAS units as quorum devices. An external machine can be configured to run a quorum server for Solaris Cluster, and the quorum server will serve the purpose of a quorum device for the cluster. But we will not discuss those as part of this blog post.

Solaris Cluster can use a shared disk as a quorum device. But there is a caveat. Traditionally, Solaris Cluster could not use just any shared disk as a quorum device. For the quorum device acquisition 'race', Solaris Cluster uses SCSI reservation protocols on the shared disk. Over the years, we have encountered multiple cases where a device did not correctly support the SCSI reservation protocols. These configurations need an alternative mechanism to support their shared disks as quorum devices. Today, there are disks that do not support the SCSI reservations at all. So here is another reason for having an alternative mechanism to support the shared disk as a quorum device.

The software quorum feature, introduced in Solaris Cluster 3.2 1/09, addresses this need. The Solaris Cluster software quorum protocol completely supports all aspects of quorum device behavior for a shared disk, and does so entirely in software without any SCSI reservation-related operations.

So how does software quorum work? Solaris Cluster has exclusive use of 65 sectors on the disk from the Solaris disk driver. Traditionally, these sectors have been used to implement what is called Persistent Group Reservation emulation in Solaris Cluster software (more on that in a separate post). The software quorum subsystem reuses the reserved space on shared disks for storing quorum-related information (lock data, registration and reservation keys) and performs block reads and writes to access this information. Using these reserved sectors, software quorum essentially implements the Persistent Group Reservation (PGR) emulation algorithm in software, without using SCSI reservation operations for shared-disk access control. During cluster reconfiguration, different cluster nodes (possibly in different partitions) attempt to access (read, write, remove) registration and reservation keys on the quorum device in a distributed fashion, in their 'race' to acquire the quorum device. At any time, only one node can manipulate (write, preempt) the keys on the quorum device. Software quorum ensures proper mutual exclusion on concurrent reservation data access (read, write).

To summarize this concept in a few words: Any Sun-supported disk that is connected to multiple nodes in a cluster can serve as a software-quorum disk. To highlight the advantages further, notice that the disk you choose need not support SCSI reservation protocols - it might well be a SATA disk. So if you do not have a shared disk that supports SCSI reservation protocols and you do not have an external machine to use as a quorum server, then you now have the flexibility to choose any Sun-supported shared disk.

To use a shared disk as a software-quorum disk, SCSI fencing should be disabled on the shared disk. Solaris Cluster device-management framework uses SCSI fencing to control access of the shared disks from cluster nodes and external machines. There are certain configurations where users want to turn off fencing for some shared disks. A new feature called optional fencing, that comes with Solaris Cluster 3.2 1/09, provides users with a simple method to achieve this flexibility. We will look at the optional fencing feature in another blog post. But for the purpose of software quorum, let's assume that we have such a feature like optional fencing that provides a simple command (as shown below in the example) to turn off SCSI fencing for the shared disk to be used as a software-quorum disk.

Now let's see an example of how to configure such a software-quorum disk. Let us assume that d4 is the shared disk that is connected to multiple cluster nodes, and you want to use d4 as a software-quorum disk.

(1) First disable SCSI fencing for the shared disk, from one cluster node:

schost1$ cldevice set -p default_fencing=nofencing d4
Updating shared devices on node 1
Updating shared devices on node 2
Updating shared devices on node 3
Updating shared devices on node 4

Let's check whether that worked fine:

schost1$ cldevice show d4
=== DID Device Instances ===                 DID Device Name:                                /dev/did/rdsk/d4
 Full Device Path:                                schost1:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Full Device Path:                                schost2:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Full Device Path:                                schost3:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Full Device Path:                                schost4:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Replication:                                     none
 default_fencing:                                 nofencing

Notice that the 'default_fencing' parameter above for d4 is set to 'nofencing', which means SCSI fencing is disabled for d4.

(2) Now from one cluster node, let's add d4 as a quorum device. The cluster software will automatically detect that d4 is set to use 'nofencing', and hence the cluster software should use software quorum protocol for d4.

schost1$ clquorum add d4

Let's check to see what shows:

schost1$ clquorum show d4
=== Quorum Devices ===                       Quorum Device Name:                             d4
 Enabled:                                         yes
 Votes:                                           3
 Global Name:                                     /dev/did/rdsk/d4s2
 Type:                                            shared_disk
 Access Mode:                                     sq_disk
 Hosts (enabled):                                 schost1, schost2, schost3, schost4

Notice that the 'Access Mode' parameter above for d4 is set to 'sq_disk', which is a short form for 'software-quorum disk'. Let's check the quorum status:

schost1$ clquorum status
=== Cluster Quorum ===
--- Quorum Votes Summary ---
           Needed   Present   Possible
           ------   -------   --------
           4        7         7
--- Quorum Votes by Node ---
Node Name       Present       Possible       Status
---------       -------       --------       ------
schost1         1             1              Online
schost2         1             1              Online
schost3         1             1              Online
schost4         1             1              Online
--- Quorum Votes by Device ---
Device Name       Present      Possible      Status
-----------       -------      --------      ------
d4                3            3             Online

That's it! You have configured d4 as a software-quorum disk; it functions as any other traditional quorum device.

So, don't ignore non-SCSI disks anymore - you can use them as software-quorum disks.

Sambit Nayak

Solaris Cluster Engineering

Wednesday Feb 04, 2009

Zone Clusters

The Solaris(TM) Cluster 3.2 update 2 release , also called Sun Cluster 3.2, introduces the new feature called Zone Clusters, which is also known as Solaris Containers Clusters, and this blog introduces the reader to Zone Clusters. Here you will find an overview that defines a Zone Cluster and identifies some important reasons why you would want to use a Zone Cluster. Blogs should be short and concise. So this will be the introductory blog. I plan to provide a series of blogs, where each blog covers one important aspect of Zone Clusters. Subsequent blogs will cover the major use cases, a comparison of Zone Cluster versus other zone solutions, and explanations of various aspects of the technologies that support a Zone Cluster.

Now let’s begin by defining the feature.

A Zone Cluster is a virtual cluster, where each virtual node is a non-global zone.

Thus we are entering a world where a set of machines (defined as something that can host an operating system image) can now support multiple clusters. Prior to this feature, there was exactly one cluster and we did not have a unique name for that kind of cluster. The original cluster type has as voting member nodes all of the global zones, which led us to apply the name Global Cluster to that kind of cluster. Starting with SC3.2 1/09 (also called update 2) there will always be exactly one Global Cluster on a set of machines that Sun Cluster software supports.

The same set of machines can optionally also support concurrently an arbitrary number of Zone Clusters. The number of Zone Clusters is limited by the amount CPU's, memory, and other resources needed to support the applications in the Zone Clusters. Exactly one Solaris operating system instance and exactly one Sun Cluster instance supports the one Global Cluster and all Zone Clusters. A Zone Cluster cannot be up unless the Global Cluster is also up. The Global Cluster does not contain the Zone Clusters. Each cluster has its own private name spaces for a variety of purposes, including application management.

A Zone Cluster appears to applications as a cluster dedicated for those applications. This same principle applies to administrators logged in to a Zone Cluster.

The Zone Cluster design follows the minimalist approach about what items are present. Those items that are not directly used by the applications running in that Zone Cluster are not available in that Zone Cluster.

A typical application A stores data in a file system F. The application needs a network resource N (authorized IP address and NIC combination) to communicate with clients. The Zone Cluster would contain just the application A, file system F, and network resource N. Normally, the storage device for the file system would not be present in that Zone Cluster.

Many people familiar with the Global Cluster, will remember that the Global Cluster has other things, such as a quorum device. The Zone Cluster applications do not directly use the quorum device. So there is no quorum device in the Zone Cluster. When dealing with the Zone Cluster, the administrator can ignore quorum devices and other things that exist only in the Global Cluster.

The Zone Cluster design results in a much simpler cluster that greatly reduces administrative support costs.

A Zone Cluster provides the following major features:

  • Application Fault Isolation – A problem with an application in one Zone Cluster does not affect applications in other Zone Clusters. Those operations that might crash an entire machine are generally disallowed in a Zone Cluster. Some operations have been made safe. For example, a reboot operation in a Zone Cluster becomes a zone reboot. So even an action that can boot or halt one Zone Cluster, will not affect another Zone Cluster.

  • Security Isolation – An application in one Zone Cluster cannot see and cannot affect resources not explicitly configured to be present in that specific Zone Cluster. A resource only appears in a Zone Cluster when the administrator explicitly configures that resource to be in that Zone Cluster.

  • Resource Management – The Solaris Resource Management facilities can operate at the granularity of the zone. We have made it possible to manage resources across the entire Zone Cluster. All of the facilities of Solaris Resource Management can be applied to a Zone Cluster. This includes controls on CPU’s, memory, etc. This enables the administrator to manage Quality of Service and control application license fees based upon CPU's.

We recognize that administrators are overworked. So we designed Zone Clusters to reduce the amount of work that administrators must do. We provide a single command that can create/modify/destroy an entire Zone Cluster from any machine. This eliminates the need for the administrator to go to each machine to create the various zones.

Since a Zone Cluster is created after the creation of the Global Cluster, we use knowledge of the Global Cluster to further reduce administrative work. At this point we already know the configuration of the cluster private interconnect, and thus can automate the private interconnect set up for a Zone Cluster. We can specify reasonable default values for a wide range of parameters. For example, a Zone Cluster usually runs with the same time zone as the Global Cluster.

Once you have installed Sun Cluster 3.2 1/09 on Solaris 10 5/08 (also called update 5) or later release, the Zone Cluster feature is ready to use. There is no need to install additional software. The Zone Cluster feature is maintained by the regular patches and updates for the Sun Cluster product.

So a Zone Cluster is a truly simplified cluster.

Now, let’s talk at a high level about why you would use a Zone Cluster.

Many organizations run multiple applications or multiple data bases. It has been common practice to place each application or data base on its own hardware. Figure 1 shows an example of three data bases running on different clusters.

Moore’s Law continues to apply to computers, and the industry continues to produce ever more powerful computers. The trend towards ever more powerful processors has been accompanied by increases in storage capacity, network bandwidth, etc. Along with greater power has come improved price/performance ratios. Over time, application processing demands have grown, but in many cases the application processing demands have grown at a much slower rate than that of the processing capacity of the system. The result is that many clusters now have considerable surplus processing capacity in all areas: processor, storage, and networking.

Such large amounts of idle processing capacity present an almost irresistible opportunity for better system utilization. Organizations seek ways to reclaim this unused capacity. Thus, they are choosing to host multiple cluster applications on a single cluster. However, concerns about interactions between cluster applications, especially in the areas of security and resource management, make people wary. Zone Clusters provide safe ways to host multiple cluster applications on a single cluster hardware configuration. Figure 2 shows the same data bases from the previous example now consolidated onto one set of cluster hardware using three Zone Clusters.

Zone Clusters can support a variety of use cases:

  • Data Base Consolidation – You can run separate data bases in separate Zone Clusters. We have run Oracle RAC 9i, RAC 10g, and RAC 11g in separate Zone Clusters on the same hardware concurrently.

  • Functional Consolidation – Test and development activities can occur concurrently while also being independent.

  • Multiple Application Consolidation – Zone Clusters can support applications generally. So you can run both data bases and also applications that work with data bases in the same or separate Zone Clusters. We will be announcing certification of other applications in Zone Clusters in the coming months.

  • License Fee Cost Containment – Resource controls can be used to control costs. There are many use cases where someone can save many tens of thousands of dollars per year. The savings are heavily dependent upon the use case.

    Here is an arbitrary example: the cluster runs two applications, where each application takes half of the CPU resources. The two applications come from different vendors, who each charge a license fee where: Total_Charge = Number_CPUs \* Per_CPU_Charge. The administrator places each application in its own Zone Cluster with half the CPU's. This reduces the number of CPU's available to each application. The result is that the administrator has now reduced the Total Charge cost by 50%.

In future blogs, I plan to explain how to take the most advantage of Zone Cluster in these various use cases.

Please refer to this video blog that provides a long detailed explanation of Zone Cluster.

Dr. Ellard Roush

Technical Lead Solaris Cluster Infrastructure

Tuesday Feb 03, 2009

Business Continuity with Solaris Cluster Geographic Edition 3.2 1/09

In case you have not seen it yet, we just uploaded a new video clip on 'Business Continuity with Solaris Cluster Geographic Edition 3.2 1/09'. Click here to view the Flash version and click here to download the iPod version.

Jatin Jhala

Solaris Cluster 3.2U2 Release Lead

Thursday Jan 29, 2009

Working Across Boundaries: Solaris Cluster 3/2 1/09!

I am sure you have seen the recent blog post and the announcement of Solaris Cluster 3.2 1/09. This release has a cool set of features, which includes providing high availability and disaster recovery in virtualized environment.  It is also exciting to see distributed applications like Oracle RAC run in separate virtual clusters! Some of the features integrated into this release were developed in the open HA Cluster community. That is another first for us!

I am sure the engineers in the team will be writing blog journals in the coming weeks, detailing the features that have been developed. Stay tuned for more big things from this very energetic and enthusiastic team!

It is very interesting to see the product getting rave reviews from you,  our customers. We value your feedback and take extra steps to make the product even better. We appreciate the acknowledgement! Here is one customer success story from EMBARQ that will catch your attention:
"Solaris Cluster provides a superior method of high availability. Our overall availability is 99.999%. No other solution has been able to ensure us with such tremendous uptime"        —  Kevin McBride, Network Systems Administrator III at EMBARQ

That is one big compliment! Thank you for your continued support and feedback!

A distributed team has made this release possible. Some of the features are large and needed coordinated effort across various boundaries (teams, organizations, continents)! A BIG thanks to the entire Cluster Team for their hard work, to get yet another quality product released. It is a great team!

-Meenakshi Kaul-Basu

Director, Availability Engineering


Oracle Solaris Cluster Engineering Blog


« June 2016