Monday Feb 09, 2009

Software Quorum

This blog post describes a new feature called software quorum, which is introduced in Solaris Cluster 3.2 1/09.

To explain things well, let us start with a simple explanation of the concept of quorum.

A cluster can potentially break up into multiple partitions due to the complete failure of all paths across the private interconnects for any pair of cluster nodes. In such scenarios, it is imperative for the cluster software to ensure that only one of those partitions survives as the running cluster, and all other partitions kill themselves. This is essential to ensure data integrity on storage shared between cluster nodes, and to ensure that only one set of cluster nodes functions as the cluster hosting services and applications.

How do we decide which partition survives? There is a concept of 'quorum' - quite similar to the usual meaning of the word. Each cluster node is given a vote. If the sum of the votes given to cluster nodes is V, a cluster partition can survive as a running cluster when the sum of the votes of the nodes present in the partition is at least one more than half of the total votes V, in other words at least (1 + V/2) votes.

What happens if two partitions have the same number of votes? Consider a simple example. Each node of a 2-node cluster wants to survive as an independent cluster, when network communications fail. We could say let neither partition form a cluster, but we want high availability as well. So we have the concept of a quorum device (think of it as a 'token' for simplification). We say in such an equal-partitions case, the partition that holds the token should form the surviving cluster. So the idea is: all partitions 'race' to get the quorum device ('token'), but only one partition can win the race and survive. The other partition simply commits suicide. That is the basic simplistic idea of a quorum device. There are many complications in practice, but let's leave it at that simple explanation.

Now what do we use as a quorum device? Note that the term 'device' does not mean a disk necessarily. There are various entities that could constitute a quorum device. Apart from shared disks, Solaris Cluster can use NAS units as quorum devices. An external machine can be configured to run a quorum server for Solaris Cluster, and the quorum server will serve the purpose of a quorum device for the cluster. But we will not discuss those as part of this blog post.

Solaris Cluster can use a shared disk as a quorum device. But there is a caveat. Traditionally, Solaris Cluster could not use just any shared disk as a quorum device. For the quorum device acquisition 'race', Solaris Cluster uses SCSI reservation protocols on the shared disk. Over the years, we have encountered multiple cases where a device did not correctly support the SCSI reservation protocols. These configurations need an alternative mechanism to support their shared disks as quorum devices. Today, there are disks that do not support the SCSI reservations at all. So here is another reason for having an alternative mechanism to support the shared disk as a quorum device.

The software quorum feature, introduced in Solaris Cluster 3.2 1/09, addresses this need. The Solaris Cluster software quorum protocol completely supports all aspects of quorum device behavior for a shared disk, and does so entirely in software without any SCSI reservation-related operations.

So how does software quorum work? Solaris Cluster has exclusive use of 65 sectors on the disk from the Solaris disk driver. Traditionally, these sectors have been used to implement what is called Persistent Group Reservation emulation in Solaris Cluster software (more on that in a separate post). The software quorum subsystem reuses the reserved space on shared disks for storing quorum-related information (lock data, registration and reservation keys) and performs block reads and writes to access this information. Using these reserved sectors, software quorum essentially implements the Persistent Group Reservation (PGR) emulation algorithm in software, without using SCSI reservation operations for shared-disk access control. During cluster reconfiguration, different cluster nodes (possibly in different partitions) attempt to access (read, write, remove) registration and reservation keys on the quorum device in a distributed fashion, in their 'race' to acquire the quorum device. At any time, only one node can manipulate (write, preempt) the keys on the quorum device. Software quorum ensures proper mutual exclusion on concurrent reservation data access (read, write).

To summarize this concept in a few words: Any Sun-supported disk that is connected to multiple nodes in a cluster can serve as a software-quorum disk. To highlight the advantages further, notice that the disk you choose need not support SCSI reservation protocols - it might well be a SATA disk. So if you do not have a shared disk that supports SCSI reservation protocols and you do not have an external machine to use as a quorum server, then you now have the flexibility to choose any Sun-supported shared disk.

To use a shared disk as a software-quorum disk, SCSI fencing should be disabled on the shared disk. Solaris Cluster device-management framework uses SCSI fencing to control access of the shared disks from cluster nodes and external machines. There are certain configurations where users want to turn off fencing for some shared disks. A new feature called optional fencing, that comes with Solaris Cluster 3.2 1/09, provides users with a simple method to achieve this flexibility. We will look at the optional fencing feature in another blog post. But for the purpose of software quorum, let's assume that we have such a feature like optional fencing that provides a simple command (as shown below in the example) to turn off SCSI fencing for the shared disk to be used as a software-quorum disk.

Now let's see an example of how to configure such a software-quorum disk. Let us assume that d4 is the shared disk that is connected to multiple cluster nodes, and you want to use d4 as a software-quorum disk.

(1) First disable SCSI fencing for the shared disk, from one cluster node:

schost1$ cldevice set -p default_fencing=nofencing d4
Updating shared devices on node 1
Updating shared devices on node 2
Updating shared devices on node 3
Updating shared devices on node 4

Let's check whether that worked fine:

schost1$ cldevice show d4
=== DID Device Instances ===                 DID Device Name:                                /dev/did/rdsk/d4
 Full Device Path:                                schost1:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Full Device Path:                                schost2:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Full Device Path:                                schost3:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Full Device Path:                                schost4:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Replication:                                     none
 default_fencing:                                 nofencing

Notice that the 'default_fencing' parameter above for d4 is set to 'nofencing', which means SCSI fencing is disabled for d4.

(2) Now from one cluster node, let's add d4 as a quorum device. The cluster software will automatically detect that d4 is set to use 'nofencing', and hence the cluster software should use software quorum protocol for d4.

schost1$ clquorum add d4

Let's check to see what shows:

schost1$ clquorum show d4
=== Quorum Devices ===                       Quorum Device Name:                             d4
 Enabled:                                         yes
 Votes:                                           3
 Global Name:                                     /dev/did/rdsk/d4s2
 Type:                                            shared_disk
 Access Mode:                                     sq_disk
 Hosts (enabled):                                 schost1, schost2, schost3, schost4

Notice that the 'Access Mode' parameter above for d4 is set to 'sq_disk', which is a short form for 'software-quorum disk'. Let's check the quorum status:

schost1$ clquorum status
=== Cluster Quorum ===
--- Quorum Votes Summary ---
           Needed   Present   Possible
           ------   -------   --------
           4        7         7
--- Quorum Votes by Node ---
Node Name       Present       Possible       Status
---------       -------       --------       ------
schost1         1             1              Online
schost2         1             1              Online
schost3         1             1              Online
schost4         1             1              Online
--- Quorum Votes by Device ---
Device Name       Present      Possible      Status
-----------       -------      --------      ------
d4                3            3             Online

That's it! You have configured d4 as a software-quorum disk; it functions as any other traditional quorum device.

So, don't ignore non-SCSI disks anymore - you can use them as software-quorum disks.

Sambit Nayak

Solaris Cluster Engineering


Oracle Solaris Cluster Engineering Blog


« July 2016