News, tips, partners, and perspectives for the Oracle Solaris operating system

Software Quorum

Guest Author
This blog post describes a new feature called software quorum, which is introduced in Solaris Cluster 3.2 1/09.

To explain things well, let us start with a simple explanation of the concept of quorum.

A cluster can potentially break up into multiple partitions due to the complete failure of all paths across the private interconnects for any pair of cluster nodes. In such scenarios, it is imperative for the cluster software to ensure that only one of those partitions survives as the running cluster, and all other partitions kill themselves. This is essential to ensure data integrity on storage shared between cluster nodes, and to ensure that only one set of cluster nodes functions as the cluster hosting services and applications.

How do we decide which partition survives? There is a concept of 'quorum' - quite similar to the usual meaning of the word. Each cluster node is given a vote. If the sum of the votes given to cluster nodes is V, a cluster partition can survive as a running cluster when the sum of the votes of the nodes present in the partition is at least one more than half of the total votes V, in other words at least (1 + V/2) votes.

What happens if two partitions have the same number of votes? Consider a simple example. Each node of a 2-node cluster wants to survive as an independent cluster, when network communications fail. We could say let neither partition form a cluster, but we want high availability as well. So we have the concept of a quorum device (think of it as a 'token' for simplification). We say in such an equal-partitions case, the partition that holds the token should form the surviving cluster. So the idea is: all partitions 'race' to get the quorum device ('token'), but only one partition can win the race and survive. The other partition simply commits suicide. That is the basic simplistic idea of a quorum device. There are many complications in practice, but let's leave it at that simple explanation.

Now what do we use as a quorum device? Note that the term 'device' does not mean a disk necessarily. There are various entities that could constitute a quorum device. Apart from shared disks, Solaris Cluster can use NAS units as quorum devices. An external machine can be configured to run a quorum server for Solaris Cluster, and the quorum server will serve the purpose of a quorum device for the cluster. But we will not discuss those as part of this blog post.

Solaris Cluster can use a shared disk as a quorum device. But there is a caveat. Traditionally, Solaris Cluster could not use just any shared disk as a quorum device. For the quorum device acquisition 'race', Solaris Cluster uses SCSI reservation protocols on the shared disk. Over the years, we have encountered multiple cases where a device did not correctly support the SCSI reservation protocols. These configurations need an alternative mechanism to support their shared disks as quorum devices. Today, there are disks that do not support the SCSI reservations at all. So here is another reason for having an alternative mechanism to support the shared disk as a quorum device.

The software quorum feature, introduced in Solaris Cluster 3.2 1/09, addresses this need. The Solaris Cluster software quorum protocol completely supports all aspects of quorum device behavior for a shared disk, and does so entirely in software without any SCSI reservation-related operations.

So how does software quorum work? Solaris Cluster has exclusive use of 65 sectors on the disk from the Solaris disk driver. Traditionally, these sectors have been used to implement what is called Persistent Group Reservation emulation in Solaris Cluster software (more on that in a separate post). The software quorum subsystem reuses the reserved space on shared disks for storing quorum-related information (lock data, registration and reservation keys) and performs block reads and writes to access this information. Using these reserved sectors, software quorum essentially implements the Persistent Group Reservation (PGR) emulation algorithm in software, without using SCSI reservation operations for shared-disk access control. During cluster reconfiguration, different cluster nodes (possibly in different partitions) attempt to access (read, write, remove) registration and reservation keys on the quorum device in a distributed fashion, in their 'race' to acquire the quorum device. At any time, only one node can manipulate (write, preempt) the keys on the quorum device. Software quorum ensures proper mutual exclusion on concurrent reservation data access (read, write).

To summarize this concept in a few words: Any Sun-supported disk that is connected to multiple nodes in a cluster can serve as a software-quorum disk. To highlight the advantages further, notice that the disk you choose need not support SCSI reservation protocols - it might well be a SATA disk. So if you do not have a shared disk that supports SCSI reservation protocols and you do not have an external machine to use as a quorum server, then you now have the flexibility to choose any Sun-supported shared disk.

To use a shared disk as a software-quorum disk, SCSI fencing should be disabled on the shared disk. Solaris Cluster device-management framework uses SCSI fencing to control access of the shared disks from cluster nodes and external machines. There are certain configurations where users want to turn off fencing for some shared disks. A new feature called optional fencing, that comes with Solaris Cluster 3.2 1/09, provides users with a simple method to achieve this flexibility. We will look at the optional fencing feature in another blog post. But for the purpose of software quorum, let's assume that we have such a feature like optional fencing that provides a simple command (as shown below in the example) to turn off SCSI fencing for the shared disk to be used as a software-quorum disk.

Now let's see an example of how to configure such a software-quorum disk. Let us assume that d4 is the shared disk that is connected to multiple cluster nodes, and you want to use d4 as a software-quorum disk.

(1) First disable SCSI fencing for the shared disk, from one cluster node:

schost1$ cldevice set -p default_fencing=nofencing d4
Updating shared devices on node 1
Updating shared devices on node 2
Updating shared devices on node 3
Updating shared devices on node 4

Let's check whether that worked fine:

schost1$ cldevice show d4
=== DID Device Instances ===                 DID Device Name:                                /dev/did/rdsk/d4
 Full Device Path:                                schost1:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Full Device Path:                                schost2:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Full Device Path:                                schost3:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Full Device Path:                                schost4:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Replication:                                     none
 default_fencing:                                 nofencing

Notice that the 'default_fencing' parameter above for d4 is set to 'nofencing', which means SCSI fencing is disabled for d4.

(2) Now from one cluster node, let's add d4 as a quorum device. The cluster software will automatically detect that d4 is set to use 'nofencing', and hence the cluster software should use software quorum protocol for d4.

schost1$ clquorum add d4

Let's check to see what shows:

schost1$ clquorum show d4
=== Quorum Devices ===                       Quorum Device Name:                             d4
 Enabled:                                         yes
 Votes:                                           3
 Global Name:                                     /dev/did/rdsk/d4s2
 Type:                                            shared_disk
 Access Mode:                                     sq_disk
 Hosts (enabled):                                 schost1, schost2, schost3, schost4

Notice that the 'Access Mode' parameter above for d4 is set to 'sq_disk', which is a short form for 'software-quorum disk'. Let's check the quorum status:

schost1$ clquorum status
=== Cluster Quorum ===
--- Quorum Votes Summary ---
           Needed   Present   Possible
           ------   -------   --------
           4        7         7
--- Quorum Votes by Node ---
Node Name       Present       Possible       Status
---------       -------       --------       ------
schost1         1             1              Online
schost2         1             1              Online
schost3         1             1              Online
schost4         1             1              Online
--- Quorum Votes by Device ---
Device Name       Present      Possible      Status
-----------       -------      --------      ------
d4                3            3             Online

That's it! You have configured d4 as a software-quorum disk; it functions as any other traditional quorum device.

So, don't ignore non-SCSI disks anymore - you can use them as software-quorum disks.

Sambit Nayak

Solaris Cluster Engineering

Join the discussion

Comments ( 7 )
  • John Monday, February 9, 2009

    Does this allow for the use of zfs disks as quorum devices? I seem to remember something in previous releases about a disk in use by a zfs pool not having those 65 sectors due to the efi label.

  • Sambit Nayak Tuesday, February 10, 2009

    Hi John,

    A disk in use by a zfs pool still can be used as a quorum device.

    The catch is this : changing the labelling of a disk from VTOC to EFI, or vice versa, modifies the partition information.

    With respect to quorum, this means : if a disk in use as a quorum disk is put into a zfs pool (which essentially can convert the labelling to EFI), then the partition info is modified, and so quorum software will not know that the labelling has changed and will still try to read using the older offsets. That will fail.

    In other words, one should not change the label of a disk while it is being used as a quorum device.

    Configure a disk as quorum device after you finish changing the label of the disk.

    To summarize :

    - a disk, that is in a zfs pool already, can be configured as a quorum device

    - if a disk is being used as a quorum device and you want to put the disk in a zfs pool, then do the following :

    (i) unconfigure the quorum device

    (ii) add the disk to the zfs pool

    (iii) if you are not rebooting nodes before you do step (iv), then run each of the following commands, one by one, serially on each cluster node.

    - devfsadm (its in /usr/sbin)

    - devfsadm -C

    - scdidadm -C (its in /usr/cluster/bin)

    - scgdevs (its in /usr/cluster/bin)

    These steps make the DID driver know about the label change. If the nodes are being rebooted, then the software will automatically run these at boot time.

    (iv) configure the disk as a quorum disk

    Hoping that answers your query.

  • Jeroen Tuesday, February 10, 2009

    Would using the command 'cldevice populate' be the equivalent of scdidadm -C and scgdevs ? It's a little confusing with the mix of old/new command formats.

  • venku Tuesday, February 10, 2009

    The steps with new CLI is

    # devfsadm

    # devfsadm -C

    # cldevice clear

    # cldevice populate

    The clear is equivalent to scdidadm -C which cleanup any unused DID mappings.

  • Geoff Thursday, February 19, 2009

    Here's a somewhat related question... is there any work being done to implement this on a grander scale (software fencing)? It seems like the fundamentals of software quorum could be expanded to all non-scsi disks to provide fencing for sata devices.

  • Stephen Docy Thursday, February 19, 2009

    Software quorum does nothing to provide an alternative to device fencing, as it does not in any way prevent a 'fenced' node from performing I/O to the shared device. All software quorum can do is provide an alternative quorum configuration option for cluster setups in which our traditional fencing cannot be supported .

  • Geoff Thursday, February 19, 2009

    I undestand that it doesn't fence the I/O. However, if all nodes attached to the shared disk are part of the cluster, it seems to me that these emulated SCSI reservations could be used to implement a software fencing much like the software quorum. Maybe I'm way off on that. If sun cluster takes part of the reserved disk space for the software quorum, can't another part be reserved for an access control that all connected cluster nodes recognize and support? It's not as good as hardware-fencing, no. But it does provide some potential for non-SCSI devices.

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.