Availability Best Practices - using a mirrored ZFS pool with virtual disks

Previous blog entries emphasized service domain and path resiliency, rather than redundancy for disk media, though I illustrated it in Availability Best Practices - Example configuring a T5-8. In that blog entry, I configured two identically sized virtual disks, each in its own mpgroup pair, to be used as a ZFS mirrored root pool in the guest VM. That provides resiliency in case a disk fails and in case a path or service domain goes down.

The purpose of this example is slightly different: we will test a configuration that does not use mpgroup, but instead relies on resiliency provided by a ZFS mirrored pool. This is similar to bare-metal configurations in which ZFS is used to provide resiliency, but using virtual disks.

To test this, we configure a redundant 2-way ZFS mirrored pool, where each half of the pool comes from a different service domain and no mpgroup has been prepared. This will show the behavior to expect if a service domain is made unavailable, with and without specifying a timeout for the virtual disk.

We would like to see that the guest VM handles the error and continues operation even if one side of the mirror becomes unavailable, and the way to configure the virtual disks to make this happen.

NOTE: for production purposes it is strongly recommended to use mpgroup and resilient media as shown in Availability Best Practices - Example configuring a T5-8. The method illustrated here avoids a single point of failure and provides resiliency for media, path and service domain, but it also causes a disk I/O timeout, and requires manual intervention in the guest to clean things up. Use the mpgroup with redundant media in order to provide resiliency and continuous operation

A simple ZFS mirror with vdisks from two service domains

In this test, we provision a guest domain with two virtual disks: one from the control domain, and one from an I/O and service domain 'alternate'. There's no mpgroup, just one disk from each service domain.

Two identically-sized LUNs will be used for this exercise, one each from the primary and alternate domains. We used the format command to determine the path names to the two virtual disks, and then use the ldm commands to define them as virtual disks and assign them to the guest domain mydom.

# ldm add-vdsdev /dev/dsk/c0t5000C5003330AC6Bd0s0 myguestdisk0@primary-vds0
# ldm add-vdsdev /dev/dsk/c0t5000C500332C1087d0s0 myguestdisk1@alternate-vds0
# ldm add-vdisk my0 myguestdisk0@primary-vds0 mydom
# ldm add-vdisk my1 myguestdisk1@alternate-vds0 mydom

The two disks were used to create a mirrored root pool when Solaris was installed in the domain (steps not shown). This creates a configuration that looks like the illustration below: two disks used for a mirrored ZFS root pool, with each half of the pair from a different service domain.

Production ZFS pools should always use redundancy, whether mirror or RAIDZn. ZFS detects errors by computing checksums on data and comparing them to stored checksums, and corrects errors if the ZFS pool has been configured for redundancy. That is a valuable feature that sets ZFS apart from most file systems.

When the alternate domain is stopped, the I/O operation is blocked, and resumes without complaint when alternate brought back up. As stated in the Oracle VM Server for SPARC 3.1 Administration Guide,

By default, if the service domain providing access to a virtual disk back end is down, all I/O from the guest domain to the corresponding virtual disk is blocked. The I/O automatically is resumed when the service domain is operational and is servicing I/O requests to the virtual disk back end.

Guest-level error handling and disk timeouts

In some situations it is acceptable to block guest I/O until the service domain resumes, but in most cases we want the guest to continue processing. In particular, we must let the guest handle the failure and continue operation. To do this, we set a virtual disk timeout parameter so the guest I/O fails if the service is down, rather than waits till it resumes. This generates an error if the virtual disk is unavailable past the duration of the timeout, and invokes error handling in the guest.

With mydom inactive, I issued:

# ldm set-vdisk timeout=5 my0 mydom
# ldm set-vdisk timeout=5 my1 mydom
# ldm list -l -o disk mydom

    NAME             VOLUME                      TOUT ID   DEVICE  SERVER         MPGROUP       
    my0              myguestdisk0@primary-vds0   5    1    disk@1  primary                      
    my1              myguestdisk1@alternate-vds0 5    2    disk@2  alternate

If we look at the disks from within the domain we see that all is in clean status:

# fmadm faulty
# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 0h4m with 0 errors on Mon Jul  1 13:38:11 2013

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c2d1s0  ONLINE       0     0     0
            c2d2s0  ONLINE       0     0     0

errors: No known data errors
# zpool status -x
all pools are healthy

Forcing an error

Now, lets reboot the alternate service domain. This generates an error if the disk backend is unavailable for 5 seconds. (This doesn't imply that 5 seconds is the correct or optimal time - it's just the time I used in this example). Now, when the alternate service domain is stopped, the guest produces fmadm error messages and the following console error message, but continues operation.

# Jul  4 16:48:04 ldom147-123.us.oracle.com vdc: NOTICE: vdisk@2 disk access failed

SUNW-MSG-ID: ZFS-8000-NX, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Thu Jul  4 16:48:09 PDT 2013
PLATFORM: SPARC-T3-4, CSN: unknown, HOSTNAME: ldom147-123.us.oracle.com
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 05b072d9-8a95-e1c7-e14d-949bf40c4e60
DESC: Probe of ZFS device 'c2d2s0' in pool 'rpool' has failed.
AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-NX for the latest service procedures and policies regarding this diagnosis.
Jul  4 16:48:56 ldom147-123.us.oracle.com last message repeated 10 time

Despite these error essages, the guest is operational - there is no interruption in application service. ZFS handles the missing volume in the mirrored root pool. Within the guest we use zpool status to display the error and show that one side of the mirror is unavailable:

# zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices are unavailable in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or 'fmadm repaired', or replace the device
        with 'zpool replace'.
        Run 'zpool status -v' to see device specific details.
  scan: scrub repaired 0 in 0h4m with 0 errors on Thu Jul  4 16:38:46 2013

        NAME        STATE     READ WRITE CKSUM
        rpool       DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            c2d1s0  ONLINE       0     0     0
            c2d2s0  UNAVAIL      0   111     0

errors: No known data errors

The important point is that we were able to tolerate loss of a service domain without a system or application outage in the domain. The ZFS pool has redundancy, and continues to provide I/O to the guest domain. That's the desired and expected behavior in this configuration.

Resuming normal operation

However, the ZFS pool is now in degraded state, and would be vulnerable to another failure, either a failure on c2d1d0 or loss of the domain that services it. We want to restore the pool to non-degraded state for performance and to restore its resiliency.

First, we bring back the alternate I/O domain, which restores access to the virtual disk. The output of zpool status is unchanged but issuing zpool clear -f tells ZFS that the situation has been cleared up. ZFS now resilvers the disk to copy over any changes in disk content that occured while it was unavailable. Once resilvering is complete the pool is no longer in degraded mode, and has access to its full capacity and redundancy.

Note that if you issue zpool status right after resilvering starts, or just after issuing a zpool scrub, the estimated completion time is much higher than it actually will be. For a more realistic time, wait a while after the scrub or resilvering starts before checking the status.

# zpool clear -f rpool
# zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Thu Jul  4 16:52:30 2013
    474K scanned out of 22.2G at 474K/s, 13h38m to go
    308K resilvered, 0.00% done

        NAME        STATE     READ WRITE CKSUM
        rpool       DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            c2d1s0  ONLINE       0     0     0
            c2d2s0  DEGRADED     0     0     0  (resilvering)

errors: No known data errors
... we wait a little while. Little changed while c2d2 was gone, so resilvering is fast ...
# zpool status
  pool: rpool
 state: ONLINE
  scan: resilvered 9.17M in 0h0m with 0 errors on Thu Jul  4 16:52:33 2013

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c2d1s0  ONLINE       0     0     0
            c2d2s0  ONLINE       0     0     0
errors: No known data errors
# zpool status -x
all pools are healthy
Then use fmdump and fmadm faulty to list the fault identifiers to be cleared, and then use fmadm repaired or fmadm acquit to clear them. I'm a mere novice at using fmadm, but it all worked :-)

The following snippet shows reported errors and using acquit, used to indicate that we know about the problem and it has been fixed.

# fmdump
Jul 04 16:48:09.9398 05b072d9-8a95-e1c7-e14d-949bf40c4e60 ZFS-8000-NX Diagnosed
Jul 04 16:48:51.8993 08f0a2c3-b2aa-4fff-89f2-a52d2df33161 ZFS-8000-FD Diagnosed
Jul 04 16:48:56.2739 baf9bc24-dd40-67ce-ed1f-8a4525e1981d ZFS-8000-LR Diagnosed
Jul 04 16:52:29.8094 baf9bc24-dd40-67ce-ed1f-8a4525e1981d FMD-8000-4M Repaired
Jul 04 16:52:29.8254 baf9bc24-dd40-67ce-ed1f-8a4525e1981d FMD-8000-6U Resolved
Jul 04 16:52:30.0832 f5bdb18a-a875-cd5e-e9fd-8c043be3ba7e ZFS-8000-QJ Diagnosed
Jul 04 16:52:34.4226 f5bdb18a-a875-cd5e-e9fd-8c043be3ba7e FMD-8000-4M Repaired
Jul 04 16:52:34.4716 f5bdb18a-a875-cd5e-e9fd-8c043be3ba7e FMD-8000-6U Resolved
# fmadm acquit 05b072d9-8a95-e1c7-e14d-949bf40c4e60
fmadm: recorded acquittal of 05b072d9-8a95-e1c7-e14d-949bf40c4e60
# fmadm acquit 08f0a2c3-b2aa-4fff-89f2-a52d2df33161
fmadm: recorded acquittal of 08f0a2c3-b2aa-4fff-89f2-a52d2df33161

This process clears out all the error messages back to a "clean" state.


In summary: if you create a mirrored zpool with disks from multiple service domains, you get continuous availability if a service domain is lost, and can resume 'normal' status by issuing the appropriate ZFS and fmadm recovery commands. This is consistent with behavior in non-virtualized, "bare metal" situations with an offline disk being placed back online. As a Best Practice, institutions should configure to provide media, path, and service domain resiliency. Using mpgroup provides resiliency against losing a path or service domain, and using mirrored pools provides media redundancy.They can be used together or, as shown in this case study, error handling can be passed to the guest.


When we tested the configuration (some time ago) of using two independent VDS devices from different IO domains and mirroring in the guest domain we found that when one device went away the guest domain could hang for 30 seconds or more due to Solaris driver timeouts (if I recall correctly the response to the case we raised).

Also, as you mention, the recovery in this situation is a manual one - in the guest domain itself. If that is not done before the other IO domain goes away then the guest domain will fail and recovery is more onerous.

Personally I think the mpgroup mechanism, while still having some flaws, is a better approach generally.


Posted by guest on December 04, 2013 at 12:24 AM MST #

Hi Ian,

I completely agree that mpgroups are correct practice, in particular for the case of a failed path or service domain. This blog entry was specifically to illustrate what happens if you don't have an mpgroup, and in practice it's not always acceptable to have that I/O timeout delay. Since mpgroups by themselves do nothing for media failure, it's important to combine it with mirroring, RAIDZ, etc.

Please see the blog cited at the top of this entry, https://blogs.oracle.com/jsavit/entry/availability_best_practices_example_configuring where I describe using mpgroup *AND* mirrors to create a disk configuration that provides protection against media problems *AND* loss of a service domain.

Thanks for the comment - I appreciate the dialogue.

cheers, Jeff

Posted by guest on December 05, 2013 at 10:32 AM MST #

Post a Comment:
Comments are closed for this entry.



« April 2014