Availability Best Practices - using a mirrored ZFS pool with virtual disks
By jsavit on Oct 13, 2013
The purpose of this example is slightly different: we will test a configuration that does not use mpgroup, but instead relies on resiliency provided by a ZFS mirrored pool. This is similar to bare-metal configurations in which ZFS is used to provide resiliency, but using virtual disks.
To test this, we configure a redundant 2-way ZFS mirrored pool, where each half of the pool comes from a different service domain and no mpgroup has been prepared. This will show the behavior to expect if a service domain is made unavailable, with and without specifying a timeout for the virtual disk.
We would like to see that the guest VM handles the error and continues operation even if one side of the mirror becomes unavailable, and the way to configure the virtual disks to make this happen.
NOTE: for production purposes it is strongly recommended to use mpgroup and resilient media as shown in Availability Best Practices - Example configuring a T5-8. The method illustrated here avoids a single point of failure and provides resiliency for media, path and service domain, but it also causes a disk I/O timeout, and requires manual intervention in the guest to clean things up. Use the mpgroup with redundant media in order to provide resiliency and continuous operation
A simple ZFS mirror with vdisks from two service domains
In this test, we provision a guest domain with two virtual disks: one from the control domain, and one from an I/O and service domain 'alternate'. There's no mpgroup, just one disk from each service domain.
Two identically-sized LUNs will be used for this exercise, one each from the primary and alternate domains. We used the format command to determine the path names to the two virtual disks, and then use the ldm commands to define them as virtual disks and assign them to the guest domain mydom.
# ldm add-vdsdev /dev/dsk/c0t5000C5003330AC6Bd0s0 myguestdisk0@primary-vds0 # ldm add-vdsdev /dev/dsk/c0t5000C500332C1087d0s0 myguestdisk1@alternate-vds0 # ldm add-vdisk my0 myguestdisk0@primary-vds0 mydom # ldm add-vdisk my1 myguestdisk1@alternate-vds0 mydom
The two disks were used to create a mirrored root pool when Solaris was installed in the domain
(steps not shown).
This creates a configuration that looks like the illustration below:
two disks used for a mirrored ZFS root pool, with each half of the pair from a different service domain.
Production ZFS pools should always use redundancy, whether mirror or RAIDZn. ZFS detects errors by computing checksums on data and comparing them to stored checksums, and corrects errors if the ZFS pool has been configured for redundancy. That is a valuable feature that sets ZFS apart from most file systems.
When the alternate domain is stopped, the I/O operation is blocked, and resumes without complaint when alternate brought back up. As stated in the Oracle VM Server for SPARC 3.1 Administration Guide,
By default, if the service domain providing access to a virtual disk back end is down, all I/O from the guest domain to the corresponding virtual disk is blocked. The I/O automatically is resumed when the service domain is operational and is servicing I/O requests to the virtual disk back end.
Guest-level error handling and disk timeouts
In some situations it is acceptable to block guest I/O until the service domain resumes, but in most cases we want the guest to continue processing. In particular, we must let the guest handle the failure and continue operation. To do this, we set a virtual disk timeout parameter so the guest I/O fails if the service is down, rather than waits till it resumes. This generates an error if the virtual disk is unavailable past the duration of the timeout, and invokes error handling in the guest.
With mydom inactive, I issued:
# ldm set-vdisk timeout=5 my0 mydom # ldm set-vdisk timeout=5 my1 mydom # ldm list -l -o disk mydom NAME mydom DISK NAME VOLUME TOUT ID DEVICE SERVER MPGROUP my0 myguestdisk0@primary-vds0 5 1 disk@1 primary my1 myguestdisk1@alternate-vds0 5 2 disk@2 alternate
If we look at the disks from within the domain we see that all is in clean status:
# fmadm faulty # zpool status pool: rpool state: ONLINE scan: scrub repaired 0 in 0h4m with 0 errors on Mon Jul 1 13:38:11 2013 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c2d1s0 ONLINE 0 0 0 c2d2s0 ONLINE 0 0 0 errors: No known data errors # zpool status -x all pools are healthy
Forcing an error
Now, lets reboot the alternate service domain. This generates an error if the disk backend is unavailable for 5 seconds. (This doesn't imply that 5 seconds is the correct or optimal time - it's just the time I used in this example). Now, when the alternate service domain is stopped, the guest produces fmadm error messages and the following console error message, but continues operation.
# Jul 4 16:48:04 ldom147-123.us.oracle.com vdc: NOTICE: vdisk@2 disk access failed SUNW-MSG-ID: ZFS-8000-NX, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Thu Jul 4 16:48:09 PDT 2013 PLATFORM: SPARC-T3-4, CSN: unknown, HOSTNAME: ldom147-123.us.oracle.com SOURCE: zfs-diagnosis, REV: 1.0 EVENT-ID: 05b072d9-8a95-e1c7-e14d-949bf40c4e60 DESC: Probe of ZFS device 'c2d2s0' in pool 'rpool' has failed. AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. IMPACT: Fault tolerance of the pool may be compromised. REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-NX for the latest service procedures and policies regarding this diagnosis. Jul 4 16:48:56 ldom147-123.us.oracle.com last message repeated 10 time
Despite these error essages, the guest is operational - there is no interruption in application service. ZFS handles the missing volume in the mirrored root pool. Within the guest we use zpool status to display the error and show that one side of the mirror is unavailable:
# zpool status pool: rpool state: DEGRADED status: One or more devices are unavailable in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or 'fmadm repaired', or replace the device with 'zpool replace'. Run 'zpool status -v' to see device specific details. scan: scrub repaired 0 in 0h4m with 0 errors on Thu Jul 4 16:38:46 2013 config: NAME STATE READ WRITE CKSUM rpool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 c2d1s0 ONLINE 0 0 0 c2d2s0 UNAVAIL 0 111 0 errors: No known data errors
The important point is that we were able to tolerate loss of a service domain without a system or application outage in the domain. The ZFS pool has redundancy, and continues to provide I/O to the guest domain. That's the desired and expected behavior in this configuration.
Resuming normal operation
However, the ZFS pool is now in degraded state, and would be vulnerable to another failure, either a failure on c2d1d0 or loss of the domain that services it. We want to restore the pool to non-degraded state for performance and to restore its resiliency.
First, we bring back the alternate I/O domain, which restores access to the virtual disk. The output of zpool status is unchanged but issuing zpool clear -f tells ZFS that the situation has been cleared up. ZFS now resilvers the disk to copy over any changes in disk content that occured while it was unavailable. Once resilvering is complete the pool is no longer in degraded mode, and has access to its full capacity and redundancy.
Note that if you issue zpool status right after resilvering starts, or just after issuing a zpool scrub, the estimated completion time is much higher than it actually will be. For a more realistic time, wait a while after the scrub or resilvering starts before checking the status.
# zpool clear -f rpool # zpool status pool: rpool state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function in a degraded state. action: Wait for the resilver to complete. Run 'zpool status -v' to see device specific details. scan: resilver in progress since Thu Jul 4 16:52:30 2013 474K scanned out of 22.2G at 474K/s, 13h38m to go 308K resilvered, 0.00% done config: NAME STATE READ WRITE CKSUM rpool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 c2d1s0 ONLINE 0 0 0 c2d2s0 DEGRADED 0 0 0 (resilvering) errors: No known data errors ... we wait a little while. Little changed while c2d2 was gone, so resilvering is fast ... # zpool status pool: rpool state: ONLINE scan: resilvered 9.17M in 0h0m with 0 errors on Thu Jul 4 16:52:33 2013 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c2d1s0 ONLINE 0 0 0 c2d2s0 ONLINE 0 0 0 errors: No known data errors # zpool status -x all pools are healthyThen use fmdump and fmadm faulty to list the fault identifiers to be cleared, and then use fmadm repaired or fmadm acquit to clear them. I'm a mere novice at using fmadm, but it all worked :-)
The following snippet shows reported errors and using acquit, used to indicate that we know about the problem and it has been fixed.
# fmdump Jul 04 16:48:09.9398 05b072d9-8a95-e1c7-e14d-949bf40c4e60 ZFS-8000-NX Diagnosed Jul 04 16:48:51.8993 08f0a2c3-b2aa-4fff-89f2-a52d2df33161 ZFS-8000-FD Diagnosed Jul 04 16:48:56.2739 baf9bc24-dd40-67ce-ed1f-8a4525e1981d ZFS-8000-LR Diagnosed Jul 04 16:52:29.8094 baf9bc24-dd40-67ce-ed1f-8a4525e1981d FMD-8000-4M Repaired Jul 04 16:52:29.8254 baf9bc24-dd40-67ce-ed1f-8a4525e1981d FMD-8000-6U Resolved Jul 04 16:52:30.0832 f5bdb18a-a875-cd5e-e9fd-8c043be3ba7e ZFS-8000-QJ Diagnosed Jul 04 16:52:34.4226 f5bdb18a-a875-cd5e-e9fd-8c043be3ba7e FMD-8000-4M Repaired Jul 04 16:52:34.4716 f5bdb18a-a875-cd5e-e9fd-8c043be3ba7e FMD-8000-6U Resolved # fmadm acquit 05b072d9-8a95-e1c7-e14d-949bf40c4e60 fmadm: recorded acquittal of 05b072d9-8a95-e1c7-e14d-949bf40c4e60 # fmadm acquit 08f0a2c3-b2aa-4fff-89f2-a52d2df33161 fmadm: recorded acquittal of 08f0a2c3-b2aa-4fff-89f2-a52d2df33161
This process clears out all the error messages back to a "clean" state.
In summary: if you create a mirrored zpool with disks from multiple service domains, you get continuous availability if a service domain is lost, and can resume 'normal' status by issuing the appropriate ZFS and fmadm recovery commands. This is consistent with behavior in non-virtualized, "bare metal" situations with an offline disk being placed back online. As a Best Practice, institutions should configure to provide media, path, and service domain resiliency. Using mpgroup provides resiliency against losing a path or service domain, and using mirrored pools provides media redundancy.They can be used together or, as shown in this case study, error handling can be passed to the guest.