Availability Best Practices - Avoiding Single Points of Failure
By jsavit on Jul 08, 2013
This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly named Logical Domains)
Avoiding Single Points Of Failure (SPOF)
Highly available systems are configured without Single Points Of Failure (SPOF) to ensure that individual component failures do not result in loss of service. The general method to avoid SPOFs is to provide redundant components for each necessary resource, so service can continue if a component fails. In this article we will discuss resources to make resilient in Oracle VM Server for SPARC environments. This primarily consists of configuring redundant network and disk I/O. Subsequent articles will drill down into each resource type and provide comprehensive illustrations.
Network connectivity is made resilient in guest domains by using redundant virtual network devices, just as redundant physical network devices are used with non-virtualized environments. Multiple virtual network devices are defined to the guest domain, each based on a different virtual switch ("vswitch") associated with a different physical backend. The backend can be either a physical link or an aggregation, which can provide additional resiliency against device failure and potentially improve performance.
Solaris IP Multipathing (IPMP) is then configured within the guest much the same as in a physical instance of Solaris.
If one link in an IPMP group fails, the remaining links provide continued network connectivity.
The visible differences within the guest are the device names,
since the "synthetic" device names for virtual network links (vnet0, vnet1, etc) are used instead of physical link names like nxge0 or ixgbe0.
This provides protection against failure of a switch, cable, NIC, and so on. Assuming that there are virtual switches in the control domain named primary-vsw0 and primary-vsw1, each using a different physical network device (that is, primary-vsw0 could be based on the control domain's net0, and primary-vsw1 based on net1) this is as simple as:
# ldm add-vnet linkprop=phys-state vnet0 primary-vsw0 ldg1 # ldm add-vnet linkprop=phys-state vnet1 primary-vsw1 ldg1The optional parameter linkprop=phys-state indicates that the guest will use link-based IP Multipathing, and passes link up/down status from the physical device to the virtual device so Solaris can detect link status changes. This parameter can be omitted if Solaris is going to use probe-based IPMP with a test IP address. For a full discussion, see Using Virtual Networks and Configuring IPMP in a Logical Domains Environment. The remainder of the exercise is to configure IPMP in the guest domain, which will be illustrated in a later article.
There is additional protection if virtual network devices are provisioned from different service domains. If any of
the service domains fails or is shutdown, network access continues over virtual network devices provided by the remaining service domains.
Use of separate service domains is transparent to the guest Solaris OS, and the only configuration difference is that virtual switches from different service domains are used in the above command sequence.
Let's face it, disks are essentially rapidly rotating rust ("round, brown and spinning"), so configuring for redundancy is essential in order to tolerate media failure without data loss. Disks are made resilient with Oracle VM Server for SPARC using methods similar to physical systems, either by provisioning virtual disks with "backends" that are themselves resilient, or by mirroring non-resilient disks.
Review of disk backend choices
Every virtual disk has a backend, which can be a physical disk or LUN, an iSCSI target, a disk volume based on a ZFS volume ("zvol") or volume manager such as Solaris Volume Manager (SVM), or a file on a file system (including NFS).
The following graphic shows the relationship between the virtual disk client ("vdc") device driver in the guest domain, and the virtual disk server ("vds) driver in the service domain that work together, connected by a Logical Domain Channnel (LDC) to associate the virtual disk to the physical backend:
See Using Virtual Disks for a full discussion.
One of the most important domain configuration choices is whether a disk backend should be a physical device, such as a LUN or local disk, or a file in a file system mounted to a service domain. This has implications for performance, availability, and functionality such as the ability to perform live migration or support SCSI reservation. In general, physical devices (like a LUN) have the best performance, while file-based disk backends offer convenience, since they can be created as needed without needing to allocate physical devices from a storage array. Here are examples of using a physical device backend, a file-based backend, and a ZFS volume:
# virtual disk based on a LUN # ldm add-vdsdev /dev/rdsk/c0t5000C5003330AC6Bd0s0 myguestdisk0@primary-vds0 # ldm add-vdisk my0 myguestdisk0@primary-vds0 mydom # virtual disk based on a file (in this case, on NFS) # mkfile -n 20g /ldomsnfs/ldom0/disk0.img # ldm add-vdsdev /ldomsnfs/ldom0/disk0.img ldom0disk0@primary-vds0 # ldm add-vdisk vdisk00 ldom0disk0@primary-vds0 ldg0 # virtual disk based on a ZFS volume # zfs create -V 20g rpool/export/home/ldoms/ldom1-disk0 # ldm add-vdsdev /dev/zvol/rdsk/rpool/export/home/ldoms/ldom1-disk0 ldg1-disk0@primary-vds0 # ldm add-vdisk vdisk10 ldg1-disk0@primary-vds0 ldg1
The first example takes a device visible to the service domain (which, in this case is the control domain) and presents it directly to the guest. The control domain does not format it or mount a file system on it. That is the simplest backend to understand: the service domain does a "passthrough" of all I/O activity. The other two examples create a virtual disk from a file or ZFS volume in a file system visible to the service domain. There is a layer of indirection and file buffering that increases latency, but the guest can leverage features like ZFS mirroring or RAIDZ, compression and cloning provided in the service domain's file system.
There are also functional differences. For example, domains using LUNs or NFS file virtual disks can be live migrated to any server that can see the disk backends. A local (non-NFS) ZFS disk backend cannot be used with live migration because a ZFS pool can only be imported to one Solaris instance (in this case, one control domain) at a time.
This is the most complex part of configuring domains, and we will return to it in later articles. Also see Best Practices - Top Ten Tuning Tips.
Making virtual disks resilient
The rule is that the virtual disk uses the redundancy of the underlying backend, or is made resilent within the guest by mirroring/RAID. There are several ways to accomplish this:
- Use disk backends based on devices that provide their own redundancy.
This capability is typical of enterprise-grade storage arrays and NAS servers like
Oracle's Sun ZFS Storage Appliance
and is transparent to Solaris or to logical domains.
The storage administrator configures LUNs based on a RAID group, or shares on a resilient NAS server, and the logical domain administrator simply defines virtual disks based on the backend: The virtual disks based on the resilient disk inherit its protection, regardless of whether the backend is a LUN passed directly to the guest domain or a file in a file system mounted to the service domain. Media failures in the storage device are transparently handled within the storage device.
The advantages of this method are that it offloads data integrity design to storage administrators, is simple from the server administrator perspective, and often fits companies' existing storage standards. However, it requires appropriate storage backends, and that does not provide end-to-end data protection (a storage device cannot handle errors that emerge after the bits have left its controller). That said, this is a widely used production deployment method.
- Use disk backends made resilient within the service domain.
This is often done using ZFS mirroring or RAIDZ: a zvol or file backend in a redundant ZFS pool is protected from media failure by ZFS checksum and repair.
The virtual disk can be presented locally, and also exported over the network via NFS or iSCSI.
The advantages of this method are that it can use a wide range of storage devices, including local disk on the control domain, is simple to administer, and provides full end-to-end data protection. Choosing between this method and the one immediately preceding it is similar to the question of whether ZFS pools should use ZFS mirroring or RAIDZ, or use simple ZFS pools based on hardware RAID. Both methods are widely used, however only the ZFS data protection provides full end-to-end checksum and data repair capabilities.
- Use disk backends made resilient within the guest.
In this case, non-redundant disks are presented to the guest, which arranges them into a mirrored ZFS pool or RAIDZ group.
One must be careful to not offer disk backends on the same physical device, of course.
This has the advantages of the previous method, is simple to implement, and offloads data integrity design to the owner of the guest domain, who could use existing procedures used in non-virtual environments. It also has the advantage that virtual disks can be provisioned from different service domains, adding an additional level of resiliency just as with virtual network devices provided by multiple service domains.
Here is an example with two physical LUNs, one each provisioned in the primary domain and an alternate service domain. Note the use of the timeout parameter: if a path or the service domain fails, the timeout will cause an I/O error to be presented to the guest domain. The guest can then take corrective action, mark the mirrored ZFS as being in degraded state, and continuing operation:
# ldm add-vdsdev /dev/dsk/c0t5000C5003330AC6Bd0s0 myguestdisk0@primary-vds0 # ldm add-vdsdev /dev/dsk/c0t5000C500332C1087d0s0 myguestdisk1@alternate-vds0 # ldm add-vdisk my0 myguestdisk0@primary-vds0 mydom # ldm add-vdisk my1 myguestdisk1@alternate-vds0 mydom # ldm set-vdisk timeout=5 my0 mydom # ldm set-vdisk timeout=5 my1 mydomInside mydom, these two disks are used to create a mirrored ZFS pool. In the example below, there is already a root ZFS pool based on disk c2d1s0, and c2d2s0 is added to it to form a mirror. The format command can be used to determine the device names of disks presented to the guest.
mydom # zpool attach rpool c2d1s0 c2d2s0 Make sure to wait until resilver is done before rebooting.If there is a media failure, Solaris ZFS will do normal ZFS checksum processing to detect and correct the error. If either the primary or alternate domain fails or their hardware connection to the disk, the guest will get an I/O error but continue running.
These methods are all valid, and should be evaluated in the context of application requirements and existing standards and procedures. There is no single "best practice" that fits all circumstances. Subsequent blog entries in this series will illustrate each of these methods in more detail.
Disk path availability
The preceding section mostly discussed disk availability from the perspective of media failure: what happens if there is a disk failure or checksum error. A separate and equally important aspect is ensuring redundant path availability to the device. Both Fibre Channel Storage Area Network (FC SAN) and NFS devices support multiple access paths.
An FC SAN controller can have multiple host interfaces, preferably using more than one FC switch, to provide redundant access to multiple servers. Each server can have multiple Host Bus Adapter (HBA) cards to connect to the FC fabric. Together, they provide resiliency against the loss of an individual HBA, cable, FC switch, or host interface on the FC controller. This is similar to configuring multiple channels and control units as has long been done with mainframe DASD.
The same concept is available for virtual disks based on NFS or iSCSI. In those cases, the physical transport is Ethernet, and resiliency can be provided by multiple network devices (which may be network devices, aggregates, or IPMP groups).
Redundant disk paths are expressed in logical domains by using virtual disks defined with multipath groups ("mpgroup") as described at Configuring Virtual Disk Multipathing. The logical domains administrator defines multiple paths to the same disk backend (emphasized because this is crucial: the same device, accessed by multiple paths) and then adds the disk to the guest domain using one of the paths. This is illustrated below using an NFS backend available to multiple service domains.
# ldm add-vdsdev mpgroup=mympgroup /ldoms/myguest/disk0.img mydisk0@primary-vds0 # ldm add-vdsdev mpgroup=mympgroup /ldoms/myguest/disk0.img mydisk0@alternate-vds0 # ldm add-vdisk mydisk0 mydisk0@primary-vds0 myguest
If a path/connection to the backend device fails, access automatically continues via the remaining path. There would also be path protection if the virtual disk was provided by one service domain instead of two, or if the device was a LUN instead of NFS; only the service names and device paths in the ldm add-vdsdev commands would be different.
Service Domain Availability Using Multiple Service Domains
The preceding discussion shows how to provide resiliency by using redundant network and disk resources. However, each virtual device is provided by a service domain, which can also be a single point of failure: if the service domain fails or is taken down for maintenance, then all of the devices it provides become unavailable. This can be prevented by using multiple service domains, as shown in some of the examples above.
While not mandatory, this is a powerful and widely used feature that adds resiliency and reduces single points of failure. It guards against failure, and can be used for "rolling upgrades" in which service domains can be taken out of service for maintenance without having a loss of service or requiring that a server be "evacuated". Use of multiple service domains is such an important topic that the entire next blog entry will be devoted to reviewing domain roles and how to configure for redundancy at the domain level.
Oracle VM Server for SPARC lets you configure resilient virtual network and disk I/O services, and provide resiliency for the service and I/O domains that provision them. Following articles will show exactly how to configure such domains for highly available guest I/O.