This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly named Logical Domains)
Oracle VM Server for SPARC lets the administrator create multiple service domains to provide virtual I/O devices to guests. While not mandatory, this powerful feature adds resiliency and reduces single points of failure. Resiliency is provided by redundant I/O devices, strengthened by providing the I/O from different service domains.
First, a review of material previously presented in Best Practices - which domain types should be used to run applications, re-stated here with more focus on availability.
Traditional virtual machine designs use a "thick" hypervisor that provides all virtualization functions: core virtualization (creating and running virtual machines), management control point, resource manager, live migration (if available), and I/O. If there is a crash in code providing one of those functions, it can take the entire system down with it.
In contrast, Oracle VM Server for SPARC offloads management and I/O functionality from the hypervisor to domains (virtual machines).
This is a modern, modular alternative to older monolithic designs, and permits a simpler and smaller hypervisor. This enhances reliability and security because a smaller body of code is easier to develop and test, and has a smaller attack surface for bugs and exploits. It reduces single points of failure by assigning responsibilities to multiple system components that can be configured for redundancy.
Oracle VM Server for SPARC defines the following types of domain, each with their own roles:
There can be as many domains as fit in available CPU and memory, up to a limit of 128 per T-series server or M-series physical domain.
A server can have as many I/O domains as there are assignable physical devices. In these articles we will refer to I/O domains that own a PCIe bus, also called "root complex domains". There can be as many such domains as there are assignable "PCIe root complexes". For example, there are up to 4 root complex I/O domains on a T4-4 since there is one PCIe root per socket and there are 4 sockets. A T5 system has two such buses per socket, so an eight-socket T5-8 server can have 16 I/O domains.
There is one control domain per T-series server or M-series physical domain.
There are separable components for different services.You can create redundant network and disk configurations that can tolerate loss of an I/O device, and you can also create multiple service domains for redundant access to I/O.
This is not mandatory: you can choose to have a single primary domain that is the sole I/O and service domain, and still provide redundant I/O. Oracle VM Server for SPARC provides an additional level of redundancy and fault tolerance. Guest domains can continue working even if a service domain or the control domain is down
Multiple service domains can be used for "rolling upgrades" in which service domains are serially updated and rebooted without disrupting guest domain operation. This provides "serviceability", the "S" in "Reliability, Availability, and Serviceability" (RAS) for continuous application availability during a planned service domain outage.
For example, you might update Solaris in the service domain to apply fixes or get the latest version of a device driver, or even (in the case of the control domain) update the logical domains manager (the ldomsmanager package) to add new functionality. (Note: there is never a problem with doing a pkg update to create a new boot environment while guest domains continue to run. Solaris 11 permits system updates during normal operation. There is only a domain outage during the reboot to bring up the new boot environment into operation, and that brief outage can be made invisible to guests by using multiple service domains.)
If you have two service domains providing virtual disk and network services, you can update one of them and reboot it. During the time that domain is rebooting all virtual I/O continues on the other service domain. When the first service domain is up again, you can reboot the other one, providing continuous operation while upgrading system software.
This avoids taking a planned application outage for upgrades, which is often difficult to schedule due to business requirements. It is also an efficient alternative to "evacuating" a server by using live migration to move guests off the box during maintenance and then moving them back. The advantages over evacuation are that overall capacity isn't reduced during maintenance, since all boxes remain in service, and the delay and overhead of migrating guest virtual machines off and then back onto a server is eliminated. Oracle VM Server for SPARC can use the "evacuation with live migration" method - but doesn't have to.
Let's discuss details for configuring alternate service domains. The reference material is at Oracle VM Server for SPARC Administration Guide - Assigning PCIe Buses. The basic process is to identify which buses to keep on the control domain, which initially has all hardware resources, and which ones to assign to alternate service domains (which means they are also I/O domains), and then assign them.
Let's demonstrate the process on a T5-2. This lab machine has a small I/O configuration, but enough to demonstrate the techniques used.
First, let's see the configuration. Note that a T5-2 has four PCIe buses, since T5 systems have two buses per socket, and that the ldm list -l -o physio command displays both the bus device and its pseudonym, showing the correspondence between the pci@NNN and pci_N bus identification format.
t5# ldm list NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME primary active -n-c-- UART 256 511G 0.2% 0.0% 24m t5# ldm list-io NAME TYPE BUS DOMAIN STATUS ---- ---- --- ------ ------ pci_0 BUS pci_0 primary pci_1 BUS pci_1 primary pci_2 BUS pci_2 primary pci_3 BUS pci_3 primary ... snip for brevity ... t5# ldm list -l -o physio IO DEVICE PSEUDONYM OPTIONS pci@340 pci_1 pci@300 pci_0 pci@3c0 pci_3 pci@380 pci_2 ... snip for brevity ...
This section illustrates the method shown in Assigning PCIe Buses.First, ensure that the ZFS root pool is named 'rpool' as expected, and get the names of disk devices in the pool.
t5# df / / (rpool/ROOT/solaris-1):270788000 blocks 270788000 files t5# zpool status rpool pool: rpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0c0t5000CCA03C2C7B50d0s0 ONLINE 0 0 0 errors: No known data error
t5# mpathadm show lu /dev/rdsk/c0t5000CCA03C2C7B50d0s0 Logical Unit: /dev/rdsk/c0t5000CCA03C2C7B50d0s2 ... snip for brevity ... Paths: Initiator Port Name: w508002000138a190 Target Port Name: w5000cca03c2c7b51 ... snip for brevity ... t5# mpathadm show initiator-port w508002000138a190 Initiator Port: w508002000138a190 Transport Type: unknown OS Device File: /devices/pci@300/pci@1/pci@0/pci@2/scsi@0/iport@1
On older systems like UltraSPARC T2 Plus, the disk may not be managed by Solaris I/O multipathing, and you'll see the traditional "cNtNdN" name disk device. Simply use ls -l on the device path to get the bus name. The example below uses a mirrored ZFS root
pool, and both disks are on bus pci_400, with pseudonym pci_0
t5240# df / / (rpool/ROOT/solaris-1):60673718 blocks 60673718 files t5240# zpool status rpool pool: rpool state: ONLINE scan: scrub repaired 0 in 0h39m with 0 errors on Fri Jul 12 08:08:16 2013 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3t0d0s0 ONLINE 0 0 0 c3t1d0s0 ONLINE 0 0 0 errors: No known data errors t5240# ls -l /dev/dsk/c3t0d0s0 lrwxrwxrwx 1 root root 49 Jul 24 2012 /dev/dsk/c3t0d0s0 -> ../../devices/pci@400/pci@0/pci@8/scsi@0/sd@0,0:a t5240# ls -l /dev/dsk/c3t1d0s0 lrwxrwxrwx 1 root root 49 Jul 24 2012 /dev/dsk/c3t1d0s0 -> ../../devices/pci@400/pci@0/pci@8/scsi@0/sd@1,0:a t5240# ldm list-io NAME TYPE BUS DOMAIN STATUS ---- ---- --- ------ ------ pci_0 BUS pci_0 primary pci_1 BUS pci_1 primary ... snip for brevity ... t5240# ldm list -l -o physio primary NAME primary IO DEVICE PSEUDONYM OPTIONS pci@400 pci_0 pci@500 pci_1 ... snip for brevity ...
The control domain on this T5240 requires bus pci@400, which has pseudonym pci_0. The other bus on this two-bus system can be used for an alternate service domain.
To summarize: we've identified the PCI bus needed for the control domain boot disks that can not be removed.
A similar process is used to determine which buses are needed for network access. In this example net0 is used to log into the control domain, and also for a virtual switch. That is the "vanity name" for ixgbe0.
t5# dladm show-phys net0 LINK MEDIA STATE SPEED DUPLEX DEVICE net0 Ethernet up 10000 full ixgbe0 t5# ls -l /dev/ixgbe0 lrwxrwxrwx 1 root root 53 May 9 11:04 /dev/ixgbe0 -> ../devices/pci@300/pci@1/pci@0/pci@1/network@0:ixgbe0 t5# ls -l /dev/ixgbe1 lrwxrwxrwx 1 root root 55 May 9 11:04 /dev/ixgbe1 -> ../devices/pci@300/pci@1/pci@0/pci@1/network@0,1:ixgbe1 t5# ls -l /dev/ixgbe2 lrwxrwxrwx 1 root root 53 May 9 11:04 /dev/ixgbe2 -> ../devices/pci@3c0/pci@1/pci@0/pci@1/network@0:ixgbe2 t5# ls -l /dev/ixgbe3 lrwxrwxrwx 1 root root 55 May 9 11:04 /dev/ixgbe3 -> ../devices/pci@3c0/pci@1/pci@0/pci@1/network@0,1:ixgbe3
Based on these commands pci@300 is used for ixgbe0 and ixgbe1. while pci@3c0, is used for ixgbe2 and ixgbe3.
The preceding commands have shown that the control domain uses pci@300 (also called pci_0) for both network and disk, and we can reassign any other bus to an alternate service + I/O domain. This will require a delayed reconfiguration to remove physical I/O from the control domain, and then we'll define a new domain and give it that bus. We first save the system configuration to the service processor so we can fall back to the prior state if we need to.
t5# ldm add-spconfig 20130617-originalbus t5# ldm start-reconf primary Initiating a delayed reconfiguration operation on the primary domain. All configuration changes for other domains are disabled until the primary domain reboots, at which time the new configuration for the primary domain will also take effect. t5# ldm remove-io pci_2 primary t5# ldm remove-io pci_3 primary t5# shutdown -i6 -y -g0
t5# ldm list-io NAME TYPE BUS DOMAIN STATUS ---- ---- --- ------ ------ pci_0 BUS pci_0 primary pci_1 BUS pci_1 primary pci_2 BUS pci_2 pci_3 BUS pci_3 ... snip for brevity ...
At this point we create the alternate I/O domain. This is done using normal ldm commands with the exception that we do not define virtual disk and network devices, because it will use physical I/O devices on the buses it owns.
t5# ldm create alternate t5# ldm set-core 1 alternate t5# ldm set-mem 8g alternate t5# ldm add-io pci_3 alternate t5# ldm add-io pci_2 alternate t5# ldm bind alternate t5# ldm start alternate LDom alternate started t5# ldm add-spconfig 20130617-alternate
At this point, just install Solaris in the alternate domain using whatever process seems convenient. A network install could be used, or a virtual ISO install image could be temporarily added from the control domain, and then removed after installation.
Oracle VM Server for SPARC lets you configure multiple service and I/O domains in order to provide resilient I/O services to guests.
Following articles will show exactly how to configure such domains for highly available guest I/O.
This article shows how to split out buses to create an alternate I/O domain, but we haven't made it into a service domain yet. The next article will complete the picture, using an even larger server, to illustrate how to create the I/O domain and then define redundant network and disk services.