Availability Best Practices - Example configuring a T5-8
By jsavit on Jul 19, 2013
This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly named Logical Domains)This article continues the series on availability best practices. In this post we will show each step used to configure a T5-8 for availability with redundant network and disk I/O, using multiple service domains.
Overview of T5
The SPARC T5 servers are a powerful addition to the SPARC line. Details on the product can be seen at SPARC T5-8 Server, SPARC T5-8 Server Documentation, The SPARC T5 Servers have landed, and other locations.
For this discussion, the important things to know are:
- Each T5 chip (sometimes referred to as a socket or CPU) has 16 cores operating at 3.36GHz. Each core has 8 CPU threads. A server can have between 1 and 8 sockets for a maximum of 1,024 CPU threads on 128 cores. (Editorial comment: Wow!)
- Up to 4TB of RAM.
- Two PCIe 3.0 buses per socket. A T5-8 has 16 PCIe buses.
The following graphic shows T5-8 server resources. This picture labels each chip as a CPU, and shows CPU0 through CPU7 on their respective Processor Modules (PM) and the associated buses. On-board devices are connected to buses on CPU0 and CPU7.
This demo is done on a lab system with a limited I/O configuration, but enough to show availability practices. Real T5-8 systems would typically have much richer I/O. The system is delivered with a single control domain owning all CPU, I/O and memory resources. Let's view the resources bound to the control domain (the only domain at this time). Wow, that's a lot of CPUs and memory. Some output and whitespace snipped out for brevity.
primary# ldm list -l NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME primary active -n-c-- UART 1024 1047296M 0.0% 0.0% 2d 5h 11m ----snip---- CORE CID CPUSET 0 (0, 1, 2, 3, 4, 5, 6, 7) 1 (8, 9, 10, 11, 12, 13, 14, 15) 2 (16, 17, 18, 19, 20, 21, 22, 23) 3 (24, 25, 26, 27, 28, 29, 30, 31) ----snip---- 124 (992, 993, 994, 995, 996, 997, 998, 999) 125 (1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007) 126 (1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015) 127 (1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023) VCPU VID PID CID UTIL NORM STRAND 0 0 0 4.7% 0.2% 100% 1 1 0 1.3% 0.1% 100% 2 2 0 0.2% 0.0% 100% 3 3 0 0.1% 0.0% 100% ----snip---- 1020 1020 127 0.0% 0.0% 100% 1021 1021 127 0.0% 0.0% 100% 1022 1022 127 0.0% 0.0% 100% 1023 1023 127 0.0% 0.0% 100% ----snip---- IO DEVICE PSEUDONYM OPTIONS pci@300 pci_0 pci@340 pci_1 pci@380 pci_2 pci@3c0 pci_3 pci@400 pci_4 pci@440 pci_5 pci@480 pci_6 pci@4c0 pci_7 pci@500 pci_8 pci@540 pci_9 pci@580 pci_10 pci@5c0 pci_11 pci@600 pci_12 pci@640 pci_13 pci@680 pci_14 pci@6c0 pci_15 ----snip----Let's also look at the bus device names and pseudonyms:
primary# ldm list -l -o physio primary NAME primary IO DEVICE PSEUDONYM OPTIONS pci@300 pci_0 pci@340 pci_1 pci@380 pci_2 pci@3c0 pci_3 pci@400 pci_4 pci@440 pci_5 pci@480 pci_6 pci@4c0 pci_7 pci@500 pci_8 pci@540 pci_9 pci@580 pci_10 pci@5c0 pci_11 pci@600 pci_12 pci@640 pci_13 pci@680 pci_14 pci@6c0 pci_15 ----snip----
Basic domain configuration
The following commands are basic configuration steps to define virtual disk, console and network services and resize the control domain. They are shown for completeness but are not specifically about configuring for availability.
primary# ldm add-vds primary-vds0 primary primary# ldm add-vcc port-range=5000-5100 primary-vcc0 primary primary# ldm add-vswitch net-dev=net0 primary-vsw0 primary primary# ldm set-core 2 primary primary# svcadm enable vntsd primary# ldm start-reconf primary primary# ldm set-mem 16g primary primary# shutdown -y -g0 -i6
This is standard control domain configuration. After reboot, we have a resized control domain, and save the configuration to the service processor.
primary# ldm list NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME primary active -n-cv- UART 16 16G 3.3% 2.5% 4m primary# ldm add-spconfig initial
Determine which buses to reassignThis step follows the same procedure as in the previous article to determine which buses must be kept on the control domain and which can be assigned to an alternate service domain. The official documentation is at Assigning PCIe Buses in the Oracle VM Server for SPARC 3.0 Administration Guide.
First, identify the bus used for the root pool disk (in a production environment this would be mirrored) by getting the device name and then using the mpathadm command.
primary# zpool status rpool pool: rpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c0t5000CCA01605A11Cd0s0 ONLINE 0 0 0 errors: No known data errors primary# mpathadm show lu /dev/rdsk/c0t5000CCA01605A11Cd0s0 Logical Unit: /dev/rdsk/c0t5000CCA01605A11Cd0s2 ----snip---- Paths: Initiator Port Name: w508002000145d1b1 ----snip---- primary# mpathadm show initiator-port w508002000145d1b1 Initiator Port: w508002000145d1b1 Transport Type: unknown OS Device File: /devices/pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/iport@1
That shows that the boot disk is on bus pci@300 (pci_0).
Next, determine which bus is used for network. Interface net0 (based on ixgbe0) is our primary interface and hosts a virtual switch, so we need to keep its bus.
primary# dladm show-phys LINK MEDIA STATE SPEED DUPLEX DEVICE net1 Ethernet unknown 0 unknown ixgbe1 net2 Ethernet unknown 0 unknown ixgbe2 net0 Ethernet up 100 full ixgbe0 net3 Ethernet unknown 0 unknown ixgbe3 net4 Ethernet up 10 full usbecm2 primary# ls -l /dev/ix* lrwxrwxrwx 1 root root 31 Jun 21 12:04 /dev/ixgbe -> ../devices/pseudo/clone@0:ixgbe lrwxrwxrwx 1 root root 65 Jun 21 12:04 /dev/ixgbe0 -> ../devices/pci@300/pci@1/pci@0/pci@4/pci@0/pci@8/network@0:ixgbe0 lrwxrwxrwx 1 root root 67 Jun 21 12:04 /dev/ixgbe1 -> ../devices/pci@300/pci@1/pci@0/pci@4/pci@0/pci@8/network@0,1:ixgbe1 lrwxrwxrwx 1 root root 65 Jun 21 12:04 /dev/ixgbe2 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0:ixgbe2 lrwxrwxrwx 1 root root 67 Jun 21 12:04 /dev/ixgbe3 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0,1:ixgbe3
Both disk and network are on bus pci@300 (pci_0), and there are network devices on pci@6c0 (pci_15) that we can give to an alternate service domain.
Lets determine which buses are needed to give that service domain access to disk. Previously we saw that the control domain's root pool was on c0t5000CCA01605A11Cd0s0 on pci@300. The control domain currently has access to all buses and devices, so we can use the format command to see what other disks are available. There is a second disk, and it's on bus pci@6c0:
primary# format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c0t5000CCA01605A11Cd0 <HITACHI-H109060SESUN600G-A244 cyl 64986 alt 2 hd 27 sec 66> /scsi_vhci/disk@g5000cca01605a11c /dev/chassis/SPARC_T5-8.1239BDC0F9//SYS/SASBP0/HDD0/disk 1. c0t5000CCA016066100d0 <HITACHI-H109060SESUN600G-A244 cyl 64986 alt 2 hd 27 sec 668> /scsi_vhci/disk@g5000cca016066100 /dev/chassis/SPARC_T5-8.1239BDC0F9//SYS/SASBP1/HDD4/disk Specify disk (enter its number): ^C primary# mpathadm show lu /dev/dsk/c0t5000CCA016066100d0s0 Logical Unit: /dev/rdsk/c0t5000CCA016066100d0s2 ----snip---- Paths: Initiator Port Name: w508002000145d1b0 Target Port Name: w5000cca016066101 ----snip---- primary# mpathadm show initiator-port w508002000145d1b0 Initiator Port: w508002000145d1b0 Transport Type: unknown OS Device File: /devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/iport@1
This provides the information needed to reassign buses.
- Bus pci@300 (pci_0) has control domain's boot disk and primary network device.
- Bus pci@6c0 (pci_15) has unused disk and network devices, and can be used for another domain.
Define alternate service domain and reassign buses
We now define an alternate service domain, remove the above buses from the control domain and assign them to the alternate. Removing the buses cannot be done dynamically (add to or remove from a running domain). If I had planned ahead and obtained bus information earlier, I could have done this when I resized the domain's memory and avoided the second reboot.
primary# ldm add-dom alternate primary# ldm set-core 2 alternate primary# ldm set-mem 16g alternate primary# ldm start-reconf primary primary# ldm rm-io pci_15 primary primary# init 6
After rebooting the control domain, I give the unassigned bus pci_15 to the alternate domain. At this point I could install Solaris in the alternate domain using a network install server, but for convenience I use a virtual CD image in a .iso file on the control domain. Normally you do not use virtual I/O devices in the alternate service domain because that introduces a dependency on the control domain, but this is temporary and will be removed after Solaris is installed.
primary# ldm add-io pci_15 alternate primary# ldm add-vdsdev /export/home/iso/sol-11-sparc.iso s11iso@primary-vds0 primary# ldm add-vdisk s11isodisk s11iso@primary-vds0 alternate primary# ldm bind alternate primary# ldm start alternate
At this point, I installed Solaris in the domain. When the install was complete, I removed the Solaris install CD image, and saved the configuration to the service processor:
primary# ldm rm-vdisk s11isodisk alternate primary# ldm add-spconfig 20130621-splitNote that the network devices on pci@6c0 are enumerated starting at ixgbe0, even though they were ixgbe2 and ixgbe3 when on the control domain that had all 4 installed interfaces.
alternate# ls -l /dev/ixgb* lrwxrwxrwx 1 root root 31 Jun 21 10:34 /dev/ixgbe -> ../devices/pseudo/clone@0:ixgbe lrwxrwxrwx 1 root root 65 Jun 21 10:34 /dev/ixgbe0 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0:ixgbe0 lrwxrwxrwx 1 root root 67 Jun 21 10:34 /dev/ixgbe1 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0,1:ixgbe1
Define redundant services
We've split up the bus configuration and defined an I/O domain that can boot and run independently on its own PCIe bus. All that remains is to define redundant disk and network services to pair with the ones defined above in the control domain:
primary# ldm add-vds alternate-vds0 alternate primary# ldm add-vsw net-dev=net0 alternate-vsw0 alternate
Note that we could increase resiliency, and potentially performance as well, by using a Solaris 11 network aggregate as the net-dev for each virtual switch. That would provide additional insulation: if a single network device fails the aggregate can continue operation without requiring IPMP failover in the guest.
In this exercise we use a ZFS storage appliance as an NFS server to host guest disk images, so we mount it on both the control and alternate domain, and then create a directory and boot disk for a guest domain. The following two commands are executed in both the primary and alternate domains:
# mkdir /ldoms # mount zfssa:/export/mylab /ldomsThose are the only configuration commands run in the alternate domain. All other commands in this exercise are only run from the control domain.
Define a guest domain
A guest domain will be defined with two network devices so it can use IP Multipathing (IPMP) and two virtual disks for a mirrored root pool, each with a path from both the control and alternate domains. This pattern can be repeated as needed for multiple guest domains, as shown in the following graphic with two guests.
primary# ldm add-dom ldg1 primary# ldm set-core 16 ldg1 primary# ldm set-mem 64g ldg1 primary# ldm add-vnet linkprop=phys-state ldg1net0 primary-vsw0 ldg1 primary# ldm add-vnet linkprop=phys-state ldg1net1 alternate-vsw0 ldg1 primary# ldm add-vdisk s11isodisk s11iso@primary-vds0 ldg1 primary# mkdir /ldoms/ldg1 primary# mkfile -n 20g /ldoms/ldg1/disk0.img primary# ldm add-vdsdev mpgroup=ldg1group /ldoms/ldg1/disk0.img ldg1disk0@primary-vds0 primary# ldm add-vdsdev mpgroup=ldg1group /ldoms/ldg1/disk0.img ldg1disk0@alternate-vds0 primary# ldm add-vdisk ldg1disk0 ldg1disk0@primary-vds0 ldg1 primary# mkfile -n 20g /ldoms/ldg1/disk1.img primary# ldm add-vdsdev mpgroup=ldg1group1 /ldoms/ldg1/disk1.img ldg1disk1@primary-vds0 primary# ldm add-vdsdev mpgroup=ldg1group1 /ldoms/ldg1/disk1.img ldg1disk1@alternate-vds0 primary# ldm add-vdisk ldg1disk1 ldg1disk1@alternate-vds0 ldg1 primary# ldm bind ldg1 primary# ldm start ldg1
Note the use of linkprop=phys-state on the virtual network definitions: this indicates that changes in physical link state should be passed to the virtual device so it can perform a failover.
Also note mpgroup on the virtual disk definitions. The ldm add-vdsdev commands define a virtual disk exported by a service domain, and the mpgroup pair indicates they are the same disk (the administrator must ensure they are different paths to the same disk) accessible by multiple paths. A different mpgroup pair is used for each multi-path disk. For each actual disk there are two "add-vdsdev" commands, and one ldm add-vdisk command that adds the multi-path disk to the guest. Each disk can be accessed from either the control domain or the alternate domain, transparent to the guest. This is documented in the Oracle VM Server for SPARC 3.0 Administration Guide at Configuring Virtual Disk Multipathing.
At this point, Solaris is installed in the guest domain without any special procedures. It will have a mirrored ZFS root pool, and each disk is available from both service domains. It also has two network devices, one from each service domain. This provides resiliency for device failure, and in case either the control domain or alternate domain is rebooted.
Configuring and testing redundancy
Multipath disk I/O is transparent to the guest domain. This was tested by serially rebooting the control domain or the alternate domain, and observing that disk I/O operation just proceeded without noticeable effect.
Network redundancy required configuring IP Multipathing (IPMP) in the guest domain. The guest has two network devices, net0 provided by the control domain, and net1 provided by the alternate domain. The process is documented at Configuring IPMP in a Logical Domains Environment.
The following commands are executed in the guest domain to make a redundant network connection:
ldg1# ipadm create-ipmp ipmp0 ldg1# ipadm add-ipmp -i net0 -i net1 ipmp0 ldg1# ipadm create-addr -T static -a 10.134.116.224/24 ipmp0/v4addr1 ldg1# ipadm create-addr -T static -a 10.134.116.225/24 ipmp0/v4addr2 ldg1# ipadm show-if IFNAME CLASS STATE ACTIVE OVER lo0 loopback ok yes -- net0 ip ok yes -- net1 ip ok yes -- ipmp0 ipmp ok yes net0 net1
This was tested by bouncing the alternate service domain and control domain (one at a time) and noting that network sessions remained intact. The guest domain console displayed messages when one link failed and was restored:
Jul 9 10:35:51 ldg1 in.mpathd: The link has gone down on net1 Jul 9 10:35:51 ldg1 in.mpathd: IP interface failure detected on net1 of group ipmp0 Jul 9 10:37:37 ldg1 in.mpathd: The link has come up on net1
While one of the service domains was down, dladm and ipadm showed link status:
ldg1# ipadm show-if IFNAME CLASS STATE ACTIVE OVER lo0 loopback ok yes -- net0 ip ok yes -- net1 ip failed no -- ipmp0 ipmp ok yes net0 net1 ldg1# dladm show-phys LINK MEDIA STATE SPEED DUPLEX DEVICE net0 Ethernet up 0 unknown vnet0 net1 Ethernet down 0 unknown vnet1 ldg1# dladm show-link LINK CLASS MTU STATE OVER net0 phys 1500 up -- net1 phys 1500 down --When the service domain finished rebooting, the "down" status returned to "up". There was no outage at any time.
This article showed how to configure a T5-8 with an alternate service domain, and define services for redundant I/O access. This was tested by rebooting each service domain one at a time, and observing that guest operation considered without interruption. This is a very powerful Oracle VM Serer for SPARC capability for configuring highly available virtualized compute environments.