Availability Best Practices - Example configuring a T5-8

This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly named Logical Domains)
This article continues the series on availability best practices. In this post we will show each step used to configure a T5-8 for availability with redundant network and disk I/O, using multiple service domains.

Overview of T5

The SPARC T5 servers are a powerful addition to the SPARC line. Details on the product can be seen at SPARC T5-8 Server, SPARC T5-8 Server Documentation, The SPARC T5 Servers have landed, and other locations.

For this discussion, the important things to know are:

  • Each T5 chip (sometimes referred to as a socket or CPU) has 16 cores operating at 3.36GHz. Each core has 8 CPU threads. A server can have between 1 and 8 sockets for a maximum of 1,024 CPU threads on 128 cores. (Editorial comment: Wow!)
  • Up to 4TB of RAM.
  • Two PCIe 3.0 buses per socket. A T5-8 has 16 PCIe buses.

The following graphic shows T5-8 server resources. This picture labels each chip as a CPU, and shows CPU0 through CPU7 on their respective Processor Modules (PM) and the associated buses. On-board devices are connected to buses on CPU0 and CPU7.

Initial configuration

This demo is done on a lab system with a limited I/O configuration, but enough to show availability practices. Real T5-8 systems would typically have much richer I/O. The system is delivered with a single control domain owning all CPU, I/O and memory resources. Let's view the resources bound to the control domain (the only domain at this time). Wow, that's a lot of CPUs and memory. Some output and whitespace snipped out for brevity.

primary# ldm list -l
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  NORM  UPTIME
primary          active     -n-c--  UART    1024  1047296M 0.0%  0.0%  2d 5h 11m

----snip----

CORE
    CID    CPUSET
    0      (0, 1, 2, 3, 4, 5, 6, 7)
    1      (8, 9, 10, 11, 12, 13, 14, 15)
    2      (16, 17, 18, 19, 20, 21, 22, 23)
    3      (24, 25, 26, 27, 28, 29, 30, 31)
----snip----
    124    (992, 993, 994, 995, 996, 997, 998, 999)
    125    (1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007)
    126    (1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015)
    127    (1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023)
VCPU
    VID    PID    CID    UTIL NORM STRAND
    0      0      0      4.7% 0.2%   100%
    1      1      0      1.3% 0.1%   100%
    2      2      0      0.2% 0.0%   100%
    3      3      0      0.1% 0.0%   100%
----snip----
    1020   1020   127    0.0% 0.0%   100%
    1021   1021   127    0.0% 0.0%   100%
    1022   1022   127    0.0% 0.0%   100%
    1023   1023   127    0.0% 0.0%   100%
----snip----
IO
    DEVICE                           PSEUDONYM        OPTIONS
    pci@300                          pci_0           
    pci@340                          pci_1           
    pci@380                          pci_2           
    pci@3c0                          pci_3           
    pci@400                          pci_4           
    pci@440                          pci_5           
    pci@480                          pci_6           
    pci@4c0                          pci_7           
    pci@500                          pci_8           
    pci@540                          pci_9           
    pci@580                          pci_10          
    pci@5c0                          pci_11          
    pci@600                          pci_12          
    pci@640                          pci_13          
    pci@680                          pci_14          
    pci@6c0                          pci_15    
----snip----
Let's also look at the bus device names and pseudonyms:
primary# ldm list -l -o physio primary
NAME             
primary          

IO
    DEVICE                           PSEUDONYM        OPTIONS
    pci@300                          pci_0           
    pci@340                          pci_1           
    pci@380                          pci_2           
    pci@3c0                          pci_3           
    pci@400                          pci_4           
    pci@440                          pci_5           
    pci@480                          pci_6           
    pci@4c0                          pci_7           
    pci@500                          pci_8           
    pci@540                          pci_9           
    pci@580                          pci_10          
    pci@5c0                          pci_11          
    pci@600                          pci_12          
    pci@640                          pci_13          
    pci@680                          pci_14          
    pci@6c0                          pci_15
----snip----          

Basic domain configuration

The following commands are basic configuration steps to define virtual disk, console and network services and resize the control domain. They are shown for completeness but are not specifically about configuring for availability.

primary# ldm add-vds primary-vds0 primary
primary# ldm add-vcc port-range=5000-5100 primary-vcc0 primary
primary# ldm add-vswitch net-dev=net0 primary-vsw0 primary
primary# ldm set-core 2 primary
primary# svcadm enable vntsd
primary# ldm start-reconf primary
primary# ldm set-mem 16g primary
primary# shutdown -y -g0 -i6

This is standard control domain configuration. After reboot, we have a resized control domain, and save the configuration to the service processor.

primary# ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  NORM  UPTIME
primary          active     -n-cv-  UART    16    16G      3.3%  2.5%  4m
primary# ldm add-spconfig initial

Determine which buses to reassign

This step follows the same procedure as in the previous article to determine which buses must be kept on the control domain and which can be assigned to an alternate service domain. The official documentation is at Assigning PCIe Buses in the Oracle VM Server for SPARC 3.0 Administration Guide.

First, identify the bus used for the root pool disk (in a production environment this would be mirrored) by getting the device name and then using the mpathadm command.

primary# zpool status rpool
  pool: rpool
 state: ONLINE
  scan: none requested
config:
        NAME                       STATE     READ WRITE CKSUM
        rpool                      ONLINE       0     0     0
          c0t5000CCA01605A11Cd0s0  ONLINE       0     0     0
errors: No known data errors
primary# mpathadm show lu /dev/rdsk/c0t5000CCA01605A11Cd0s0
Logical Unit:  /dev/rdsk/c0t5000CCA01605A11Cd0s2
----snip----        
        Paths:  
                Initiator Port Name:  w508002000145d1b1
----snip----

primary# mpathadm show initiator-port w508002000145d1b1
Initiator Port:  w508002000145d1b1
        Transport Type:  unknown
        OS Device File:  /devices/pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/iport@1

That shows that the boot disk is on bus pci@300 (pci_0).

Next, determine which bus is used for network. Interface net0 (based on ixgbe0) is our primary interface and hosts a virtual switch, so we need to keep its bus.

primary# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net1              Ethernet             unknown    0      unknown   ixgbe1
net2              Ethernet             unknown    0      unknown   ixgbe2
net0              Ethernet             up         100    full      ixgbe0
net3              Ethernet             unknown    0      unknown   ixgbe3
net4              Ethernet             up         10     full      usbecm2
primary# ls -l /dev/ix*
lrwxrwxrwx   1 root     root     31 Jun 21 12:04 /dev/ixgbe -> ../devices/pseudo/clone@0:ixgbe
lrwxrwxrwx   1 root     root     65 Jun 21 12:04 /dev/ixgbe0 -> ../devices/pci@300/pci@1/pci@0/pci@4/pci@0/pci@8/network@0:ixgbe0
lrwxrwxrwx   1 root     root     67 Jun 21 12:04 /dev/ixgbe1 -> ../devices/pci@300/pci@1/pci@0/pci@4/pci@0/pci@8/network@0,1:ixgbe1
lrwxrwxrwx   1 root     root     65 Jun 21 12:04 /dev/ixgbe2 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0:ixgbe2
lrwxrwxrwx   1 root     root     67 Jun 21 12:04 /dev/ixgbe3 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0,1:ixgbe3

Both disk and network are on bus pci@300 (pci_0), and there are network devices on pci@6c0 (pci_15) that we can give to an alternate service domain.

Lets determine which buses are needed to give that service domain access to disk. Previously we saw that the control domain's root pool was on c0t5000CCA01605A11Cd0s0 on pci@300. The control domain currently has access to all buses and devices, so we can use the format command to see what other disks are available. There is a second disk, and it's on bus pci@6c0:

primary# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
       0. c0t5000CCA01605A11Cd0 <HITACHI-H109060SESUN600G-A244 cyl 64986 alt 2 hd 27 sec 66>
          /scsi_vhci/disk@g5000cca01605a11c
          /dev/chassis/SPARC_T5-8.1239BDC0F9//SYS/SASBP0/HDD0/disk
       1. c0t5000CCA016066100d0 <HITACHI-H109060SESUN600G-A244 cyl 64986 alt 2 hd 27 sec 668>
          /scsi_vhci/disk@g5000cca016066100
          /dev/chassis/SPARC_T5-8.1239BDC0F9//SYS/SASBP1/HDD4/disk
Specify disk (enter its number): ^C
primary# mpathadm show lu /dev/dsk/c0t5000CCA016066100d0s0
Logical Unit:  /dev/rdsk/c0t5000CCA016066100d0s2
----snip----
        Paths:  
                Initiator Port Name:  w508002000145d1b0
                Target Port Name:  w5000cca016066101
----snip----
primary# mpathadm show initiator-port w508002000145d1b0
Initiator Port:  w508002000145d1b0
        Transport Type:  unknown
        OS Device File:  /devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/iport@1

This provides the information needed to reassign buses.

  • Bus pci@300 (pci_0) has control domain's boot disk and primary network device.
  • Bus pci@6c0 (pci_15) has unused disk and network devices, and can be used for another domain.

Define alternate service domain and reassign buses

We now define an alternate service domain, remove the above buses from the control domain and assign them to the alternate. Removing the buses cannot be done dynamically (add to or remove from a running domain). If I had planned ahead and obtained bus information earlier, I could have done this when I resized the domain's memory and avoided the second reboot.

primary# ldm add-dom alternate
primary# ldm set-core 2 alternate
primary# ldm set-mem 16g alternate
primary# ldm start-reconf primary
primary# ldm rm-io pci_15 primary
primary# init 6

After rebooting the control domain, I give the unassigned bus pci_15 to the alternate domain. At this point I could install Solaris in the alternate domain using a network install server, but for convenience I use a virtual CD image in a .iso file on the control domain. Normally you do not use virtual I/O devices in the alternate service domain because that introduces a dependency on the control domain, but this is temporary and will be removed after Solaris is installed.

primary# ldm add-io pci_15 alternate
primary# ldm add-vdsdev /export/home/iso/sol-11-sparc.iso s11iso@primary-vds0
primary# ldm add-vdisk s11isodisk s11iso@primary-vds0 alternate
primary# ldm bind alternate
primary# ldm start alternate

At this point, I installed Solaris in the domain. When the install was complete, I removed the Solaris install CD image, and saved the configuration to the service processor:

primary# ldm rm-vdisk s11isodisk alternate
primary# ldm add-spconfig 20130621-split
Note that the network devices on pci@6c0 are enumerated starting at ixgbe0, even though they were ixgbe2 and ixgbe3 when on the control domain that had all 4 installed interfaces.
alternate# ls -l /dev/ixgb*
lrwxrwxrwx   1 root     root     31 Jun 21 10:34 /dev/ixgbe -> ../devices/pseudo/clone@0:ixgbe
lrwxrwxrwx   1 root     root     65 Jun 21 10:34 /dev/ixgbe0 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0:ixgbe0
lrwxrwxrwx   1 root     root     67 Jun 21 10:34 /dev/ixgbe1 -> ../devices/pci@6c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0,1:ixgbe1

Define redundant services

We've split up the bus configuration and defined an I/O domain that can boot and run independently on its own PCIe bus. All that remains is to define redundant disk and network services to pair with the ones defined above in the control domain:

primary# ldm add-vds alternate-vds0 alternate
primary# ldm add-vsw net-dev=net0 alternate-vsw0 alternate

Note that we could increase resiliency, and potentially performance as well, by using a Solaris 11 network aggregate as the net-dev for each virtual switch. That would provide additional insulation: if a single network device fails the aggregate can continue operation without requiring IPMP failover in the guest.

In this exercise we use a ZFS storage appliance as an NFS server to host guest disk images, so we mount it on both the control and alternate domain, and then create a directory and boot disk for a guest domain. The following two commands are executed in both the primary and alternate domains:

# mkdir /ldoms				 
# mount zfssa:/export/mylab /ldoms  
Those are the only configuration commands run in the alternate domain. All other commands in this exercise are only run from the control domain.

Define a guest domain

A guest domain will be defined with two network devices so it can use IP Multipathing (IPMP) and two virtual disks for a mirrored root pool, each with a path from both the control and alternate domains. This pattern can be repeated as needed for multiple guest domains, as shown in the following graphic with two guests.

primary# ldm add-dom ldg1
primary# ldm set-core 16 ldg1
primary# ldm set-mem 64g ldg1
primary# ldm add-vnet linkprop=phys-state ldg1net0 primary-vsw0 ldg1 
primary# ldm add-vnet linkprop=phys-state ldg1net1 alternate-vsw0 ldg1
primary# ldm add-vdisk s11isodisk s11iso@primary-vds0 ldg1
primary# mkdir /ldoms/ldg1
primary# mkfile -n 20g /ldoms/ldg1/disk0.img
primary# ldm add-vdsdev mpgroup=ldg1group /ldoms/ldg1/disk0.img ldg1disk0@primary-vds0
primary# ldm add-vdsdev mpgroup=ldg1group /ldoms/ldg1/disk0.img ldg1disk0@alternate-vds0
primary# ldm add-vdisk ldg1disk0 ldg1disk0@primary-vds0 ldg1
primary# mkfile -n 20g /ldoms/ldg1/disk1.img
primary# ldm add-vdsdev mpgroup=ldg1group1 /ldoms/ldg1/disk1.img ldg1disk1@primary-vds0
primary# ldm add-vdsdev mpgroup=ldg1group1 /ldoms/ldg1/disk1.img ldg1disk1@alternate-vds0
primary# ldm add-vdisk ldg1disk1 ldg1disk1@alternate-vds0 ldg1
primary# ldm bind ldg1
primary# ldm start ldg1

Note the use of linkprop=phys-state on the virtual network definitions: this indicates that changes in physical link state should be passed to the virtual device so it can perform a failover.

Also note mpgroup on the virtual disk definitions. The ldm add-vdsdev commands define a virtual disk exported by a service domain, and the mpgroup pair indicates they are the same disk (the administrator must ensure they are different paths to the same disk) accessible by multiple paths. A different mpgroup pair is used for each multi-path disk. For each actual disk there are two "add-vdsdev" commands, and one ldm add-vdisk command that adds the multi-path disk to the guest. Each disk can be accessed from either the control domain or the alternate domain, transparent to the guest. This is documented in the Oracle VM Server for SPARC 3.0 Administration Guide at Configuring Virtual Disk Multipathing.

At this point, Solaris is installed in the guest domain without any special procedures. It will have a mirrored ZFS root pool, and each disk is available from both service domains. It also has two network devices, one from each service domain. This provides resiliency for device failure, and in case either the control domain or alternate domain is rebooted.

Configuring and testing redundancy

Multipath disk I/O is transparent to the guest domain. This was tested by serially rebooting the control domain or the alternate domain, and observing that disk I/O operation just proceeded without noticeable effect.

Network redundancy required configuring IP Multipathing (IPMP) in the guest domain. The guest has two network devices, net0 provided by the control domain, and net1 provided by the alternate domain. The process is documented at Configuring IPMP in a Logical Domains Environment.

The following commands are executed in the guest domain to make a redundant network connection:

ldg1# ipadm create-ipmp ipmp0
ldg1# ipadm add-ipmp -i net0 -i net1 ipmp0
ldg1# ipadm create-addr -T static -a 10.134.116.224/24 ipmp0/v4addr1
ldg1# ipadm create-addr -T static -a 10.134.116.225/24 ipmp0/v4addr2
ldg1# ipadm show-if
IFNAME     CLASS    STATE    ACTIVE OVER
lo0        loopback ok       yes    --
net0       ip       ok       yes    --
net1       ip       ok       yes    --
ipmp0      ipmp     ok       yes    net0 net1

This was tested by bouncing the alternate service domain and control domain (one at a time) and noting that network sessions remained intact. The guest domain console displayed messages when one link failed and was restored:

Jul  9 10:35:51 ldg1 in.mpathd[107]: The link has gone down on net1
Jul  9 10:35:51 ldg1 in.mpathd[107]: IP interface failure detected on net1 of group ipmp0
Jul  9 10:37:37 ldg1 in.mpathd[107]: The link has come up on net1

While one of the service domains was down, dladm and ipadm showed link status:

ldg1# ipadm show-if
IFNAME     CLASS    STATE    ACTIVE OVER
lo0        loopback ok       yes    --
net0       ip       ok       yes    --
net1       ip       failed   no     --
ipmp0      ipmp     ok       yes    net0 net1
ldg1# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         0      unknown   vnet0
net1              Ethernet             down       0      unknown   vnet1
ldg1# dladm show-link
LINK                CLASS     MTU    STATE    OVER
net0                phys      1500   up       --
net1                phys      1500   down     --
When the service domain finished rebooting, the "down" status returned to "up". There was no outage at any time.

Summary

This article showed how to configure a T5-8 with an alternate service domain, and define services for redundant I/O access. This was tested by rebooting each service domain one at a time, and observing that guest operation considered without interruption. This is a very powerful Oracle VM Serer for SPARC capability for configuring highly available virtualized compute environments.

Comments:

Thanks for the info - I understand that the T5-8 can support up to eight internal discs - how are these divided across the PCI buses ? Are they split across just two PCI buses (similar to the T4-4) or are they split across more ?

Posted by guest on July 22, 2013 at 01:46 AM MST #

I'm glad you like it. I am not at office and don't have the links handy, but if I recall correctly it is the same: the controllers are on 2 buses. The same as in this example but more disks installed. The t5 reference docs will have the answer to make sure. Regards, Jeff

Posted by Jeff on July 22, 2013 at 02:13 AM MST #

That's a bit of a shame. My initial though was that it would be more advantageous if more IO domains could be created (using all internal components) so that we could spread the IO load between multiple instances.

On another note are there any tools yet to determine which IO domain is serving the guest domains for disc IO ? And for that matter the ability to move the load between IO domains without having to reboot the IO domain currently serving the guest ?

Posted by guest on July 22, 2013 at 02:51 AM MST #

I see >2 I/O domains more for app separation as in Supercluster than raw performance, in that 2 domains can provide redundancy and scale as much I/O as needed for internal devices, and even external. To the other point: not yet, but can do indirectly by using iostat in the service domains (for visibility), but no way yet to set which is used. Jeff

Posted by Jeff on July 22, 2013 at 08:33 AM MST #

Very nice, I create 2x I/O domain on my T5 server, thanks.

Posted by guest on July 27, 2013 at 12:24 AM MST #

Very useful, I just create the alternate service domain on my T5, thanks.

Posted by ZW on July 27, 2013 at 12:30 AM MST #

Thanks for the comments - I'm glad you found it helpful.

Posted by guest on July 28, 2013 at 11:18 AM MST #

Post a Comment:
Comments are closed for this entry.
About

jsavit

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today