Tuesday Aug 20, 2013

Live migration of a Guest with an SR-IOV VF using dynamic SR-IOV feature!

NOTE: This is only an example how Dynamic SR-IOV can be exploited to accomplish Live migration of a Guest with SR-IOV VF. 

OVM Server for SPARC 3.1 introduces Dynamic SR-IOV feature which provides the capability to dynamically add and remove SR-IOV Virtual Functions to Logical domains. This is one use case on how Dynamic SR-IOV feature can be combined with rich Solaris IO features and accomplish the Live migration of a Logical domain that has an Ethernet SR-IOV Virtual Function assigned to it. The idea here is to create a multipath configuration in the logical domain with a VF and a Virtual Network(vnet) device so that we can dynamically remove the VF before the live migration and then re-assign the same VF on the target system. The vnet device in the Logical domain provides two functionalities, 1) allows the VF to be dynamically removed from the domain 2) provides communication for the application during the length of Live migration as the VF would be removed at the start of Live migration. 

The Solaris IPMP is the best possible multipath configuration that is possible for a VF and Vnet today. We need to configure the Vnet device as a standby device so that the high-performance VF is used for communication when it is available. If you want a VF to be assigned again on the target system, the restriction today is to add a VF with exact same name, that is a VF from same PCIE slot and same VF number as that on the source system. When the same device is seen by the Solaris OS in the Logical domain, it will automatically adds it to the same IPMP group  so that no manual intervention is required. As the VF was configured as active device, the IPMP will automatically re-direct the traffic to the VF.  The following diagram shows this configuration visually:

Live migration of a Guest with an SR-IOV VF

The above configuration shows that we can also choose to use the same PF as the backend device for the Virtual Switch and a VF from that PF can be assigned to the Guest domain.  That is, both Vnet and VF use the same PF device.  Note, you are free to use another NIC as the backend device for the virtual switch too, but this example  demonstrates that one doesn't require multiple network ports to be used for using this configuration.  The other recommended configuration would be to use another Service domain to host the Virtual switch that way an admin could use the same config to handle the needs of rebooting the Root domain manually. That is, manually remove the VF from the Guest domain dynamically before rebooting the Root domain and assign the VF again dynamically after the Root domain is booted.  

The following are the high-level steps configure such configuration.

Assumptions: 

  • A Guest domain named "ldg1" is already created with configuration to meet the Live migration requirements. 
  • The Guest domain OS support Dynamic SR-IOV, see the OVM Server for SPARC 3.1 Release notes for the OS versions that support Dynamic SR-IOV.
  • The Physical Function /SYS/MB/NET0/IOVNET.PF0 for the VFs.
  • The network device "net0" on the primary domain maps the PF /SYS/MB/NET0/IOVNET.PF0.
  • The desired number of SR-IOV Virtual Functions are already created on the PF /SYS/MB/NET0/IOVNET.PF0. See OVM Server for SPARC 3.1 admin guide for how to create the Virtual Functions. This example uses the VF named /SYS/MB/NET0/IOVNET.PF0.VF0.
  • A Virtual Switch(vsw) named "primary-vsw0" is already created on the primary domain.
Steps to create the config:

  • Create and add a Vnet device to the Guest domain ldg1.
    • # ldm add-vnet vnet0 primary-vsw0 ldg1
  • Add a VF to the Guest domain ldg1
    • # ldm add-io /SYS/MB/NET0/IOVNET.PF0.VF0 ldg1
  • Boot the Guest domain ldg1
  • Login to the Guest domain and configure IPMP
    • Use "dladm show-phys" command to determine the netX names that Solaris OS assigned to the Vnet and VF devices. 
    • This example assumes net0 maps to Vnet device and net1 maps to the VF device.
  • Configure IPMP configuration. Note, this creates a simple IPMP active/standby configuration, you can choose to create the IPMP configuration based on your network needs. Here the important thing is that Vnet device to be configured as standby device.
    • # ipadm create-ip net0
    • # ipadm create-ip net1
    • # ipadm set-ifprop -p standby=on -m ip net0
    • # ipadm create-ipmp -i net0 -i net1 ipmp0
    • # ipadm create-addr -T static -a local=<ipaddr>/<netmask> ipmp0

Live migration steps:

  • On the source system to live migrate:
    • # ldm remove-io  /SYS/MB/NET0/IOVNET.PF0.VF0 ldg1
    • # ldm migrate -p <password-file> ldg1 root@<target-system>
  • On the target system after live migration:
    • # ldm add-io /SYS/MB/NET0/IOVNET.PF0.VF0 ldg1

The following  youtube video is a demo of live migration with SR-IOV VF in a guest while running the network performance test. The graphs show the traffic switching to Vnet after the VF is removed and then switching back to VF when the VF is added on the target system.

Oracle Open World 2012 Demo of Live migration with SR-IOV Virtual Function


Direct I/O and SR-IOV features are now extended to Non-Primary root domains!

Until now OVM Server for SPARC Direct I/O and SR-IOV features were limited to PCIe buses assigned to the primary domain only. This restriction is now removed with the release of OVM Server for SPARC 3.1. That is, now you can assign a PCIe bus to a logical domain and then assign PCIe slots or SR-IOV Virtual Functions from that PCIe bus to other domains. This opens up many different creative opportunities. For example it enables configuration such as below:

Non-Primary root domain example config

A config like the above along with Dynamic SR-IOV feature in OVM Server for SPARC 3.1 would open up various opportunities for deployment. Note this won't increase the availability of the I/O domains yet, but certainly in the near future. It certainly provides opportunity to handle situations manually like rebooting a Root domain. To reboot one of the Root domains, an admin can now remove VFs from I/O domains and then reboot that Root domain, and once the Root domain is rebooted, you can add those VFs back to the I/O domains again. The Solaris OS in I/O domains will automatically add those VFs(need to assign the same exact VFs) back to the same multi-path groups automatically there by everything back to normal. 

OVM Server for SPARC 3.1 introduces Dynamic SR-IOV feature

OVM Server for SPARC 3.1 introduces a great enhancement to the PCIe SR-IOV feature. Until now, creating/destroying SR-IOV Virtual Functions(VFs) in a static way, that is it requires a reboot of the root domain and adding and removing VFs requires the Guest domain to be stopped. Rebooting root domains can be disruptive as it impacts the I/O domains that depend on it. Stopping Guest domain to add or remove a VF is also disruptive to the applications running in it.  OVM Server for SPARC 3.1 enhances the PCIe SR-IOV feature with Dynamic SR-IOV for Ethernet SR-IOV devices that removes these restrictions. That is now we can create/destroy VFs while a root domain is running and we can also add/remove VFs from a Guest domain while it is running without impacting applications. Note, OVM Server for SPARC 3.1 introduces another feature named "Non-Primary Root domains" which extends the PCIe SR-IOV and Dynamic SR-IOV features to all Root domains along with Primary domain. That is, you can perform all Dynamic SR-IOV operations on Non-Primary Root domains also.

This feature is supported only when OVM Server for SPARC 3.1 LDoms manager, corresponding System Firmware and the supported OS version. You can refer to the OVM Server for SPARC 3.1 release notes for the exact information the supported OS version and System Firmware versions. The dynamic IOV is enabled for a given logical domain only if all of the s/w components are installed and other configuration requirements are met. If not, you can always use the static method to accomplish your changes.

There is still one operation that we could not make it dynamic yet, that is enabling IO virtualization for a given PCIe bus. For now, IO virtualization for a given PCIe bus needs to be enabled ahead of time. That is, the following need to be done ahead of time. Note, if you are planning to create VFs at the same time, it is a good idea to create VFs at the same time as well as you would be rebooting any way. The following steps enables IO virtualization for a PCIe bus. Note, this needs to be done only once for a PCIe bus, of course while it is assigned to a Root domain.

Enable I/O Virtualization for a PCIe bus:

  • If the PCIe bus is already assigned to a domain: 
    • # ldm start-reconf <root-domain-name>
    • # ldm set-io iov=on <PCIe bus name>
    • # reboot  
  • You can also enable it while adding a PCIe bus a logical domain:
    • # ldm add-io iov=on <PCIe bus name> <domain-name>
  • You can check if the IOV is enabled for a given PCIe bus with "ldm list-io <PCIe bus name>".

Dynamically create or destroy VFs:

  • Once the IOV is enabled for a given PCIe bus, you can create or destroy VFs with create-vf and destroy-vf without requiring the delayed reconfiguration and a reboot. But this requires the Physical Function network device to be either not plumbed or in a multi-path configuration. The dynamic create/destroy VF operations perform hotplug offline and online operations on the PF device, that causes the PF to detached and re-attached. As a result the  PF device need to be either not plumbed(that is not in use) or in a multipathed(IPMP or aggr) configuration so that hotplug offline/online operations are successful. If that is not the case, the dynamic create/destroy operations will fail reporting that the device is  busy, in such case you can use the static method to create/destroy VFs
    • # ldm create-vf <pf-name>
    • # ldm destroy-vf <vf-name>

Dynamically add or remove VFs:

  • You can now  dynamically add and remove VFs to/from a logical domain. All you would need to run is 'add-io' and 'remove-io' commands without stopping the Guest domain.
    • # ldm add-io <vf-name>  <domain-name>
    • # ldm remove-io <vf-name> <domain-name>

Trouble shooting:

  • The dynamic SR-IOV operations are disabled:
    • Check if the System Firmware that is released with OVM Server for SPARC 3.1 is installed on your system.
    • Check if the OVM Server for SPARC 3.1 LDoms manager is installed.
    • Check if the given domain has the required OS version supported. Check the OVM Server for SPARC 3.1 release notes for this information.
    • Check if IOV is enabled for the given PCIe bus, use "ldm ls-io <PCIe bus name>" to check this.
  • The create-vf or destroy-vf  failed to dynamically perform the operation:
    • Verify if the network device that maps to the PF is not plumbed or in a multipath(Aggr or IPMP). 
      • You can obtain the device path for the PF using 'ldm ls-io -l <pf-name>" and then map it to the corresponding device in the root domain by grepping for it in /etc/path_to_inst. Then use "dladm show-phys" command map that to the netX device name.
  • Dynamically removing a VF from a Guest domain fails
    • Ensure the VF is not in use or in a multi-path configuration.

Virtual network performance greatly improved!

With latest OVM Server for SPARC, the virtual network performance has been improved greatly.  We are now able to drive the line rate(9.xGbps) on a 10Gbps NIC and up to 16Gbps for Guest-to-Guest communication. These numbers are achieved with standard MTU(1500), that is no need to use Jumbo Frames to get this performance. This is made possible by introducing support for LSO in our networking stack. The following graphs are from a SPARC T5-2 platform, with 2 cores to Control domain and Guest domains. 

LDoms Virtual network performance graphs

Note: In general for any network device, the performance numbers depends on the type of workload, the above numbers are obtained with iperf workload and a message size of 8k. Note, the interface is configured with standard MTU.for throughput.

These improvements are available in S11.1 SRU9,  the latest SRU is always recommended. The S10 patches with the same improvement will be available very soon. We recommend highly to use S11.1 in the Service domains.

What you need to get this performance? 

  • Install S11.1 SRU9 or later in both Service domain(the domain that hosts LDoms vsw) and the Guest domains. It is important both Service domain and the Guest domains to be updated to get this performance.
    • S10 patches with equivalent performance are also available. The S10 patch 150031-07 is required to be installed in the S10 domain(s). Please contact Oracle support teams for any additional information.
  • Update the latest system Firmware that is available for your platform platform.
    • These performance numbers can only expected on SPARC T4 and newer platforms only.
  • Ensure that the extended-mapin-space is set to on for both Service domain and Guest domains.  
    • Note OVM Server for SPARC 3.1 Software and associated FW sets extended-mapin-space to on by default so that this performance comes out of the box, in any case, confirm if it is set to on all domains.
    • You can check this with the following command:

# ldm ls -l <domain-name> |grep extended-mapin
   extended-mapin-space=on
 
    •  If the extended-mapin-space is not set to on, you can set it to on with the following commands. Note, the changes to extended-mapin-space will trigger delayed reconfig for primary domain and require a reboot and the Guest domains required to be stopped.
# ldm set-domain extended-mapin-space=on <domain-name>

  •  Ensure there are sufficient CPU and memory resources assigned to both the Service domain and Guest domains. Note to drive 10Gbps or beyond performance a Guest domain need to be configured to be able to drive such performance, we recommend 2 CPU Cores or more and 4GB or more memory to be assigned to each Guest domain. As the service domain is also involved in proxying the data for the Guest domains, it is very important to assign sufficient CPU and memory resources, we recommend 2 CPU Cores or more and 4GB or more memory resources to the Service domain.
  • No jumbo frames configuration is required, that is, this performance improvement will be available for standard MTU(1500) as well. We introduced support for LSO to be able to optimize the performance for standard MTU. In fact, we recommend avoid configuring Jumbo Frames unless you have specific need. 

Saturday Jun 09, 2012

Direct IO enhancements in OVM Server for SPARC 2.2(a.k.a LDoms2.2)

The Direct I/O feature has been available for LDoms customers since LDoms2.0. Apart from the latest SR-IOV feature in LDoms2.2, it is worth noting a few enhancements to the Direct I/O feature. These are:

  • Support for Metis-Q and Metis-E cards.
    • These cards are highly requested for support and are worth mentioning because they are the only combo cards containing both FibreChannel and Ethernet in the same card. With this support, a customer can have both SAN storage and network access with just one card and one PCIe slot assigned to a logical domain. This reduces cost and helps when there are less number of slots in a given platform.
    • The following are the part numbers for these cards. I have tried to put the platforms on which each card is supported, but this information can get quickly outdated. The accurate information can be found at the Support Document.
 Card Name  Part Number  Platforms

Metis-Q: StorageTek Dual 8Gb Fibre Channel Dual GbE ExpressModule HBA, QLogic

SG-XPCIEFCGBE-Q8-N  SPARC T3-4, T4-4
Metis-E: StorageTek Dual 8Gb Fibre Chanel Dual GbE ExpressModule HBA, Emulex SG-XPCIEFCGBE-E8-N SPARC T3-4, T4-4 
  • Additional cards added to the portfolio of supported cards. This is mainly Powerville based Ethernet cards, the part numbers for these cards as below:
 Part Number  Description
 7100477 Sun Quad Port GbE PCI Express 2.0 Low Profile Adapter, UTP
 7100481 Sun Dual Port GbE PCI Express 2.0 Low Profile Adapter, MMF
 7100483 Sun Quad Port GbE PCI Express 2.0 ExpressModule, UTP
 7110486 Sun Quad Port GbE PCI Express 2.0 ExpressModule, MMF   

Note:  Direct IO feature has a hard dependency on the Root domain(PCIe bus owner, here Primary domain). That is, rebooting the Root domain for any reason may impact the logical domains having PCIe slots assigned with Direct IO feature. So rebooting a root domain need to be carefully managed. Also apply the failure-policy settings as described in the admin guide and release notes to deal with unexpected cases.

Thursday May 24, 2012

SR-IOV feature in OVM Server for SPARC 2.2

One of the main features of OVM Server for SPARC(a.k.a LDoms) 2.2 is SR-IOV support. This blog is to help understand SR-IOV feature in LDoms a little better.

What is SR-IOV?

SR-IOV is an abbreviation for Single Root I/O virtualization. It is a PCI-SIG standards based I/O virtualization, that enables a PCIe function known as Physical Function(PF) to create multiple light weight PCIe functions known as Virtual Functions(VFs). After they are created, VFs show up like a regular PCIe functions and also operate like regular PCIe functions. The address space for a VF is well contained so that a VF can be assigned to a Virtual Machine(a logical domain) with the help of Hypervisor. SR-IOV provides the high granularity of sharing compared to other form of direct h/w access methods that are available in LDoms technology, that is, PCIe bus assignment and Direct I/O. A few important things to understand about PF and VFs are:

  • A VF configuration space provide access to registers to perform I/O only. That is, only access to DMA channels and related registers.
  • The common h/w related configuraton changes can only be performed via the PF, so a VF driver need to contact PF driver perform the change on behalf of VF. A PF driver owns the responsiblity of ensuring a VF does not impact other VFs or PF in any way.

More detalis of SR-IOV can be found at the PCI-SIG website: PCI-SIG Single Root I/O Virtualization

What are the benefits of SR-IOV Virtual Functions?

  • Bare metal like performance.
    • No CPU overhead and latency issues that are seen in Virtual I/O.
  • Throughput that is only limited by the number of VFs from the same device and actively performing I/O.
    • There is no limitation of throughput due to such things as implementation limitations that exist Virtual I/O.
    • At a given time, if it is the only one VF that is performing I/O, it can potentially utilize the entire b/w available.
    • When multiple VFs are trying perform I/O, then b/w allocation depends on the SR-IOV card h/w on how it allocates the b/w to VFs. The devices that are supported in LDoms2.2 apply a round robin type of policy, which distributes the available b/w equally to all VFs that are performing I/O.
 

In summary, a logical domain with an application that requires bare metal like I/O performance is a best candidate to use SR-IOV. Before assigning an SR-IOV Virtual Function to a logical domain, it is important to understand the limitations that come along with it, see below for more details.

 

LDoms2.2 SR-IOV Limitations:

Understand the following limitations and plan ahead on how you would deal with them in your deployment.

  • Migration feature is disabled for a logical domain that has a VF is assigned to it.
    • For all practical purposes, a VF looks like physical device in a logical domain. This brings all the limitations of having a physical device in a logical domain.
  • Hard dependency on the Root domain(a domain in which PF device exists).
    • In LDoms2.2, the Primary domain is the only root domain that is supported. That is, rebooting the Primary domain will impact the logical domains that have a VF assigned to them, the behavior is unpredictable but the common expectation is an OS panic.
    • Caution: Prior to rebooting the Primary domain, ensure that all logical domains that have a VF assigned to them are properly shutdown. See LDoms2.2 admin guide about how to setup a failure-policy to handle unplanned cases.
  • Primary Domain as the only root domain supported. That is, SR-IOV supported only for the SR-IOV cards that are in the PCIe bus owned by the Primary domain.
    • If a PCIe bus assigned to another logical domain, typically used to create failover configs, then SR-IOV support for for the cards in that bus is disabled. You will not see the Physical functions from those cards.

What hardware needed?

The following details may help you plan on what hardware needed to use LDoms SR-IOV feature.

  • SR-IOV feature is supported only on platforms based on T3 and beyond. This feature is not available on platforms T2 and T2+.
  • LDoms2.2 at the release time will support two SR-IOV devices. These are:
    • On-board SR-IOV ethernet devices. T3 and T4 platforms have Intel Gigabit SR-IOV capable Ethernet devices on the mother-board. So, to explore this technology, you have a device already available in your system.
    • Intel 10Gbps SR-IOV card with part numbers((X)1109A-Z, (X)1110A-Z, (X)4871A-Z). See LDoms2.2 Release notes for accurate info.
NOTE: Make sure to update Fcode firmware on these cards to ensure all features work as expected. See LDoms2.2 release notes for details on where and how to update the card's firmware.

What software needed?

The following are the software requirements, see the LDoms2.2 release notes and admin guide for more details.

  • LDoms2.2 Firmware and LDoms manager. See LDoms release notes for the Firmware versions for your platform.
  • ">SR-IOV feature requires major Solaris framework support and PF drivers in the Root Domain. At this time, the SR-IOV feature support available only in Solaris11 + SRU7 or later. So, ensure Primary domain has Solaris11 + SRU7 or later.
  • Guest domains can run either Solaris10 or Solaris11. If Solaris10 were to be used ensure, you have update9 or update10 with VF driver patches installed. See LDoms2.2 release notes for the patch numbers. If Solaris11 is used, then ensure you have SRU7 or later instaled.

 

References: LDoms 2.2 documentation 

 

How to create and assign SR-IOV VFs to logical domains?

This an example that shows how to create 4 VFs from an on-board SR-IOV PF device and assign them to 4 logical domains on a T4-1 platform. The following diagram shows the end result of this example. Example Showing SR-IOV VFs assigned to 4 Logical domains

Step1:

Run 'ldm list-io' command to see all available PF devices. Note, the name of the PF device includes details about which slot the PF is located at. For example, a PF named /SYS/MB/RISER1/PCIE4/IOVNET.PF0 is present in a slot labled as PCIE4. Primary# ldm ls-io NAME TYPE DOMAIN STATUS ---- ---- ------ ------ pci_0 BUS primary niu_0 NIU primary /SYS/MB/RISER0/PCIE0 PCIE - EMP /SYS/MB/RISER1/PCIE1 PCIE - EMP /SYS/MB/RISER2/PCIE2 PCIE - EMP /SYS/MB/RISER0/PCIE3 PCIE - EMP /SYS/MB/RISER1/PCIE4 PCIE primary OCC /SYS/MB/RISER2/PCIE5 PCIE primary OCC /SYS/MB/SASHBA0 PCIE primary OCC /SYS/MB/SASHBA1 PCIE primary OCC /SYS/MB/NET0 PCIE primary OCC /SYS/MB/NET2 PCIE primary OCC /SYS/MB/RISER1/PCIE4/IOVNET.PF0 PF - /SYS/MB/RISER1/PCIE4/IOVNET.PF1 PF - /SYS/MB/RISER2/PCIE5/P0/P2/IOVNET.PF0 PF - /SYS/MB/RISER2/PCIE5/P0/P2/IOVNET.PF1 PF - /SYS/MB/RISER2/PCIE5/P0/P4/IOVNET.PF0 PF - /SYS/MB/RISER2/PCIE5/P0/P4/IOVNET.PF1 PF - /SYS/MB/NET0/IOVNET.PF0 PF - /SYS/MB/NET0/IOVNET.PF1 PF - /SYS/MB/NET2/IOVNET.PF0 PF - /SYS/MB/NET2/IOVNET.PF1 PF - Primary#

Step2:

Let's use the PF with name /SYS/MB/NET0/IOVNET.PF0 for this example. The PF name has NET0, which indicates this is an on-board device. Using the -l option we can find additional details such as the path of the device and the maximum number of VFs it supports. This device supports upto a maximum of 7 VFs. Primary# ldm ls-io -l /SYS/MB/NET0/IOVNET.PF0 NAME TYPE DOMAIN STATUS ---- ---- ------ ------ /SYS/MB/NET0/IOVNET.PF0 PF - [pci@400/pci@2/pci@0/pci@6/network@0] maxvfs = 7 Primary# In the root domain, we can find the network device that maps to this PF by searching for the matching path in /etc/path_to_inst file. This device maps to igb0. Primary# grep pci@400/pci@2/pci@0/pci@6/network@0 /etc/path_to_inst "/pci@400/pci@2/pci@0/pci@6/network@0" 0 "igb" "/pci@400/pci@2/pci@0/pci@6/network@0,1" 1 "igb" Primary# In Solaris11 the auto-vanity name generates generic linknames, you can find the linkname for the device using the following command. You can see the igb0 maps to net0. So, we are really using the net0 device in Primary domain. Primary# dladm show-phys -L LINK DEVICE LOC net0 igb0 /SYS/MB net1 igb1 /SYS/MB net2 igb2 /SYS/MB net3 igb3 /SYS/MB net4 ixgbe0 PCIE4 net5 ixgbe1 PCIE4 net6 igb4 PCIE5 net7 igb5 PCIE5 net8 igb6 PCIE5 net9 igb7 PCIE5 net10 vsw0 -- net11 usbecm2 -- Primary#

Step3:

Create 4 VFs on the PF /SYS/MB/NET0/IOVNET.PF0 using the create-vf command. Note the creating VFs in LDoms2.2 release requires a reboot of the root domain. We can create multiple VFs and reboot only once. NOTE: As this operation requires, plan ahead on how many VFs you would like create and create them in advance. You might be tempted to create max number of VFs and use them later, but this may not good with devices that support large number of VFs. For example Intel 10Gbps SR-IOV device that is supported in this release supports upto a max of 63 VFs. But T3 and T4 platforms can only support a max of 15 I/O domains per PCIe bus. So, creating more than 15 VFs on the same PCIe bus needs to planned on how you would use them, typically you may have to assign multiple VFs to a domain as we only support 15 I/o domains per PCIe bus. Primary# ldm create-vf /SYS/MB/NET0/IOVNET.PF0 Initiating a delayed reconfiguration operation on the primary domain. All configuration changes for other domains are disabled until the primary domain reboots, at which time the new configuration for the primary domain will also take effect. Created new VF: /SYS/MB/NET0/IOVNET.PF0.VF0 Primary# ldm create-vf /SYS/MB/NET0/IOVNET.PF0 ------------------------------------------------------------------------------ Notice: The primary domain is in the process of a delayed reconfiguration. Any changes made to the primary domain will only take effect after it reboots. ------------------------------------------------------------------------------ Created new VF: /SYS/MB/NET0/IOVNET.PF0.VF1 Primary# ldm create-vf /SYS/MB/NET0/IOVNET.PF0 ------------------------------------------------------------------------------ Notice: The primary domain is in the process of a delayed reconfiguration. Any changes made to the primary domain will only take effect after it reboots. ------------------------------------------------------------------------------ Created new VF: /SYS/MB/NET0/IOVNET.PF0.VF2 Primary# ldm create-vf /SYS/MB/NET0/IOVNET.PF0 ------------------------------------------------------------------------------ Notice: The primary domain is in the process of a delayed reconfiguration. Any changes made to the primary domain will only take effect after it reboots. ------------------------------------------------------------------------------ Created new VF: /SYS/MB/NET0/IOVNET.PF0.VF3 Primary#

Step4:

Reboot the Primary domain. Caution: If there are any I/O domains that have PCIe slots of VFs assigned to them, then shutdown those logical domains before rebooting the Primary domain.

Step5:

Once the Primary domain is rebooted, now the VFs are available to assign to other logical domains. Use list-io command to see the VFs and then assign them to I/O domains. You can see the VFs at the end. If the list is long you can use the PF name as the arg to limit the listing to VFs from that PF only. Primary# ldm ls-io NAME TYPE DOMAIN STATUS ---- ---- ------ ------ pci_0 BUS primary niu_0 NIU primary /SYS/MB/RISER0/PCIE0 PCIE - EMP /SYS/MB/RISER1/PCIE1 PCIE - EMP /SYS/MB/RISER2/PCIE2 PCIE - EMP /SYS/MB/RISER0/PCIE3 PCIE - EMP /SYS/MB/RISER1/PCIE4 PCIE primary OCC /SYS/MB/RISER2/PCIE5 PCIE primary OCC /SYS/MB/SASHBA0 PCIE primary OCC /SYS/MB/SASHBA1 PCIE primary OCC /SYS/MB/NET0 PCIE primary OCC /SYS/MB/NET2 PCIE primary OCC /SYS/MB/RISER1/PCIE4/IOVNET.PF0 PF - /SYS/MB/RISER1/PCIE4/IOVNET.PF1 PF - /SYS/MB/RISER2/PCIE5/P0/P2/IOVNET.PF0 PF - /SYS/MB/RISER2/PCIE5/P0/P2/IOVNET.PF1 PF - /SYS/MB/RISER2/PCIE5/P0/P4/IOVNET.PF0 PF - /SYS/MB/RISER2/PCIE5/P0/P4/IOVNET.PF1 PF - /SYS/MB/NET0/IOVNET.PF0 PF - /SYS/MB/NET0/IOVNET.PF1 PF - /SYS/MB/NET2/IOVNET.PF0 PF - /SYS/MB/NET2/IOVNET.PF1 PF - /SYS/MB/NET0/IOVNET.PF0.VF0 VF /SYS/MB/NET0/IOVNET.PF0.VF1 VF /SYS/MB/NET0/IOVNET.PF0.VF2 VF /SYS/MB/NET0/IOVNET.PF0.VF3 VF Primary#

Step6:

Assign each VF to a logical domain using the 'add-io' command.

NOTE: LDoms2.2 requires the logical domain to which the VF is being assigned to be stopped. So, if the logical domains to which the VFs need to be assigned are running, then stop them and then assign VFs. Primary# ldm add-io /SYS/MB/NET0/IOVNET.PF0.VF0 ldg0 Primary# ldm add-io /SYS/MB/NET0/IOVNET.PF0.VF1 ldg1 Primary# ldm add-io /SYS/MB/NET0/IOVNET.PF0.VF2 ldg2 Primary# ldm add-io /SYS/MB/NET0/IOVNET.PF0.VF3 ldg3

Step7:

Start the logical domains to use the VFs in them. You can start each domain individually or start all logical domains with 'ldm start -a'. NOTE: A VF device can be used to boot over network at the OBP prompt too. Primary# ldm start ldg0 LDom ldg0 started Primary# ldm start ldg1 LDom ldg1 started Primary# ldm start ldg2 LDom ldg2 started Primary# ldm start ldg3 LDom ldg3 started

Step8:

Login to the guest domain and configure VF device for use. The VF device will appear like any other physical NIC device. You can only distinguish it by the device name using solaris commands. The following commands show a VF on logical domain 'ldg0' running solaris11 and configure it for use using dhcp. ldg0# dladm show-phys LINK MEDIA STATE SPEED DUPLEX DEVICE net0 Ethernet unknown 0 unknown igbvf0 ldg0# ldg0# ipadm create-ip net0 ldg0# ipadm create-add -T dhcp net0/dhcp Unrecognized subcommand: 'create-add' For more info, run: ipadm help ldg0# ipadm create-addr -T dhcp net0/dhcp ldg0# ifconfig net0 net0: flags=1004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4> mtu 1500 index 3 inet 10.129.241.141 netmask ffffff00 broadcast 10.129.241.255 ether 0:14:4f:f9:48:69 ldg0# After this the VF device can be used like any other network device for all applications without any latency or performance issues seen in Virtual I/O.

Saturday Dec 10, 2011

LDoms networking in Solaris 11

Since Oracle Solaris11 is officially released now, I thought I will try explain how LDoms networking  is integrated with all the networking enhancements in S11, mainly with project Crossbow.  The network stack for Oracle Solaris 11 has been substantially re-architected in an effort known as the project Crossbow. One of the main goals of Crossbow is to virtualize the hard NICs into Virtual NICs (VNICs) to provide more effective sharing of networking resources. The VNIC feature allows dividing a physical NIC into multiple virtual interfaces to provide independent network stacks for applications.

LDoms networking in Oracle Solaris11 has been re-designed along with Crossbow to utilize the underlying infrastructure enhancements provided by Crossbow. The following is a high-level view of how LDoms virtual switch in an S11 service domain and LDoms virtual network device in an S11 Guest domain. The diagram also shows an example of an S10 domain to be fully compatible with S11 Service domain. High-level view of LDoms networking in Solaris11

LDoms virtual switch in Solaris11 service domain is now re-designed to be layered on top of Crossbow MAC layer. It is designed to be at the same level of a VNIC. The actual virtual switching is now done at the Crossbow MAC layer, as a result the LDoms virtual switch is fully compatible with VNICs on the same physical NIC. Now, there is specific requirement to plumb LDoms vsw in the service domain to communicate with Guest Domains. LDoms virtual network device driver has been cre-designed to exploit various features such as rings and polling e.t.c.

Highlights: 

  • All existing LDoms networking features are fully supported with Solaris11 in both Service domain and Guest domains, this includes:
    • VLANs feature
    • Link based failure detection support
    • NIU Hybrid I/O
    • Jumbo Frames
    • Link Aggregation device as an LDoms virtual switch backend devices.
  • The Guest domains running both Solaris10 and Solaris11 are fully compatible with Solar11 Service domain. 
  • A Guest domain running Solaris11 is fully compatible with a service domain running Solaris10 or Solaris11.
  • The existing LDoms configuration continues to work even if the existing Service domain or Guest domain is re-installed with Solaris11. That is, no need to re-create LDoms configuration.
  • The Crossbow VNIC features such as b/w limit and link priorities are not available for LDoms virtual network(vnet) devices.
  • Creation of VNICs on top of LDoms vsw or vnet devices is not supported.
  • Solaris11 introduces a new IPMP failure detection mechanism known as the transitive probing which helps avoid the requirement of test IP addresses. That is, now the virtualization customers can use transitive probing to detect network failures but not worry about the requirement of large test IP addresses.
  • Solaris11 has a feature known as Auto Vanity Naming that generates simple names  such as net0 to all network devices. When you are creating an LDoms virtual switch, you can either use the vanity name or the actual physical NIC interface name for the net-dev option. The vanity name is preferred, but make sure you are using the right network device.

Known issues: 

  • CR 7087781: First Boot After Adding a Virtual Switch to a service domain may hang. See the Solaris11 Release notes at the following URL for more details and the workaround.
    • http://docs.oracle.com/cd/E23824_01/html/E23811/glmdq.html#gloes
  • Creation of VNICs on LDoms vnet and vsw may succeed without failing the command but the VNICs on vnet or vsw won't communicate. Note, VNICs on top of LDoms vsw and vnet are not supported. 
    • Zones creation Solaris11 may auto create a VNIC on LDoms vnet device which won't function. As a workaround, create a vnet device for each zone in the guest domain and explicitly assign the vnet device to a Zone. If the deployment  requires a large number of vnets, then you may choose to disable inter-vnet-link feature in LDoms to save the LDC resources, there by having the ability to create  a lot more vnets or other virtual devices. NOTE: the ability to disable inter-vnet links is introduced in LDoms2.1.



Wednesday Dec 24, 2008

NIU Hybrid I/O

NIU Hybrid I/O feature is one of the significant network features released with LDoms1.1. NIU Hybrid I/O provides a very high  bandwidth with low latency directly to a Guest domain. This technology now limited UltraSPARC-T2(Niagara-II) cpu based platforms only. More details below:

Hybrid I/O model explained:

The Hybrid I/O model combines direct(or physical) and virtualized I/O to allow flexible deployment of I/O resources to virtual machines(or Guest Domains). Particularly, it is useful when direct I/O does not provide full I/O functioanlity for a Virtual Machine(VM), or direct I/O may not be persistently or consistently available to the VM (due to resource availability or VM migration). The hybrid I/O architecture is well suited to Niagara II's Network Interface Unit (NIU), a network I/O interface integrated on chip. This allows the dynamic assignment of DMA resources to virtual networking devices and thereby providing consistent performance to applications in the domain.

Hybrid I/O and Virtualized I/O differences:

In Virtualized I/O, the network packets flows through another domain(Service Domain), which sends/receives packets from/to a physical NIC via direct I/O. That is, in case of LDoms network, the packets are sent/received to/from the vswitch in a Service Domain where the vsiwtch sends/receives packets via the physical NIC. The Virtualized I/O model is flexible, but performance suffers due to overhead of another domain in between to perform direct I/O to/from a physical device.

In Hybrid I/O, the packets sent/received directly via physical DMA resources, but virtualized I/O used for broadcast/multicast packets. That is, all unicast packets from external network are directly received via the direct DMA resources, but the broadcast and multicast packets go through Vswitch in the Service domain. The Hybird I/O provides high bandwidth with low latency, it is also flexible as the assignment of DMA resources is dynamic, but a complex solution.

Hybrid I/O specific details of NIU:

  • Network Interface Unit(NIU) is on chip on UltraSPARC-T2 cpu.
  • Currently there are two platforms T5120 and T5220 ship with UltraSPARC-T2 cpu.
  • NIU has support for 2 ports.  Both T5120 and T5220 have two slots for NIU, that is one slot for each port. The XAUI adapters need to be installed in order gain access to the NIU ports.
  • Each NIU port can share up to 3 Hybrid mode resources. That is, a total of 6 Hybrid resources available to virtual network devices. That means, 6 virtual network devices can have hybrid resource assigned to them.

The following is a diagram that shows internal blocks in an NIU Hybrid/IO for two Guest domains with Hybrid resource assigned.

Hybrid I/O diagram

  • NIU -- Network interface Unit
  • RX/TX -- A pair of Receive and Transmit DMA channels
  • VR -- Virtual Region where DMA resources are grouped and assigned to a Guest domain.
  • HIO -- Hybrid I/O
  • VIO -- Virtualized I/O
  • Hybrid nxge -- Nxge driver operating in Hybrid mode
  • Vnet -- Virtual network device
  • Virtual switch -- A vswitch that provides communication to vnet devices

Hybrid I/O CLIs:

The Hybrid I/O for a vnet is enabled/disabled via the 'mode' option of vnet CLIs. If the mode is set to 'hybrid' then Hybrid I/O enabled for a vnet device. If not, its disabled. By default the 'hybrid' mode is disabled.

To enable hybrid mode, use either of the commands below:

 primary# ldm add-vnet mode=hybrid vnet0 primary-vsw0 ldg1
 primary# ldm set-vnet mode=hybrid vnet0 ldg1
To disable hybrid mode, use either of the commands below:
 primary# ldm add-vnet vnet0 primary-vsw0 ldg1
 primary# ldm set-vnet mode= vnet0 ldg1

Important Note: Setting "mode=hybrid" is treated as a hint or desired flag only. That means, enabling hybrid mode doesn't guarantee the hybrid resource assignment. No specific checks are made when the hybrid mode is enable for a Vnet device. Actual assignment depends on many criteria, some of these are as below. 

    1. The Vswitch to which a vnet device is connected is backed by one of the ports of NIU.
    2. The system has the right level of Firmware, that is, a Firmware released via LDoms1.1 or later.
    3. Both Service domain(where vsiwch exists) and Guest domain(where vnet exists) have Solaris 10 10/08(a few patches may be required) or later.
    4. A free hybrid resource(VR+DMA channels) is available. Note, each port of an NIU has only 3 hybrid resource available.
    5. The vnet interface in the Guest domain is plumbed.

Important points to note about NIU Hybrid I/O:

  1. Today, the Hybrid I/O feature implemented only for the NIU which is on chip in UltraSPARC-T2 cpu. That means, this feature is available for all platforms that have Niagara II cpu. Note, this is not a feature for UltraSPARC-T2+ cpu which doesn't have NIU on chip.
  2. Note, the nxge driver used for NIU and the same driver used for other NICs such as Neptune PCIe card. So, when the system has a mix of these devices, its important to identify which nxge interface belong to NIU ports.
  3. This technology provides direct DMA of unicast packets only, that means, the performance improvement is only seen for unicast packets. The broadcast, multicast type of applications may not benefit from this feature.
  4. The hybrid resource(show as VR(NIU)) above is assigned only when a vnet is plumbed. That means, the DMA resources are not wasted for a virtual network device that is not currently active.
  5. OBP doesn't use Hybird I/O, that means, the boot net type of traffic in OBP doesn't benefit from Hybrid I/O.
  6. Guest Domain to Guest Domain communication in the same system does not use Hybrid I/O.

Troubleshooting tips:

  1. Hybrid resource not assigned to a vnet device?
    • Identify the vswitch to which the vnet device connects to and find the assigned net-dev. It should be an nxge device. Verify that nxge device is really from NIU, not from other nxge supported devices as below. This grep needs to be run on the Service Domain(where Vswitch is created). If you see the '/niu@80' infront of the nxge instance you are using, then its the right device to use.
      • Service# grep nxge /etc/path_to_inst 
        "/niu@80/network@0" 0 "nxge"
        "/niu@80/network@1" 1 "nxge"
        
    • Verify the required Firmware and Software installed on the system.
    • Verify if the mode is set to 'hybrid' that vnet.
    • Verify if that vnet interface is plumbed in the Guest domain.
    • Verify that there are no more than 3 vnet devices per vswitch have 'hybrid' mode enabled.
  2. How confirm a Hybrid resource assigned to a vnet device?
    • Run “kstat nxge” in the Guest domain. If you see any nxge kstats, that indicate a hybrid mode nxge instance is created. If you have multiple vnet devices with hybrid mode enabled in the same Guest domain, then you need to verify if you have that many nxge instances shown in the "kstat nxge" output.
    • Next release of Solaris 10 update will have Hybrid I/O kstats to make it more meaningful and easiler to monitor.
  3. How to verify the unicast traffic is really going through hybrid resource(nxge)?
    • kstats of nxge provide the packets going through hybrid nxge device.
      # kstat -p nxge:\*:\*:\*packets 
      nxge:0:RDC Channel 0 Stats:rdc_packets  20
      nxge:0:RDC Channel 1 Stats:rdc_packets  30
      nxge:0:TDC Channel 0 Stats:tdc_packets  20
      nxge:0:TDC Channel 1 Stats:tdc_packets  20
      • tdc_packets – packets going via TX DMA channel
        rdc_packets – packets received by the RX DMA channel
About

Raghuram Kothakota

Search

Categories
Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today