Tuesday Jul 07, 2015

PV IPoIB in Kernel Zones in Solaris 11.3

The Paravirtualization of IP over Infiniband (IPoIB) in kernel zones is a 
new feature in S11.3 enhancing the network virtualization offering in Solaris.
This allows for existing IP applications in the guest to run over Infiniband 
fabrics. Features such as Kernel zone Live Migration and IPMP are supported 
with the Paravirtualized IPoIB datalinks making it an appealing option.

Moreover, the device management of these guest datalinks are similar to their 
Ethernet counterparts making it straightforward to configure and manage. Zonecfg 
is used in the host to configure the kernel zone's automatic network interface 
(anet) to select the link of the IB HCA port to paravirtualize and assign as the 
lower-link, the Partition Key (P_Key) wthin the IB fabric and the possible 
link mode to choose from which could either be IPoIB-CM or IPoIB-UD.

The PV IPoIB datalink is a front end guest driver emulating a IPoIB VNIC 
in the host created over a physical IB partition datalink per P_Key and port.

To create a PV IPoIB datalink in a kernel zone the configuration is fairly 
simple. Here is an example showing how to create a PV IPoIB datalink in a 
kernel zone.

1. Find the IB datalink in the host to paravirtualize. 

I am selecting net7 for this example.

# ibadm
HCA             TYPE      STATE     IOV    ZONE
hermon0         physical  online    off    global

# dladm show-ib
net5      21280001A0D220 21280001A0D222 2    up      --           --       8001,FFFF
net7      21280001A0D220 21280001A0D221 1    up      --           --       8001,FFFF
# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         1000   full      igb0
net2              Ethernet             unknown    0      unknown   igb2
net3              Ethernet             unknown    0      unknown   igb3
net1              Ethernet             unknown    0      unknown   igb1
net4              Ethernet             up         10     full      usbecm0
net5              Infiniband           up         32000  full      ibp1
net7              Infiniband           up         32000  full      ibp0

2. Create an IPoIB PV datalinks to a kernel zone.
To add an IPoIB PV interface to a kernel zone say tzone1 add an anet 
and specify a lower-link and pkey which are mandatory properties using 
zonecfg. If not specified IPoIB-CM is the default link mode.

# zonecfg -z tzone1
    zonecfg:kzone0> add anet
    zonecfg:kzone0:anet> set lower-link=net7
    zonecfg:kzone0:anet> set pkey=0xffff
    zonecfg:kzone0:anet> info
    anet 1:
        lower-link: net7
        pkey: 0xffff
        linkmode not specified
        evs not specified
        vport not specified
        iov: off
        lro: auto
        id: 1

3. Additional IPoIB PV datalinks to the kernel zone.
Additional IPoIB PV interfaces to a kernel zone with a lower-link and pkey 
can be added as indicated above. These datalinks can be used exclusively 
to host native zones within the kernel zones.

4. The PV IPoIB datalinks appear within the kernel zone on boot.

root@tzone1:~# dladm 
LINK                CLASS     MTU    STATE    OVER
net1                phys      65520  up       --
net0                phys      65520  up       --

root@tzone1:~# ipadm
NAME              CLASS/TYPE STATE        UNDER      ADDR
lo0               loopback   ok           --         --
   lo0/v4         static     ok           --
   lo0/v6         static     ok           --         ::1/128
net0              ip         ok           --         --
   net0/v4        static     ok           --
net1              ip         ok           --         --
   net1/v4        static     ok           --

Virtual NICs (VNICs) tzone1/net0 and tzone1/net1 are created in the
host kernel which are the backend of the PV interface.

# dladm show-vnic
tzone1/net1     net7           32000  80:0:0:4d:fe:..   fixed       PKEY:0xffff
tzone1/net0     net7           32000  80:0:0:4e:fe:..   fixed       PKEY:0xffff

Thursday Aug 14, 2014

VXLAN in Solaris 11.2

What is a VXLAN?

VXLAN, or Virtual eXtensible LAN, is essentially a tunneling mechanism used to provide isolated virtual Layer 2 (L2) segments that can span multiple physical L2 segments. Since it is a tunneling mechanism it uses IP (IPv4 or IPv6) as its underlying network which means we can have isolated virtual L2 segments over networks connected by IP. This allows Virtual Machines (VM) to be in the same L2 segment even if they  are located on systems that are in different physical networks. Some of the benefits of VXLAN include:

  • Better use of resources, i.e. VMs can be provisioned on systems, that span different geographies, based on system load.
  • VMs can be moved across systems without having to reconfigure the underlying physical network.
  • Fewer MAC address collision issues, i.e. MAC address may collide as long as they are in different VXLAN segments.
Isolated L2 segments can be supported by existing mechanisms such as VLANs, but VLANs don't scale; the number of VLANs are limited to 4094 (0 and 1 are reserved), but VXLAN can provide upto 16 million isolated L2 networks.

Additional details, including protocol working, can be found in the VXLAN draft IETF RFC. Note that Solaris uses the IANA specified UDP port number of 4789 for VXLAN. 

The following is a quick primer on administering VXLAN in Solaris 11.2 using the Solaris administrative utility dladm(1m). Solaris Elastic Virtual Switch (EVS) can be used to manage VXLAN deployment automatically in a cloud environment - this will be the subject of a  future discussion.

The following illustrates how VXLANs are created on Solaris:

where IPx is an IP address (IPv4 or IPv6) and VNIs y and z are different VXLAN segments. VM1, VM2 and VM3 are guests with interfaces configured on VXLAN segments y and z. vxlan1 and vxlan2 are VXLAN links, represented by a new class called VXLAN.

Creating VXLANs

To begin with we need to create  VXLAN links in the segments that we want to use  for guests - let's assume we want to create segments 100 and 101. Additionally, we also want to create the VXLAN links on IP (remember VXLANs are overlay over IP networks), so we need the IP address over which we want to create the VXLAN links - let's assume our endpoint on this system is (in the following example this IP address resides on net4).

# ipadm show-addr net4                                      
ADDROBJ           TYPE     STATE        ADDR
net4/v4                 static        ok 

Create VXLAN segments 100 and 101 on this IP address.

# dladm create-vxlan -p addr=,vni=100 vxlan1 
# dladm create-vxlan -p addr=,vni=101 vxlan2    


  • In the above example we explicitly provide the IP address, however, you could also:
    • provide a prefix and prefixlen to use an IP address that matches it, e.g:
# dladm create-vxlan -p addr=,vni=100 vxlan1
    • provide an interface (say net4 in our case) to pick an active address on that interface, e.g:
# dladm create-vxlan -p interface=net4,vni=100 vxlan1
(you can't provide interface and addr together)

  • VXLAN links can be created on an IP address over any interface, including IPoIB link, except IPMP, loopback or VNI (Virtual Network Interface).
  • The IP address may belong to a VLAN segment.

Displaying VXLANs

Check if we have our VXLAN links:

# dladm show-vxlan                                          
LINK                ADDR                     VNI   MGROUP
vxlan1                   100
vxlan2                   101

One thing  we haven't talked about so far is the MGROUP. Recall from the RFC that VXLAN links use IP multicast for broadcast. So, we can assign a multicast address to each  VXLAN segment that we create. If we don't specify a multicast address, we assign the all-host multicast address (or all nodes for IPv6) to the VXLAN segments. In the above case since we didn't specify the multicast address both vxlan1 and vxlan2 will use the all-host multicast address.

The VXLAN links created, vxlan1 and vxlan2, are just like other datalinks (physical, VNIC, VLAN, etc.) and can be displayed using 

# dladm show-link
LINK                CLASS     MTU    STATE    OVER
vxlan1              vxlan     1440         up            --
vxlan2              vxlan     1440         up            --

The STATE reflects that state of the VXLAN links which is based on the status of the IP address ( in this case). Note that the MTU is reduced because of the VXLAN encapsulation for each packet, on this VXLAN link.

Now that we have our VXLAN links, we can create Virtual Links (VNICs) over these  VXLAN links. Note, the VXLAN links themselves not active links, i.e. you can't plumb IP address or create Flows on them, but they can be snooped.

# dladm create-vnic  -l vxlan1 vnic1                    
# dladm create-vnic  -l vxlan1 vnic2    
# dladm create-vnic  -l vxlan2 vnic3            
# dladm create-vnic  -l vxlan2 vnic4  

# dladm show-vnic                                           
LINK                OVER              SPEED  MACADDRESS        MACADDRTYPE VIDS
vnic1               vxlan1            10000     2:8:20:d9:df:5f            random                   0
vnic2               vxlan1            10000     2:8:20:72:9a:70          random                   0
vnic3               vxlan2            10000     2:8:20:19:c7:14          random                   0
vnic4               vxlan2            10000     2:8:20:88:98:6d         random                    0

You can see from the above that the process of creating a VNIC on a VXLAN link  is no different from creating one any other link  such as physical, aggregation, etherstub etc.  This means that the VNICs created may belong to a VLAN and properties (such as maxbw and priority) can be set on them.

Once created, these VNICs can be assiged explicitly to Solaris zones. Alternatively, the VXLAN links can be set as the lower-link for configuring anet (automatic VNIC) links in Solaris Zones.

For Logical Domains on SPARC, the virtual switch (add-vsw) can be created on the VXLAN device which means the vnets created on the virtual switch will be part of the VXLAN segment.

Deleting VXLANs

A VXLAN can be deleted once all the VNICs over the VXLAN links have been deleted. Thus in our case:

# dladm delete-vnic vnic1   
# dladm delete-vnic vnic2 
# dladm delete-vnic vnic3     
# dladm delete-vnic vnic4  

# dladm delete-vxlan vxlan1
# dladm delete-vxlan vxlan2  

Additional Notes:
  • VXLAN for Solaris Kernel zone and LDom guests are not supported with direct I/O.
  • Hardware capabilities such as checksum and LSO are not available for the encapsulated (inner) packet.
  • Some earlier implementations (e.g. Linux) might use a pre-IANA assigned port number. If so, such implementations might have to be configured to use the IANA port number to interoperate with Solaris VXLAN. 
  • IP multicast must be available in the underlying network and if communicating  across different IP subnets, multicast routing should be available as well.
  • Modifying properties (IP address, multicast address or VNI) on a VXLAN link is currently not supported; you'd have to delete the VXLAN and re-create it.

Saturday Apr 11, 2009

Another OpenSolaris presentation from Community One

Nick Solter and Dave Miner (authors of OpenSolaris Bible) presented a session called Becoming an OpenSolaris Power User, which covers topics such as ZFS, DTrace and networking at Community One. Source: Nick Solter's blog.


The Observatory is a blog for users of Oracle Solaris. Tune in here for tips, tricks and more as we explore the Solaris operating system from Oracle.


« November 2015