Crossbow - Network Virtualization Architecture Comes to Life

Crossbow - Network Virtualization Architecture Comes to Life

Crossbow - Network Virtualization Architecture Comes to Life

December 5th, 2008 was a joyous occasion and a humbling one at the same time. A vision that was created 4 years back was coming to life. I still remember the summer of 2004 when Sinyaw threw a challenge at me - can you Change the world? And it was Fall of same year when I unveiled the first set of Crossbow slides to him and Fred Zlotnik over a bottle of wine. Lot of planning and finally ready to start but there were still hurdles in the way. We were still trying to finish Nemo aka GLDv3 - A high performance device driver framework which was absolutely required for Crossbow (We needed absolute control over the Hardware). Nemo finished mid 2005 but then Nicolas, Yuzo etc. left Sun and went to a startup. Thiru was still trying to finish Yosemite (the FireEngine follow on). So in short, 2005 was basically more planning and prototyping (specially controlling the Rx rings and dynamic polling) on my part. I think it was early 2006 when work begin on Crossbow in earnest. Kais moved over from security group, Nicolas was back at Sun, Thiru, Eric Cheng, Mike Lim (and of course me) came together to form the core team (which later expanded to 20+ people in early 2008). So it was a long standing dream and almost three years of hard work that finally came to life when Crossbow Phase 1 integrated in Nevada Build 105 (and will be available in OpenSolaris 6.09 release).

Crossbow - H/W Virtualized Lanes that Scale (10gigE over multiple cores)

One of key tenets of Crossbow design was the concept of H/W Virtualization Lanes. Essentially tying a NIC Receive and Transmit ring, DMA channel, kernel threads, kernel queues, processing CPUs together. There are no shared locks, counters or anything. Each lane gets to individually schedule the packet processing by switching its Rx ring independently between interrupt mode and poll mode (Dynamic Polling). Now you can see why Nemo was so important because without it, stack couldn't control the H/W and without Nemo, the NIC vendors wouldn't have played along with us in adding the features we wanted (stateless classification, Rx/Tx rings, etc). Once a lane is created, we can program the classifier to spread packets based on IP addresses and port between each lane for scaling reasons. With the multiple cores and multiple thread that seems to be the way of life going forward and 10+ gigE of Bandwidth (soon we will have IPoIB working as well), scaling really matters (and we are not talking about achieving line rates on 10 gigE with jumbo grams - we are talking about real world, mix of small and large packets, 10k of connections and 1000s of threads).

To demonstrate the point, I captured bunch of statistics while finishing the final touches to the data path and getting ready to beat some world records. The table below shows mpstat output along with packets per second serviced for the Intel Oplin (10gigE) NIC on a Niagara2 based system. The NIC has enabled all 8 Rx/Tx rings and has 8 interrupts enabled (one for each rx ring).
 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
 38    0   0    6    21    3   31    1    5   12    0    86    0   0   0  99
 39    0   0 2563  5506 3907 3282   28   34 1170    0   178    0  21   0  78
 40    0   0 2553  5117 3948 2410   38  150 1192    0   504    1  21   0  77
 41    0   0 2651  5221 4232 2011   25   53 1195    0   210    0  20   0  80
 42    0   0 3078  5700 4743 2069   21   28 1285    0   125    0  22   0  78
 43    0   0 3280  5837 4777 2118   19   24 1328    0   101    0  22   0  78
 44    0   0 3143 19566 18801 1773  50   44 1285    0    68    0  65   0  35
 45    0   0 4570  7748 6838 1984   23   27 1697    0   118    0  29  0  71

# netstat -ia 1
    input   e1000g    output       input  (Total)    output
packets errs  packets errs  colls  packets errs  packets errs  colls 
4       0     1       0     0      61284   0     128820  0     0     
3       0     2       0     0      61015   0     129316  0     0     
4       0     2       0     0      60878   0     128922  0     0  

This link shows the interrupt binding, mpstat and intrstat output. You can see that the NIC is trying very hard to spread the load but because the stack sees this as one NIC, there is one CPU (number 44) where all the 8 threads collide. Its like a 8 lane highway becoming single lane during rush hours.

Now lets look what happens when Crossbow enables a lane all the way up the stack for each Rx ring and also enables dynamic polling for each individually. If you look at the corresponding mpstat and intrstat output and packets per second rate, you will see that the lanes really do work independently from each other resulting in almost linear spreading and much higher packets per second serviced. The benchmark represents a webserver workload and needless to say, Crossbow with dynamic polling on per Rx ring basis almost tripled the performance. The raw stats can be seen here.
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys wt idl
 37    0   0 2507 11906 10272 4267  265  326  489    0   776    4  28   0  68
 38    0   0 2111 11793 9840 6503  336  314  472    0   615    3  32   0  65
 39    0   0  500 10409 10164  565    7  125  174    0  1413    6  23   0  70
 40    0   1  660 10423 9982  950   23  288  272    0  3834    8  34   0  58
 41    0   1  658 10490 10108  847   16  238  237    0  2549    8  29   0  64
 42    0   0  584 10605 10299  708   12  181  207    0  1828    7  26   0  67
 43    0   0  732 10829 10559  598    9  141  193    0  1485    7  25   0  68
 44    0   1  306   487   25 1091   17  282  330    0  4083    9  17   0  74

# netstat -ia 1
     input   e1000g    output       input  (Total)    output
packets errs  packets errs  colls  packets errs  packets errs  colls 
2       0     1       0     0      267619  0     522226  0     0     
2       0     2       0     0      275395  0     539920  0     0     
2       0     2       0     0      251023  0     482335  0     0     
And finally below we print some statistics from the MAC per Rx ring data structure (mac_soft_ring_set_t). For each Rx ring, we track the number of packets received via interrupt path, number received via poll path, chains less than 10, chains between 10 and 50 and chains over 50 (each time we polled the Rx ring). And you can see that polling path brings a larger chunk of packets and in bigger chains.
Crossbow Virtualization Architecture
Keep in mind that for most OSes and most NIC, the interrupt path brings one packet at a time. This makes Crossbow architecture more efficient for scaling as well as performance at higher loads on high B/W NICs.

Crossbow and Network Virtualization

Once we have the ability to create these independent H/W lanes, programming the NIC classifier is easy. Instead of spreading the incoming traffic for scaling, we program the classifier to send packets for a mac address to a individual lane. The MAC addresses are tied to individual Virtual NICs (VNICs) which are in turn attached to guest Virtual Machines or Solaris Containers (Zones). The separation for each virtual machine is driven by the H/W and processed on the CPUs attached to the virtual machine (the poll thread and interrupts for the Rx ring for a VNIC are bound to the assigned CPUs). The picture kind of looks like this
Crossbow Virtualization Architecture
Since for NICs and VNICs, we always do dynamic polling, enforcing bandwidth limit is pretty easy. One can create a VNIC by simply specifying the B/W limit, priority, cpu lists in one shot and the poll thread will enforce the limit by picking up only packets that meet the limit. Something as simple as
freya(67)% dladm create-vnic -l e1000g0 -p maxbw=100,cpus=2 my_guest_vm
The above command will create a VNIC called my_guest_vm with a random MAC address and assign it a B/W of 100Mbps. All the processing for this VNIC is tied to CPU 2. Its features like this that makes Crossbow a integral part of Sun Cloud Computing initiative due to roll out soon.

Anyway, this should give you a flavour. There is a white paper and more detailed documents (including how to get started) at the Crossbow OpenSolaris page.



Comments:

Raw stats URL should be: http://blogs.sun.com/sunay/resource/dynamic_polling.html

Thanks for the info!

Posted by Brook on December 14, 2008 at 10:04 PM PST #

Brook,

Thanks for pointing out the broken URL for Raw Stats. I have fixed it.

Sunay

Posted by Sunay Tripathi on December 16, 2008 at 03:55 AM PST #

Post a Comment:
Comments are closed for this entry.
About

Sunay Tripathi, Sun Distinguished Engineer, Solaris Core OS, writes a weblog on architecture for Solaris Networking Stack, GLDv3 (Nemo) framework, Crossbow Network Virtualization and related things

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Blogroll
News

No bookmarks in folder

Solaris Networking: Magic Revealed

No bookmarks in folder

solaris networking