X

Crossbow - Network Virtualization Architecture Comes to Life



Crossbow - Network Virtualization Architecture Comes to Life<br/>

Crossbow - Network Virtualization Architecture Comes to Life


December 5th, 2008 was a joyous occasion and a humbling one at the
same time. A vision that was created 4 years back was coming to life.
I still remember the summer of 2004 when "http://blogs.sun.com/syw/entry/distinguished_engineers">Sinyaw
threw a challenge at me - can you Change the world? And it was
Fall of same year when I unveiled the first set of Crossbow slides to
him and Fred Zlotnik over a bottle of wine. Lot of planning and
finally ready to start but there were still hurdles in the way. We were still
trying to finish
Nemo aka GLDv3 - A high performance device driver framework
which
was absolutely required for Crossbow (We needed absolute control over
the Hardware). Nemo finished mid 2005 but then Nicolas, Yuzo etc. left
Sun and went to a startup. Thiru was still trying to finish Yosemite
(the FireEngine follow on). So in short, 2005 was basically more
planning and prototyping (specially controlling the Rx rings and
dynamic polling) on my part. I think it was early 2006 when work
begin on Crossbow in earnest. Kais moved over from security group,
Nicolas was back at Sun, Thiru, Eric Cheng, Mike Lim (and of course me)
came together to form the core team (which later expanded to 20+ people
in early 2008). So it was a long standing dream
and almost three years of hard work that finally came to life when href = "http://www.opensolaris.org/os/project/crossbow">Crossbow Phase
1 integrated in Nevada Build 105 (and will be available in
OpenSolaris 6.09 release).

Crossbow - H/W Virtualized Lanes that Scale (10gigE over multiple cores)


One of key tenets of Crossbow design was the concept of H/W Virtualization
Lanes. Essentially tying a NIC Receive and Transmit ring, DMA channel,
kernel threads, kernel queues, processing CPUs together
. There are
no shared locks, counters or anything. Each lane gets to individually
schedule the packet processing by switching its Rx ring independently
between interrupt mode and poll mode (Dynamic Polling). Now
you can see why Nemo was so
important because without it, stack couldn't control the H/W and
without Nemo, the NIC vendors wouldn't have played along with us in
adding the features we wanted (stateless classification, Rx/Tx rings,
etc). Once a lane is created, we can program the classifier to spread
packets based on IP addresses and port between each lane for scaling
reasons. With the multiple cores and multiple thread that seems to be
the way of life going forward and 10+ gigE of Bandwidth (soon we will
have IPoIB working as well), scaling really matters (and we are not
talking about achieving line rates on 10 gigE with jumbo grams - we
are talking about real world, mix of small and large packets, 10k of
connections and 1000s of threads).




To demonstrate the point, I captured bunch of statistics while
finishing the final touches to the data path and getting ready to beat
some world records. The table below shows mpstat output along with
packets per second serviced for the Intel Oplin (10gigE) NIC on a
Niagara2 based system. The NIC has enabled all 8 Rx/Tx rings and has 8
interrupts enabled (one for each rx ring).
 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
38 0 0 6 21 3 31 1 5 12 0 86 0 0 0 99
39 0 0 2563 5506 3907 3282 28 34 1170 0 178 0 21 0 78
40 0 0 2553 5117 3948 2410 38 150 1192 0 504 1 21 0 77
41 0 0 2651 5221 4232 2011 25 53 1195 0 210 0 20 0 80
42 0 0 3078 5700 4743 2069 21 28 1285 0 125 0 22 0 78
43 0 0 3280 5837 4777 2118 19 24 1328 0 101 0 22 0 78
44 0 0 3143 19566 18801 1773 50 44 1285 0 68 0 65 0 35
45 0 0 4570 7748 6838 1984 23 27 1697 0 118 0 29 0 71
# netstat -ia 1
input e1000g output input (Total) output
packets errs packets errs colls packets errs packets errs colls
4 0 1 0 0 61284 0 128820 0 0
3 0 2 0 0 61015 0 129316 0 0
4 0 2 0 0 60878 0 128922 0 0



"http://blogs.sun.com/sunay/resource/no_dynamic_poll.html">This
link
shows the interrupt binding, mpstat and intrstat output. You
can see that the NIC is trying very hard to spread the load but
because the stack sees this as one NIC, there is one CPU (number 44)
where all the 8 threads collide. Its like a 8 lane highway becoming
single lane during rush hours
.




Now lets look what happens when Crossbow enables a lane all the way up
the stack for each Rx ring and also enables dynamic polling for each
individually. If you look at the corresponding mpstat and intrstat
output and packets per second rate, you will see that the lanes
really do work independently from each other resulting in almost
linear spreading and much higher packets per second serviced. The
benchmark represents a webserver workload and needless to say,
Crossbow with dynamic polling on per Rx ring basis almost tripled the
performance. The raw stats can be seen "http://blogs.sun.com/sunay/resource/dynamic_polling.html">here.
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys wt idl
37 0 0 2507 11906 10272 4267 265 326 489 0 776 4 28 0 68
38 0 0 2111 11793 9840 6503 336 314 472 0 615 3 32 0 65
39 0 0 500 10409 10164 565 7 125 174 0 1413 6 23 0 70
40 0 1 660 10423 9982 950 23 288 272 0 3834 8 34 0 58
41 0 1 658 10490 10108 847 16 238 237 0 2549 8 29 0 64
42 0 0 584 10605 10299 708 12 181 207 0 1828 7 26 0 67
43 0 0 732 10829 10559 598 9 141 193 0 1485 7 25 0 68
44 0 1 306 487 25 1091 17 282 330 0 4083 9 17 0 74
# netstat -ia 1
input e1000g output input (Total) output
packets errs packets errs colls packets errs packets errs colls
2 0 1 0 0 267619 0 522226 0 0
2 0 2 0 0 275395 0 539920 0 0
2 0 2 0 0 251023 0 482335 0 0

And finally below we print some statistics from the MAC per Rx ring data
structure (mac_soft_ring_set_t). For each Rx ring, we track the number
of packets received via interrupt path, number received via poll path,
chains less than 10, chains between 10 and 50 and chains over 50 (each
time we polled the Rx ring). And you can see that polling path brings
a larger chunk of packets and in bigger chains.
Crossbow Virtualization Architecture src="//cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/f4a5b21d-66fa-4885-92bf-c4e81c06d916/Image/3f5b500e14700dfea9fc72e14e8f2fb1/numbers.jpg"
style="width: 678px; height: 345px;">

Keep in mind that for most OSes and most NIC, the interrupt path
brings one packet at a time. This makes Crossbow architecture more
efficient for scaling as well as performance at higher loads on high
B/W NICs
.

Crossbow and Network Virtualization


Once we have the ability to create these independent H/W lanes,
programming the NIC classifier is easy. Instead of spreading the
incoming traffic for scaling, we program the classifier to send
packets for a mac address to a individual lane. The MAC addresses are
tied to individual Virtual NICs (VNICs) which are in turn attached to
guest Virtual Machines or Solaris Containers (Zones). The separation
for each virtual machine is driven by the H/W and processed on the
CPUs attached to the virtual machine (the poll thread and interrupts
for the Rx ring for a VNIC are bound to the assigned CPUs). The
picture kind of looks like this
Crossbow Virtualization Architecture src="//cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/f4a5b21d-66fa-4885-92bf-c4e81c06d916/Image/b94089892190a10203d16407eb3a57b5/cmt.jpg"
style="width: 678px; height: 445px;">

Since for NICs and VNICs, we always do dynamic polling, enforcing bandwidth
limit is pretty easy. One can create a VNIC by simply specifying the
B/W limit, priority, cpu lists in one shot and the poll thread will
enforce the limit by picking up only packets that meet the limit. Something
as simple as
freya(67)% dladm create-vnic -l e1000g0 -p maxbw=100,cpus=2 my_guest_vm

The above command will create a VNIC called my_guest_vm with a random MAC
address and assign it a B/W of 100Mbps. All the processing for this VNIC
is tied to CPU 2. Its features like this that makes Crossbow a integral part
of Sun Cloud Computing initiative due to roll out soon.




Anyway, this should give you a flavour. There is a white paper and more detailed
documents (including how to get started) at the
Crossbow
OpenSolaris page.












Join the discussion

Comments ( 2 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services