Achieving line rate on a 10 Gb NIC

As my very first blog, I'm going to talk about some work I did to improve the networking performance on Sun's newest line of servers based on the UltraSPARC T2 multi-core chips.

I work in a performance group in Sun and in September 2006, I was asked to help out the product group on a software scaling issue in Sun's new T5220/"Huron" server using the 10 GbE NIC which is embedded in the T2 CPU.

The problem we had was that at exactly 8 connections, we could achieve 8 Gbps but with increasing connections, the throughput showed negative scaling dropping to ~5 Gbps at ~400 connections and ~2 Gbps at 800 connections.

A bigger problem from the perspective of our group which is responsible (among other things) for optimizing specweb numbers on this platform was that the specweb support results which also require a high throughput from the NIC was also showing the same negative scaling. The baseline at the time we started was about 7000 [1] which was about half of what was reported for the T2000 (which is based on the UltraSPARC T1 CPU).

In this blog I'm going to talk about how I solved the problem so we could get near line rate out of that NIC using 8 or more connections. I won't be talking about the code itself but about the process by which the code reached its final form.

Now, just from the problem description, it was pretty obvious that we had a scaling bottleneck and running lockstat pointed to one culprit, bugid 6492551 : a hot lock in the nxge driver. At higher connections (800 connections), we were tripping over a different problem: bugid 6405012, where the GLDv3 framework went single threaded when the hardware was out of transmit resources.

My first suggestion to the driver writer was that the "right" fix would be to find a way to hold the lock for a much shorter time which would reduce the contention on the lock. Since this was a hardware resource, there was no point in trying to break up the lock. Since redesigning the locking code to reduce the hold time was non-trivial (and quite possibly risky/error-prone) I offered to help in a different direction, solving an easier problem - I'd work on a prototype that would avoid the locking problem altogether.

A little background about the NIC: the NIU (Network Interface Unit) which is embedded in the T2 is configured with 8 tx descriptor rings per port whereas the Atlas (PCI-E card, based on the same ASIC) is configured with 12 tx descriptor rings per port. Think of each descriptor ring as a hardware queue. The number of such queues limits the amount of parallelism allowed by the NIC hardware. Each ring had its own lock (in software) to control access to elements of the ring and the scaling problem was because the 64 CPU strands on the T2 were being allowed to contend over any of these 8 or 12 rings.

So my first design choice was easy: allow only a single thread per ring. To implement this I decided to design a fairly simple multiplexer (mux): All threads with data bound for a ring would have to drop off their packets on a mux queue and a worker thread assigned to each ring would process these packets. The only point of contention between the "producers" dropping off packets and the "consumer" thread working on packets could be pared down to a tiny portion of code while manipulating the shared queue between them.

I coded up a prototype where the overriding design goal was to minimize any time spent holding a lock and added a heuristic to reduce latency (don't queue a packet unless a queue has already built up). The actual coding was done in about 2 weeks and I thought it was a neat little piece of code (at just under 100 lines, I think) and I had coded it as an "N producer 1 consumer mux" similar to the taskq(9f) primitive which is available in Solaris. I had designed it as a separate subsystem and chose an interface identical to the taskq_{create,dispatch}/serializer_{create,enter} routines available in Solaris so it could be a drop-in replacement, if required.

Now I had to test the code to see if the scaling problem was solved. Since the Huron box where I would have ideally liked to do the testing was a point of contention (among the humans trying to run tests on it - there was 1 precious new machine shared among 7 people I think), I decided to move my testing to an older T2000 (UltraSPARC T1 based machine) with an Atlas card. Since my changes were isolated to the transmit side and I didn't want to tickle TCP or upper layer issues, my test setup was a simple, transmit only, multi-process ttcp test using small packets sending 100 byte messages over UDP.

My testing showed that with the new prototype code, the lock contention problem was solved and I could reach (at that time a record) ~1.5M packets per second (Mpps). I was confident enough with the changes that I handed it off to the Specweb folks.

The Specweb folks used my fix and promptly reported a machine hang. Sigh! Debugging begins. What had happened was that I had run my little microbenchmark instead of doing a test that mimicked their workload which would have turned up the problem in my code. Anyway, despite being blindsided by overconfidence, things weren't too bad because the problem turned out to be rather obvious.

What I found was that packets that needed to be handled in interrupt context were also being placed on the mux queue. To deal with this, the first piece of ugliness crept into my otherwise elegant mux, which would have been a joy and delight to behold - atleast to my CS 101 professor. Oh well, correctness trumped elegance and I had to special case interrupt handling into the code.

With this change in place, the specweb support runs ran to completion and we managed to beat our earlier internal record set with the T2000. We no longer had the negative scaling I had set out to solve.

I called this version v2 to distinguish it from the earlier (buggy) version v1.

Now that I had 'a' solution, I wanted to make sure I hadn't reinvented the wheel. I was familiar with most (if not all) of the earlier solutions that had been attempted for this problem. I'm getting into Solaris arcana here but for those who are familiar with these subsystems, here is a list of previous solutions:

  • (a) Jeff Bonwick's taskq(9f) subsystem.
  • (b) Frank Dimambro's solution from the ce driver (for Cassini, Sun's 1 Gbps NIC)
  • (c) Alexander Kolbasov's serializer_{create,enter} routines built on top of taskq
  • (d) Paul Riethmuller's and Ajoy Siddhabathuni's solution for the ipge driver (for the Intel 1 Gbps NIC)

I evaluated prototype codes with each of the above solutions and each approach had problems they induced:

  • (a) The taskq routines unfortunately does memory allocation on the hot path. This is by choice since taskqs are agnostic to the data elements that may be passed to it but it was suboptimal in this case. While preallocating a fixed chunk of memory (using the Solaris internal interface, not the documented ddi interface) could solve this, there was the possibility that a large burst of outgoing packets could fill it all up. A second issue is that taskqs can't do flow control - well, they can, kinda, but that is done by simply dropping packets on the floor - tail drops in TCP parlance. A third issue is that taskqs were designed for the case where latency is not an issue. This is usually not true for network traffic.
  • (b) The ce driver approach was to turn the first thread that grabbed the lock into an impromptu consumer/worker thread - this approach was obviously unfair (and hurt specweb results due to some threads being unfairly made to do processing on behalf of other threads and thus hurting latency). In addition, the current code has a latent bug as well that allows more than one thread to contend on the lock (ie back to original problem). Interestingly enough, I was trying to look up the provenance of this code recently and it turns out that the hme driver on which the ce driver may have been based, does not have this bug. Which just goes to show that even customizing code is not necessarily trivial or bug free.
  • (c) The serializer_{create,enter} routines in Solaris are built on top of the taskq routines by reusing the "system_taskq" so in addition to the taskq() specific issues mentioned in (a) above, it had a latent bug 6525649 that at high loads, more than one consumer thread would start to run resulting in the original contention problems that I had set out to fix.
  • (d) The ipge driver which was later obsoleted by the e1000g driver (now thats another story waiting to be told) used the STREAMS service routine framework to achieve the same mux effect. But since the nxge driver was based on the newer GLDv3/ Nemo framework and didn't use the STREAMS framework anymore, this was no longer an option. Moreover the single threading implied by the STREAMS service routine was sufficient to keep the pipe full on a 1 Gbps link but couldn't cut it on a 10 Gbps link.

Building and testing all of the above prototypes ensured due diligence on my part to prove that I wasn't suffering from the NIH syndrome (but then again, that is not necessarily a bad thing - so long as it falls under the rubric of questioning basic assumptions).

So now I had a working prototype that solved an important problem in a new way.

All that remained was to get approval to integrate my code into Solaris. This first involves getting a thumbs up from the QA group that runs a battery of stress tests for days at a time on the various server platforms that Sun manufactures. Stress testing the code on UDP/DLPI transmit on a sun4u platform resulted in a memory exhaustion issue. A related stress test on an AMD Galaxy4 x86 platform showed relatively low performance. Both of these were quickly debugged to flow control issues.

The flow control issue was this: Assume we have a huge burst ("thundering herd") of activity (outgoing packets). The system would faithfully allocate memory for all of this activity and queue everything up. Until we could drain the queue, all of this memory was locked up and this was quite sub-optimal. A better approach is to slow down the thundering herd to a rate that can be absorbed by the system (think parking /metering lights on highway 101). I'll call this "in-bound" flow control - and this was what we needed for the DLPI stress test. A subtle variation of this needed to happen between the worker thread and the hardware - I'll call this the "out-bound" flow control to distinguish it from the in-bound case. Whenever the output hardware queue was full, the worker thread needed to put itself to sleep for a while. This was needed on the x86 platforms. The flow-controlled versions were named v3 and v4 respectively.

One interesting note about the flow control issue was that I had recognized the need for flow control when I wrote the initial version (in the code it showed up as the lack of an else clause to a conditional test for an exception) but I had decided to defer it because

  • (a) A single strand on a T1 was slow enough to not be able to fill up the output queue and I didn't have enough data for how much faster other processors might be to know if I needed to worry about it just yet.
  • (b) I assumed a 10 Gb NIC would be used mostly on a server with lots of memory
  • (c) I didn't see a problem show up in my testing or on specweb which used servers with 32 GB of memory.

So I'd concluded my concerns were just theoretical. But as Dijkstra pointed out a long time ago "testing can only show the presence of bugs, not their absence".

What I didn't know then was that the same ASIC used in the 10Gb NIU and Atlas was also used in a 4x1 Gbps (quad gig) NIC that could be used in our low end sun4u desktop servers that typically ship with "only" 1-2 GB of memory.

Adding flow control to the mux took care of this issue and resulted in the code growing to ~200 lines of code and I had to kiss goodbye to my hope of a "tiny" solution. (As an aside, I used to work in a sustaining group in Sun fixing bugs in various parts of Solaris and my definition of a "good fix" is a one-liner or even better a no-liner - ie a workaround).

Now fixing a software problem is one thing but getting it into Solaris is a whole different ball game (eg: I had only written code for the hot path - so the unplumb/plumb part of the code needed additional work as did support for suspend/resume). The work in my group ends at building prototypes to show performance gains and the product folks own the actual code so I was only too happy leaving the code integration into Nevada and the backport to Solaris 10 to the nxge driver folks. The only suggestion I had for them was that the mux which is a general purpose piece of code didn't belong in the driver and could be moved to the more generic GLDv3/ Nemo layer (ie dls/dld).

The Nemo guys were understandably reluctant to be saddled with a fairly new piece of code that would affect all subsystems/drivers. Some of the Nemo guys are currently working on Crossbow now and and Crossbow didn't have a solution that would be ready before server shipment began. So the unhappy solution we chose was to put the mux code into the nxge driver with the Crossbow guys promising to "rip it out" and replace it with a solution of their own making when their code eventually integrates into Nevada.

Anyway finally, after about ~5 months of never ending discussions, code reviews, sundry processes and the associated hand-wringing the code was eventually put back into Nevada build 62 on April 3, 2007 and backported to Solaris 10 update 4 build 8 on April 27, 2007 - the shipping OS release for the Huron.

For Solaris 10 update 3, the nxge driver is unbundled (ie not part of Solaris) and that integration happened much sooner - sometime in January 2007, I think.

On a pessimistic note, I wonder how Fred Brooks would measure the productivity on this project. Is it 200 lines/4 weeks == the prototypical 10 lines per day for me or is it 200 lines/5 months for us as an organization?

In summary (and on a more positive note), let me do a smug self-eval to see if Larry Wall's metrics for a good programmer apply to me in this work:

  • (1) Laziness: By designing an (almost) general purpose, low latency, low contention, flow controlled mux, I was lazy enough to workaround the locking problem instead of solving it using a difficult/"high risk-low return" approach. I also chose to not implement a fancy lockless design and favored a simple plain vanilla locking strategy based on the rule that the more critical your subsystem, the simpler it should be and moreover, it works well enough. Maybe I'll whip up a lockless version when 40Gbps or 100Gbps NICs show up.
  • (2) Impatience: Having the performance of a beautiful, new piece of hardware mangled by the software angered me enough to fix the problem.
  • (3) Hubris: With multi-core CPU designs becoming more common, the pipelining that used to happen in hardware will now need to happen to some extent in software. The mux that I designed is a perfect fit for this paradigm shift that allows such pipelining to happen in software. I hope this software meme will make its way into various pieces of code - it doesn't matter whether it is within Sun or outside Sun, thanks to OpenSolaris.

But at the same time, having fixed other people's bugs for a large portion of my professional life, I'm quite familiar with the fraility of the human mind especially in its interactions with dumb machines that show emergent behaviour and so it is also with some humility that I look at this software especially when I think of the subtle bugs that likely remain that haven't made their presence felt yet in our testing - even on the world record specweb run at 37000 connections. [2]

So, there you have it: the very first blog of a competent programmer and yes, the world does need more low latency, low contention, flow controlled muxes. ;-)

For those who care to take a look, the code itself is in nxge_serialize.c (and no, I didn't perpetrate all that debugging code bloatware which is part of the code now).

As for future work on this code, I'm working on making a more generic version without many of the magic constants that are peppered in it today. I also hope to subsume taskq functionality into it and use it as the basis for the fix for bugid 6652443

For more blogs and information on Sun's T5120, T5220 and T6320 servers that incorporate this technology, check out Allan Packers's index page on all things CMT for other perspectives from the trenches.

The following footnotes are to keep the lawyers happy:

[1] My runs used the SPECweb2005 benchmark for research purposes. Deviations from the benchmark run rules include but were not limited to

  • short runs and
  • various prototype codes which were not generally available.
[2] For published SPEC and other benchmarks numbers from Sun please refer to the official server benchmarks.

Nice blog. What I like best about the code is the delicacy between freelance() and onetrack() in nxge_serializer. So in future, if the cost of hardware lock in the card became very low, your serializing queue would not build up, and things would continue as before. Great work.

Posted by amitab on March 12, 2008 at 12:33 PM PDT #

Any chance of seeing this ported to the linux driver for the atlas card? :-)

Posted by John on March 12, 2008 at 01:50 PM PDT #

Good work!

Posted by Joe G on March 13, 2008 at 06:43 AM PDT #

Amitab, Joe:
Thanks for the kind words. ;)

The driver folks who work on Linux need to make the call on whether this needs to
be ported there. Do you know for sure if the driver does not scale on Linux?

Posted by Charles Suresh on May 06, 2008 at 08:49 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed

Charles Suresh


« April 2014