Monday Oct 13, 2008

Networking on the Sun SPARC Enterprise T5440

Today Sun has launched industry's first four-socket CMT platform based on the ULTRASPARC T2 Plus (Victoria Falls) processor, the Sun SPARC Enterprise T5440 server. This adds to Sun's existing CMT platform consisting of the two-socket ULTRASPARC T2 Plus based Sun SPARC Enterprise T5140/T5240, ULTRASPARC T2 based Sun SPARC Enterprise T5120/T5220, and ULTRASPARC based Sun SPARC Enterprise T2000 server.

The Sun SPARC Enterprise T5440 server has an impressive network I/O capacity: 8 PCIe slots x8 with 2 x16 connectors for graphics. These are connected to 4 PCIe x8 switches.Based on the PCI Express® 1.1 specification, the bandwidth on each x8 PCIe switch is 16 Gbits/s per direction. In practice, this translates to about 12 Gbits/s for payload in each direction for 10 GigE networking, taking into account PCIe protocol overhead and networking protocol headers. In this blog we will investigate the performance of 10 GigE networking on this impressive system, and provide some tunable guidelines and best networking practices. Our aim is to maximize network performance with four Sun Multithreaded 10 GigE Network Interface Cards (NICs).


To maximize throughput, we would like the NIC traffic to go via independent PCIe x8 switches. We place the cards in Slots 0,2,5 and 7. These NICs are connected back to back with 4 Sun T4150 machines (Two-socket Intel Xeon platform). We installed Open Solaris version 98 in our System Under Test (SUT): the Sun SPARC Enterprise T5440. To characterize network performance, we use Uperf 1.0.2 . Uperf is installed on the SUT and all the client machines. All the results in this blog are with regular sized MTU (1500 bytes). Results with jumbo MTU (9000 bytes) are much better in terms of both CPU utilization and throughput and will be discussed in a later blog.


T5440 10GbE throughput (64 threads per interface, 64K sized writes)
# ports (NIC interfaces) metric
1 Gbps
9.37 9.46
cpu util. (%)
2 Gbps 18.75 18.68
cpu util. (%) 18
3 Gbps 26.92 27.88
cpu util. (%) 28
4 Gbps 31.5 26.21
cpu util. (%) 38

This shows that the Sun SPARC Enterprise T5440 has great scalability in terms of both throughput and CPU Utilization. We can achieve an astonishing 30 Gbps from a single system with more than half of the box idle for your cool applications to run. Something like this was unimaginable 5 years ago. Thats the power of CMT :) .


We use the following tunings for the above results:
1) Large Segment Offload (LSO):
LSO enables the network stack to deal with large segments and leaves the job of fragmentation into MTU sized packets to the lower layers. The NIC may provide features to do the above segmentation in hardware, this is known as hardware LSO. Alternatively, the driver may be able to provide the segmentation, this is known as soft LSO. LSO helps reduce system CPU utilization by reducing the number of system calls needed totransmit data.

The Sun Multithreaded 10 GigE driver (nxge) provides support for soft LSO which may be enabled by uncommenting the following line from /platform/sun4v/kernel/drv/nxge.conf

soft-lso-enable = 1;

LSO is a recommended feature for higher throughput and/or lower CPU utilization. It may lead to higher packet latency.

Open Solaris has MSI-X support. By default interrupts fan out to 8 CPUs per NIC. However at high throughput, interrupt CPUs may get pinned. To avoid this bottleneck, we recommend enabling 16 soft rings by the following lines in /etc/system.

set ip:ip_soft_rings_cnt=16

Soft rings reduce the load on the interrupt CPU by taking over the job of processing the received packets up the network protocol stack. As in LSO, soft rings may lead to higher packet latency.

3) TCP settings:
We recommend the following TCP settings for higher throughput:
ndd -set /dev/tcp tcp_xmit_hiwat 262144
ndd -set /dev/tcp tcp_recv_hiwat 262144

On the receive side, although we do not scale perfectly from 3 to 4 ports because of the higher cost of sending small ACK packets at higher number of connections, a fix is in the works that will resolve this problem (see bug discussion below).

The XAUI card which uses on-board 10 GigE performs almost identical to the Sun Multithreaded 10 GigE NIC . Here are the results.

T5440 10GbE throughput (16 threads per port, 64K sized writes)
# ports (NIC interfaces) metric
1 Gbps
9.33 9.32
cpu util. (%)

Now, let me talk about some bugs that are still in the works. You may easily track these bugs at

CR 6311743 will improve callout table scalability in Solaris. This issue affects small packet performance at large number of connections and high throughput. Callout tables are used in Solaris to maintain TCP timeout entries. In the case of TCP receive, a very high number of small packet ACK messages needs to be transmitted for a high number of connections and high throughput. The results in the next table show relatively poor TCP receive performance at 500 connections per port using 4 ports (2000 in all). This CR when integrated in Solaris will resolve this problem.

T5440 10GbE throughput (500 threads per port, 64K sized messages)
# ports (NIC interfaces) metric
1 Gbps
30.26 14.60

CR 6644935 will address scalability in the credentials data structure. This bug also affects small packet performance. As an example, the following results demonstate small packet (64 byte) transmit using UDP.

T5440 10GbE throughput (250 threads per port, 64 byte write)
# ports (NIC interfaces) metric
TX (Packets/s)
4 packet/s

While we fix this CR and integrate it in Solaris, a simple workaround is to bind the application to one UltraSparc T2 socket using:

psrset -c 0-63
Note processor set id here
psrset -b processor set id shell/process id
psradm -f 64-255
psradm -a -n
The last two commands ensure that all the soft rings run on one socket. Now we easily get more than 1.5 million packets/sec.

T5440 10GbE throughput (1000 threads, 64 byte write)
# ports (NIC interfaces) metric
TX (Packets/s)
4 packet/s

CR 6449810 will address throughput that may be achieved from a single PCIe bus. Using two ports in the same NIC gave us 11.2 Gbps. Once this CR is integrated in Solaris, we expect a throughput increase of ~15%.

In summary, the Sun SPARC Enterprise T5440 is an outstanding piece of technology, supporting high degree of multithreaded applications, as demonstrated by its great network performance.

Monday Nov 19, 2007

Google's custom 10 GigE switches

Just read an article from Nyquist Capital about Google designing its own 10 GigE switches. Its interesting how the authors traced the source of thousand of SFP+ components to Google to determine this. The article discusses how the strategy is very similar to Google designing its own servers and compute farms from basic components as opposed to buying servers from companies like SUN. the article also mentions Google adding 5K+ 10GigE ports a month to manage its 500,000+ compute nodes.

Thats a ton of money being saved, given the cost of 10GigE switches in the market today. Which brings us to the question, what is so proprietary about a switch that vendors like Cisco, Juniper, Woven etc., can sell them at a huge price and a huge margins. Does IOS-X from Cisco have some patented stuff that open source software cannot implement or don't have. I would love to see the above Google technique revolutionize the switch market and motivate some startup to design a switch fabric based on commodity hardware.

This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.


« February 2015