Monday Oct 13, 2008

Networking on the Sun SPARC Enterprise T5440

Today Sun has launched industry's first four-socket CMT platform based on the ULTRASPARC T2 Plus (Victoria Falls) processor, the Sun SPARC Enterprise T5440 server. This adds to Sun's existing CMT platform consisting of the two-socket ULTRASPARC T2 Plus based Sun SPARC Enterprise T5140/T5240, ULTRASPARC T2 based Sun SPARC Enterprise T5120/T5220, and ULTRASPARC based Sun SPARC Enterprise T2000 server.

The Sun SPARC Enterprise T5440 server has an impressive network I/O capacity: 8 PCIe slots x8 with 2 x16 connectors for graphics. These are connected to 4 PCIe x8 switches.Based on the PCI ExpressĀ® 1.1 specification, the bandwidth on each x8 PCIe switch is 16 Gbits/s per direction. In practice, this translates to about 12 Gbits/s for payload in each direction for 10 GigE networking, taking into account PCIe protocol overhead and networking protocol headers. In this blog we will investigate the performance of 10 GigE networking on this impressive system, and provide some tunable guidelines and best networking practices. Our aim is to maximize network performance with four Sun Multithreaded 10 GigE Network Interface Cards (NICs).


To maximize throughput, we would like the NIC traffic to go via independent PCIe x8 switches. We place the cards in Slots 0,2,5 and 7. These NICs are connected back to back with 4 Sun T4150 machines (Two-socket Intel Xeon platform). We installed Open Solaris version 98 in our System Under Test (SUT): the Sun SPARC Enterprise T5440. To characterize network performance, we use Uperf 1.0.2 . Uperf is installed on the SUT and all the client machines. All the results in this blog are with regular sized MTU (1500 bytes). Results with jumbo MTU (9000 bytes) are much better in terms of both CPU utilization and throughput and will be discussed in a later blog.


T5440 10GbE throughput (64 threads per interface, 64K sized writes)
# ports (NIC interfaces) metric
1 Gbps
9.37 9.46
cpu util. (%)
2 Gbps 18.75 18.68
cpu util. (%) 18
3 Gbps 26.92 27.88
cpu util. (%) 28
4 Gbps 31.5 26.21
cpu util. (%) 38

This shows that the Sun SPARC Enterprise T5440 has great scalability in terms of both throughput and CPU Utilization. We can achieve an astonishing 30 Gbps from a single system with more than half of the box idle for your cool applications to run. Something like this was unimaginable 5 years ago. Thats the power of CMT :) .


We use the following tunings for the above results:
1) Large Segment Offload (LSO):
LSO enables the network stack to deal with large segments and leaves the job of fragmentation into MTU sized packets to the lower layers. The NIC may provide features to do the above segmentation in hardware, this is known as hardware LSO. Alternatively, the driver may be able to provide the segmentation, this is known as soft LSO. LSO helps reduce system CPU utilization by reducing the number of system calls needed totransmit data.

The Sun Multithreaded 10 GigE driver (nxge) provides support for soft LSO which may be enabled by uncommenting the following line from /platform/sun4v/kernel/drv/nxge.conf

soft-lso-enable = 1;

LSO is a recommended feature for higher throughput and/or lower CPU utilization. It may lead to higher packet latency.

Open Solaris has MSI-X support. By default interrupts fan out to 8 CPUs per NIC. However at high throughput, interrupt CPUs may get pinned. To avoid this bottleneck, we recommend enabling 16 soft rings by the following lines in /etc/system.

set ip:ip_soft_rings_cnt=16

Soft rings reduce the load on the interrupt CPU by taking over the job of processing the received packets up the network protocol stack. As in LSO, soft rings may lead to higher packet latency.

3) TCP settings:
We recommend the following TCP settings for higher throughput:
ndd -set /dev/tcp tcp_xmit_hiwat 262144
ndd -set /dev/tcp tcp_recv_hiwat 262144

On the receive side, although we do not scale perfectly from 3 to 4 ports because of the higher cost of sending small ACK packets at higher number of connections, a fix is in the works that will resolve this problem (see bug discussion below).

The XAUI card which uses on-board 10 GigE performs almost identical to the Sun Multithreaded 10 GigE NIC . Here are the results.

T5440 10GbE throughput (16 threads per port, 64K sized writes)
# ports (NIC interfaces) metric
1 Gbps
9.33 9.32
cpu util. (%)

Now, let me talk about some bugs that are still in the works. You may easily track these bugs at

CR 6311743 will improve callout table scalability in Solaris. This issue affects small packet performance at large number of connections and high throughput. Callout tables are used in Solaris to maintain TCP timeout entries. In the case of TCP receive, a very high number of small packet ACK messages needs to be transmitted for a high number of connections and high throughput. The results in the next table show relatively poor TCP receive performance at 500 connections per port using 4 ports (2000 in all). This CR when integrated in Solaris will resolve this problem.

T5440 10GbE throughput (500 threads per port, 64K sized messages)
# ports (NIC interfaces) metric
1 Gbps
30.26 14.60

CR 6644935 will address scalability in the credentials data structure. This bug also affects small packet performance. As an example, the following results demonstate small packet (64 byte) transmit using UDP.

T5440 10GbE throughput (250 threads per port, 64 byte write)
# ports (NIC interfaces) metric
TX (Packets/s)
4 packet/s

While we fix this CR and integrate it in Solaris, a simple workaround is to bind the application to one UltraSparc T2 socket using:

psrset -c 0-63
Note processor set id here
psrset -b processor set id shell/process id
psradm -f 64-255
psradm -a -n
The last two commands ensure that all the soft rings run on one socket. Now we easily get more than 1.5 million packets/sec.

T5440 10GbE throughput (1000 threads, 64 byte write)
# ports (NIC interfaces) metric
TX (Packets/s)
4 packet/s

CR 6449810 will address throughput that may be achieved from a single PCIe bus. Using two ports in the same NIC gave us 11.2 Gbps. Once this CR is integrated in Solaris, we expect a throughput increase of ~15%.

In summary, the Sun SPARC Enterprise T5440 is an outstanding piece of technology, supporting high degree of multithreaded applications, as demonstrated by its great network performance.

Thursday Feb 28, 2008

Getting more beef from your network

Let us consider the process by which the operating system (OS) handles network I/O. For simplicity we consider the receive path. While there are some differernces between Solaris, Linux, and other flavors of unix, I will try to generalize the steps to construct a high-level representative picture. Here is an outline of the steps:

1. When packets are received, the Network Interface Card (NIC) performs a Direct Memory Access (DMA) to transfer the data to the main memory. Once a sufficient size of data prescribed by the interrupt coalescing parameter is received, an interrupt is raised to inform the device driver of this event. The device driver assigns a data structure called the receive descriptor to handle the memory location identified by the DMA.

2. In the interrupt handling context, the device driver handles the packet in the DMA memory. The packet is processed through the network protocol stack (MAC, IP, TCP layers) in the interrupt context and is ultimately copied to the TCP socket buffer. The work of the interrupt handler ends at this stage. Solaris GLDv3 based drivers have a tunable to employ independent kernel threads (also known as soft rings) to handle the network protocol stack so that the interrupt CPU does not become the bottleneck. This is is sometimes required on the UltraSparc based systems because of the large number of cores that they support.

3. The application thread, usually executing as a user-level process, then reads the packet from the socket buffer and processes the data appropriately.

Thus, data transfer between the NIC and the application may involve at least two copies of data: one from the DMA memory to kernel space, and the other from kernel space to user space. In addition, if the application is writing data to the disk, there may be an additional copy of data from memory to the disk. Such a large number of copies has high overhead, particularly when the network transmission line rate is high. Moreover, the CPU becomes increasingly burdened with the large amount of packet processing and copying.

The following techniques have been studied to improve the end-system performance.

Protocol Offload Engines (POE)
Offload engines implement the functionality of the network protocol in on-chip hardware (usually in the NIC), which reduces the CPU overhead for network I/O. The most common offload engines are TCP Offload Engines (TOEs). TOEs have been demonstrated to deliver higher throughput as well as reduce the CPU utilization for data transfers. Although POEs improve network I/O performance, they do not completely eliminate the I/O bottleneck, as the data still must be copied to the application buffer space.

Moreover TOE has numerous vulnerabilities because of which it is not supported by any operating system. Patches to provide TOE support to Linux were rejected for many reasons, which are documented here. The main reasons are: (i)Difficulty of patching security updates since TOE resides firmly in hardware, (ii) Inability of ToE to perform well under stress, (iii)Vulnerabilities to SYN flooding attacks, (iv)Difficulties in longterm kernel maintenance with evolving dimensions of TOE hardware.

Zero-Copy Optimizations
Zero-copy optimizations such as the sendfile() implementation in Linux 2.4 , aim to reduce the number of copy operations between kernel and user space. As an example, in sendfile(), only one copy of data occurs when data is transferred from the file to the NIC. Numerous zero-copy enabled versions of TCP/IP have been developed and implementations are available for Linux, Solaris, and FreeBSD. A limitation of most zero-copy implementations is the amount of data that may be transferred. As an example, sendfile() has the limitation of a maximum file size of 2 GB. Although zero-copies improve performance, they do not eliminate the contention for end-system resources.

Remote DMA (RDMA)
The RDMA protocol implements both POEs and zero-copy optimizations.
RDMA allows data to be directly written to/read from the application buffer without the involvement of the CPU or OS. It thus avoids the overhead of the network protocol stack and context switches, and allows transfers to continue in parallel with other executing tasks. However, apart from cluster computing environments, the acceptance of RDMA has been rather limited because of the need of a separate networking infrastructure. Moreover, RDMA has security concerns, particularly in the setting of remote end-to-end data transfers.

Large Send Offload (LSO)/ Large Receive Offload (LRO)
LSO and LRO are NIC features to allow the network protocol stack to process large (up to 64 KB) segments. The NIC has hardware features to split the segments into 1500 byte MTU packets for send (LSO) and combine incoming MTU sized packets into a large segment for receive (LRO). LSO and LRO help save CPU cycles consumed in the network protocol stack because a single call can handle a 64 KB segment. LSO/LRO are supported in most NICs and are known to improve the CPU efficiency of networking considerably.

Transport Protocol Mechanisms
There are several approaches to optimizing TCP performance. Most focus on improving the Additive Increase Multiplicative Decrease (AIMD) congestion control algorithm of TCP which is sometimes less inefficient at very high bit-rates, because a single packet-loss may quench the transfer rate. Also, the congestion control algorithm in TCP has been demonstrated to be not scalable in high Bandwidth Delay Product (BDP) settings (connections with high bandwidth and Round Trip Time (RTT)).
To improve these remedies, a large variety of TCP-variants which improve on the congestion control algorithm have been proposed, such as FAST, High-Speed TCP (HS-TCP), Scalable TCP, BIC-TCP, and many others.


This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.


« July 2016