Thursday Feb 28, 2008

Getting more beef from your network

Let us consider the process by which the operating system (OS) handles network I/O. For simplicity we consider the receive path. While there are some differernces between Solaris, Linux, and other flavors of unix, I will try to generalize the steps to construct a high-level representative picture. Here is an outline of the steps:

1. When packets are received, the Network Interface Card (NIC) performs a Direct Memory Access (DMA) to transfer the data to the main memory. Once a sufficient size of data prescribed by the interrupt coalescing parameter is received, an interrupt is raised to inform the device driver of this event. The device driver assigns a data structure called the receive descriptor to handle the memory location identified by the DMA.

2. In the interrupt handling context, the device driver handles the packet in the DMA memory. The packet is processed through the network protocol stack (MAC, IP, TCP layers) in the interrupt context and is ultimately copied to the TCP socket buffer. The work of the interrupt handler ends at this stage. Solaris GLDv3 based drivers have a tunable to employ independent kernel threads (also known as soft rings) to handle the network protocol stack so that the interrupt CPU does not become the bottleneck. This is is sometimes required on the UltraSparc based systems because of the large number of cores that they support.

3. The application thread, usually executing as a user-level process, then reads the packet from the socket buffer and processes the data appropriately.

Thus, data transfer between the NIC and the application may involve at least two copies of data: one from the DMA memory to kernel space, and the other from kernel space to user space. In addition, if the application is writing data to the disk, there may be an additional copy of data from memory to the disk. Such a large number of copies has high overhead, particularly when the network transmission line rate is high. Moreover, the CPU becomes increasingly burdened with the large amount of packet processing and copying.

The following techniques have been studied to improve the end-system performance.

Protocol Offload Engines (POE)
Offload engines implement the functionality of the network protocol in on-chip hardware (usually in the NIC), which reduces the CPU overhead for network I/O. The most common offload engines are TCP Offload Engines (TOEs). TOEs have been demonstrated to deliver higher throughput as well as reduce the CPU utilization for data transfers. Although POEs improve network I/O performance, they do not completely eliminate the I/O bottleneck, as the data still must be copied to the application buffer space.

Moreover TOE has numerous vulnerabilities because of which it is not supported by any operating system. Patches to provide TOE support to Linux were rejected for many reasons, which are documented here. The main reasons are: (i)Difficulty of patching security updates since TOE resides firmly in hardware, (ii) Inability of ToE to perform well under stress, (iii)Vulnerabilities to SYN flooding attacks, (iv)Difficulties in longterm kernel maintenance with evolving dimensions of TOE hardware.

Zero-Copy Optimizations
Zero-copy optimizations such as the sendfile() implementation in Linux 2.4 , aim to reduce the number of copy operations between kernel and user space. As an example, in sendfile(), only one copy of data occurs when data is transferred from the file to the NIC. Numerous zero-copy enabled versions of TCP/IP have been developed and implementations are available for Linux, Solaris, and FreeBSD. A limitation of most zero-copy implementations is the amount of data that may be transferred. As an example, sendfile() has the limitation of a maximum file size of 2 GB. Although zero-copies improve performance, they do not eliminate the contention for end-system resources.

Remote DMA (RDMA)
The RDMA protocol implements both POEs and zero-copy optimizations.
RDMA allows data to be directly written to/read from the application buffer without the involvement of the CPU or OS. It thus avoids the overhead of the network protocol stack and context switches, and allows transfers to continue in parallel with other executing tasks. However, apart from cluster computing environments, the acceptance of RDMA has been rather limited because of the need of a separate networking infrastructure. Moreover, RDMA has security concerns, particularly in the setting of remote end-to-end data transfers.

Large Send Offload (LSO)/ Large Receive Offload (LRO)
LSO and LRO are NIC features to allow the network protocol stack to process large (up to 64 KB) segments. The NIC has hardware features to split the segments into 1500 byte MTU packets for send (LSO) and combine incoming MTU sized packets into a large segment for receive (LRO). LSO and LRO help save CPU cycles consumed in the network protocol stack because a single call can handle a 64 KB segment. LSO/LRO are supported in most NICs and are known to improve the CPU efficiency of networking considerably.

Transport Protocol Mechanisms
There are several approaches to optimizing TCP performance. Most focus on improving the Additive Increase Multiplicative Decrease (AIMD) congestion control algorithm of TCP which is sometimes less inefficient at very high bit-rates, because a single packet-loss may quench the transfer rate. Also, the congestion control algorithm in TCP has been demonstrated to be not scalable in high Bandwidth Delay Product (BDP) settings (connections with high bandwidth and Round Trip Time (RTT)).
To improve these remedies, a large variety of TCP-variants which improve on the congestion control algorithm have been proposed, such as FAST, High-Speed TCP (HS-TCP), Scalable TCP, BIC-TCP, and many others.


This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.


« April 2014