Getting more beef from your network

Let us consider the process by which the operating system (OS) handles network I/O. For simplicity we consider the receive path. While there are some differernces between Solaris, Linux, and other flavors of unix, I will try to generalize the steps to construct a high-level representative picture. Here is an outline of the steps:

1. When packets are received, the Network Interface Card (NIC) performs a Direct Memory Access (DMA) to transfer the data to the main memory. Once a sufficient size of data prescribed by the interrupt coalescing parameter is received, an interrupt is raised to inform the device driver of this event. The device driver assigns a data structure called the receive descriptor to handle the memory location identified by the DMA.

2. In the interrupt handling context, the device driver handles the packet in the DMA memory. The packet is processed through the network protocol stack (MAC, IP, TCP layers) in the interrupt context and is ultimately copied to the TCP socket buffer. The work of the interrupt handler ends at this stage. Solaris GLDv3 based drivers have a tunable to employ independent kernel threads (also known as soft rings) to handle the network protocol stack so that the interrupt CPU does not become the bottleneck. This is is sometimes required on the UltraSparc based systems because of the large number of cores that they support.

3. The application thread, usually executing as a user-level process, then reads the packet from the socket buffer and processes the data appropriately.

Thus, data transfer between the NIC and the application may involve at least two copies of data: one from the DMA memory to kernel space, and the other from kernel space to user space. In addition, if the application is writing data to the disk, there may be an additional copy of data from memory to the disk. Such a large number of copies has high overhead, particularly when the network transmission line rate is high. Moreover, the CPU becomes increasingly burdened with the large amount of packet processing and copying.

The following techniques have been studied to improve the end-system performance.

Protocol Offload Engines (POE)
Offload engines implement the functionality of the network protocol in on-chip hardware (usually in the NIC), which reduces the CPU overhead for network I/O. The most common offload engines are TCP Offload Engines (TOEs). TOEs have been demonstrated to deliver higher throughput as well as reduce the CPU utilization for data transfers. Although POEs improve network I/O performance, they do not completely eliminate the I/O bottleneck, as the data still must be copied to the application buffer space.

Moreover TOE has numerous vulnerabilities because of which it is not supported by any operating system. Patches to provide TOE support to Linux were rejected for many reasons, which are documented here. The main reasons are: (i)Difficulty of patching security updates since TOE resides firmly in hardware, (ii) Inability of ToE to perform well under stress, (iii)Vulnerabilities to SYN flooding attacks, (iv)Difficulties in longterm kernel maintenance with evolving dimensions of TOE hardware.

Zero-Copy Optimizations
Zero-copy optimizations such as the sendfile() implementation in Linux 2.4 , aim to reduce the number of copy operations between kernel and user space. As an example, in sendfile(), only one copy of data occurs when data is transferred from the file to the NIC. Numerous zero-copy enabled versions of TCP/IP have been developed and implementations are available for Linux, Solaris, and FreeBSD. A limitation of most zero-copy implementations is the amount of data that may be transferred. As an example, sendfile() has the limitation of a maximum file size of 2 GB. Although zero-copies improve performance, they do not eliminate the contention for end-system resources.

Remote DMA (RDMA)
The RDMA protocol implements both POEs and zero-copy optimizations.
RDMA allows data to be directly written to/read from the application buffer without the involvement of the CPU or OS. It thus avoids the overhead of the network protocol stack and context switches, and allows transfers to continue in parallel with other executing tasks. However, apart from cluster computing environments, the acceptance of RDMA has been rather limited because of the need of a separate networking infrastructure. Moreover, RDMA has security concerns, particularly in the setting of remote end-to-end data transfers.

Large Send Offload (LSO)/ Large Receive Offload (LRO)
LSO and LRO are NIC features to allow the network protocol stack to process large (up to 64 KB) segments. The NIC has hardware features to split the segments into 1500 byte MTU packets for send (LSO) and combine incoming MTU sized packets into a large segment for receive (LRO). LSO and LRO help save CPU cycles consumed in the network protocol stack because a single call can handle a 64 KB segment. LSO/LRO are supported in most NICs and are known to improve the CPU efficiency of networking considerably.

Transport Protocol Mechanisms
There are several approaches to optimizing TCP performance. Most focus on improving the Additive Increase Multiplicative Decrease (AIMD) congestion control algorithm of TCP which is sometimes less inefficient at very high bit-rates, because a single packet-loss may quench the transfer rate. Also, the congestion control algorithm in TCP has been demonstrated to be not scalable in high Bandwidth Delay Product (BDP) settings (connections with high bandwidth and Round Trip Time (RTT)).
To improve these remedies, a large variety of TCP-variants which improve on the congestion control algorithm have been proposed, such as FAST, High-Speed TCP (HS-TCP), Scalable TCP, BIC-TCP, and many others.


Which copy is more expensive? App <--> kernel or kernel <--> NIC?
DMA should use less cpu right? Can we offload some of this to say
a GPU? or Do some pagetable remapping magic to reduce the copy overhead?

Posted by Neel on February 29, 2008 at 04:15 AM PST #

DMA doesnt take much CPU since it is managed by the DMA engine. However DMA binding (attaching a descriptor to the DMA memory) is what will eat the CPU. You cannot offload this binding since you are dealing with main memory.

Usually user to kernel copy should be more expensive than DMA binding.

I am not very sure how socketwrite() is implemented in different OSs. I suspect that it will be doing a simple copy of bits, and will not do any page copying/flipping stuff. I think part of the hard stuff when it comes to dealing with pages is that you have to be very careful about page boundaries.

Posted by amitab on February 29, 2008 at 04:35 AM PST #

You briefly mention scalability issues with stock TCP over long fat networks. Are any of the improved congestion control algorithms that you mention available for Solaris 10? We're a Solaris-only shop and we're currently having performance issues between two Solaris 10 end-point boxes separated by a large WAN circuit. Sure would be nice if Solaris 10 had "pluggable" congestion control modules like Linux. Thanks for your time.

Posted by Jeff on March 05, 2008 at 01:47 PM PST #

Wrt the previous comment by Jeff, I do not believe Solaris 10 has support for pluggable TCP variants. There should be implementations in Open Solaris, and I will do some research and then post at this forum. If you believe that TCP variants on Solaris 10 is crucial to your business, you may want to convey this to your sales representative. Thanks for your comment.

Posted by Amitab on March 08, 2008 at 10:14 AM PST #

Hi Amitabha,

Thanks for this information.

Krishna Manoharan

Posted by Krishna Manoharan on December 21, 2008 at 01:39 PM PST #

Post a Comment:
  • HTML Syntax: NOT allowed

This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.


« July 2016