Wednesday Dec 17, 2008

CPU Utilization in networking - the TCP transmit path

I am planning to write a series of blogs on CPU Utilization in networking. Here is the first one focusing on the TCP transmit path. This blog article primarily considers Solaris, although equivalent concepts can be applied to other Unix based operating systems.

Let us first examine CPU utilization in Solaris. Here are the results for network I/O using TCP transmit. Our System Under Test(SUT) is a Sun Fire X4440 server, a 16-core, 4-socket AMD Opteron based system with 64 GB memory, and 1 Myricom 10 Gig Ethernet card. It is connected to 15 Sun v20z clients (using one on-board 1 GigE NIC on each system), via a Cisco Catalyst 6500 switch. We use uperf 1.0.2 in these measurements. The profile is a bulk throughput oriented profile, while the write size is varied. The results are collected using the Network Characterization Suite (NCS) that we have developed in our group, and which will be open-sourced soon. Very briefly, the cycles per second is measured using DTrace, and describes how many CPU cycles are required to transmit every KByte of data. usr/sys/idle is measured using vmstat, while intr is measured using another Dtrace script. Throughput is reported by uperf at the end of the 120 second run.

Here are the results:
TCP Transmit tests using uperf with write size = 64K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/3/0/96)              11792          928.11Mb/s
4        256k    (0/4/0/95)              3365           3.71Gb/s
8        256k    (0/7/1/92)              2815           7.42Gb/s
32       256k    (0/8/1/91)              2793           9.22Gb/s
100      256k    (0/9/1/90)              3161           9.24Gb/s
400      32k     (0/24/3/74)             8392           8.93Gb/s
1000     32k     (0/24/3/74)             8406           8.80Gb/s
2000     32k     (0/31/4/68)             12869          7.01Gb/s
4000     32k     (0/32/5/67)             14418          6.44Gb/s
6000     32k     (0/35/5/63)             17053          6.37Gb/s

TCP Transmit tests using uperf with msg size = 8K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/4/0/95)              14259          896.23Mb/s
4        256k    (0/5/0/93)              4276           3.72Gb/s
8        256k    (0/10/1/89)             4385           7.13Gb/s
32       256k    (0/14/2/85)             4951           8.46Gb/s
100      256k    (0/16/2/83)             5515           8.11Gb/s
400      32k     (0/29/3/69)             10738          7.46Gb/s
1000     32k     (0/31/4/68)             11388          7.31Gb/s
2000     32k     (0/36/6/62)             16818          6.44Gb/s
4000     32k     (0/37/7/61)             14951          6.28Gb/s
6000     32k     (1/34/5/64)             18752          6.19Gb/s

Section: TCP Transmit tests using uperf with msg size = 1K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/4/1/95)              13450          915.02Mb/s
4        256k    (0/21/4/77)             19239          3.53Gb/s
8        256k    (0/38/6/60)             20890          5.42Gb/s
32       256k    (0/46/8/52)             18792          5.97Gb/s
100      256k    (0/48/8/50)             21831          5.77Gb/s
400      32k     (1/58/9/40)             24547          5.81Gb/s
1000     32k     (1/53/9/45)             31557          4.73Gb/s
2000     32k     (1/51/9/47)             38520          3.89Gb/s
4000     32k     (1/58/11/40)            40116          3.98Gb/s
6000     32k     (1/53/9/45)             40209          3.97Gb/s
The key metric to see above is cycles/Kbyte. We would like to spend as few cycles as possible for a bytes of transmission. So we want this number to be as low as possible. From the results above, we can infer the following about CPU utilization:

(1) The CPU utilization drops with increase in number of connections. The single connection case is an exception.

(2) The CPU utilization drops with a smaller write size.

(3) Most of the CPU is consumed in the kernel (sys). With increase in number of connection, the usr column goes up too due to the overhead of so many threads (In uperf, each connection is established on an independent thread).

(4) The throughput follows the same trend as CPU Utilization.

Most of the above is on expected lines, but to understand this better, let us profile CPU utilization for the case of 4000 connections doing 8k sized writes, and the case of 100 connections doing 32k writes. We use dtrace based er_kernel to gather the profile data, and then er_print to view the CPU utilization. er_print displays both inclusive (including function calls origination from the mentioned function), and exclusive (excluding all other function calls). The following syntax of er_print is used to list functions and sort them in order of inclusive CPU utilization.
er_print -metrics i.%kcycles:e.%kcycles -sort i.%kcycles -function pae4440oneportmyri.er/
Filtering through the data, we gather the following CPU utilization for various function calls.

For 4000 connections with 8K sized writes (Throughput=6.28 Gbps):

FunctionInclusive CPU Utilization %
write()5.28%
tcp_wput_data()4.88%
myri10ge_onetrack()6.3%
tcp_rput_data()7%

For 100 connections with 64K sized writes (Throughput=9.24 Gbps):

FunctionInclusive CPU Utilization %
write()4.62%
tcp_wput_data()1.82%
myri10ge_onetrack()1.01%
tcp_rput_data()2.02%

Ratio of CPU Utilizations normalized to bandwidth:

FunctionNormalized CPU Utilization Ratio (4000 connections, 8K writes/ 100 connections, 32 K writes)
write()1.68
tcp_wput_data()3.94
myri10ge_onetrack()9.17
tcp_rput_data()5.09

Comparing the normalized values, the cost of the system call write() doesn't change much. Copying becomes a little more efficient with a increase in write size from 8K to 64K. Increase in number of connections is not expected to add to the cost of write().

tcp_wput_data() turns more expensive as the effectiveness of Large Segment Offload (LSO) decreases with higher number of connections, resulting in increased number of function calls and reduced efficiency. Please read my blog about LSO on the Solaris networking stack here.

The driver send routine myri10ge_one_track() turns more expensive due to a combination of smaller LSO segments, and increased cost of DMAing the higher number of segments. We observer that in terms of increase, the cost of driver send operations increases the most (>9x).

Finally, with higher number of connections, we observe a TCP ACK ratio of 2:1, instead of the maximum of 8:1 that is possible on a LAN. A lower ACK ratio leads to higher number of ACK packets and subsequently, a higher cost of tcp_rput_data().

In conclusion, CPU efficiency in the transmit path may reduce due to the following factors.

(i) Poor LSO efficiency: This causes higher number of function calls for driving the same volume of data.
(ii) Higher number of DMA calls: More number of DMA operations leads to reduced CPU efficiency since each DMA operation would require binding and later freeing DMA handles which are expensive operations.
(iii) Poor ACK ratio: A 8:1 ACK ratio leads to lower volume of TCP ACKs and frees CPU cycles. The ACK ratio is seen go reduce with increase in connections.

Monday Nov 03, 2008

Examining Large Segment Offload (LSO) in the Solaris Networking Stack

In this blog article, I will share my experience on Large Segment Offload (LSO), one of the recent additions to the Solaris network stack. I will discuss a few observability tools, and also what helps achieve better LSO.

LSO saves valuable CPU cycles by allowing the network protocol stack to handle large segments instead of the traditional model of MSS sized segments. In the traditional network stack, the TCP layer segments the outgoing data into the MSS sized segments and passes them down to the driver. This becomes computationally expensive with 10 GigE networking because of the large number of kernel functional calls required for every MSS segment. With LSO, a large segment is passed by TCP to the driver, and the driver or NIC hardware does the job of TCP segmentation. An LSO segment may be as large as 64 KByte. The larger the LSO segment, better the CPU efficiency since the network stack has to work with smaller number of segments for the same throughput. The size of the LSO segment is the key metric we will examine in our discussion here.

Simply put, LSO segment size is better (higher) when the thread draining the data can drive as much data as possible. A thread can drive only as much data as is available in the TCP congestion control window. What we need to ensure is that (i) TCP congestion window is large enough, and (ii) Enough data is ready to be transmitted by the draining thread.

It is important to remember that in the Solaris networking stack, packets may be drained by three different threads:

(i) The thread writing to the socket may drain its own and other packets in the squeue.
(ii) The squeue worker thread may drain all the packets in the squeue.
(iii) The thread in the interrupt (or soft ring) may drain the squeue.

The ratio of occurence of these three threads depends on system dynamics. Nevertheless, it is useful to keep in mind these in the context of the discussion below. An easy way to monitor is by checking the stack output count from the following DTrace script.

dtrace -n 'tcp_lsosend_data:entry{@[stack()]=count();}'


Experiments

Our experiment testbed is as follows. We connect a Sun Fire X4440 server (16-core 4-socket AMD Opteron based system) to 15 V20z clients. The X4440 server has PCI-E x8/16 slots. Out of the different possible options for 10 GigE NICs, we chose to use the Myricom 10-GigE PCI-E because it supports native PCI-E along along with hardware LSO (hardware LSO is more CPU efficient). Another option is to use the Sun Multithreaded 10 Gig-E PCI-E card which supports software LSO. LSO is enabled by default in the Myricom 10 GigE driver. LSO may be enabled in the Sun Mutithreaded 10 GigE driver nxge by commenting out the appropriate line in /kernel/drv/nxge.conf

Each client has 15 Broadcom 1 GigE NICs. The clients and the server are connected to an independent VLAN in a Cisco Catalyst 6509 switch. All systems are running Open Solaris.

We use the following Dtrace script to observe LSO segment size. This reports the average size of LSO in bytes every 5 seconds.

bash-3.2#cat tcpaveragesegsize.d
#!/usr/sbin/dtrace -s
/\*
\*/

tcp_lsosend_data:entry
{
@av[0]=avg(arg5);
}

tick-5s {
        printa(@av);
        trunc(@av);
}


Now, let us run a simple experiment. We use uperf to do a throughput test using this profile which will drive as much traffic as possible, writing 64KByes to the socket, using one connection to each client. Now we can run the above dtrace script at the server (X4440) during the run. Here is the output:

Example 1: Throughput profile with 64K writes and one connection per client.
bash-3.2# ./tcpavgsegsize.d

        0            40426

        0            40760

        0            40530
The above numbers are at 5 second intervals. We are doing 40K sized segments per transmit. That is much better than 1 MSS in the traditional network stack.

To demonstrate what helps get better LSO, let us run the same experiment, but with a specweb support oriented profile instead of the throughput profile. In this profile, uperf writes 64 KByte to the socket, and waits for the receiver to send back 64 bytes before it writes again (it emulates a request response pattern of clients requesting large files from a server). Now, if we measure LSO during the run using the same DTrace script, we get:

Example 2: Specweb profile with 64K writes, one connection per client.
bash-3.2# ./tcpavgsegsize.d

        0            62693

        0            58388

        0            60084


Our LSO segment size increased from about 40K to 60K. The specweb support profile ensures that the next batch of writes to a connection occur only after the previous has been read by the client . Since the ACKs of the previous writes are received by that time, the next 64K write sees an empty TCP congestion window, and can indeed drain the 64K bytes. Note that the LSO segment is very near is maximum possible of 64K. Indeed, we can get 64K if we use the above profile with only one client. Here is the output:

Example 3: Specweb profile with a single client
 bash-3.2# ./tcpavgsegsize.d

        0            65536

        0            65524

        0            65524


Now let us move back to the throughput profile. If we reduce our write size in the throughput profile to 1KB instead of 64 KB, we get much worse LSO. With smaller writes, the number of bytes drained by either threads (i), (ii), or (iii) is smaller, leading to smaller LSO. Here is the result.

Example 4: Throughput profile with 1K writes.
bash-3.2# ./tcpavgsegsize.d

        0            11127

        0            10381

        0            10640


Now let us increase the number of connections per client to 2000. This is a bulk throughput workload. So now we have 30000 connections across 15 clients.

Example 5: Throughput profile with 2000 connections per client
bash-3.2# ./tcpavgsegsize.d

        0             5496

        0             5084

        0             5069
Here the LSO segment is smaller because we are limited by the TCP congestion window. With larger number of connections, the per connection TCP congestion window becomes more dependent on clearing of ACKS. Transmits are more ACK-driven than in any other case.

The other factor to also keep in mind is that any out of order segments or dup acks may reduce TCP congestion window. To check for the same use the following commands at server and client respectively:

netstat -s -P tcp 1 | grep tcpInDup
netstat -s -P tcp 1 | grep tcpInUnorderSegs
Ideally, the number of dup acks and out-of-order segments shoulds be as close to 0 as possible.

An interesting exercise would be to monitor the ratio of (i), (ii), and (iii) in each of the above cases. Here is the data.

Example (i) (ii) (iii)
1 12% 0% 88%
2 70% 0% 30%
3 98% 0% 2%
4 74% 1% 24%
5 34% 37% 29%


To summarize, we have noted the following about LSO:

(1) A higher LSO segment size helps improve CPU efficiency with lesser function calls per byte of data sent out.
(2) A request-response profile helps drive larger LSO segment size compared to a throughput oriented profile.
(3) A larger write size (till 64K) helps drive larger LSO segment size.
(4) Smaller number of connections help drive larger LSO segment size.


Since a blog is a good medium for communication both ways, I appreciate comments and suggestions from readers. Please do post them in this forum or email them to me.
About

This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today