CPU Utilization in networking - the TCP transmit path

I am planning to write a series of blogs on CPU Utilization in networking. Here is the first one focusing on the TCP transmit path. This blog article primarily considers Solaris, although equivalent concepts can be applied to other Unix based operating systems.

Let us first examine CPU utilization in Solaris. Here are the results for network I/O using TCP transmit. Our System Under Test(SUT) is a Sun Fire X4440 server, a 16-core, 4-socket AMD Opteron based system with 64 GB memory, and 1 Myricom 10 Gig Ethernet card. It is connected to 15 Sun v20z clients (using one on-board 1 GigE NIC on each system), via a Cisco Catalyst 6500 switch. We use uperf 1.0.2 in these measurements. The profile is a bulk throughput oriented profile, while the write size is varied. The results are collected using the Network Characterization Suite (NCS) that we have developed in our group, and which will be open-sourced soon. Very briefly, the cycles per second is measured using DTrace, and describes how many CPU cycles are required to transmit every KByte of data. usr/sys/idle is measured using vmstat, while intr is measured using another Dtrace script. Throughput is reported by uperf at the end of the 120 second run.

Here are the results:
TCP Transmit tests using uperf with write size = 64K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/3/0/96)              11792          928.11Mb/s
4        256k    (0/4/0/95)              3365           3.71Gb/s
8        256k    (0/7/1/92)              2815           7.42Gb/s
32       256k    (0/8/1/91)              2793           9.22Gb/s
100      256k    (0/9/1/90)              3161           9.24Gb/s
400      32k     (0/24/3/74)             8392           8.93Gb/s
1000     32k     (0/24/3/74)             8406           8.80Gb/s
2000     32k     (0/31/4/68)             12869          7.01Gb/s
4000     32k     (0/32/5/67)             14418          6.44Gb/s
6000     32k     (0/35/5/63)             17053          6.37Gb/s

TCP Transmit tests using uperf with msg size = 8K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/4/0/95)              14259          896.23Mb/s
4        256k    (0/5/0/93)              4276           3.72Gb/s
8        256k    (0/10/1/89)             4385           7.13Gb/s
32       256k    (0/14/2/85)             4951           8.46Gb/s
100      256k    (0/16/2/83)             5515           8.11Gb/s
400      32k     (0/29/3/69)             10738          7.46Gb/s
1000     32k     (0/31/4/68)             11388          7.31Gb/s
2000     32k     (0/36/6/62)             16818          6.44Gb/s
4000     32k     (0/37/7/61)             14951          6.28Gb/s
6000     32k     (1/34/5/64)             18752          6.19Gb/s

Section: TCP Transmit tests using uperf with msg size = 1K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/4/1/95)              13450          915.02Mb/s
4        256k    (0/21/4/77)             19239          3.53Gb/s
8        256k    (0/38/6/60)             20890          5.42Gb/s
32       256k    (0/46/8/52)             18792          5.97Gb/s
100      256k    (0/48/8/50)             21831          5.77Gb/s
400      32k     (1/58/9/40)             24547          5.81Gb/s
1000     32k     (1/53/9/45)             31557          4.73Gb/s
2000     32k     (1/51/9/47)             38520          3.89Gb/s
4000     32k     (1/58/11/40)            40116          3.98Gb/s
6000     32k     (1/53/9/45)             40209          3.97Gb/s
The key metric to see above is cycles/Kbyte. We would like to spend as few cycles as possible for a bytes of transmission. So we want this number to be as low as possible. From the results above, we can infer the following about CPU utilization:

(1) The CPU utilization drops with increase in number of connections. The single connection case is an exception.

(2) The CPU utilization drops with a smaller write size.

(3) Most of the CPU is consumed in the kernel (sys). With increase in number of connection, the usr column goes up too due to the overhead of so many threads (In uperf, each connection is established on an independent thread).

(4) The throughput follows the same trend as CPU Utilization.

Most of the above is on expected lines, but to understand this better, let us profile CPU utilization for the case of 4000 connections doing 8k sized writes, and the case of 100 connections doing 32k writes. We use dtrace based er_kernel to gather the profile data, and then er_print to view the CPU utilization. er_print displays both inclusive (including function calls origination from the mentioned function), and exclusive (excluding all other function calls). The following syntax of er_print is used to list functions and sort them in order of inclusive CPU utilization.
er_print -metrics i.%kcycles:e.%kcycles -sort i.%kcycles -function pae4440oneportmyri.er/
Filtering through the data, we gather the following CPU utilization for various function calls.

For 4000 connections with 8K sized writes (Throughput=6.28 Gbps):

FunctionInclusive CPU Utilization %
write()5.28%
tcp_wput_data()4.88%
myri10ge_onetrack()6.3%
tcp_rput_data()7%

For 100 connections with 64K sized writes (Throughput=9.24 Gbps):

FunctionInclusive CPU Utilization %
write()4.62%
tcp_wput_data()1.82%
myri10ge_onetrack()1.01%
tcp_rput_data()2.02%

Ratio of CPU Utilizations normalized to bandwidth:

FunctionNormalized CPU Utilization Ratio (4000 connections, 8K writes/ 100 connections, 32 K writes)
write()1.68
tcp_wput_data()3.94
myri10ge_onetrack()9.17
tcp_rput_data()5.09

Comparing the normalized values, the cost of the system call write() doesn't change much. Copying becomes a little more efficient with a increase in write size from 8K to 64K. Increase in number of connections is not expected to add to the cost of write().

tcp_wput_data() turns more expensive as the effectiveness of Large Segment Offload (LSO) decreases with higher number of connections, resulting in increased number of function calls and reduced efficiency. Please read my blog about LSO on the Solaris networking stack here.

The driver send routine myri10ge_one_track() turns more expensive due to a combination of smaller LSO segments, and increased cost of DMAing the higher number of segments. We observer that in terms of increase, the cost of driver send operations increases the most (>9x).

Finally, with higher number of connections, we observe a TCP ACK ratio of 2:1, instead of the maximum of 8:1 that is possible on a LAN. A lower ACK ratio leads to higher number of ACK packets and subsequently, a higher cost of tcp_rput_data().

In conclusion, CPU efficiency in the transmit path may reduce due to the following factors.

(i) Poor LSO efficiency: This causes higher number of function calls for driving the same volume of data.
(ii) Higher number of DMA calls: More number of DMA operations leads to reduced CPU efficiency since each DMA operation would require binding and later freeing DMA handles which are expensive operations.
(iii) Poor ACK ratio: A 8:1 ACK ratio leads to lower volume of TCP ACKs and frees CPU cycles. The ACK ratio is seen go reduce with increase in connections.
Comments:

[Trackback] I found two really interesting articles about the networking in Solaris: The first one - Examining Large Segment Offload (LSO) in the Solaris Networking Stack - examines the effect of large segment offloading to the cpu utilisation. The second one - CP...

Posted by c0t0d0s0.org on December 17, 2008 at 06:44 PM PST #

An interesting writeup, glade you posted it. I have a few questions/comments

For the 1K send size tests was TCP_NODELAY set or was the stack trying to create full-sized segments?

In the netperf kstat-based CPU utilization measurement, it is/was necessary to try to account for intr time being measured in parallel with usr/sys/idle rather than all together, so they can overlap (a major bummer IMO). I suspect you may need to do the same thing as I notice that several of your CPU percentage sums end-up over 100%. In some cases by quite a bit.

Is the sense of conclusion #2 correct, or should that read that CPU utilization \*increases\* with smaller write size?

Why the switch to a 32K window halfway through at each write size? Was it really necessary to have a 256K window to get GbE speed for the lower connection counts??

The ratio of 8:1 for ACKs on a LAN stems from the local maximum delayed ACK settings in TCP right?

It does seem a bit odd that with the large numbers of connections the throughput would drop but the CPU utilization would drop as well - did a single core or small number of cores become saturated or some other single-threading point? There is a reason the under-development netperf "omni" tests also attempt to report the ID and utilization of the most utilized "cpu" along with the overall CPU utilization :) Is the driver for the myricom card unable to direct interrupts to multiple CPUs in your setup?

In iii I think you meant to say "is seen to reduce with the increase in connections."

Posted by rick jones on January 07, 2009 at 11:07 AM PST #

Hi Rick,

Thanks a lot for reading this blog, and for your note. Here are the answers to your questions.

TCP_NODELAY was not set, I suspect setting that would lead to much worse CPU utilization with smaller LSO segments.

The intr time is measured using Dtrace. usr/sys/idle is measured using vmstat. Effectively intr time is part of the sys time which is used for interrupt processing. usr + sys + idle will sum up to 100%.

A smaller write size increases CPU utilization because of higher number of copy operations during socket write(). Higher number of DMA operations (caused in part by smaller LSO) leads to higher CPU utilization as well. Although these two are tied (caused by smaller write size, in my mind they are separate things which may be tackled differently.

Smaller number of connections required a 256K window to reach the 10 GigE bandwidth.

8:1 ACKS are because of maximum delayed ACK setting.

The myricom driver does direct interrupts to multiple CPUS in MSI-X mode. We didn't find any core exhausted out of CPU. This raises the question as to why we didn't reach 10 Gbps in all cases since we always have CPU free. I suspect this is because of hardware limitations of the DMA mechanism or NIC card, with smaller LSO segments the NIC does not drive data fast enough.

HTH. Thanks again for your comments.

Amitab

Posted by guest on January 08, 2009 at 03:31 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today