Wednesday Dec 17, 2008

CPU Utilization in networking - the TCP transmit path

I am planning to write a series of blogs on CPU Utilization in networking. Here is the first one focusing on the TCP transmit path. This blog article primarily considers Solaris, although equivalent concepts can be applied to other Unix based operating systems.

Let us first examine CPU utilization in Solaris. Here are the results for network I/O using TCP transmit. Our System Under Test(SUT) is a Sun Fire X4440 server, a 16-core, 4-socket AMD Opteron based system with 64 GB memory, and 1 Myricom 10 Gig Ethernet card. It is connected to 15 Sun v20z clients (using one on-board 1 GigE NIC on each system), via a Cisco Catalyst 6500 switch. We use uperf 1.0.2 in these measurements. The profile is a bulk throughput oriented profile, while the write size is varied. The results are collected using the Network Characterization Suite (NCS) that we have developed in our group, and which will be open-sourced soon. Very briefly, the cycles per second is measured using DTrace, and describes how many CPU cycles are required to transmit every KByte of data. usr/sys/idle is measured using vmstat, while intr is measured using another Dtrace script. Throughput is reported by uperf at the end of the 120 second run.

Here are the results:
TCP Transmit tests using uperf with write size = 64K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/3/0/96)              11792          928.11Mb/s
4        256k    (0/4/0/95)              3365           3.71Gb/s
8        256k    (0/7/1/92)              2815           7.42Gb/s
32       256k    (0/8/1/91)              2793           9.22Gb/s
100      256k    (0/9/1/90)              3161           9.24Gb/s
400      32k     (0/24/3/74)             8392           8.93Gb/s
1000     32k     (0/24/3/74)             8406           8.80Gb/s
2000     32k     (0/31/4/68)             12869          7.01Gb/s
4000     32k     (0/32/5/67)             14418          6.44Gb/s
6000     32k     (0/35/5/63)             17053          6.37Gb/s

TCP Transmit tests using uperf with msg size = 8K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/4/0/95)              14259          896.23Mb/s
4        256k    (0/5/0/93)              4276           3.72Gb/s
8        256k    (0/10/1/89)             4385           7.13Gb/s
32       256k    (0/14/2/85)             4951           8.46Gb/s
100      256k    (0/16/2/83)             5515           8.11Gb/s
400      32k     (0/29/3/69)             10738          7.46Gb/s
1000     32k     (0/31/4/68)             11388          7.31Gb/s
2000     32k     (0/36/6/62)             16818          6.44Gb/s
4000     32k     (0/37/7/61)             14951          6.28Gb/s
6000     32k     (1/34/5/64)             18752          6.19Gb/s

Section: TCP Transmit tests using uperf with msg size = 1K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/4/1/95)              13450          915.02Mb/s
4        256k    (0/21/4/77)             19239          3.53Gb/s
8        256k    (0/38/6/60)             20890          5.42Gb/s
32       256k    (0/46/8/52)             18792          5.97Gb/s
100      256k    (0/48/8/50)             21831          5.77Gb/s
400      32k     (1/58/9/40)             24547          5.81Gb/s
1000     32k     (1/53/9/45)             31557          4.73Gb/s
2000     32k     (1/51/9/47)             38520          3.89Gb/s
4000     32k     (1/58/11/40)            40116          3.98Gb/s
6000     32k     (1/53/9/45)             40209          3.97Gb/s
The key metric to see above is cycles/Kbyte. We would like to spend as few cycles as possible for a bytes of transmission. So we want this number to be as low as possible. From the results above, we can infer the following about CPU utilization:

(1) The CPU utilization drops with increase in number of connections. The single connection case is an exception.

(2) The CPU utilization drops with a smaller write size.

(3) Most of the CPU is consumed in the kernel (sys). With increase in number of connection, the usr column goes up too due to the overhead of so many threads (In uperf, each connection is established on an independent thread).

(4) The throughput follows the same trend as CPU Utilization.

Most of the above is on expected lines, but to understand this better, let us profile CPU utilization for the case of 4000 connections doing 8k sized writes, and the case of 100 connections doing 32k writes. We use dtrace based er_kernel to gather the profile data, and then er_print to view the CPU utilization. er_print displays both inclusive (including function calls origination from the mentioned function), and exclusive (excluding all other function calls). The following syntax of er_print is used to list functions and sort them in order of inclusive CPU utilization.
er_print -metrics i.%kcycles:e.%kcycles -sort i.%kcycles -function
Filtering through the data, we gather the following CPU utilization for various function calls.

For 4000 connections with 8K sized writes (Throughput=6.28 Gbps):

FunctionInclusive CPU Utilization %

For 100 connections with 64K sized writes (Throughput=9.24 Gbps):

FunctionInclusive CPU Utilization %

Ratio of CPU Utilizations normalized to bandwidth:

FunctionNormalized CPU Utilization Ratio (4000 connections, 8K writes/ 100 connections, 32 K writes)

Comparing the normalized values, the cost of the system call write() doesn't change much. Copying becomes a little more efficient with a increase in write size from 8K to 64K. Increase in number of connections is not expected to add to the cost of write().

tcp_wput_data() turns more expensive as the effectiveness of Large Segment Offload (LSO) decreases with higher number of connections, resulting in increased number of function calls and reduced efficiency. Please read my blog about LSO on the Solaris networking stack here.

The driver send routine myri10ge_one_track() turns more expensive due to a combination of smaller LSO segments, and increased cost of DMAing the higher number of segments. We observer that in terms of increase, the cost of driver send operations increases the most (>9x).

Finally, with higher number of connections, we observe a TCP ACK ratio of 2:1, instead of the maximum of 8:1 that is possible on a LAN. A lower ACK ratio leads to higher number of ACK packets and subsequently, a higher cost of tcp_rput_data().

In conclusion, CPU efficiency in the transmit path may reduce due to the following factors.

(i) Poor LSO efficiency: This causes higher number of function calls for driving the same volume of data.
(ii) Higher number of DMA calls: More number of DMA operations leads to reduced CPU efficiency since each DMA operation would require binding and later freeing DMA handles which are expensive operations.
(iii) Poor ACK ratio: A 8:1 ACK ratio leads to lower volume of TCP ACKs and frees CPU cycles. The ACK ratio is seen go reduce with increase in connections.

Sunday Nov 09, 2008

Analyzing the Sun Storage 7000 with Filebench

Today, I am proud to have contributed to the industry's first open storage appliances, the Sun Storage 7000 Unified Storage Systems from Sun Microsystems. Open Storage delivers open-source software on top of standard x86 based commodity hardware delivered as an appliance, at a fraction of cost of propriery hardware. Sun's open-source software brings the added advantage of flexibility, ease of customization, as well as support from Sun Microsystems and a large, growing open-source community.

At a high level, the Sun Storage 7000 has the following components:

1)x86 based Sun Servers: Sun Storage 7000 comes with 1, 2, and 4 sockets of Quad-core AMD-Opteron processors. You can easily configure the amount of processing power and RAM according to your own storage requirements.

2) Open Solaris based appliance software: Along with existing technology of Open Solaris such as ZFS, DTrace, and Zones, the Sun Storage 7000 supports a cool-graphical browser based user-interface, that lets you configure your storage server, set up your storage environment, and then easily monitor various metrics such as CPU, network, and disk usage. I have extensively made use of an excellent feature called Analytics (described later in this blog article).

3) Solid-state devices: Sun Storage 7000 may be configured with different number of solid-state devices, which may act as an extra layer of cache for disk I/O. We have two types of solid-state devices: for read and write operations, named readzillas and logzillas respectively. These devices have been shown to help deliver better read or write performance compared to what SATA 7200rpm disk based storage would sustain. You can read more about these devices in Adam Leventhal's excellent blog.

4) JBODS (Just a Bunch of Disks). Sun Storage 7000 may be connected to arrays of disks. One example of an external disk-array is the Sun Storage J4400 Array , which contains 24 750 GB, 7200 rpm SATA disks.

One of the tools we used to evaluate this platform is Filebench, an open-source framework for simulating applications on file systems. We deploy a Sun Storage 7410, a 4-socket, 16-core system configured with 128 GB RAM, two Sun Storage J4400s, 1 Sun Multithreaded 10 GigE card, 3 logzillas, and 2 readzillas. This appliance is connected to 18 Sun v40z clients via a switch. All clients have a 1 GigE interface.

We configured the storage as a mirrored RAID10. A ZFS filesystem was created for each client and then mounted using NFSv4. While Filebench has support for synchronizing threads, our main challenge was to synchronize different instances of filebench running on the different clients, so that they may simultaneously perform operations on our Storage appliance. We ran a variety of workloads such as single and multi-threaded streaming-read, streaming writes, random-read, and file creation. While no workload can replicate the exact requirements our customers may have, we hope that the above workloads are greatly illustrative of the power of our Sun Storage 7400. You may read Roch's interesting blog on how the workloads were designed.

Coordinating so many clients was no easy task, and we struggled for days writing a script, which is available in the toolkit here. While we have added sufficient comments to the script, it is by no means ready for easy installation and use. Please do communicate with me if you plan to use it. We also monitor CPU, network, and disk metrics on our applicance using Dtrace (Please see 11metrics.d in the toolkit). Using DTrace has minimum but not negligable overhead, so the following results should be 3-5% lower than what we would achieve without DTrace running in the background.

Results from different filebench workloads are as follows:

Test Name (Executed Per Client) Aggregate Metric (18 clients) Network I/O (MBytes/s) CPU Util %
1 thread streaming reads from 20G uncached set, 30 sec 871.0 MBytes/s 936 82
1 thread streaming reads from same set, 30 sec 1012.6 MBytes/s 1086 68
20 threads streaming reads from a different 20G uncached set, 30 sec 924.2 MBytes/s 1008 88
10 threads streaming reads from same set, 30 sec 1005.9 MBytes/s 1071 83
1 thread streaming write, 120 sec 461.5 MBytes/s 589 68
20 threads streaming write, 120 sec 444.5 MBytes/s 503 81
20 threads 8K random read from a different 20G uncached set, 30 sec 6204 IOPS 52 14
128 threads 8K random read from same set, 30 sec 7047 IOPS 57 23
128 threads 8K synchronous writes to 20G set, 120 sec 24555 IOPS 111 73

We use the ZFS record size of 128K for the MBytes/s oriented tests, and 8k record size for the IOPS (I/O Operations per second) based tests. The record size of 8k helps deliver better performance by better allignment of the read/write requests with the record sizes. Please also note that the file sets mentioned above are 20 GBytes per client. So for 18 clients with 20 Gbytes per client, the total working set for these tests was 360 Gbytes.

A few observations from the data:

(1) For streaming read tests, we sustain network I/O of close to 1 GByte/sec. LSO helps us significantly in this regard (Please refer to this blog article on benefits of LSO). Also, please read how our team helped deliver better LSO here.

(2) Reading from a cached dataset, improves streaming read performance by about 15-20%. You may observe that caching helps improve CPU utilization and reduces disk activity.

(3) One thread in one client can read close to 50 MByte/sec. With 18 clients (18 threads) we can get to 900 MBytes/s. Therefore, a multithreaded read per client (20 threads per client, 360 threads) does not increase the read performance by much.

(4) For random reads, we are bottlenecked by the IOPS we can do per disk (which is about 150-180 from a 7200 rpm disk). Using the same system attached to more disks, Roch acheived between 28559/36478 IOPS (cold run/warm run) from a 400GB dataset.

(5) We get great synchronous write performance because the logzillas allow much lesser write latencies than what the raw disk would have provided.

Using analytics, we can easily track how the applicance is behaving for each workload. Here is a screen shot from analytics. You can see the trendy plots of how the appliance behaved during the test. The three charts show NFSv4 operations per second, CPU utilization per second, and disk I/O operations per second. While these charts show aggregate metrics, you can easily break down on a type of operation, per CPU, and per disk basis, respectively.

In conclusion, the Sun Storage 7000 series has integrated a nice bunch of goodies, which combined with open-source software, should give a new direction of cheap, proprietory storage in the years to come.

Thursday Mar 20, 2008

Virtualization and Networking


Since virtualization is one of the hottest areas of growth today, it would be good to blog about virtualization and networking today. This is one of the beauties of blogging, just writing about some topic in a public forum motivates one to do more research and become more thorough and proficient with the subject.

So why is virtualization so hot? It is primarily because as servers grow more and more powerful, virtualization allows consolidation of multiple hosts on one physical system. The benefits of consolidation are many, mainly power and administrative costs saving. These end hosts can be very different operating systems. The challenge is to run each independent of the other. So that the performance of one host is independent of the performance of the other. While they all share resources of the same physical system.

So what's the challenge that virtualization brings to networking. Simply put, sharing I/O is challenging. Why? Consider other components such as CPU and memory. Since modern servers have multiple CPUs, we can simply assign the desired number of CPUs to each host and not allow hosts to touch each other. If a single CPU needed to be shared, that too could be done with a scheduling algorithm that follows some time Division Multiplexed (TDM) like approach. How about memory? Since memory is always managed as virtual memory, all we need to do is play with the paging algorithm. Partition the memory and just be careful about paging algorithms. Now this is not always very simple because of memory locality issues in a system which is Non-Uniform Memory Access (NUMA). But more on that later.

Now let us consider I/O. It is hard to partition peripheral devices across multiple hosts. Consider a Network Interface Card (NIC). Suppose two hosts do network I/O using this NIC simultaneously. Who resolves this conflict? Who coordinates the device instructions so that DMA mappings do not overlap with each other? How to fairly distribute network bandwidth amongst the two hosts? These are challenging problems.

In comes the role of the hypervisor. The hypervisor is a thin layer of software which interfaces between the virtual hosts and the physical machine. Simply put, in a virtualized environment, it is the hypervisor which plays the role of managing all the resources, such as CPU, memory, and I/O, and coordinating all the instructions sent by the virtual hosts.

So now let us talk about virtualization and networking. Here are the prominent ways in which network I/O works over virtualized environments today. The hypervisor plays different roles depending on the solution chosen by the vendor.

Software solutions

Binary Translations:

The idea here is to trap the privileged instructions issued by the guest operating system (OS) at the hypervisor layer and translate them into safe instructions. Binary translations have been historically used by VMWare to support virtualization on unmodified OSs such as Microsoft Windows. The guest OS being completely ignorant of the hypervisor, and issues instructions assuming it is executing on a bare metal x86 box. The hypervisor classifies all instructions issued into two broad categories, those that may be directly executed (called non-privileged) and those that need to be translated (called priviledged). Priviledged instructions are translated on the fly and executed.

The biggest advantage of this technique is that it doesn't require any modification in guest OSs. However, performance often suffers because of the in-flight translation, and therefore the virtualization industry is moving more towards paravirtualization and hardware assisted virtualization.


In paravirtualization, the guest OS is modified to recognize the hypervisor and interact with it. The best example of this technique is in the open source Xen and Solaris XVM. In Solaris XVM, network I/O is handled by the Xen frontend driver whose source code is available here. The frontend driver interacts with the Solaris XVM backend driver (found here) which is running on the control domain, also known as Dom0. Dom0 controls and manages the network and other I/O devices directly. Thus the network path from all guest OSs is Guest OS -> Dom0 -> external world for transmit and in the reverse direction for receive. Dom0 plays the role of the arbitrator when multiple guest domains are conflicting for network I/O.

Paravirtualization typically performs better than binary translations because the hypervisor doesn't have to inspect each and every instruction. Moreover, it works great in cases like guest domain to guest domain communication, since the Dom0 can recognize the same and avoid sending packets to the hardware. However, paravirtualized solutions often require a good design (to ensure that Dom0 does not become a bottleneck as an example), and therefore higher cost of support and maintenance.

Hardware solutions

Intel I/O virtualization and AMD Pacifica virtualization technologies: Since 2006, both Intel and AMD have had hardware support to support virtualization. The hardware provides support to trap any priviledged instruction and send it to the hypervisor. This allows support of unmodified OSs on the Xen hypervisor on supported hardware. As an example, we can now run Windows XP, Solaris and Linux with Solaris XVM in the same box. Support for hardware virtualization although currently an initial step, is expected to grow and become dominant in the coming years. But as of now, paravirtualized solutions are generally seen outperforming hardware assisted solutions.

PCI-Express Technologies- I/O VT
The PCI-Express community is currently standardizing technologies to support multiple OSs running simultaneously within a single computer to natively share PCI-Express devices. There are two main technologies currently undergoing standardization, single-root I/O virtualization and multi-root I/O virtualization. The idea here is to allow an OS handle its own IOV compliant interface over PCI-Express which is also shared by other virtual OSs running in the system. This will allow more parallelism in hardware and reduce the role of the hypervisor in arbitrating amongst multiple OSs competing for the same I/O.

The current industry is in a flux of moving from software based virtualization solutions to hardware assisted ones. How much the performance of hardware solutions will improve over time is difficult to speculate. Therefore, paravirtualized solutions are still expected to be dominant for some time. It is interesting to see most vendors to support both hardware and software solutions for now.


This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.


« June 2016