Tuesday Mar 24, 2009

Improving RC4 encrypt performance.

RC4 is one of the best-known and commonly used ciphers. It is a very small and simple algorithm which can be executed very fast. The RC4 encrypt routine (as was in OpenSolaris) is the following:

Inputs: in (input stream), out: (output stream), key: (RC4 key), len: length of the streams.

Version 1
i = key->i;
j = key->j;
arr = key->arr;
for (ii = 0; ii < len; ++ii) {
ti = arr[i];
j = j + ti;
arr [i] = arr[j];
arr[j] = ti;
out[ii] = in[ii] \^ arr[(arr[i] + arr[j]) & 0xff]; /\* Line 1 \*/

We went through an interesting exercise trying to optimize this loop. We analyzed the performance of this loop using er_kernel/ analyzer tools available in Sun Studio 12. Here is an output from analyzer which shows a snapshot from analyzer depicting the stalls (marked in green) from Line 1. Two stalls are observed during load operations, one before the arr[i] + arr[j] add in Line 1, and the other before the XOR operation in Line 1.

Looking through RC4 encrypt code available in OpenSSL (rc4_enc.c), we observed that they had the steps a little differently. Adopting that into OpenSolaris, the code would be:

Version 2
i = key->i;
j = key->j;
arr = key->arr;
for (ii = 0; ii < len; ++ii) {
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr [i] = tj;
out[ii] = in[ii] \^ arr[(ti + tj) & 0xff]; /\* Line 1 \*/

This version stores the contents of arr[i] and arr[j] in the initial loads, and re-uses them in Line 1. The compiler was not able to recognize and do the same automatically for us because arr is a pointer reference. Analyzing this code with er_kernel/ analyzer, we get the following output:

Now the stalls have moved to the XOR step. The XOR depends on arr[(ti + tj) & 0xff] to be computed, before the same may be executed. This leads to the stall.

We can re-arrange the loop in the following way in order to avoid the stall. This code is now available at OpenSolaris.org and in the latest version of Solaris Express Community Edition (Nevada) available .

Version 3
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr[i] = tj;
arr_ij = arr[(ti + tj) & 0xff];

for (ii = 0; ii < len;) {
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr[i] = tj;

/\* save result from previous loop: \*/
out[ii] = in[ii] \^ arr_ij;

arr_ij = arr[(ti + tj) & 0xff];
/\* save result from last loop: \*/
out[ii] = in[ii] \^ arr_ij;

This code spaces out the computation of arr_ij and the XOR operations, thereby avoiding the stall and allowing for more pipelining. Analyzing this code in er_kernel, we can observe that the stalls for the XORs are absent. There are still a couple of stalls, which are hard to avoid because of the dependency between the steps of RC4 encrypt. Here is the output.

The following graph shows the improvement that we get for RC4 encrypt from Version 1 to Version 3 on a Sun Fire X4150 server. It is interesting to observe that a 15% performance gain can be achieved by making the code more efficient to avoid stalls.

About RSA decrypt performance

We started looking at some common cryptography algorithms and investigate their performance on our x86 based servers about two months ago. Crypto performance is important, particularly because x86 processors do not have the crypto hardware accelerator that the CMT systems have.

RSA is the defacto standard for public key cryptography and is widely used for the key-exchange step in the Transport Layer Security (TLS)/ Secure Socket Layer (SSL) protocols. A client which desires a secure connection with the server negotiates a symmetric-key algorithm such as RC4, Triple DES, AES, etc. Then, in order to send the cipher, it encrypts it using the public key of the server. Only the server can decrypt this using its private-key. Following the key exchange, a secure channel is established by encrypting every communication with the cipher.

Turns out that RSA decrypt is one of the most expensive crypto operations that the server performs. RSA decrypt involves modular exponentiation, i.e computing a\^q mod n where a,q, and n - all are huge sized numbers. RSA1024 requires this operation with 512-bit sized a,q and 1024-bit sized n. The operation can be simplified a little bit with a technique known as Montogomery reduction, but that too required 512-bit multiplications. Thus improving RSA decrypt is often the key to improving the scalability for how many sessions a web server can support.

We started out comparing how RSA decypt provided as part of the Solaris Crypto Framework(SCF) with that of OpenSSL, a publicly available SSL library. It turned out that OpenSSL 0.9.8j has much improved RSA-decrypt performance, compared to OpenSSL 0.9.8a, the version shipped with Solaris and some flavours of Linux. A comparison of the two versions using the OpenSSL speed benchmark, compiled with Sun Studio 12 on a Sun Fire X4150 server is shown below. It is very interesting to note that with the same hardware, RSA decrypt performs 3 times faster with a new version of OpenSSL.

So what buys RSA decrypt such a lot of improvement? Looking through OpenSSL changelog, bn_mont_mul(), the montogomery multiplication operation was introduced as an assembly routine for x86_64 in 0.9.8h. Montogomery multiplication is the most vital step in RSA decrypt; it allows for easy modular multiplication (a\*b mod n) arithmetic by multiplications using bit-shifting operations. An optimized assembly implementation of the same in OpenSSL 0.9.8j leads to 3x faster RSA decrypt. This assembly is available at /crypto/bn/asm/x86_64-mont.pl in the OpenSSL source.

When we compared the OpenSSL 0.9.8j number with the speed of RSA decrypt provided by SCF (measured using the pkcs11 engine), we found that the latter was about 2x slower. The pkcs11 engine can be used with OpenSSL with the following command:

/usr/bin/amd64/openssl -engine pkcs11

To measure RSA performance:

/usr/bin/amd64/openssl speed rsa -engine pkcs11

Investigating the performance of the same with the er_kernel/analyzer tools available in Sun Studio 12, we found that the biggest CPU consumer was the big_mul_add_vec32() routine, which does the multiplications. Dan Anderson worked through a number of changes as part of the fix for CR 6799218 in which the bignum data structure was changed to support 64-bit chunks instead of 32-bit chunks. With 64-bit multiplication support in 64-bit processors, the number of multiplicative operations reduces by a factor of 4, and this gave us nearly 1.8x time gain for RSA decrypt. We also found an unnecessary copy operation in the big_mul() routine, and took it out as a fix for CR 6811474. You can check out the new code of big_mul() at OpenSolaris.org. The following figure shows RSA decrypt performance with OpenSolaris pkcs11 library on a X4150 server, before and after the change.

This exercise shows how big a difference a performance tuned software can make. There are usually a few hot spots of code which get executed a lot more times than the others. In crypto operations, hot spots are very prominent. As Amdahl's law suggests, it is important to prioritize performance improvement on areas which would give the best overall system improvement, and targeting hotspots is a great example of this. In my next blog, I will discuss how we tuned another crypto operation, RC4 encrypt for better performance.

Wednesday Dec 17, 2008

CPU Utilization in networking - the TCP transmit path

I am planning to write a series of blogs on CPU Utilization in networking. Here is the first one focusing on the TCP transmit path. This blog article primarily considers Solaris, although equivalent concepts can be applied to other Unix based operating systems.

Let us first examine CPU utilization in Solaris. Here are the results for network I/O using TCP transmit. Our System Under Test(SUT) is a Sun Fire X4440 server, a 16-core, 4-socket AMD Opteron based system with 64 GB memory, and 1 Myricom 10 Gig Ethernet card. It is connected to 15 Sun v20z clients (using one on-board 1 GigE NIC on each system), via a Cisco Catalyst 6500 switch. We use uperf 1.0.2 in these measurements. The profile is a bulk throughput oriented profile, while the write size is varied. The results are collected using the Network Characterization Suite (NCS) that we have developed in our group, and which will be open-sourced soon. Very briefly, the cycles per second is measured using DTrace, and describes how many CPU cycles are required to transmit every KByte of data. usr/sys/idle is measured using vmstat, while intr is measured using another Dtrace script. Throughput is reported by uperf at the end of the 120 second run.

Here are the results:
TCP Transmit tests using uperf with write size = 64K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/3/0/96)              11792          928.11Mb/s
4        256k    (0/4/0/95)              3365           3.71Gb/s
8        256k    (0/7/1/92)              2815           7.42Gb/s
32       256k    (0/8/1/91)              2793           9.22Gb/s
100      256k    (0/9/1/90)              3161           9.24Gb/s
400      32k     (0/24/3/74)             8392           8.93Gb/s
1000     32k     (0/24/3/74)             8406           8.80Gb/s
2000     32k     (0/31/4/68)             12869          7.01Gb/s
4000     32k     (0/32/5/67)             14418          6.44Gb/s
6000     32k     (0/35/5/63)             17053          6.37Gb/s

TCP Transmit tests using uperf with msg size = 8K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/4/0/95)              14259          896.23Mb/s
4        256k    (0/5/0/93)              4276           3.72Gb/s
8        256k    (0/10/1/89)             4385           7.13Gb/s
32       256k    (0/14/2/85)             4951           8.46Gb/s
100      256k    (0/16/2/83)             5515           8.11Gb/s
400      32k     (0/29/3/69)             10738          7.46Gb/s
1000     32k     (0/31/4/68)             11388          7.31Gb/s
2000     32k     (0/36/6/62)             16818          6.44Gb/s
4000     32k     (0/37/7/61)             14951          6.28Gb/s
6000     32k     (1/34/5/64)             18752          6.19Gb/s

Section: TCP Transmit tests using uperf with msg size = 1K
#conn    Wnd     (usr/sys/intr/idle)     cycles/Kbyte    Throughput
1        256k    (0/4/1/95)              13450          915.02Mb/s
4        256k    (0/21/4/77)             19239          3.53Gb/s
8        256k    (0/38/6/60)             20890          5.42Gb/s
32       256k    (0/46/8/52)             18792          5.97Gb/s
100      256k    (0/48/8/50)             21831          5.77Gb/s
400      32k     (1/58/9/40)             24547          5.81Gb/s
1000     32k     (1/53/9/45)             31557          4.73Gb/s
2000     32k     (1/51/9/47)             38520          3.89Gb/s
4000     32k     (1/58/11/40)            40116          3.98Gb/s
6000     32k     (1/53/9/45)             40209          3.97Gb/s
The key metric to see above is cycles/Kbyte. We would like to spend as few cycles as possible for a bytes of transmission. So we want this number to be as low as possible. From the results above, we can infer the following about CPU utilization:

(1) The CPU utilization drops with increase in number of connections. The single connection case is an exception.

(2) The CPU utilization drops with a smaller write size.

(3) Most of the CPU is consumed in the kernel (sys). With increase in number of connection, the usr column goes up too due to the overhead of so many threads (In uperf, each connection is established on an independent thread).

(4) The throughput follows the same trend as CPU Utilization.

Most of the above is on expected lines, but to understand this better, let us profile CPU utilization for the case of 4000 connections doing 8k sized writes, and the case of 100 connections doing 32k writes. We use dtrace based er_kernel to gather the profile data, and then er_print to view the CPU utilization. er_print displays both inclusive (including function calls origination from the mentioned function), and exclusive (excluding all other function calls). The following syntax of er_print is used to list functions and sort them in order of inclusive CPU utilization.
er_print -metrics i.%kcycles:e.%kcycles -sort i.%kcycles -function pae4440oneportmyri.er/
Filtering through the data, we gather the following CPU utilization for various function calls.

For 4000 connections with 8K sized writes (Throughput=6.28 Gbps):

FunctionInclusive CPU Utilization %

For 100 connections with 64K sized writes (Throughput=9.24 Gbps):

FunctionInclusive CPU Utilization %

Ratio of CPU Utilizations normalized to bandwidth:

FunctionNormalized CPU Utilization Ratio (4000 connections, 8K writes/ 100 connections, 32 K writes)

Comparing the normalized values, the cost of the system call write() doesn't change much. Copying becomes a little more efficient with a increase in write size from 8K to 64K. Increase in number of connections is not expected to add to the cost of write().

tcp_wput_data() turns more expensive as the effectiveness of Large Segment Offload (LSO) decreases with higher number of connections, resulting in increased number of function calls and reduced efficiency. Please read my blog about LSO on the Solaris networking stack here.

The driver send routine myri10ge_one_track() turns more expensive due to a combination of smaller LSO segments, and increased cost of DMAing the higher number of segments. We observer that in terms of increase, the cost of driver send operations increases the most (>9x).

Finally, with higher number of connections, we observe a TCP ACK ratio of 2:1, instead of the maximum of 8:1 that is possible on a LAN. A lower ACK ratio leads to higher number of ACK packets and subsequently, a higher cost of tcp_rput_data().

In conclusion, CPU efficiency in the transmit path may reduce due to the following factors.

(i) Poor LSO efficiency: This causes higher number of function calls for driving the same volume of data.
(ii) Higher number of DMA calls: More number of DMA operations leads to reduced CPU efficiency since each DMA operation would require binding and later freeing DMA handles which are expensive operations.
(iii) Poor ACK ratio: A 8:1 ACK ratio leads to lower volume of TCP ACKs and frees CPU cycles. The ACK ratio is seen go reduce with increase in connections.

Sunday Nov 09, 2008

Analyzing the Sun Storage 7000 with Filebench

Today, I am proud to have contributed to the industry's first open storage appliances, the Sun Storage 7000 Unified Storage Systems from Sun Microsystems. Open Storage delivers open-source software on top of standard x86 based commodity hardware delivered as an appliance, at a fraction of cost of propriery hardware. Sun's open-source software brings the added advantage of flexibility, ease of customization, as well as support from Sun Microsystems and a large, growing open-source community.

At a high level, the Sun Storage 7000 has the following components:

1)x86 based Sun Servers: Sun Storage 7000 comes with 1, 2, and 4 sockets of Quad-core AMD-Opteron processors. You can easily configure the amount of processing power and RAM according to your own storage requirements.

2) Open Solaris based appliance software: Along with existing technology of Open Solaris such as ZFS, DTrace, and Zones, the Sun Storage 7000 supports a cool-graphical browser based user-interface, that lets you configure your storage server, set up your storage environment, and then easily monitor various metrics such as CPU, network, and disk usage. I have extensively made use of an excellent feature called Analytics (described later in this blog article).

3) Solid-state devices: Sun Storage 7000 may be configured with different number of solid-state devices, which may act as an extra layer of cache for disk I/O. We have two types of solid-state devices: for read and write operations, named readzillas and logzillas respectively. These devices have been shown to help deliver better read or write performance compared to what SATA 7200rpm disk based storage would sustain. You can read more about these devices in Adam Leventhal's excellent blog.

4) JBODS (Just a Bunch of Disks). Sun Storage 7000 may be connected to arrays of disks. One example of an external disk-array is the Sun Storage J4400 Array , which contains 24 750 GB, 7200 rpm SATA disks.

One of the tools we used to evaluate this platform is Filebench, an open-source framework for simulating applications on file systems. We deploy a Sun Storage 7410, a 4-socket, 16-core system configured with 128 GB RAM, two Sun Storage J4400s, 1 Sun Multithreaded 10 GigE card, 3 logzillas, and 2 readzillas. This appliance is connected to 18 Sun v40z clients via a switch. All clients have a 1 GigE interface.

We configured the storage as a mirrored RAID10. A ZFS filesystem was created for each client and then mounted using NFSv4. While Filebench has support for synchronizing threads, our main challenge was to synchronize different instances of filebench running on the different clients, so that they may simultaneously perform operations on our Storage appliance. We ran a variety of workloads such as single and multi-threaded streaming-read, streaming writes, random-read, and file creation. While no workload can replicate the exact requirements our customers may have, we hope that the above workloads are greatly illustrative of the power of our Sun Storage 7400. You may read Roch's interesting blog on how the workloads were designed.

Coordinating so many clients was no easy task, and we struggled for days writing a script, which is available in the toolkit here. While we have added sufficient comments to the script, it is by no means ready for easy installation and use. Please do communicate with me if you plan to use it. We also monitor CPU, network, and disk metrics on our applicance using Dtrace (Please see 11metrics.d in the toolkit). Using DTrace has minimum but not negligable overhead, so the following results should be 3-5% lower than what we would achieve without DTrace running in the background.

Results from different filebench workloads are as follows:

Test Name (Executed Per Client) Aggregate Metric (18 clients) Network I/O (MBytes/s) CPU Util %
1 thread streaming reads from 20G uncached set, 30 sec 871.0 MBytes/s 936 82
1 thread streaming reads from same set, 30 sec 1012.6 MBytes/s 1086 68
20 threads streaming reads from a different 20G uncached set, 30 sec 924.2 MBytes/s 1008 88
10 threads streaming reads from same set, 30 sec 1005.9 MBytes/s 1071 83
1 thread streaming write, 120 sec 461.5 MBytes/s 589 68
20 threads streaming write, 120 sec 444.5 MBytes/s 503 81
20 threads 8K random read from a different 20G uncached set, 30 sec 6204 IOPS 52 14
128 threads 8K random read from same set, 30 sec 7047 IOPS 57 23
128 threads 8K synchronous writes to 20G set, 120 sec 24555 IOPS 111 73

We use the ZFS record size of 128K for the MBytes/s oriented tests, and 8k record size for the IOPS (I/O Operations per second) based tests. The record size of 8k helps deliver better performance by better allignment of the read/write requests with the record sizes. Please also note that the file sets mentioned above are 20 GBytes per client. So for 18 clients with 20 Gbytes per client, the total working set for these tests was 360 Gbytes.

A few observations from the data:

(1) For streaming read tests, we sustain network I/O of close to 1 GByte/sec. LSO helps us significantly in this regard (Please refer to this blog article on benefits of LSO). Also, please read how our team helped deliver better LSO here.

(2) Reading from a cached dataset, improves streaming read performance by about 15-20%. You may observe that caching helps improve CPU utilization and reduces disk activity.

(3) One thread in one client can read close to 50 MByte/sec. With 18 clients (18 threads) we can get to 900 MBytes/s. Therefore, a multithreaded read per client (20 threads per client, 360 threads) does not increase the read performance by much.

(4) For random reads, we are bottlenecked by the IOPS we can do per disk (which is about 150-180 from a 7200 rpm disk). Using the same system attached to more disks, Roch acheived between 28559/36478 IOPS (cold run/warm run) from a 400GB dataset.

(5) We get great synchronous write performance because the logzillas allow much lesser write latencies than what the raw disk would have provided.

Using analytics, we can easily track how the applicance is behaving for each workload. Here is a screen shot from analytics. You can see the trendy plots of how the appliance behaved during the test. The three charts show NFSv4 operations per second, CPU utilization per second, and disk I/O operations per second. While these charts show aggregate metrics, you can easily break down on a type of operation, per CPU, and per disk basis, respectively.

In conclusion, the Sun Storage 7000 series has integrated a nice bunch of goodies, which combined with open-source software, should give a new direction of cheap, proprietory storage in the years to come.

This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.


« July 2016