Tuesday Apr 07, 2009

Benchmarking (and improving the benchmark) for RSA decrypt

I wrote about our work on RSA-decrypt in OpenSolaris two posts ago. One of the biggest obstacles we faced while improving RSA is that the bignum library (which RSA uses for the expensive multiplication routines) is a kernel module. Anything in kernel land is so much harder to play around with -- analyze, improve, debug, measure performance --, and so we worked on porting the kernel code to a userland program. We have something ready now, and the code in userland looks exactly as what is in the current version of bignum in Opensolaris.

More interestingly, we can use this code as a simple benchmark for CPU performance. The current code has been tuned for performance for only x86_64 (CMT has hardware accelerators; therefore software performance of crypto algorithms does not interest us much). As an example, here are the results for RSA1024 decrypt on various systems (and processors):

Sun Fire X4150 (Intel Xeon 5450 3.16 GHz Processor) 493544 nsec
Sun Fire X4600 (AMD Opteron(tm) Processor 885, 2.6 GHz) 557934 nsec
Sun Fire X4200 (AMD Opteron(tm) Processor 280, 2.31 GHz) 606147 nsec


The benchmark code is available for download here.

However, our main objective remains to improve this benchmark, have better performance on RSA decrypt, and then deliver it to OpenSolaris. This effort has already led to a CR (which will be fixed soon in OpenSolaris). Hopefully you can contribute more improvements to the code. Please send in your ideas by e-mail or post a comment to this blog.

Here is how the benchmark performs on a X4150 (2-socket quad-core Intel Xeon 3.16 GHz processor). The hottest function is big_mul_add_vec, which is already implemented in assembly for x64. The analysis is done using collect/er_print tools available in Sun Studio 12.

%collect ./bignum_test
Creating experiment database test.1.er ...

%er_print -functions test.1.er
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name  
User CPU  User CPU         
  sec.      sec.      
51.956    51.956     
26.899    26.899     big_mul_add_vec
 9.236    47.623     big_mont_mul
 4.953    11.428     big_sqr_vec
 3.663    18.863     big_mul
 1.601     1.601     big_mul_set_vec
 1.181    49.234     big_modexp_ncp_int
 0.921     0.921     big_sub_vec
 0.570     0.570     big_sub_pos
 0.380     0.380     rand_r
 0.370     0.911     genrandomstring
 0.280     3.613     big_mul_vec
 0.240     0.240     big_cmp_abs
 0.220     1.301     big_div_pos
 0.170     0.170     big_mulhalf_high
 0.160     0.160     _free_unlocked
 0.160     0.160     big_copy
 0.160     0.540     rand
 0.120     0.120     big_mulhalf_low
 0.120     0.140     mutex_unlock
 0.070     0.130     _malloc_unlocked
 0.070     0.640     big_sub_pos_high
 0.050     0.050     big_shiftleft
 0.050     0.050     sigon
 0.040     0.070     mutex_lock_impl
 0.030     0.030     _smalloc
 0.030     0.300     big_init1
 0.030     0.030     cleanfree
 0.030     0.030     gettimeofday
 0.020     0.090     big_cmp_abs_high
 0.020     0.310     big_finish
 0.020     0.020     big_n0
 0.020     0.290     free
 0.020     0.270     malloc
 0.020     0.090     mutex_lock
 0.010     0.010     big_add_abs
 0.010     0.010     big_numbits
 0.010     0.010     big_shiftright
 0.        0.        _init
 0.        0.        _rt_boot
 0.        0.        _setup
 0.       51.956     _start
 0.       51.016     big_modexp_crt_ext
 0.       50.065     big_modexp_ext
 0.        0.200     big_mont_conv
 0.        0.490     big_mont_rr
 0.        0.        call_init
 0.        0.        dtrace_dof_init
 0.        0.        ioctl
 0.       51.956     main
 0.        0.        setup

Tuesday Mar 24, 2009

Improving RC4 encrypt performance.

RC4 is one of the best-known and commonly used ciphers. It is a very small and simple algorithm which can be executed very fast. The RC4 encrypt routine (as was in OpenSolaris) is the following:

Inputs: in (input stream), out: (output stream), key: (RC4 key), len: length of the streams.

Version 1
i = key->i;
j = key->j;
arr = key->arr;
for (ii = 0; ii < len; ++ii) {
++i;
ti = arr[i];
j = j + ti;
arr [i] = arr[j];
arr[j] = ti;
out[ii] = in[ii] \^ arr[(arr[i] + arr[j]) & 0xff]; /\* Line 1 \*/
}

We went through an interesting exercise trying to optimize this loop. We analyzed the performance of this loop using er_kernel/ analyzer tools available in Sun Studio 12. Here is an output from analyzer which shows a snapshot from analyzer depicting the stalls (marked in green) from Line 1. Two stalls are observed during load operations, one before the arr[i] + arr[j] add in Line 1, and the other before the XOR operation in Line 1.

Looking through RC4 encrypt code available in OpenSSL (rc4_enc.c), we observed that they had the steps a little differently. Adopting that into OpenSolaris, the code would be:

Version 2
i = key->i;
j = key->j;
arr = key->arr;
for (ii = 0; ii < len; ++ii) {
++i;
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr [i] = tj;
out[ii] = in[ii] \^ arr[(ti + tj) & 0xff]; /\* Line 1 \*/
}

This version stores the contents of arr[i] and arr[j] in the initial loads, and re-uses them in Line 1. The compiler was not able to recognize and do the same automatically for us because arr is a pointer reference. Analyzing this code with er_kernel/ analyzer, we get the following output:

Now the stalls have moved to the XOR step. The XOR depends on arr[(ti + tj) & 0xff] to be computed, before the same may be executed. This leads to the stall.

We can re-arrange the loop in the following way in order to avoid the stall. This code is now available at OpenSolaris.org and in the latest version of Solaris Express Community Edition (Nevada) available .

Version 3
++i;
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr[i] = tj;
arr_ij = arr[(ti + tj) & 0xff];
--len;

for (ii = 0; ii < len;) {
++i;
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr[i] = tj;

/\* save result from previous loop: \*/
out[ii] = in[ii] \^ arr_ij;

++ii;
arr_ij = arr[(ti + tj) & 0xff];
}
/\* save result from last loop: \*/
out[ii] = in[ii] \^ arr_ij;

This code spaces out the computation of arr_ij and the XOR operations, thereby avoiding the stall and allowing for more pipelining. Analyzing this code in er_kernel, we can observe that the stalls for the XORs are absent. There are still a couple of stalls, which are hard to avoid because of the dependency between the steps of RC4 encrypt. Here is the output.

The following graph shows the improvement that we get for RC4 encrypt from Version 1 to Version 3 on a Sun Fire X4150 server. It is interesting to observe that a 15% performance gain can be achieved by making the code more efficient to avoid stalls.

About RSA decrypt performance

We started looking at some common cryptography algorithms and investigate their performance on our x86 based servers about two months ago. Crypto performance is important, particularly because x86 processors do not have the crypto hardware accelerator that the CMT systems have.

RSA is the defacto standard for public key cryptography and is widely used for the key-exchange step in the Transport Layer Security (TLS)/ Secure Socket Layer (SSL) protocols. A client which desires a secure connection with the server negotiates a symmetric-key algorithm such as RC4, Triple DES, AES, etc. Then, in order to send the cipher, it encrypts it using the public key of the server. Only the server can decrypt this using its private-key. Following the key exchange, a secure channel is established by encrypting every communication with the cipher.

Turns out that RSA decrypt is one of the most expensive crypto operations that the server performs. RSA decrypt involves modular exponentiation, i.e computing a\^q mod n where a,q, and n - all are huge sized numbers. RSA1024 requires this operation with 512-bit sized a,q and 1024-bit sized n. The operation can be simplified a little bit with a technique known as Montogomery reduction, but that too required 512-bit multiplications. Thus improving RSA decrypt is often the key to improving the scalability for how many sessions a web server can support.

We started out comparing how RSA decypt provided as part of the Solaris Crypto Framework(SCF) with that of OpenSSL, a publicly available SSL library. It turned out that OpenSSL 0.9.8j has much improved RSA-decrypt performance, compared to OpenSSL 0.9.8a, the version shipped with Solaris and some flavours of Linux. A comparison of the two versions using the OpenSSL speed benchmark, compiled with Sun Studio 12 on a Sun Fire X4150 server is shown below. It is very interesting to note that with the same hardware, RSA decrypt performs 3 times faster with a new version of OpenSSL.



So what buys RSA decrypt such a lot of improvement? Looking through OpenSSL changelog, bn_mont_mul(), the montogomery multiplication operation was introduced as an assembly routine for x86_64 in 0.9.8h. Montogomery multiplication is the most vital step in RSA decrypt; it allows for easy modular multiplication (a\*b mod n) arithmetic by multiplications using bit-shifting operations. An optimized assembly implementation of the same in OpenSSL 0.9.8j leads to 3x faster RSA decrypt. This assembly is available at /crypto/bn/asm/x86_64-mont.pl in the OpenSSL source.

When we compared the OpenSSL 0.9.8j number with the speed of RSA decrypt provided by SCF (measured using the pkcs11 engine), we found that the latter was about 2x slower. The pkcs11 engine can be used with OpenSSL with the following command:

/usr/bin/amd64/openssl -engine pkcs11

To measure RSA performance:

/usr/bin/amd64/openssl speed rsa -engine pkcs11

Investigating the performance of the same with the er_kernel/analyzer tools available in Sun Studio 12, we found that the biggest CPU consumer was the big_mul_add_vec32() routine, which does the multiplications. Dan Anderson worked through a number of changes as part of the fix for CR 6799218 in which the bignum data structure was changed to support 64-bit chunks instead of 32-bit chunks. With 64-bit multiplication support in 64-bit processors, the number of multiplicative operations reduces by a factor of 4, and this gave us nearly 1.8x time gain for RSA decrypt. We also found an unnecessary copy operation in the big_mul() routine, and took it out as a fix for CR 6811474. You can check out the new code of big_mul() at OpenSolaris.org. The following figure shows RSA decrypt performance with OpenSolaris pkcs11 library on a X4150 server, before and after the change.



This exercise shows how big a difference a performance tuned software can make. There are usually a few hot spots of code which get executed a lot more times than the others. In crypto operations, hot spots are very prominent. As Amdahl's law suggests, it is important to prioritize performance improvement on areas which would give the best overall system improvement, and targeting hotspots is a great example of this. In my next blog, I will discuss how we tuned another crypto operation, RC4 encrypt for better performance.

Thursday Feb 19, 2009

Trip Report from IOM-2009

Last weekend I participated at IOM-2009, a workshop on The Influence of I/O on Microprocessor Architecture, co-located with the High Performance Computer Architectures (HPCA) conference-2009 at Raleigh, NC. Ram Huggahalli, Principal Engineer, and Platform Architect in Communication Technology Lab, Intel served as the workshop chair. I must say that this workshop was very well organized. There were 8 presentations, 45 minutes each, leaving ample time for valuable discussions and feedback.

Challenges
The workshop addressed the challenge of how to provide more I/O to systems, particularly with 40 GigE and 100 GigE getting standardized very soon. Here are the challenges as listed by the workshop chair.

1) Making I/O an integral part of chip/system design instead of having it as a peripheral device. Networking and other I/O needs to be integrated with microprocessor design instead of being thought as a peripheral device.
2) Making some kind of revolutionary change in the way network I/O is performed because (i) Memory access wont become faster, (ii) Demand for I/O (networking) will only increase and the current way will probably not scale.

Summary of viewpoints:
Main viewpoints expressed in the workshop:
1) A presentation from Sandia National Labs demonstrated experimental data on pre-release Nehalem systems. Main points: (i) Nehalem showing much improved network I/O because of reduced local memory latency,
(ii) NUMAness in Nehalem plays major role, many benchmarks show great difference using local memory than using remote memory. a list of commercial benchmarks for which performance varies a lot depending on whether memory is located locally or remotely. Memory is controlled in linux with numactl.

2) Main stumbling bottleneck for Network I/O: Avoiding copy required from user space to kernel space for transmit and vice-versa for receive. Strategies presented:
(i) Cache injection (2 papers: 1 from IBM Labs, Zurich, and 1 from Univ. of Victoria): Strategy is to inject data received directly into cache so that when receive() is issued, there is no cache-miss and data is readily available. Challenge is that data that is displaced in cache will cause cache-misses, and therefore it is hard to come up with an algorithm which is suited to all workloads. Presenters from Univ. of Victoria presented strategies in the context of MPI running on a IBM Cell processor.

(ii) IOMMU to drive hardware accelerators (1 paper from Univ. Pittsburgh, Intel): Using IOMMU hardware to access physical memory by supplying virtual address so that a hardware accelerator device can directly access memory. The presenter demonstrated this approach in the context of a USB drive.

(iii) Creating DMA Cache (Chinese Academy of Sciences) : Having separate cache to keep I/O data before it can be read by application. As a result primary cache is not affected.

(iv) Intelligent NICs (Virginia Tech): NICs which can interact with the CPU and transfer data when required.

(3) Other Interesting papers:
(i) Using network processors to create virtual NICs. (Univ Massachussetts, Lowell)
(ii) Active end-system analysis for finding bottleneck rate for receive network I/O: This work is mainly from UCDavis, while I have contributed to the theoretical part. This work demonstrates the importance of pacing on the transmit side, and illustrates how to compute the bottleneck at the receiver using a stochastic model. Slides are available here.

About

This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today