Tuesday Apr 07, 2009

Benchmarking (and improving the benchmark) for RSA decrypt

I wrote about our work on RSA-decrypt in OpenSolaris two posts ago. One of the biggest obstacles we faced while improving RSA is that the bignum library (which RSA uses for the expensive multiplication routines) is a kernel module. Anything in kernel land is so much harder to play around with -- analyze, improve, debug, measure performance --, and so we worked on porting the kernel code to a userland program. We have something ready now, and the code in userland looks exactly as what is in the current version of bignum in Opensolaris.

More interestingly, we can use this code as a simple benchmark for CPU performance. The current code has been tuned for performance for only x86_64 (CMT has hardware accelerators; therefore software performance of crypto algorithms does not interest us much). As an example, here are the results for RSA1024 decrypt on various systems (and processors):

Sun Fire X4150 (Intel Xeon 5450 3.16 GHz Processor) 493544 nsec
Sun Fire X4600 (AMD Opteron(tm) Processor 885, 2.6 GHz) 557934 nsec
Sun Fire X4200 (AMD Opteron(tm) Processor 280, 2.31 GHz) 606147 nsec


The benchmark code is available for download here.

However, our main objective remains to improve this benchmark, have better performance on RSA decrypt, and then deliver it to OpenSolaris. This effort has already led to a CR (which will be fixed soon in OpenSolaris). Hopefully you can contribute more improvements to the code. Please send in your ideas by e-mail or post a comment to this blog.

Here is how the benchmark performs on a X4150 (2-socket quad-core Intel Xeon 3.16 GHz processor). The hottest function is big_mul_add_vec, which is already implemented in assembly for x64. The analysis is done using collect/er_print tools available in Sun Studio 12.

%collect ./bignum_test
Creating experiment database test.1.er ...

%er_print -functions test.1.er
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name  
User CPU  User CPU         
  sec.      sec.      
51.956    51.956     
26.899    26.899     big_mul_add_vec
 9.236    47.623     big_mont_mul
 4.953    11.428     big_sqr_vec
 3.663    18.863     big_mul
 1.601     1.601     big_mul_set_vec
 1.181    49.234     big_modexp_ncp_int
 0.921     0.921     big_sub_vec
 0.570     0.570     big_sub_pos
 0.380     0.380     rand_r
 0.370     0.911     genrandomstring
 0.280     3.613     big_mul_vec
 0.240     0.240     big_cmp_abs
 0.220     1.301     big_div_pos
 0.170     0.170     big_mulhalf_high
 0.160     0.160     _free_unlocked
 0.160     0.160     big_copy
 0.160     0.540     rand
 0.120     0.120     big_mulhalf_low
 0.120     0.140     mutex_unlock
 0.070     0.130     _malloc_unlocked
 0.070     0.640     big_sub_pos_high
 0.050     0.050     big_shiftleft
 0.050     0.050     sigon
 0.040     0.070     mutex_lock_impl
 0.030     0.030     _smalloc
 0.030     0.300     big_init1
 0.030     0.030     cleanfree
 0.030     0.030     gettimeofday
 0.020     0.090     big_cmp_abs_high
 0.020     0.310     big_finish
 0.020     0.020     big_n0
 0.020     0.290     free
 0.020     0.270     malloc
 0.020     0.090     mutex_lock
 0.010     0.010     big_add_abs
 0.010     0.010     big_numbits
 0.010     0.010     big_shiftright
 0.        0.        _init
 0.        0.        _rt_boot
 0.        0.        _setup
 0.       51.956     _start
 0.       51.016     big_modexp_crt_ext
 0.       50.065     big_modexp_ext
 0.        0.200     big_mont_conv
 0.        0.490     big_mont_rr
 0.        0.        call_init
 0.        0.        dtrace_dof_init
 0.        0.        ioctl
 0.       51.956     main
 0.        0.        setup
About

This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today