RSA Performance of Sun Fire T2000

You might have heard that UltraSPARC T1 has special hardware circuitry to accelerate certain crypto operations. In this blog I will show you what operations it is good at, and how good.

UltraSPARC T1 comes with Modular Arithmetic Unit (MAU) per core which can accelerate expensive modular arithmetic operations found in public key crypto algorithms such as RSA, DSA and DH. In Solaris, the utilization of MAU has to go through Niagara Cryptographic Provider (NCP) within Solaris Cryptographic Framework (SCF). Currently only RSA (up to 2048 bit) and DSA (up to 1024 bit) are supported by NCP.

On the Sun Fire T2000/T1000 with the UltraSPARC T1 processor, you can readily get a glimpse of the fast RSA operations performed by MAU. Here is an example for the popular 1024-bit and 2048-bit RSA on a Sun Fire T2000 with 1.2 GHz UltraSPARC T1 with 8 cores:

[watercloset]~> /usr/sfw/bin/openssl speed rsa1024 rsa2048 -engine pkcs11
engine "pkcs11" set.
Doing 1024 bit private rsa's for 10s: 10332 1024 bit private RSA's in 0.45s
Doing 1024 bit public rsa's for 10s: 25550 1024 bit public RSA's in 0.89s
Doing 2048 bit private rsa's for 10s: 2371 2048 bit private RSA's in 0.11s
Doing 2048 bit public rsa's for 10s: 10308 2048 bit public RSA's in 0.37s
OpenSSL 0.9.7d 17 Mar 2004
built on: date not available
options:bn(64,32) md2(int) rc4(ptr,char) des(ptr,risc1,16,long)
aes(partial) blowfish(ptr) 
compiler: information not available
available timing options: TIMES TIMEB HZ=100 [sysconf value]
timing function used: times
                  sign    verify    sign/s verify/s 
rsa 1024 bits   0.0000s   0.0000s  22960.0  28707.9
rsa 2048 bits   0.0000s   0.0000s  21554.5  27859.5

This invokes the OpenSSL speed test bundled with Solaris. The OpenSSL bundled with Solaris has PKCS#11 engine built-in which is necessary to access SCF (and thus MAU); if you download OpenSSL package and build it yourself, you will not be able to take advantage of MAU because it does not have PKCS#11 engine. Let's examine the performance numbers above. What we just did was to test the single-threaded RSA performance. Each RSA operation is run for 10 seconds. However, due to the timing errors in OpenSSL speed test in the single-threaded case, the throughput numbers at the bottom cannot be trusted when the operations are done in hardware. After some re-calculations we get:

                  sign    verify    sign/s verify/s 
rsa 1024 bits   0.0000s   0.0000s   1033.2   2550.0
rsa 2048 bits   0.0000s   0.0000s    237.1   1030.8

Are these numbers good? They are actually very good. Take 1024-bit RSA sign operation number, 1033.2, and compare it with the number on 3.6 GHz Xeon Dell box - 843.0. UltraSPARC T1 offers 20% more RSA performance at 1/3 clock rate and uses less power. Note that this is single-threaded test. As shown below, UltraSPARC T1 really dwarfs others in the multi-process test.

Now, let's look at multi-process RSA performance. This is where UltraSPARC T1 really shines. Do an OpenSSL speed test again, this time with the "-multi" option to invoke multiple processes to conduct RSA operations concurrently:

[watercloset]~> /usr/sfw/bin/openssl speed rsa1024 rsa2048 -engine pkcs11 -multi 32

[ intermediate output snipped..... ]

                  sign    verify    sign/s verify/s
rsa 1024 bits   0.0001s   0.0000s  12871.3  45148.1
rsa 2048 bits   0.0004s   0.0000s   2299.6  20425.3

We have used 32 processes to fully saturate the 32 hardware threads on UltraSPARC T1 to get the maximum throughput. Compare this with the results on the 2-way 3.6 GHz Xeon Dell PowerEdge 2850 (with hyperthreading on):

wgs93-187:~ openssl speed rsa1024 rsa2048 -multi 4

[ intermediate output snipped..... ]

                  sign    verify    sign/s verify/s
rsa 1024 bits   0.0005s   0.0000s   1943.2  34632.7
rsa 2048 bits   0.0031s   0.0001s    327.4  10891.9

For 1024-bit RSA sign operation (as commonly used in web server SSL handshaking), Sun Fire T2000 outperforms Dell PowerEdge 2850 by a whopping 6x! UltraSPARC T1 also excels when compared with the Sun Crypto Accelerator 4000, which can do 8000 1024-RSA signs/s. And remember, all this comes with just the Sun Fire T2000/T1000 box, no extra crypto accelerator card is needed.

In summary, if RSA/DSA operations consume a certain amount of CPU cycles in your application (e.g. HTTPS), Sun Fire T2000/T1000 with UltraSPARC T1 will offer you the biggest bang for the bucks with its per-core MAU and unique 8-core CMT architecture.

Comments:

Do you know if the Sun Webserver utilize this by "default"?

Posted by Kenneth on December 06, 2005 at 05:05 AM PST #

I think the webserver use the NSS implementation per default. You have to bind solaris crypto framework with modutil

Posted by bbr on December 06, 2005 at 09:07 PM PST #

'bbr' is correct . NCP is not enabled in Sun webserver config by default. NSS treats NCP as any other hw-crypto device and modultil can be used to configure NSS/webserver. You can find a short description at http://blogs.sun.com/roller/page/enigma

Posted by chichang_lin on December 07, 2005 at 02:43 AM PST #

Those 1.2 GHz UltraSPARC T1 performance numbers are very interesting, and I have compared them with those of produced by a 1.8 GHz Opteron (running openssl 0.9.7d in 64-bit mode).

In the single-threaded case, the T1 is able to do 1033.2 1024-bit RSA signs/s, which is very good and similar to what is accomplished by the Opteron: about 1100 signs/s. However the T1 is only able to do 2550.0 verifies/s, whereas the Opteron does about 18400 verifies/s.

chichang_lin: why is the T1 so slow on the RSA verify operations ? Is it still using the special hardware circuitry in this case ? My 2048-bit RSA tests exhibit the same performance pattern (same sign speed, worse verify speed).

PS: of course in multi-threaded cases the 8-core T1 largely outperforms the single-core Opteron.

Posted by marc on December 14, 2005 at 03:37 PM PST #

What exactly are the timing errors in the openssl speed test?

Posted by rick on December 15, 2005 at 05:19 AM PST #

marc: RSA verify also uses MAU. The reason we don't see that much of a speedup in UltraSPARC T1 is because RSA verify requires less computations compared to sign (one modular exponentiation op for verify vs. two modular exponentiation ops for sign) and the overhead of setting up operations for MAU becomes more visible. On server systems this should be less of a concern because servers perform RSA sign/decrypt operations predominantly.

Posted by chichang_lin on December 15, 2005 at 10:32 AM PST #

rick: The timing errors I referred to is this -- when you issue this command
[watercloset]~> /usr/sfw/bin/openssl speed rsa1024 rsa2048 -engine pkcs11
Take the 2nd line of the intermediate output for example:
Doing 1024 bit private rsa's for 10s: 10332 1024 bit private RSA's in 0.45s
It actually runs the RSA operation for 10 seconds, but it says that it runs for "0.45s". Therefore, in the final summary results it calculates the throughput as 10332/0.45 = 22960 signs/s. This is the timing errors I meant. Note that this only happens when "-engine pkcs11" is specified and in single-threaded case.

Posted by chichang_lin on December 15, 2005 at 10:44 AM PST #

chichang_lin: thanks for the explanation, it makes sense. But then I am wondering, if setting up the MAU is too much overhead, would it be faster to use a software implementation for RSA verify operations ? In other word, does speed(1) reports better verifies/s numbers _whithout_ "-engine pkcs11" ? Of course the disadvantage of not using the MAU is that RSA operations would use traditional CPU execution units which could be more useful for other processes, but I am just asking because I am curious.

Oh and another thing, regarding the timing errors, my guess is that these are due to the fact that by default speed(1) takes the "user time" and not the "real time" (as reported by the kernel) to compute the performance numbers; and since your 'pkcs11' openssl engine actually spends most of its time waiting for the MAU to complete its computations (I think you guys implemented it this way, didn't you ?) then the whole openssl process ends up consuming very little "user time", which is why openssl reports 0.45s instead of 10s. In order to fix this you can instruct openssl to take the "real time" by appending the "-elapsed" argument to your speed(1) command line. Can you try it ?

Damn the more I look at it, the more I think Niagara is the greatest general purpose processor released of this decade...

Posted by marc on December 15, 2005 at 04:45 PM PST #

marc: You have a good point. Actually during the design/implementation phase the engineers had considered this and found that doing RSA verify in MAU still offers better performance: without "-engine pkcs11" 1024-bit RSA verify only has 1068 ops/s compared to 2550 ops/s with MAU.

The reason for the timing error is indeed as you said. With the "-elapsed" option the timings are all correct now.

Thanks for all the feedbacks and comments! Are you planning to purchase a Sun Fire T2000 soon? :-)

Posted by chichang_lin on December 16, 2005 at 06:31 AM PST #

chichang_lin: Sorry I can't :) I don't even have 2 grands to spend on a box right now. And anyway I prefer to wait for the "Rock" and its 8 FPUs.

Posted by marc on December 16, 2005 at 07:46 AM PST #

Hi,
just run the code, you provided here on a T5140......

I mean, your input is from 2005!

Are there special things with regards to T2 processors?

Thanks
Helmut

Posted by Helmut Kirrmaier on April 02, 2009 at 02:15 AM PDT #

To Helmut,

Everything should be the same on T2-based system. You should see even higher numbers on T2.

Posted by chichang_lin on May 03, 2009 at 05:17 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

chichang_lin

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks