### About RSA decrypt performance

#### By user12608726 on Mar 24, 2009

We started looking at some common cryptography algorithms and investigate their performance on our x86 based servers about two months ago. Crypto performance is important, particularly because x86 processors do not have the crypto hardware accelerator that the CMT systems have.

RSA is the defacto standard for public key cryptography and is widely used for the key-exchange step in the Transport Layer Security (TLS)/ Secure Socket Layer (SSL) protocols. A client which desires a secure connection with the server negotiates a symmetric-key algorithm such as RC4, Triple DES, AES, etc. Then, in order to send the cipher, it encrypts it using the public key of the server. Only the server can decrypt this using its private-key. Following the key exchange, a secure channel is established by encrypting every communication with the cipher.

Turns out that RSA decrypt is one of the most expensive crypto operations that the server performs. RSA decrypt involves modular exponentiation, i.e computing a\^q mod n where a,q, and n - all are huge sized numbers. RSA1024 requires this operation with 512-bit sized a,q and 1024-bit sized n. The operation can be simplified a little bit with a technique known as Montogomery reduction, but that too required 512-bit multiplications. Thus improving RSA decrypt is often the key to improving the scalability for how many sessions a web server can support.

We started out comparing how RSA decypt provided as part of the Solaris Crypto Framework(SCF) with that of OpenSSL, a publicly available SSL library. It turned out that OpenSSL 0.9.8j has much improved RSA-decrypt performance, compared to OpenSSL 0.9.8a, the version shipped with Solaris and some flavours of Linux. A comparison of the two versions using the OpenSSL speed benchmark, compiled with Sun Studio 12 on a Sun Fire X4150 server is shown below. It is very interesting to note that with the same hardware, RSA decrypt performs 3 times faster with a new version of OpenSSL.

So what buys RSA decrypt such a lot of improvement? Looking through OpenSSL changelog, bn_mont_mul(), the montogomery multiplication operation was introduced as an assembly routine for x86_64 in 0.9.8h. Montogomery multiplication is the most vital step in RSA decrypt; it allows for easy modular multiplication (a\*b mod n) arithmetic by multiplications using bit-shifting operations. An optimized assembly implementation of the same in OpenSSL 0.9.8j leads to 3x faster RSA decrypt. This assembly is available at /crypto/bn/asm/x86_64-mont.pl in the OpenSSL source.

When we compared the OpenSSL 0.9.8j number with the speed of RSA decrypt provided by SCF (measured using the pkcs11 engine), we found that the latter was about 2x slower. The pkcs11 engine can be used with OpenSSL with the following command:

/usr/bin/amd64/openssl -engine pkcs11

To measure RSA performance:

/usr/bin/amd64/openssl speed rsa -engine pkcs11

Investigating the performance of the same with the er_kernel/analyzer tools available in Sun Studio 12, we found that the biggest CPU consumer was the big_mul_add_vec32() routine, which does the multiplications. Dan Anderson worked through a number of changes as part of the fix for CR 6799218 in which the bignum data structure was changed to support 64-bit chunks instead of 32-bit chunks. With 64-bit multiplication support in 64-bit processors, the number of multiplicative operations reduces by a factor of 4, and this gave us nearly 1.8x time gain for RSA decrypt. We also found an unnecessary copy operation in the big_mul() routine, and took it out as a fix for CR 6811474. You can check out the new code of big_mul() at OpenSolaris.org. The following figure shows RSA decrypt performance with OpenSolaris pkcs11 library on a X4150 server, before and after the change.

This exercise shows how big a difference a performance tuned software can make. There are usually a few hot spots of code which get executed a lot more times than the others. In crypto operations, hot spots are very prominent. As Amdahl's law suggests, it is important to prioritize performance improvement on areas which would give the best overall system improvement, and targeting hotspots is a great example of this. In my next blog, I will discuss how we tuned another crypto operation, RC4 encrypt for better performance.

RSA is the defacto standard for public key cryptography and is widely used for the key-exchange step in the Transport Layer Security (TLS)/ Secure Socket Layer (SSL) protocols. A client which desires a secure connection with the server negotiates a symmetric-key algorithm such as RC4, Triple DES, AES, etc. Then, in order to send the cipher, it encrypts it using the public key of the server. Only the server can decrypt this using its private-key. Following the key exchange, a secure channel is established by encrypting every communication with the cipher.

Turns out that RSA decrypt is one of the most expensive crypto operations that the server performs. RSA decrypt involves modular exponentiation, i.e computing a\^q mod n where a,q, and n - all are huge sized numbers. RSA1024 requires this operation with 512-bit sized a,q and 1024-bit sized n. The operation can be simplified a little bit with a technique known as Montogomery reduction, but that too required 512-bit multiplications. Thus improving RSA decrypt is often the key to improving the scalability for how many sessions a web server can support.

We started out comparing how RSA decypt provided as part of the Solaris Crypto Framework(SCF) with that of OpenSSL, a publicly available SSL library. It turned out that OpenSSL 0.9.8j has much improved RSA-decrypt performance, compared to OpenSSL 0.9.8a, the version shipped with Solaris and some flavours of Linux. A comparison of the two versions using the OpenSSL speed benchmark, compiled with Sun Studio 12 on a Sun Fire X4150 server is shown below. It is very interesting to note that with the same hardware, RSA decrypt performs 3 times faster with a new version of OpenSSL.

So what buys RSA decrypt such a lot of improvement? Looking through OpenSSL changelog, bn_mont_mul(), the montogomery multiplication operation was introduced as an assembly routine for x86_64 in 0.9.8h. Montogomery multiplication is the most vital step in RSA decrypt; it allows for easy modular multiplication (a\*b mod n) arithmetic by multiplications using bit-shifting operations. An optimized assembly implementation of the same in OpenSSL 0.9.8j leads to 3x faster RSA decrypt. This assembly is available at /crypto/bn/asm/x86_64-mont.pl in the OpenSSL source.

When we compared the OpenSSL 0.9.8j number with the speed of RSA decrypt provided by SCF (measured using the pkcs11 engine), we found that the latter was about 2x slower. The pkcs11 engine can be used with OpenSSL with the following command:

/usr/bin/amd64/openssl -engine pkcs11

To measure RSA performance:

/usr/bin/amd64/openssl speed rsa -engine pkcs11

Investigating the performance of the same with the er_kernel/analyzer tools available in Sun Studio 12, we found that the biggest CPU consumer was the big_mul_add_vec32() routine, which does the multiplications. Dan Anderson worked through a number of changes as part of the fix for CR 6799218 in which the bignum data structure was changed to support 64-bit chunks instead of 32-bit chunks. With 64-bit multiplication support in 64-bit processors, the number of multiplicative operations reduces by a factor of 4, and this gave us nearly 1.8x time gain for RSA decrypt. We also found an unnecessary copy operation in the big_mul() routine, and took it out as a fix for CR 6811474. You can check out the new code of big_mul() at OpenSolaris.org. The following figure shows RSA decrypt performance with OpenSolaris pkcs11 library on a X4150 server, before and after the change.

This exercise shows how big a difference a performance tuned software can make. There are usually a few hot spots of code which get executed a lot more times than the others. In crypto operations, hot spots are very prominent. As Amdahl's law suggests, it is important to prioritize performance improvement on areas which would give the best overall system improvement, and targeting hotspots is a great example of this. In my next blog, I will discuss how we tuned another crypto operation, RC4 encrypt for better performance.