Examining Crypto Performance on Intel Nehalem based Sun Fire X4170
By user12608726 on Apr 13, 2009
I must first commend Dan Anderson for doing an excellent job in incorporating a lot of hand-coded assembly into the SCF. These enhancements were available in OpenSolaris 2008.10. Since then we have made the following enhancements, which will be available in OpenSolaris 2009.06. You can also try the preview bits of 2009.06 at genunix.org.
1) CR 6799218: RSA using Solaris Kernel Crypto framework lagging behind OpenSSL. We made changes that made RSA decrypt operations 1.8 times faster. The details are documented here.
2) CR 6811474 and CR 6823192 make number of changes to big_mont_mul() and big_mul() routines which form the essence of montogomery multiplication. These changes improve RSA decrypt operations by 10%.
3) CR 6812615: 64-bit RC4 has poor performance on Intel Nehalem. We made changes to the RC4 encrypt routine which delivered an improvement of 25% on Intel Nehalem. These changes are documented here.
The performance of these and other crypto algorithms may be examined using the PKCS#11 compliant Sun Software Crypto plugin. Applications can be linked to the library, /usr/lib/amd64/libpkcs11.so. For benchmarking the performance of SCF, we patched OpenSSL 0.9.8j (patch available at Jan Pachanec's blog) to use pkcs#11. The OpenSSL speed benchmark gives us the following numbers on a Sun Fire X4270 pre-release system with 2-socket Intel(r) Xeon(r)CPU X5560@2.8 GHz processor HT-enabled:
|RSA-1024 encrypt||24.9 K ops/s||199.2 K op/s|
|RSA-1024 decrypt||1760 ops/s||14048op/s|
|RC4 encrypt (8k message)||317 MBytes/s||2265 Mbytes/s|
|MD5 Hash (8k message)||531 MBytes/s||6085 MBytes/s|
|SHA-1 Hash (8k message)||356 MBytes/s||2545 Mbytes/s|
|AES-256 encryption (8k message)||136.9 MBytes/s||1212.6 Mbytes/s|
Please note that these numbers are with Hyper-Threading (HT) enabled on the Nehalem processor, in which two virtual processors share the same execution pipeline. The performance of all algorithms is seen to scale pretty linearly from one-core to 8-cores. Disabling HT did not make much of a difference to the benchmarks, and this could be because crypto algorithmic operations do not have many stalls in the execution pipeline, and therefore the benefit from having virtual processors is less.
For further notes on Sun's Intel Nehalem based servers and blades, I recommend you to read Heather's blog which cross-links all Nehalem-based blog entries. And please do leave your comments and feedback behind.