Monday Apr 13, 2009

Examining Crypto Performance on Intel Nehalem based Sun Fire X4170

Today, Sun is releasing a vast array of servers and blades with Intel's new Xeon 5560 (Nehalem) processor. We have significantly improved the performance of crypto algorithms (as part of Solaris Cryptographic Framework (SCF)). While some of these changes have been covered in my previous blogs, I would like to summarize them here.

I must first commend Dan Anderson for doing an excellent job in incorporating a lot of hand-coded assembly into the SCF. These enhancements were available in OpenSolaris 2008.10. Since then we have made the following enhancements, which will be available in OpenSolaris 2009.06. You can also try the preview bits of 2009.06 at

1) CR 6799218: RSA using Solaris Kernel Crypto framework lagging behind OpenSSL. We made changes that made RSA decrypt operations 1.8 times faster. The details are documented here.

2) CR 6811474 and CR 6823192 make number of changes to big_mont_mul() and big_mul() routines which form the essence of montogomery multiplication. These changes improve RSA decrypt operations by 10%.

3) CR 6812615: 64-bit RC4 has poor performance on Intel Nehalem. We made changes to the RC4 encrypt routine which delivered an improvement of 25% on Intel Nehalem. These changes are documented here.

The performance of these and other crypto algorithms may be examined using the PKCS#11 compliant Sun Software Crypto plugin. Applications can be linked to the library, /usr/lib/amd64/ For benchmarking the performance of SCF, we patched OpenSSL 0.9.8j (patch available at Jan Pachanec's blog) to use pkcs#11. The OpenSSL speed benchmark gives us the following numbers on a Sun Fire X4270 pre-release system with 2-socket Intel(r) Xeon(r)CPU X5560@2.8 GHz processor HT-enabled:

Benchmark 1-thread 16-threads
RSA-1024 encrypt 24.9 K ops/s 199.2 K op/s
RSA-1024 decrypt 1760 ops/s 14048op/s
RC4 encrypt (8k message) 317 MBytes/s 2265 Mbytes/s
MD5 Hash (8k message) 531 MBytes/s 6085 MBytes/s
SHA-1 Hash (8k message) 356 MBytes/s 2545 Mbytes/s
AES-256 encryption (8k message) 136.9 MBytes/s 1212.6 Mbytes/s

Please note that these numbers are with Hyper-Threading (HT) enabled on the Nehalem processor, in which two virtual processors share the same execution pipeline. The performance of all algorithms is seen to scale pretty linearly from one-core to 8-cores. Disabling HT did not make much of a difference to the benchmarks, and this could be because crypto algorithmic operations do not have many stalls in the execution pipeline, and therefore the benefit from having virtual processors is less.

For further notes on Sun's Intel Nehalem based servers and blades, I recommend you to read Heather's blog which cross-links all Nehalem-based blog entries. And please do leave your comments and feedback behind.

Thursday Aug 28, 2008

Notes from Intel Developers Fourm

Intel Developer Forum (IDF) was held 19th-21st August last week, of which I attended Day 2-20th August. Intel seems to be focussed on three key markets: (i) Mobile Internet Devices (MIDs) (ii) Converged internet and multimedia, and (iii) High-end enterprise solutions. Intel is targeting these markets with the following processors respectively (i)Intel Atom which is sized smaller than a quarter coin but has as many transistors as the Pentium IV, (ii) Intel Media Processor CE 3100, and (iii) Intel Nehalem which is Intel's first foray into a NUMA architecture with high memory bandwidth using the Intel Quickpath Interconnect (QPI) technology.

The plenary talks revolved around the above. Out of a gamut of applications and gadgets talked about, I found two interesting. The first is Gypsii
, an unique application combining the mobile computing and social networking. On a mobile device, such as your IPhone, you can locate all your friends on a map, and instantly communicate with them like hooking up for lunch. The other is a TV internet widget jointly developed b Intel and Yahoo (see press release here. With this widget, you will have a toolbar at the bottom of your television screen with which you can check your mail, stocks, weather, news, and what not.

Many of the technical sessions were based on parallel computing and the Nehalem architecture. Nehalem seems to have improved branch prediction, better unaligned cache handling, improved store and load performance, besides significantly higher interconnect bandwidth which should help in faster memory access and better I/O. That seems to indicate that Nehalem will have great 10 GigE network I/O, so it will be fun to do some performance characterization and analysis on a Nehalem box. Besides there was a 2 hour session on the Intel Advanced Vector Extensions (AVX) ins which is expected in 2010. AVX will be operating on 256-bit registers allowing for increased vectorization and 256-bit add, multiply and shuffle operations.

All presentation materials from IDF are now publicly available here.


This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.


« July 2016