Optimizing Crypto for Intel Nehalem
By danx on Jun 04, 2009
Sun Blade X6275 is one example platform with
2 Intel Nehalem (Xeon 5500) processors
with 4 cores, 8 threads each
Amitabha Banerjee of the Performance Group wrote 4 blog articles about improving OpenSolaris performance on Intel's "Nehalem" processor. Nehalem is the codename for Intel Xenon 5500 series processors. In this article I want to highlight how we improved performance and refer to his blogs.
What's different about Intel Nehalem? Currently-released Intel Nehalem processors ("Intel Core i7") have 4 cores and 8 threads using 45nm process. But more importantly, the instructions are executed differently from not only AMD64 (as expected), but also older Intel 64 processors. Nehalem has lookahead speculative execution optimizations. These optimizations predict what instructions will probably execute in the future ("branch prediction") and executes instructions ahead of time on a preliminary basis ("translation lookaside buffer" or TLB). If they are in fact executed, the results are available immediately. If they are not executed, or the onboard data cache is invalidated with writes, the results are thrown out and instructions reexecuted. One way to optimize use of this processor is to carefully arrange instructions that doesn't invalidate (write to) memory soon after it is used (read).
Performance Improvement Process Amitab describes how he optimized OpenSolaris cryptography (and therefore SSL/HTTPS web transactions) for Nehalem processors. Basically the steps are:
- run benchmarks, such as SPECweb Banking,
- identify the bottleneck functions (using tools such as er_kernel(1M) and Sun Studio 12 Analyzer),
- examine the source for the bottleneck functions,
- try different optimizations, and
- repeat this step multiple times.
The last step was the most time consuming, and involved ensuring not only speed, but that it still produced correct results. These optimizations were all with C source. Instead of hand-coding assembly, he would re-code C source. The source would be more complex, but faster, and certainly not as complex as assembly!
I integrated these optimizations, and added my own ideas to improve performance further, along with suggestions from others during coding, regression testing, and code review.
So here's Amitab's blogs:
About RSA decrypt performance
The first bottleneck was RSA decryption. Almost all of RSA's time is spent in bignum. Bignum performs math for arbitrary-long integers. For example RSA2048 uses 1024-bit numbers with 2048-bit results. Most of RSA, as it turned out, was spent in bignum multiplications [more]
Benchmarking (and improving the benchmark) for RSA decrypt
Simply rearranging the C for loop in Bignum Montgomery Multiplication, big_mont_mul(), improved TLB performance. It's worth looking at exactly how he rearranged the for loop—see see CR 6823193, which has the old and improved for loop source code fragments. After optimization big_mont_mul() dropped to 2nd place and big_mul_add_vec() became the "hot" function. [more]
Improving RC4 encrypt performance
This blog describes how RC4 performance was improved simply by rearranging the order of assignments in a for loop in RC4 function arcfour_crypt(). This is similar in principle to the rearranged for loop mentioned above for big_mont_mul(). This rearranged C code ran faster than the previous hand-coded assembly for Intel Nehalem (but not AMD64 processors). [more]
Examining Crypto Performance on Intel Nehalem based Sun Fire X4170
This blog gives the results of various improvements in RSA (bignum) and RC4, such as:
- converting bignum from 32- to 64-bit bignum "digits" for x86 (SPARC bignum was already 64-bit)
- rearranging the for loop in big_mont_mul() described above,
- removing a call from bignum multiply, big_mul(), to bignum number copy, big_copy(), that was not needed (as the results used were all overwritten before use), and
- rearranging C code in a for loop for RC4 (ARCFOUR) crypto.
Related blogs While I'm at it, I'd also like to refer to 2 previous blogs about OpenSolaris x86-64 crypto performance:
- Optimizing OpenSolaris with Open Source, shows RC4, MD5, SHA1/2, and AES crypto optimizations made by integrating OpenSSL and other open source code into OpenSolaris. I also have a picture of our Golden Retriever, "Fannie Mae."
- Optimizing Byte Swapping for Fun and Profit shows how I optimized 32- and 64-bit byte swapping in OpenSolaris, a necessary and common function with Intel processors I also have a picture there of a previous Golden Retriever of ours, "Holly," eating pancakes with Anna
- Another previous Golden Retriever of ours, "Patsy Ann," is in this unrelated blog
Disclaimer: the statements in this blog are my personal views, not that of my employer.