Optimizing Crypto for Intel Nehalem

Sun Blade X6275 with Intel Xeon 5500
Sun Blade X6275 is one example platform with
2 Intel Nehalem (Xeon 5500) processors
with 4 cores, 8 threads each

Amitabha Banerjee of the Performance Group wrote 4 blog articles about improving OpenSolaris performance on Intel's "Nehalem" processor. Nehalem is the codename for Intel Xenon 5500 series processors. In this article I want to highlight how we improved performance and refer to his blogs.

What's different about Intel Nehalem? Currently-released Intel Nehalem processors ("Intel Core i7") have 4 cores and 8 threads using 45nm process. But more importantly, the instructions are executed differently from not only AMD64 (as expected), but also older Intel 64 processors. Nehalem has lookahead speculative execution optimizations. These optimizations predict what instructions will probably execute in the future ("branch prediction") and executes instructions ahead of time on a preliminary basis ("translation lookaside buffer" or TLB). If they are in fact executed, the results are available immediately. If they are not executed, or the onboard data cache is invalidated with writes, the results are thrown out and instructions reexecuted. One way to optimize use of this processor is to carefully arrange instructions that doesn't invalidate (write to) memory soon after it is used (read).

Performance Improvement Process Amitab describes how he optimized OpenSolaris cryptography (and therefore SSL/HTTPS web transactions) for Nehalem processors. Basically the steps are:

  • run benchmarks, such as SPECweb Banking,
  • identify the bottleneck functions (using tools such as er_kernel(1M) and Sun Studio 12 Analyzer),
  • examine the source for the bottleneck functions,
  • try different optimizations, and
  • repeat this step multiple times.

The last step was the most time consuming, and involved ensuring not only speed, but that it still produced correct results. These optimizations were all with C source. Instead of hand-coding assembly, he would re-code C source. The source would be more complex, but faster, and certainly not as complex as assembly!

I integrated these optimizations, and added my own ideas to improve performance further, along with suggestions from others during coding, regression testing, and code review.

So here's Amitab's blogs:

  • About RSA decrypt performance
    The first bottleneck was RSA decryption. Almost all of RSA's time is spent in bignum. Bignum performs math for arbitrary-long integers. For example RSA2048 uses 1024-bit numbers with 2048-bit results. Most of RSA, as it turned out, was spent in bignum multiplications [more]
  • Benchmarking (and improving the benchmark) for RSA decrypt
    Simply rearranging the C for loop in Bignum Montgomery Multiplication, big_mont_mul(), improved TLB performance. It's worth looking at exactly how he rearranged the for loop—see see CR 6823193, which has the old and improved for loop source code fragments. After optimization big_mont_mul() dropped to 2nd place and big_mul_add_vec() became the "hot" function. [more]
  • Improving RC4 encrypt performance
    This blog describes how RC4 performance was improved simply by rearranging the order of assignments in a for loop in RC4 function arcfour_crypt(). This is similar in principle to the rearranged for loop mentioned above for big_mont_mul(). This rearranged C code ran faster than the previous hand-coded assembly for Intel Nehalem (but not AMD64 processors). [more]
  • Examining Crypto Performance on Intel Nehalem based Sun Fire X4170
    This blog gives the results of various improvements in RSA (bignum) and RC4, such as:
    • converting bignum from 32- to 64-bit bignum "digits" for x86 (SPARC bignum was already 64-bit)
    • rearranging the for loop in big_mont_mul() described above,
    • removing a call from bignum multiply, big_mul(), to bignum number copy, big_copy(), that was not needed (as the results used were all overwritten before use), and
    • rearranging C code in a for loop for RC4 (ARCFOUR) crypto.

Related blogs While I'm at it, I'd also like to refer to 2 previous blogs about OpenSolaris x86-64 crypto performance:

Disclaimer: the statements in this blog are my personal views, not that of my employer.

<script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script>
<script src="http://connect.facebook.net/en_US/all.js#xfbml=1"></script>

In the new code proposed in CR6823193, the second loop looks valid but strange. It is just the addition of two bignums, but I am sure that if you think of it this way you won't implement it that way...

Also, modifying c[i] in the second loop is strange, it means writing to memory something that doesn't need to be (unless maybe the compiler can detect that the values written in c are never read later?).

Posted by Marc on June 04, 2009 at 04:54 PM PDT #

Yes the code seems strange. Optimizing code often makes it less understandable. The "for" loop was split into 2 "for" loops. That's because the assignment "c = 1" was preventing the TLB from executing instructions in advance because c was always being modified. Creating an array c[] solved that problem (as each loop iteration modifies different memory).

The "c[i] = 1" assignment in the second "for" loop is used above in the "while (rr[j] < c[i]" test.

- Dan

Posted by Dan Anderson on June 05, 2009 at 01:05 AM PDT #

Oh I understand about the optimization. What I am saying (if you reread my post) is that there seems to be room for improvement in the second loop, if you want to optimize the code a bit further.

Posted by Marc on June 05, 2009 at 04:37 AM PDT #

Hmmm. Well I don't see anything being assigned that's unused, c[i] is used in the while loop immediately after it's set and rr[] is used later in the function (not shown).

Posted by Daniel Anderson on June 05, 2009 at 05:18 AM PDT #

For c[i], it is used locally, but not later. Storing it in a local variable means it will just be in a register as long as needed. Storing it in c[i] is more likely to mean writing it to memory, which is slower. But this is a minor point.

Most importantly, this second loop looks like an addition between 2 bignums where you add one limb of the second bignum to the first, propagate the carry as far as it will go, then move on to the second limb, add it, propagate as far as it will go, etc. You probably have a bignum_add function somewhere that does something closer to the schoolbook addition (no nested loops, coded in asm to use the adc instruction).

Posted by Marc on June 05, 2009 at 01:33 PM PDT #

OK, now I understand your point. Function big_mul_add_vec() is written in amd64 asm and uses adcq. It is called by the Montgomery Multiply big_mul().

If the entire big_mul() function is recoded in assembly, it will probably improve RSA a lot, since a lot of RSA is spent in bignum Montgomery Multiplication. Assembly recoding is time-consuming, but it may be worth it.
- Dan

Posted by Dan Anderson on June 05, 2009 at 02:34 PM PDT #

Depending on the exact interface of big_mul_add_vec(), my point was that it may be possible to replace the second loop by a single call to that function (without coding any new asm). Anyway, good job on the optimizations that have already been done.

Posted by Marc on June 06, 2009 at 05:43 PM PDT #

Er, ignore my last comment, or at least s/big_mul_add_vec/big_add/, I did not think before posting, sorry.

Posted by Marc on June 06, 2009 at 05:46 PM PDT #

Post a Comment:
Comments are closed for this entry.

Solaris Verified Boot, cryptography, and security.


« August 2016