X

News, tips, partners, and perspectives for the Oracle Solaris operating system

Optimizing Crypto for Intel Nehalem




Sun Blade X6275 with Intel Xeon 5500



Sun Blade X6275
is one example platform
with


2 Intel Nehalem

(Xeon 5500)

processors


with 4 cores, 8 threads each

Amitabha Banerjee of the Performance Group wrote 4 blog articles about improving OpenSolaris performance on Intel's "Nehalem" processor.
Nehalem is the codename for Intel Xenon 5500 series processors.
In this article I want to highlight how we improved performance and refer to his blogs.

What's different about Intel Nehalem?
Currently-released Intel Nehalem processors ("Intel Core i7") have 4 cores and 8 threads using 45nm process. But more importantly, the instructions are executed differently from not only AMD64 (as expected), but also older Intel 64 processors. Nehalem has lookahead speculative execution optimizations. These optimizations predict what instructions will probably execute in the future ("branch prediction") and executes instructions ahead of time on a preliminary basis ("translation lookaside buffer" or TLB). If they are in fact executed, the results are available immediately. If they are not executed, or the onboard data cache is invalidated with writes, the results are thrown out and instructions reexecuted. One way to optimize use of this processor is to carefully arrange instructions that doesn't invalidate (write to) memory soon after it is used (read).

Performance Improvement Process Amitab describes how he optimized OpenSolaris cryptography (and therefore SSL/HTTPS web transactions) for Nehalem processors. Basically the steps are:

  • run benchmarks, such as SPECweb Banking,
  • identify the bottleneck functions (using tools such as er_kernel(1M) and Sun Studio 12 Analyzer),
  • examine the source for the bottleneck functions,
  • try different optimizations, and
  • repeat this step multiple times.

The last step was the most time consuming, and involved ensuring not only speed, but that it still produced correct results.
These optimizations were all with C source. Instead of hand-coding assembly, he would re-code C source. The source would be more complex, but faster, and certainly not as complex as assembly!

I integrated these optimizations, and added my own ideas to improve performance further, along with suggestions from others during coding, regression testing, and code review.

So here's Amitab's blogs:



  • About RSA decrypt performance



    The first bottleneck was RSA decryption. Almost all of RSA's time is spent in bignum. Bignum performs math for arbitrary-long integers. For example RSA2048 uses 1024-bit numbers with 2048-bit results. Most of RSA, as it turned out, was spent in bignum multiplications
    [more]


  • Benchmarking (and improving the benchmark) for RSA decrypt



    Simply rearranging the C for loop in
    Bignum Montgomery Multiplication, big_mont_mul(),
    improved TLB performance.
    It's worth looking at exactly how he rearranged the for loop—see
    see CR 6823193, which has the old and improved for loop source code fragments.
    After optimization big_mont_mul() dropped to 2nd place and big_mul_add_vec() became the "hot" function.
    [more]

  • Improving RC4 encrypt performance


    This blog describes how RC4 performance was improved simply by rearranging the order of assignments in a for loop in RC4 function arcfour_crypt().
    This is similar in principle to the rearranged for loop mentioned above for big_mont_mul().
    This rearranged C code ran faster than the previous hand-coded
    assembly for Intel Nehalem (but not AMD64 processors).
    [more]


  • Examining Crypto Performance on Intel Nehalem based Sun Fire X4170



    This blog gives the results of various improvements in RSA (bignum) and RC4,
    such as:
    • converting bignum from 32- to 64-bit bignum "digits" for x86 (SPARC bignum was already 64-bit)
    • rearranging the for loop in big_mont_mul() described above,
    • removing a call from bignum multiply, big_mul(), to bignum number copy, big_copy(), that was not needed (as the results used were all overwritten before use), and

    • rearranging C code in a for loop for RC4 (ARCFOUR) crypto.

    [more]

Related blogs While I'm at it, I'd also like to refer to 2 previous blogs about OpenSolaris x86-64 crypto performance:

Disclaimer:
the statements in this blog are my personal views, not that of my employer.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.