Optimizing OpenSolaris With Open Source

Fannie Mae, Golden Retriever puppy
Our Golden Retriever
puppy, Fannie Mae

For the past year or so I've been optimizing OpenSolaris encryption on x86-64bit systems using open source software. The open source I used is hand-coded amd64 assembly optimized for AMD64 and Intel EM64T processors. It replaces code written in C, which is optimized, but still slower than hand-coded assembly. This assembly source comes from OpenSSL and elsewhere. The algorithms I optimized for Solaris x86-64 are the MD5, SHA1, and SHA2 hash algorithms and the ARCFOUR and AES cryptographic algorithms, as I'll show below.

OpenSolaris uses common source to implement both user-land libraries (pkcs11_softtoken.so for cryptography and libmd.so for hash algorithms) and kernel-land modules (/kernel/crypto/amd64/\* modules accessible through pkcs11_kernel.so). All these optimizations apply to both userland and kernel. Replacing C with assembly wasn't a straight drop-in process, mainly due to differences in implementation between OpenSolaris and the Open Source used. The main differences include function definitions and data structures for keys and context (state).

Availability All these optimizations are available in OpenSolaris 2008.11 and Solaris Nevada build 93 or latter. Except for AES and SHA2, I backported all optimizations to Solaris 10 10/08 (aka U6). I backported the AES optimization into the next Solaris 10 update, which should be released in 2009.

ARCFOUR Industry-Standard Encryption Algorithm

OpenSolaris ARCFOUR AMD64/EM64T Optimization
ARCFOUR Optimization
shown by openssl speed

ARCFOUR is a clone of the RC4™ encryption algorithm trademarked by RSA Data Security, Inc. ARCFOUR is heavily used in secure web transactions (SSL or https), so it's important that this is optimized. I used OpenSSL's hand-coded amd64 assembly written by Andy Polyakov. An earlier version of this code optimized for AMD64 processors by changing the key from a 8-bit byte array of 256 elements into an aligned 32-bit array of 256 elements (with the upper bits zero). However, for Intel EM64T, this actually made it run slower than C, so Polyakov wrote a "hybrid" version that uses 4-byte arrays for AMD64 and 1-byte arrays for Intel EM64T. I ported the latter version to OpenSolaris.

Performance Gain The chart shows ARCFOUR performance gains with AMD64 2.2GHz and Intel EM64T 2.1GHz processors running OpenSolaris. ARCFOUR shows a large gain of 2x-4x with amd64 assembly over C. All performance numbers shown here use the same two systems with these two processors. I show the gain from running the benchmark
/usr/sfw/bin/amd64/openssl speed -evp rc4 -elapsed -engine pkcs11

Running SPECweb2005-banking shows these improvements with ARCFOUR and MD5 optimization:

  • Average response time improved from 1.46 down to 0.96 (35% improvement). This is a big improvement!
  • Total operations improved from 11.8 to 12.6 million (6% improvement)
  • Quality of Service (QoS) improved from 96% to 99% (3% improvement)
  • CPU user/sys/idle times changed from 21.63/77.79/0.48 to 23.96/75.49/0.55
This run was on a X4440 with 16 cores (AMD64 Opteron) using KSSL, running OpenSolaris build NV80 vs. NV103. Simultaneous User Sessions = 50000.

MD5 Industry-Standard Hash Algorithm

Marc Bevand
Marc Bevand
OpenSolaris MD5 AMD64/EM64T Optimization
MD5 Optimization
shown by openssl speed

MD5 is an old, but heavily-used hash algorithm. Its use should be avoided over the newer SHA2 algorithms. For MD5 I used OpenSSL's hand-coded amd64 assembly written by Marc Bevand.

By the way, Bevand also wrote software that runs on Sony PlayStation3 to crack UNIX crypt passwords by brute force. (The crypt algorithm is used by default on Solaris, but not OpenSolaris 2008.11. If you have CRYPT_DEFAULT=__unix__ set in /etc/security/policy.conf, you have this vulnerability) It takes advantage of the 128-bit wide PS3 processor, by parallelizing boolean operations on each bit of the 128 bits on each of 7 cores. On average a UNIX crypt password can be cracked in 70 days with one PS3 at a cost of $1100 of electricity (at $0.10/KWh). Multiple PS3s will crack a password faster. But lets get back to our topic, MD5 . . .

Performance Gain I show the gain in MD5 performance by running the benchmark
/usr/sfw/bin/amd64/openssl speed -evp md5 -elapsed -engine pkcs11
The gain increases with data size.

SHA1 and SHA2 (SHA256, SHA384, and SHA512) Hash Algorithms

SHA1 and the SHA2 family of hash algorithms are NIST standards. Use of the older SHA1 standard, should be avoided over SHA2 because of weaknesses recently found in the algorithm. A replacement for SHA1 and SHA2 (called AHS) is in the works by NIST. Each generation of SHA hashes have a more-complex algorithm and longer hash result as shown here:

$ digest -a md5 osol-0811-rc2-ai.iso
$ digest -a sha1 osol-0811-rc2-ai.iso
$ digest -a sha256 osol-0811-rc2-ai.iso
$ digest -a sha384 osol-0811-rc2-ai.iso
$ digest -a sha512 osol-0811-rc2-ai.iso

For SHA1 and SHA2 I used OpenSSL's hand-coded amd64 assembly written by Andy Polyakov (who also written ARCFOUR assembly, above).

OpenSolaris SHA2 AMD64/EM64T Optimization
SHA2 Optimization
shown by microbenchmarks

OpenSolaris SHA1 AMD64/EM64T Optimization
SHA1 Optimization
shown by openssl speed

Performance Gain I show the gain from running the benchmark
/usr/sfw/bin/AMD64/openssl speed -evp sha1 -elapsed -engine pkcs11
For SHA2, I used internal microbenchmarks that use the pkcs11_softtoken.so library. This is because when I ran the tests, SHA2 wasn't integrated into OpenSSL. SHA2 has since been integrated into OpenSolaris 2008.11.

AES: Advanced Encryption Standard

AES Encryption process
AES encryption flash animation
Since 2001, AES (aka Rijndael) has been the NIST standard encryption algorithm, replacing the earlier, and weaker, DES standard. AES is part of several data communication standards, such as WPA, IPsec, and SSH 2. AES is a complex algorithm using shifts and lookups on 16-byte blocks. It's best understood, at least for me, with an animated visualization. There's a great AES Flash animation available by Enrique Zabala that I think's worth watching.

At first I replaced the OpenSolaris C implementation of AES with the OpenSSL assembly implementation, but to my surprise the assembly version was about the same as C, using the Sun Studio C compiler with "cc -O" optimization. In fairness, the OpenSSL implementation makes a performance gain by tightly-integrating the code for AES and CBC feedback mode. However, with AES alone or AES with other feedback modes, the assembly and C implementations perform about the same.

Next, I tried Dr. Brian Gladman's AES implementation and found it was faster than both C and OpenSSL assembly, so used Gladman's assembly source. The assembly source is encoded with YASM-style assembly and macro syntax, so I translated it to Solaris assembly language and cpp-style #define/#ifdef macros.

OpenSolaris AES/IPsec AMD64/EM64T Optimization
IPSec Optimization
shown by the netperf

OpenSolaris AES AMD64/EM64T Optimization
AES Optimization
shown by the encrypt(1) command
(lower is better)

OpenSolaris AES-128 AMD64/EM64T Optimization
AES-128 Optimization
shown by openssl speed

Performance Gain I show the gain from running the benchmark
/usr/sfw/bin/AMD64/openssl speed -evp aes128 -elapsed -engine pkcs11
For AES-128, AES-192, and AES-256, I also show performance gain using the /usr/bin/amd64/encrypt command on a 500MB file on swapfs. When I ran the tests, AES-192 and AES-256 were not integrated into OpenSSL, but they have since been integrated into OpenSolaris 2008.11.

AES is used by IPsec. and has improved IPsec. throughput. Dan McDonald ran the netperf benchmark with one pair of connected e1000g Ethernet ports on two Galaxy systems. The throughput on the 56x4 TCP_STREAM tests with just one SA (no parallelism) improved from 362Mbit/sec. to 463Mbit/sec. From these numbers, FTP throughput improves from about ~300Mbit/sec. to 444Mbit/sec.


Update (12/2009): Here's an example of sha256 optimization helping ZFS performance (Darren Moffat)

<script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script>
<script src="http://connect.facebook.net/en_US/all.js#xfbml=1"></script>

First of all I want to point out that OpenSSL rc4-x86_64 module is \*not\* "based on an earlier version by Marc Bevand." Well, not to diminish Marc's effort, rc4-x86_64 second optimization round was triggered by his submission, but it was \*second\* round.

Secondly, original OpenSSL rc4-x86_64 effectively has three code-pathes: AMD, Intel pre-Core and Intel Core specific. Second one was omitted from OpenSolaris. I'm not judging the decision (though from commentary section it seems that it was done based on wrong analysis), I simply feel that this needs to be said.

As for OpenSSL AES performance. Again, I'm not judging the decision, it just needs to be said. OpenSSL module has a number of countermeasures against timing attacks, which naturally have impact on performance. The number varies from release to release, e.g. in recent 0.9.8 you'll find that the loops are folded to preclude correlation between D- and I-cache timings. Then the last round is properly protected (as far as I can see Brian's code provides this as an option, but it's not utilized in OpenSolaris). Development branch provides even further degree of protection... In other words OpenSSL AES assembler modules are not only about performance, they're as much [and sometimes even more] about security. Cheers. A.

Posted by Andy Polyakov on January 07, 2009 at 04:57 PM PST #

Post a Comment:
Comments are closed for this entry.

Solaris cryptography and optimization.


« April 2014