X

News, tips, partners, and perspectives for the Oracle Solaris operating system

Intel AES-NI Optimization on Solaris

Intel AES-NI Optimization on Solaris

Introduction



alt="AES Encryption process" width="351" height="337" border="0" />


This AES encryption flash animation


is useful to visualize AES encryption


and understand how AES operates


(Enrique Zabala, Universidad ORT, Uruguay)


AES (Advanced Encryption Standard) is the U.S. government's encryption standard, adopted by National Institute of Standards and Technology (NIST) in 2001 to replace the older, obsolete Data Encryption Standard (DES). AES is a slightly-simplified version of Rijndael, which won a NIST contest in 2001 to replace DES. AES is a symmetric private key encryption standard.
Symmetric-key algorithms use the same key for encryption and decryption. AES keys can be 128, 192, or 256 bits long.
AES is a block cipher where encryption and decryption is performed one block (16 bytes) at a time.

Since 2001, AES has been widely-adopted and is now a part of several data communication standards, such as WPA2 for wi-fi, IPsec for secure Internet transmission, SSH 2 for file and terminal access, and SSH v2 for secure web connections.

To improve performance Intel added 6 new instructions to the Intel64 instruction set, called AES-NI (for AES New Instructions). The AES-NI instructions are first available on the "Westmere" architecture microprocessors (some low-end Westmere chips for mobile/laptop use don't have AES-NI). Westmere processors are part of the Intel "Core" processor family and include the Xeon 5600 processors introduced in 2010.
Oracle's Sun Fire X4170 M2 and X4270 M2 are two systems that use Xeon 5600 processors.

Previous Work

Previously, for OpenSolaris 2008.11/Solaris Nevada
(build 93), I optimized AES by replacing optimized C code with optimized assembly.
The optimized C code used previously was the optimized reference implementation written in C furnished by the authors of AES and first made available in Solaris 10.
The optimized assembly code I used was based on
Dr. Brian Gladman's AES implementation, which was also faster than the OpenSSL assembly.
For details see my previous blog post,

Optimizing OpenSolaris With Open Source: AES
(2008).

Intel AES-NI Instruction Set

Intel AES-NI consists 6 instructions: AESENC, AESENCLAST, AESDEC, AESDECLAST,
AESKEYGENASSIST, and AESIMC.
AESENC performs one round of encryption (which consists of these steps:
Substitute bytes, shift rows, mix columns, and add (xor) round key).
AESENCLAST performs the final encryption round, which doesn't mix columns.
Similarly AESDEC and AESDECLAST perform the one round of decryption.

Two more instructions perform key expansion of the user key, formatting it for internal use by the algorithm. The AESKEYGENASSIST instruction helps generate the round keys, used for encryption. The AESIMC then converts the encryption key, with an operation called Inverse Mix Columns, to a form suitable for decryption.

Cache Attack prevention with AES-NI

The most highly-optimized AES algorithms, including Dr. Gladman's, has a weakness under timing attacks, due to their use of large lookup tables. By pre-loading the microprocessor cache the AES table entries, and measuring the encryption time, once can can find what table entries were accessed. This information could be used to help reveal the secret key (although still difficult).
Current software mitigation techniques against cache attacks carry significant performance penalties.
However, AES-NI prevents such attacks because AES-NI instruction latency is fixed and data-independent.

Implementation

To implement AES-NI required a number of dependencies, briefly:


  • getisax(2) and the Kernel's x86_feature/x86_featureset bit array
    needed to be expanded to detect and record
    the presence of Intel AES-NI instructions
    (CR 6750666).
    These bits are set by Solaris from the CPUID instruction.

  • The Solaris amd64 assemblers, as(1) and fbe(1) needed to support the new AES-NI instructions (CR 6740663).
    The disassembler, dis(1) also was extended to display AES-NI (CR 6762031).
    Also, GNU binutils was updated to 2.19 to get the latest version of the GNU assembler, gas(1), with AES-NI support (CR 6644870).

Intel provided an implementation for OpenSSL to optimize AES using assembly that includes the AES-NI instructions and the 128-bit %xmm registers, %xmm0-%xmm16.
The implementation is basically the same as in OpenSSL with minor differences in source.
Changes include reordering the function parameters and structure types from OpenSSL to those defined in Solaris.
In userland, the kernel saves and restores the %xmm registers.
However, these registers are not saved or restored when the kernel swaps kernel threads, so I added code to save and restore these registers on a 0 mod 16-aligned stack,
when necessary (that is, when Intel Control Register CR0.TS isn't set).

Performance Results


AES-NI optimization in Solaris userland

Everyone likes pretty color charts, so here they are.
I ran these on Solaris 11 running on a Piketon Platform system with an Intel Clarkdale processor, which is part of the Westmere processor architecture family.
The "before" case is Solaris 11, unmodified. Keep in mind that the "before" case already has been optimized with hand-coded amd64 assembly.
The "after" case has AES-NI instructions integrated into the Solaris Crypto Framework, which is the PKCS11 library in userland and the "aes" module in the kernel.

Userland library performance
The first chart compares AES128, AES192, and AES256 before and after the AES-NI optimization using the libpkcs11.so/libsoftcrypto.so libraries. The time shown is user time, in seconds on a quiet system running an internal micro-benchmark, aesspeed (lower is better).
Runtime improved by 79%, 74%, and 79%, respectively.

AES-NI optimization in Solaris kernel
Solaris kernel performance
This chart shows Solaris kernel performance using kernel module "aes".
This micro-benchmark, runs AES128 and AES256 in 4 threads
for 5000 iterations on 1024 bytes of data.
Numbers are is in 1000000 bytes/second (higher is better).
Performance improved here by 26% and 56%, respectively.



AES-NI optimization in Solaris kernel
Solaris kernel performance
Finally, another Solaris kernel micro-benchmark.
This one is similar to the previous one, except it's running AES128 with 64 bytes of data on 1, 2, 3, and 4 threads.
Performance improved by 50% for the 1, 3, and 4 thread case.
The 2 thread case looks like an outlier.

Availability in Solaris

This feature is available only for Solaris x86 64-bit,
running on a Intel microprocessor that supports the AES-NI instruction set.


  • I integrated AES-NI optimization in Solaris build snv_114
    (see Change Request CR 6767618),
    so it's available in Oracle Solaris 11 Express 2010.11.

  • I also back-ported AES-NI optimization to Solaris 10 10/09 (aka update 8).
  • AES-NI is available by default in Java through Java Cryptography Extension (JCE)'s PKCS#11 extension. PKCS#11 is an industry standard interface supported by Solaris and used by default by JCE. For more information see Ramesh Nagappan's blog Java Cryptography on Intel Westmere: Solaris Advantage.

More Information

Disclaimer:
the statements in this blog are my personal views, not that of my employer.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.