alt="AES Encryption process" width="351" height="337" border="0" />
This AES encryption flash animation
is useful to visualize AES encryption
and understand how AES operates
(Enrique Zabala, Universidad ORT, Uruguay)
Since 2001, AES has been widely-adopted and is now a part of several data communication standards, such as WPA2 for wi-fi, IPsec for secure Internet transmission, SSH 2 for file and terminal access, and SSH v2 for secure web connections.
To improve performance Intel added 6 new instructions to the Intel64 instruction set, called AES-NI (for AES New Instructions). The AES-NI instructions are first available on the "Westmere" architecture microprocessors (some low-end Westmere chips for mobile/laptop use don't have AES-NI). Westmere processors are part of the Intel "Core" processor family and include the Xeon 5600 processors introduced in 2010.
Oracle's Sun Fire X4170 M2 and X4270 M2 are two systems that use Xeon 5600 processors.
Previously, for OpenSolaris 2008.11/Solaris Nevada
(build 93), I optimized AES by replacing optimized C code with optimized assembly.
The optimized C code used previously was the optimized reference implementation written in C furnished by the authors of AES and first made available in Solaris 10.
The optimized assembly code I used was based on
Dr. Brian Gladman's AES implementation, which was also faster than the OpenSSL assembly.
For details see my previous blog post,
Optimizing OpenSolaris With Open Source: AES (2008).
Intel AES-NI consists 6 instructions: AESENC, AESENCLAST, AESDEC, AESDECLAST,
AESKEYGENASSIST, and AESIMC.
AESENC performs one round of encryption (which consists of these steps:
Substitute bytes, shift rows, mix columns, and add (xor) round key).
AESENCLAST performs the final encryption round, which doesn't mix columns.
Similarly AESDEC and AESDECLAST perform the one round of decryption.
Two more instructions perform key expansion of the user key, formatting it for internal use by the algorithm. The AESKEYGENASSIST instruction helps generate the round keys, used for encryption. The AESIMC then converts the encryption key, with an operation called Inverse Mix Columns, to a form suitable for decryption.
The most highly-optimized AES algorithms, including Dr. Gladman's, has a weakness under timing attacks, due to their use of large lookup tables. By pre-loading the microprocessor cache the AES table entries, and measuring the encryption time, once can can find what table entries were accessed. This information could be used to help reveal the secret key (although still difficult).
Current software mitigation techniques against cache attacks carry significant performance penalties.
However, AES-NI prevents such attacks because AES-NI instruction latency is fixed and data-independent.
To implement AES-NI required a number of dependencies, briefly:
Intel provided an implementation for OpenSSL to optimize AES using assembly that includes the AES-NI instructions and the 128-bit %xmm registers, %xmm0-%xmm16.
The implementation is basically the same as in OpenSSL with minor differences in source.
Changes include reordering the function parameters and structure types from OpenSSL to those defined in Solaris.
In userland, the kernel saves and restores the %xmm registers.
However, these registers are not saved or restored when the kernel swaps kernel threads, so I added code to save and restore these registers on a 0 mod 16-aligned stack,
when necessary (that is, when Intel Control Register CR0.TS isn't set).
Everyone likes pretty color charts, so here they are.
I ran these on Solaris 11 running on a Piketon Platform system with an Intel Clarkdale processor, which is part of the Westmere processor architecture family.
The "before" case is Solaris 11, unmodified. Keep in mind that the "before" case already has been optimized with hand-coded amd64 assembly.
The "after" case has AES-NI instructions integrated into the Solaris Crypto Framework, which is the PKCS11 library in userland and the "aes" module in the kernel.
Userland library performance
The first chart compares AES128, AES192, and AES256 before and after the AES-NI optimization using the libpkcs11.so/libsoftcrypto.so libraries. The time shown is user time, in seconds on a quiet system running an internal micro-benchmark, aesspeed (lower is better).
Runtime improved by 79%, 74%, and 79%, respectively.
Solaris kernel performance
This chart shows Solaris kernel performance using kernel module "aes".
This micro-benchmark, runs AES128 and AES256 in 4 threads
for 5000 iterations on 1024 bytes of data.
Numbers are is in 1000000 bytes/second (higher is better).
Performance improved here by 26% and 56%, respectively.
Solaris kernel performance
Finally, another Solaris kernel micro-benchmark.
This one is similar to the previous one, except it's running AES128 with 64 bytes of data on 1, 2, 3, and 4 threads.
Performance improved by 50% for the 1, 3, and 4 thread case.
The 2 thread case looks like an outlier.
This feature is available only for Solaris x86 64-bit,
running on a Intel microprocessor that supports the AES-NI instruction set.
the statements in this blog are my personal views, not that of my employer.