Intel AES-NI Optimization on Solaris

Intel AES-NI Optimization on Solaris


AES Encryption process
This AES encryption flash animation
is useful to visualize AES encryption
and understand how AES operates
(Enrique Zabala, Universidad ORT, Uruguay)
AES (Advanced Encryption Standard) is the U.S. government's encryption standard, adopted by National Institute of Standards and Technology (NIST) in 2001 to replace the older, obsolete Data Encryption Standard (DES). AES is a slightly-simplified version of Rijndael, which won a NIST contest in 2001 to replace DES. AES is a symmetric private key encryption standard. Symmetric-key algorithms use the same key for encryption and decryption. AES keys can be 128, 192, or 256 bits long. AES is a block cipher where encryption and decryption is performed one block (16 bytes) at a time.

Since 2001, AES has been widely-adopted and is now a part of several data communication standards, such as WPA2 for wi-fi, IPsec for secure Internet transmission, SSH 2 for file and terminal access, and SSH v2 for secure web connections.

To improve performance Intel added 6 new instructions to the Intel64 instruction set, called AES-NI (for AES New Instructions). The AES-NI instructions are first available on the "Westmere" architecture microprocessors (some low-end Westmere chips for mobile/laptop use don't have AES-NI). Westmere processors are part of the Intel "Core" processor family and include the Xeon 5600 processors introduced in 2010. Oracle's Sun Fire X4170 M2 and X4270 M2 are two systems that use Xeon 5600 processors.

Previous Work

Previously, for OpenSolaris 2008.11/Solaris Nevada (build 93), I optimized AES by replacing optimized C code with optimized assembly. The optimized C code used previously was the optimized reference implementation written in C furnished by the authors of AES and first made available in Solaris 10. The optimized assembly code I used was based on Dr. Brian Gladman's AES implementation, which was also faster than the OpenSSL assembly. For details see my previous blog post, Optimizing OpenSolaris With Open Source: AES (2008).

Intel AES-NI Instruction Set

Intel AES-NI consists 6 instructions: AESENC, AESENCLAST, AESDEC, AESDECLAST, AESKEYGENASSIST, and AESIMC. AESENC performs one round of encryption (which consists of these steps: Substitute bytes, shift rows, mix columns, and add (xor) round key). AESENCLAST performs the final encryption round, which doesn't mix columns. Similarly AESDEC and AESDECLAST perform the one round of decryption.

Two more instructions perform key expansion of the user key, formatting it for internal use by the algorithm. The AESKEYGENASSIST instruction helps generate the round keys, used for encryption. The AESIMC then converts the encryption key, with an operation called Inverse Mix Columns, to a form suitable for decryption.

Cache Attack prevention with AES-NI

The most highly-optimized AES algorithms, including Dr. Gladman's, has a weakness under timing attacks, due to their use of large lookup tables. By pre-loading the microprocessor cache the AES table entries, and measuring the encryption time, once can can find what table entries were accessed. This information could be used to help reveal the secret key (although still difficult). Current software mitigation techniques against cache attacks carry significant performance penalties. However, AES-NI prevents such attacks because AES-NI instruction latency is fixed and data-independent.


To implement AES-NI required a number of dependencies, briefly:

  • getisax(2) and the Kernel's x86_feature/x86_featureset bit array needed to be expanded to detect and record the presence of Intel AES-NI instructions (CR 6750666). These bits are set by Solaris from the CPUID instruction.
  • The Solaris amd64 assemblers, as(1) and fbe(1) needed to support the new AES-NI instructions (CR 6740663). The disassembler, dis(1) also was extended to display AES-NI (CR 6762031). Also, GNU binutils was updated to 2.19 to get the latest version of the GNU assembler, gas(1), with AES-NI support (CR 6644870).

Intel provided an implementation for OpenSSL to optimize AES using assembly that includes the AES-NI instructions and the 128-bit %xmm registers, %xmm0-%xmm16. The implementation is basically the same as in OpenSSL with minor differences in source. Changes include reordering the function parameters and structure types from OpenSSL to those defined in Solaris. In userland, the kernel saves and restores the %xmm registers. However, these registers are not saved or restored when the kernel swaps kernel threads, so I added code to save and restore these registers on a 0 mod 16-aligned stack, when necessary (that is, when Intel Control Register CR0.TS isn't set).

Performance Results

AES-NI optimization in Solaris userland

Everyone likes pretty color charts, so here they are. I ran these on Solaris 11 running on a Piketon Platform system with an Intel Clarkdale processor, which is part of the Westmere processor architecture family. The "before" case is Solaris 11, unmodified. Keep in mind that the "before" case already has been optimized with hand-coded amd64 assembly. The "after" case has AES-NI instructions integrated into the Solaris Crypto Framework, which is the PKCS11 library in userland and the "aes" module in the kernel.

Userland library performance The first chart compares AES128, AES192, and AES256 before and after the AES-NI optimization using the libraries. The time shown is user time, in seconds on a quiet system running an internal micro-benchmark, aesspeed (lower is better). Runtime improved by 79%, 74%, and 79%, respectively.

AES-NI optimization in Solaris kernel Solaris kernel performance This chart shows Solaris kernel performance using kernel module "aes". This micro-benchmark, runs AES128 and AES256 in 4 threads for 5000 iterations on 1024 bytes of data. Numbers are is in 1000000 bytes/second (higher is better). Performance improved here by 26% and 56%, respectively.

AES-NI optimization in Solaris kernel Solaris kernel performance Finally, another Solaris kernel micro-benchmark. This one is similar to the previous one, except it's running AES128 with 64 bytes of data on 1, 2, 3, and 4 threads. Performance improved by 50% for the 1, 3, and 4 thread case. The 2 thread case looks like an outlier.

Availability in Solaris

This feature is available only for Solaris x86 64-bit, running on a Intel microprocessor that supports the AES-NI instruction set.

  • I integrated AES-NI optimization in Solaris build snv_114 (see Change Request CR 6767618), so it's available in Oracle Solaris 11 Express 2010.11.
  • I also back-ported AES-NI optimization to Solaris 10 10/09 (aka update 8).
  • AES-NI is available by default in Java through Java Cryptography Extension (JCE)'s PKCS#11 extension. PKCS#11 is an industry standard interface supported by Solaris and used by default by JCE. For more information see Ramesh Nagappan's blog Java Cryptography on Intel Westmere: Solaris Advantage.

More Information

Disclaimer: the statements in this blog are my personal views, not that of my employer.

<script type="text/javascript" src=""></script>
<script src=""></script>

this cannot be probably Lynnfield CPU. As none from Lynnfield line provides AES-NI. See

Posted by kcg on November 24, 2010 at 03:55 AM PST #

You're right. I rechecked the lab notes. It's a Intel Piketon platform system with Clarkdale processor (with AES-NI). Another Piketon has a Lynnfield processor (no AES-NI), but not this one.
- Dan

Posted by Dan Anderson on November 24, 2010 at 05:23 AM PST #


Your web server seems to be misconfigured. The content-type in the HTTP header differs from the one in the XHML document.

curl -I 2> /dev/null | grep -i content-type
Content-Type: text/plain
curl 2> /dev/null | grep -i content-type
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Some browsers do not render the HTML then.

Best regards,


Posted by Michael on January 13, 2011 at 08:38 PM PST #

The results are correct for me, text/html (see below). Strange. Perhaps you're using a web proxy server or nat translation service that's modifying the header?
- Dan

$ curl -I
HTTP/1.1 200 OK
Server: Sun-Java-System-Web-Server/7.0
. . .
Content-type: text/html;charset=utf-8
. . .

. . .
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">
. . .
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
. . .

Posted by Daniel Anderson on January 13, 2011 at 11:42 PM PST #

Hi Dan!

I've got myself a box with a Xeon L3406 that, according to the Intel support, should support both AES and PCLMULDQD. However, the bandwidth I'm seeing for ZFS encryption is way lower than I expected.

I've checked with prtconf for the cpuid-features-ecx data. On the L3406 it reports 0098e3fd, on an i5-661 it reports 0298e3bf. If I understand the CPUID command and bits correctly, that means the L3406 has exactly AES and PCLMULDQD less than the i5.

Is there any way to force the kernel to use the optimized codepaths even if the CPU claims not to support the right instructions? Or any way to see whether they are being used?

The weird thing is that when I swapped the L3406 for an i5-661 I only saw a performance increase proportional to the increase in clock frequency. That sounds to me the optimized code is either off on both CPUs or on on both...

I'd post some actual numbers, but I'm not sure the OTN license allows me to do so. I'm CPU bound and have IO to spare. Could you tell me what ballpark throughput figures I should see with ZFS using aes-128-ccm AND your optimizations on a 2 GHz Clarkdale?

Posted by Elmar on February 08, 2011 at 08:07 AM PST #

Post a Comment:
Comments are closed for this entry.

Solaris cryptography and optimization.


« July 2015