AES is a strong method of symmetric (secret-key) encryption.
It is a U.S. FIPS-approved cryptographic algorithm (FIPS 197)
that operates on 16-byte blocks.
AES has been available since 2001 and is widely used.
However, AES by itself has a weakness.
AES encryption isn't usually used by itself because identical blocks
of plaintext are always encrypted into identical blocks of ciphertext.
This encryption can be easily attacked with "dictionaries" of common blocks of text and allows one to more-easily discern the content of the unknown cryptotext.
This mode of encryption is called "Electronic Code Book" (ECB), because one in theory can keep a "code book" of all known cryptotext and plaintext results to cipher and decipher AES.
In practice, a complete "code book" is not practical, even in electronic form, but large dictionaries of common plaintext blocks is still possible.
Here's a diagram of encrypting input data using AES ECB mode:
Block 1 Block 2
What's the solution to the same cleartext input producing the same ciphertext output?
The solution is to further process the encrypted or decrypted text in such a way that the same text produces different output.
This usually involves an Initialization Vector (IV) and XORing the decrypted or encrypted text.
As an example, I'll illustrate CBC mode encryption:
Block 1 Block 2
The steps for CBC encryption are:
Another mode besides CBC is Counter Mode (CTR).
As with CBC mode, it also starts with a 16-byte IV.
However, for subsequent blocks, the IV is just incremented by one.
Also, the IV ix XORed with the AES encryption result (not the plain text input).
Here's an illustration:
Block 1 Block 2
Which of these modes can be parallelized?
So, in summary, for ECB, CBC, and CTR modes, encryption and decryption can be parallelized with the exception of CBC encryption.
How do we parallelize encryption?
Usually when reading and writing data there are pipeline "stalls" (idle processor cycles) that result from waiting for memory to be loaded or stored to or from CPU registers.
Since the software is written to encrypt/decrypt the next data block where pipeline stalls usually occurs, we can avoid stalls and crypt with fewer cycles.
This software processes 4 blocks at a time, which ensures virtually no waiting ("stalling") for reading or writing data in memory.
Besides interleaving, other optimizations performed are
Everyone likes pretty color charts, so here they are. I ran these on Solaris 11 running on a Piketon Platform system with a 4-core Intel Clarkdale processor @3.20GHz. Clarkdale which is part of the Westmere processor architecture family.
The "before" case is Solaris 11, unmodified.
Keep in mind that the "before" case already has been optimized with hand-coded Intel AESNI assembly.
The "after" case has combined AES-NI and mode instructions,
interleaved 4 blocks at-a-time.
For the first table, lower is better (milliseconds).
The first table shows the performance improvement using the Solaris encrypt(1) and decrypt(1) CLI commands. I encrypted and decrypted a 1/2 GByte file on /tmp (swap tmpfs).
Encryption improved by about 40% and decryption improved by about 80%.
AES-128 is slighty faster than AES-256, as expected.
This software is in Solaris 11 FCS.
It is available in the 64-bit libcrypto library and the "aes" Solaris kernel module.
You must be running hardware that supports AESNI
(for example, Intel Westmere and Sandy Bridge, microprocessor architectures).
The easiest way to determine if AES-NI is available is with the isainfo(1) command. For example,
$ isainfo -v
No special configuration or setup is needed to take advantage of this software.
Solaris libraries and kernel automatically determine if it's running on AESNI-capable machines
and execute the correctly-tuned software for the current microprocessor.
Maximum throughput of AES cipher modes can be achieved by combining
AES encryption with modes processing,
interleaving encryption of 4 blocks at a time,
and using Intel's wide 128-bit %xmm registers and instructions.