# Carryless Multiplication Optimization for AES GCM Mode in Solaris

## Introduction

 Multiplying 6 x 3 TraditionalMultiplication   110 x11 ---- 110 +110 ----- 10010 =18 decimal CarrylessMultiplication   110 x11 ----- 110 +110 ----- 1010 = 10 decimal

Galois/Counter Mode (GCM) is a mode of operation for symmetric block ciphers that's mainly used with AES (AES-GCM). GCM consists of Counter (CTR) Mode encryption followed by a universal hash function, Galois Hashing, for authentication. GMAC is a variation that provides authentication but not encryption. GCM is used with IPsec, SSH, TLS, and is part of the US government's NSA Suite B Cryptography.

The Galois Hash function (GHASH) consists of two parts. The first part is carryless multiply, and the second part is a reduction modulo of the polynomial g(x) = x128 + x7 + x2 + x + 1.
Carryless multiplication is similar to regular multiplication, except there is no propagation (carries). In other words, multiply uses XOR instead of "+". See the table on the right for a simple example.
GHASH uses carryless multiply because it can be easily parallelized (no carry).
Carryless multiplication of two 128-bit blocks is the major time consumer for GHASH though.

## PCLMULQDQ: Carryless Multiply

To improve GHASH performance Intel added the PCLMULQDQ instruction to the Intel 64 instruction set to speed up carryless multiply.
PCLMULQDQ is available on the same processors as the new AES-NI instructions.
That is, "Westmere" architecture microprocessors. Westmere processors are part of the Intel "Core" processor family and include the Xeon 5600 processors introduced in 2010. Oracle's Sun Fire X4170 M2 and X4270 M2 are two systems that use Xeon 5600 processors.

Intel has provided three implementations of carryless multiply: school book, linear folding, and Karatsuba.
School book is the traditional (naïve) multiplication learned in school (multiply and add).
Linear folding replaces the most significant bits of a quantity with a product times a constant. This reduces the quantity length, and therefore reduces the multiplication expense.
Karatsuba optimizes by trading-off one multiply with multiple XOR operations on partial products.
I ran a microbenchmark on these 3 implementations and the then-existing C implementation used in Solaris.
The Solaris C version did not optimize with lookup tables, but I compiled it with the latest Sun Studio using the highest optimization level (-xO3).
On Solaris, the compiler-optimized C version took 7.83 seconds, school book 1.18, linear folding 1.36, and Karatsuba 1.19 seconds (all user time).
This is illustrated by the chart on the right.
Essentially school book and Karatsuba tied for the fastest times,
but Intel performance results seem to indicate that Karatsuba performs better than school book in some cases, so that's the implementation I used.
These results are from hashing one block at-a-time.
When hashing is parallelized multiple blocks at-a-time, results may differ.

The PCLMULQDQ instruction multiplies two 64-bit numbers, producing a 127-bit product.
This can be used as a building block to multiply two 128-bit numbers for a 256-bit product. Here's an excerpt of the Karatsuba multiply from the
assembly code:

// Multiply %xmm0 by %xmm1
movdqu %xmm0, %xmm3
pclmulqdq \$0, %xmm1, %xmm3 // %xmm3 holds a0 \* b0
movdqu %xmm0, %xmm4
pclmulqdq \$16, %xmm1, %xmm4 // %xmm4 holds a0 \* b1
movdqu %xmm0, %xmm5
pclmulqdq \$1, %xmm1, %xmm5 // %xmm5 holds a1 \* b0
movdqu %xmm0, %xmm6
pclmulqdq \$17, %xmm1, %xmm6 // %xmm6 holds a1 \* b1
pxor %xmm5, %xmm4 // %xmm4 holds a0 \* b1 + a1 \* b0
movdqu %xmm4, %xmm5 // move the contents of %xmm4 to %xmm5
psrldq \$8, %xmm4 // shift by %xmm4 64 bits to the right
pslldq \$8, %xmm5 // shift by %xmm5 64 bits to the left
pxor %xmm5, %xmm3
pxor %xmm4, %xmm6 // Result in register pair <%xmm6:%xmm3>

The code used on Solaris is essentially that provided from Intel with modifications between Intel and ATT/Solaris assembly syntax.
I added code to byte swap the input and output, as Solaris uses the Big Endian representation of the 128-bit blocks, even on Intel.
This is easily done using the PSHUFB SSE2 instruction.

An additional step is needed for the kernel.
When executing user-land code, the kernel saves and restores the %xmm registers. However, these registers are not saved or restored when the kernel swaps kernel threads, so I added code to save and restore these registers on a 0 mod 16-aligned stack, when necessary (that is, when Intel Control Register CR0.TS isn't set).

## Performance Results

I ran these on Solaris 11 running on a Piketon Platform system with an Intel Clarkdale processor, which is part of the Westmere processor architecture family. The "before" case is Solaris 11, unmodified. Keep in mind that the "before" case already has been optimized with hand-coded amd64 assembly. The "after" case has AES-NI instructions integrated into the Solaris Crypto Framework, which is the PKCS11 library in userland and the "aes" module in the kernel.

The chart on the right summarizes performance improvement using PCLMULQDQ to optimize carryless multiplication in GCM's GHASH function.
This is running a kernel microbenchmark for AES-GCM. The time is in Mbytes/second.
The "before" case was a simple C implementation of carryless multiply without lookup tables.
It was optimized by the compiler with the cc -x03 flag.
The "after" case is uses the PCMMULQDQ instruction in hand-coded assembly.
GCM is available in the Solaris kernel, and the chart shows improvement of 229%, 221%, 230%, and 224% using 1 to 4 kernel threads, respectively.

## Availability in Solaris

This feature is available only for Solaris x86 64-bit,
running on an Intel microprocessor that supports the AES-NI instruction set.
I integrated GCM optimization in Solaris build snv_125
(see Change Request CR 6826942),
so GCM optimization is available in Oracle Solaris 11 Express 2010.11.
GCM mode is not (yet) available in the user-land PKCS11 library.

## Further Information

Disclaimer: the statements in this blog are my personal views, not that of my employer.