Carryless Multiplication Optimization for AES GCM Mode in Solaris

Carryless Multiplication Optimization for AES GCM Mode in Solaris

Introduction

Multiplying 6 x 3
Traditional
Multiplication  
   110
   x11
  ----
   110
 +110
 -----
 10010
 =18 decimal
Carryless
Multiplication  
     110
     x11
   -----
     110
   +110
   -----
    1010
 = 10 decimal

Galois/Counter Mode (GCM) is a mode of operation for symmetric block ciphers that's mainly used with AES (AES-GCM). GCM consists of Counter (CTR) Mode encryption followed by a universal hash function, Galois Hashing, for authentication. GMAC is a variation that provides authentication but not encryption. GCM is used with IPsec, SSH, TLS, and is part of the US government's NSA Suite B Cryptography.

The Galois Hash function (GHASH) consists of two parts. The first part is carryless multiply, and the second part is a reduction modulo of the polynomial g(x) = x128 + x7 + x2 + x + 1. Carryless multiplication is similar to regular multiplication, except there is no propagation (carries). In other words, multiply uses XOR instead of "+". See the table on the right for a simple example. GHASH uses carryless multiply because it can be easily parallelized (no carry). Carryless multiplication of two 128-bit blocks is the major time consumer for GHASH though.

PCLMULQDQ: Carryless Multiply

To improve GHASH performance Intel added the PCLMULQDQ instruction to the Intel 64 instruction set to speed up carryless multiply. PCLMULQDQ is available on the same processors as the new AES-NI instructions. That is, "Westmere" architecture microprocessors. Westmere processors are part of the Intel "Core" processor family and include the Xeon 5600 processors introduced in 2010. Oracle's Sun Fire X4170 M2 and X4270 M2 are two systems that use Xeon 5600 processors.

Performance of Carryless Multiply Methods Intel has provided three implementations of carryless multiply: school book, linear folding, and Karatsuba. School book is the traditional (naïve) multiplication learned in school (multiply and add). Linear folding replaces the most significant bits of a quantity with a product times a constant. This reduces the quantity length, and therefore reduces the multiplication expense. Karatsuba optimizes by trading-off one multiply with multiple XOR operations on partial products. I ran a microbenchmark on these 3 implementations and the then-existing C implementation used in Solaris. The Solaris C version did not optimize with lookup tables, but I compiled it with the latest Sun Studio using the highest optimization level (-xO3). On Solaris, the compiler-optimized C version took 7.83 seconds, school book 1.18, linear folding 1.36, and Karatsuba 1.19 seconds (all user time). This is illustrated by the chart on the right. Essentially school book and Karatsuba tied for the fastest times, but Intel performance results seem to indicate that Karatsuba performs better than school book in some cases, so that's the implementation I used. These results are from hashing one block at-a-time. When hashing is parallelized multiple blocks at-a-time, results may differ.

The PCLMULQDQ instruction multiplies two 64-bit numbers, producing a 127-bit product. This can be used as a building block to multiply two 128-bit numbers for a 256-bit product. Here's an excerpt of the Karatsuba multiply from the assembly code:

// Copyright (c) 2009 Intel Corporation. All Rights Reserved.
// Multiply %xmm0 by %xmm1
movdqu     %xmm0, %xmm3
pclmulqdq  $0, %xmm1, %xmm3      // %xmm3 holds a0 \* b0
movdqu     %xmm0, %xmm4
pclmulqdq  $16, %xmm1, %xmm4     // %xmm4 holds a0 \* b1
movdqu     %xmm0, %xmm5
pclmulqdq  $1, %xmm1, %xmm5      // %xmm5 holds a1 \* b0
movdqu     %xmm0, %xmm6
pclmulqdq  $17, %xmm1, %xmm6     // %xmm6 holds a1 \* b1
pxor       %xmm5, %xmm4          // %xmm4 holds a0 \* b1 + a1 \* b0
movdqu     %xmm4, %xmm5          // move the contents of %xmm4 to %xmm5
psrldq     $8, %xmm4             // shift by %xmm4 64 bits to the right
pslldq     $8, %xmm5             // shift by %xmm5 64 bits to the left
pxor       %xmm5, %xmm3
pxor       %xmm4, %xmm6          // Result in register pair <%xmm6:%xmm3>

The code used on Solaris is essentially that provided from Intel with modifications between Intel and ATT/Solaris assembly syntax. I added code to byte swap the input and output, as Solaris uses the Big Endian representation of the 128-bit blocks, even on Intel. This is easily done using the PSHUFB SSE2 instruction.

An additional step is needed for the kernel. When executing user-land code, the kernel saves and restores the %xmm registers. However, these registers are not saved or restored when the kernel swaps kernel threads, so I added code to save and restore these registers on a 0 mod 16-aligned stack, when necessary (that is, when Intel Control Register CR0.TS isn't set).

Performance Results

Carryless Multiply Optimization

I ran these on Solaris 11 running on a Piketon Platform system with an Intel Clarkdale processor, which is part of the Westmere processor architecture family. The "before" case is Solaris 11, unmodified. Keep in mind that the "before" case already has been optimized with hand-coded amd64 assembly. The "after" case has AES-NI instructions integrated into the Solaris Crypto Framework, which is the PKCS11 library in userland and the "aes" module in the kernel.

The chart on the right summarizes performance improvement using PCLMULQDQ to optimize carryless multiplication in GCM's GHASH function. This is running a kernel microbenchmark for AES-GCM. The time is in Mbytes/second. The "before" case was a simple C implementation of carryless multiply without lookup tables. It was optimized by the compiler with the cc -x03 flag. The "after" case is uses the PCMMULQDQ instruction in hand-coded assembly. GCM is available in the Solaris kernel, and the chart shows improvement of 229%, 221%, 230%, and 224% using 1 to 4 kernel threads, respectively.

Availability in Solaris

This feature is available only for Solaris x86 64-bit, running on an Intel microprocessor that supports the AES-NI instruction set. I integrated GCM optimization in Solaris build snv_125 (see Change Request CR 6826942), so GCM optimization is available in Oracle Solaris 11 Express 2010.11. GCM mode is not (yet) available in the user-land PKCS11 library.

Further Information

Disclaimer: the statements in this blog are my personal views, not that of my employer. Trademarks are the property of the respective trademark owners.

<script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script>
<script src="http://connect.facebook.net/en_US/all.js#xfbml=1"></script>
Comments:

I noticed in this article as well as your previous article discussing implementation of AES-NI in OpenSolaris you specifically identify support for Intel's Westmere processor. However, you have not specified whether your implementation of the AES-NI instructions are hard-coded to only support Intel's processors; or if your code will use AES-NI by checking a processors features bit flag using the CPU ID instruction or the appropriate OS API to determine if support for AES-NI is present. Could you please provide more clarity as to whether your current implementation of AES-NI is CPU vendor agnostic?

I mention this as AMD plans to release both client and server processors based on its Bulldozer micro-architecture which support Intel's AES-NI from the 1H thru 2H of 2011. If you have implemented a vendor check for Intel then your improved routines will not run on otherwise capable hardware from AMD.

See John Fruehe's blog titled, "Following Instructions," third paragraph from the top for some details on AMD's support for AES-NI. Mr. Fruehe is AMD's Director of Product Marketing for Server, Embedded and FireStream products.

Here is the link: http://blogs.amd.com/work/2010/11/22/following-instructions/

For a more detailed discussion, you can review AMD Fellow and Senior Architect, Dave Christie's article "Striking a Balance" which discusses AMD's implementation of AVX, AES-NI, non-destructive FMA4, and more. The link is: http://blogs.amd.com/developer/2009/05/06/striking-a-balance/

I look forward to reading your response, thank you.

Best regards,
Daniel R.

Posted by Daniel R. on November 23, 2010 at 05:35 PM PST #

Hi Daniel,
No, the implementation is not hard-coded to support only Intel Westmere. Solaris generally doesn't use the processor-specific CPUID instruction directly. Solaris uses the processor-agnostic function getisax(2) in userland, and x86_featureset or x86_features variable (S11 and S10 kernel) to determine what processor features are available. Support for future processors, such as AMD's Bulldozer (which I understand supports AES-NI) can be easily added. I can't commit to or promise future support though, and, in any case, that work would be done by someone in the Solaris x86 kernel group. Thanks for the pointers though.
- Dan

Posted by Dan Anderson on November 24, 2010 at 12:42 AM PST #

Hi Dan,

I appreciate your reply, thank you.

Best regards,
Daniel R.

Posted by Daniel R. on November 24, 2010 at 02:27 AM PST #

I noticed in the INTEL docs that the implementation of the gcm_mul() func in usr/src/common/crypto/modes/gcm.c is considered "not very efficient". Their words not mine. See "Multiplication Instruction ands its Usage for Computing the GCM Mode" Intel document 323640-001 Rev 2.0 page 9. Personally I am a SPARC fan and wonder how a Sparc T3 processor would hande this with many cores working.

Posted by Dennis Clarke on February 06, 2011 at 09:47 AM PST #

True, the software version that was replaced in Intel is not efficient. GCM mode is newly implemented and optimization is still a work-in-progress. Stay tuned . . .

Posted by Dan Anderson on February 06, 2011 at 02:36 PM PST #

@Dennis,
Now that the SPARC T4 chip has been announced, it has two new instructions to perform carryless multiply, xmulxhi ans xmulx. Although the functionality and hardware implementation differ, the xmulxhi and xmulx SPARC T4 instructions are similar to the Intel Westmere's pclmulqdq in that they optimize the carryless multiply operations by performing it on the processor with dedicated, non-privileged CPU instructions.

Posted by Dan Anderson on September 30, 2011 at 05:52 PM PDT #

Post a Comment:
Comments are closed for this entry.
About

Solaris cryptography and optimization.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today