High performance software crypto on CMT processors

While cryptography is typically viewed as computationally intensive (and so less well suited to CMT processors), software implementations of common cryptographic algorithms can be readily optimized to excel on CMT processors. Current software implementations have been optimized for traditional processors, with multiple lookup tables sized to all fit in a processors small level-1 caches. It is this use of multiple small tables that leads to the high computational overheads associated with most cryptographic implementations -- due to the significant arithmetic operations needed to manipulate access to the tables and recombine the results from the various tables.

For example, consider the Kasumi algorithm, which is essential to 3G mobile telephony. In Kasumi, a block is 8-bytes, the key is 128-bits (although it is expanded to a 1024-bit key schedule before use), and processing consists of 8 rounds per block. While a variety of operations are performed per block, the most costly operation is termed FI and consists of the following (in C notation):

nine = (u16)(in>>7);
seven = (u16)(in&0x7F);
nine = (u16)(S9[nine] \^ seven);
seven = (u16)(S7[seven] \^ (nine & 0x7F));

seven \^= (subkey>>9);
nine \^= (subkey&0x1FF);
nine = (u16)(S9[nine] \^ seven);
seven = (u16)(S7[seven] \^ (nine & 0x7F));
in = (u16)((seven<<9) + nine);
return( in );

where in and subkey are two-byte variables, S9 is a 512-element lookup table and S7 is a 128-element lookup table. This operation is performed three times per round, for a total of 24 times per block. Each FI operation requires 22 instructions (for SPARC), for a total of 576 FI-derived instructions per block. Given the abundance of logical and shift operations, it is apparent that superscalar processors will perform this function very well, with an Instructions Per Cycle (IPC) of 2.5 or more. In contrast, the Niagara single-strand IPC is around 0.65. Further, due to the compute intensive nature of the code, the single Niagara strand uses around two thirds of the processor core's issue resources. As a result, performance does not scale as additional Kasumi threads run on a core.

To overcome this problem, the implementation can be optimized to radically reduce the instruction count. A reduction in instruction count may be achieved by replacing large parts of the FI function using a large lookup table. In the original Kasumi code, the 16-bit elements are divided into two smaller elements, one 7-bits and one 9-bits. These smaller elements are processed independently and the results combined. While this ensures that the lookup tables are small, significant logical and arithmetic operations are required to split the 16-bit elements and later recombine the smaller 7-bit and 9-bit elements back into the 16-bit elements. Significant computational saving may be achieved by processing an entire 16-bit element at once, using large lookup tables, as shown below:

t0 = LT0[in];
t0 = t0 \^ subkey;
in = LT1[t0];

The new lookup tables (LT0 and LT1) are now much larger, each being composed of 65536 2-byte elements. Note that the lookup tables are constant, may be precomputed, and are independent of the keys. However, using this approach, the FI function now only requires five instructions, a four times reduction from the original implementation. Further note that in both the optimized and the original code, the lookup table accesses are dependent and cannot be performed in parallel or prefetched in advance.

The lookup tables that once fitted in the L1 cache are now much larger and will now largely reside in the L2 cache -- this instruction count reduction has been at the expense of increased memory stalls, but here we are laying to the strengths of a CMT processor. As a result, it would appear that the performance of the code will remain largely unchanged, having traded increased instruction count for increased memory stalls. This optimization technique is beneficial for at least two reasons. First, MT (multithreading) performance is improved. For the initial implementation, due to the large computational requirements of the algorithm, as additional strands are leveraged, aggregate core performance improves very little. Given that a single strand is capable of consuming almost all of a processor core’s resources, as additional VT/SMT strands are leveraged, these strands rapidly start to deprive the other strands of resources, and the aggregate core performance is improved very little. In contrast, in the optimized version, the strands spend most of their time stalled waiting for accesses to the lookup tables to complete and consume a much smaller fraction of a processor core’s resources. As a result, as the number of strands is increased, performance scales almost linearly. Indeed, for Niagara, per-core Kasumi performance is around 8 times the performance of a single strand, and per-chip Kasumi performance is close to 64X single-strand performance. Indeed, single-core Kasumi performance is around 1.3X the performance of a single-core of a 3GHz Xeon processor.


[Trackback] Lawrence Spracklen wrote a really interesting article about cryptography done in software on CMT. This sounds counterintuitive at first, as cryptography is considered as a computational intensive task and thus considered as a tasks for fast superscalar...

Posted by c0t0d0s0.org on July 11, 2008 at 06:02 PM PDT #

Really great post ! thanks

Posted by seb on July 15, 2008 at 11:12 PM PDT #

Post a Comment:
Comments are closed for this entry.

Dr. Spracklen is a senior staff engineer in the Architecture Technology Group (Sun Microelectronics), that is focused on architecting and modeling next-generation SPARC processors. His current focus is hardware accelerators.


Top Tags
« June 2016