SHA-1 is a "hash" or "digest" operation that produces a 160 bit (20 byte) checksum value on arbitrary data, such as a file.
It is intended to uniquely identify text and to verify it hasn't been modified.
Max Locktyukhin and others at Intel have improved the performance of the SHA-1 digest algorithm using multiple techniques.
This code has been incorporated into Solaris 11
and is available in the Solaris Crypto Framework
via the libmd(3LIB), the industry-standard libpkcs11(3LIB) library, and Solaris kernel module sha1.
The optimized code is used automatically on systems with a x86 CPU supporting
SSSE3 (Intel Supplemental SSSE3). Intel microprocessor architectures that support SSSE3 include Nehalem, Westmere, Sandy Bridge microprocessor families.
Further optimizations are available for microprocessors that support AVX (such as Sandy Bridge).
Although SHA-1 is considered obsolete because of weaknesses found in the SHA-1 algorithm—NIST recommends using at least SHA-256, SHA-1 is still widely used and will be with us for awhile more.
Collisions (the same SHA-1 result for two different inputs) can be found with moderate effort.
SHA-1 is used heavily though in SSL/TLS, for example.
And SHA-1 is stronger than the older MD5 digest algorithm, another digest option defined in SSL/TLS.
SHA-1 operates by reading an arbitrary amount of data.
The data is read in 512 bit (64 byte) blocks
(the last block is padded in a specific way to ensure it's a full 64 bytes).
Each 64 byte block has 80 "rounds" of calculations (consisting of a mixture of "ROTATE-LEFT", "AND", and "XOR") applied to the block.
Each round produces a 32-bit intermediate result, called W[i].
Here's what each round operates:
W[i] = (W[i-3] XOR W[i-8] XOR W[i-14] XOR W[i-16]) ROTATE-LEFT 1
W[i] = (W[i-6] XOR W[i-16] XOR W[i-28] XOR W[i-32]) ROTATE-LEFT 2
Intel SSSE3 makes use of 16 %xmm registers, each 128 bits wide.
The 4 32-bit inputs to a round, W[i-6], W[i-16], W[i-28], W[i-32], all fit in one %xmm register.
The following code snippet, from Max Locktyukhin's article, converted to ATT assembly syntax, computes 4 rounds in parallel with just a dozen or so SSSE3 instructions:
movdqa W_minus_04, W_TMPpxor W_minus_28, W // W equals W[i-32:i-29] before XOR
A window of the 32 previous results, W[i-1] to W[i-32] is saved in memory on the stack.
This is best illustrated with a chart. Without vectorization, computing the rounds is like this
(each "R" represents 1 round of SHA-1 computation):
With vectorization, 4 rounds can be computed in parallel:
The new "Sandy Bridge" microprocessor architecture, which supports AVX, allows another interesting optimization.
SSSE3 instructions have two operands, a input and an output.
AVX allows three operands, two inputs and an output.
In many cases two SSSE3 instructions can be combined into one AVX instruction.
The difference is best illustrated with an example.
Consider these two instructions from the snippet above:
movdqa W_minus_04, W_TMPpalignr $8, W_minus_08, W_TMP
With AVX they can be combined in one instruction:
vpalignr $8, W_minus_08, W_minus_04, W
This optimization is also in Solaris, although Sandy Bridge-based systems aren't widely available yet.
As an exercise for the reader, AVX also has 256-bit media registers, %ymm0 - %ymm15 (a superset of 128-bit %xmm0 - %xmm15).
Can %ymm registers be used to parallelize the code even more?
In addition to using the Intel code described above, I performed other minor optimizations to the Solaris SHA-1 code:
I measured performance on a Sun Ultra 27 (which has a Nehalem-class Xeon 5500 Intel W3570 microprocessor @3.2GHz). Turbo mode is disabled for consistent performance measurement. Graphs are better than words and numbers, so here they are:
The first graph shows the Solaris digest(1) command before and after the optimizations discussed here, contained in libmd(3LIB).
I ran the digest command on a half GByte file in swapfs (/tmp) and execution time decreased from 1.35 seconds to 0.98 seconds.
The second graph shows the the results of an internal microbenchmark that uses the Solaris libpkcs11(3LIB) library. The operations are on a 128 byte buffer with 10,000 iterations. The results show operations increased from 320,000 to 416,000 operations per second.
Finally the third graph shows the results of an internal kernel microbenchmark that uses the Solaris /kernel/crypto/amd64/sha1 module.
The operations are on a 64Kbyte buffer with 100 iterations. third graph shows the results of an internal kernel microbenchmark that uses the Solaris /kernel/crypto/amd64/sha1 module.
The operations are on a 64Kbyte buffer with 100 iterations.
The results show for 1 kernel thread, operations increased from 410 to 600 MBytes/second.
For 8 kernel threads, operations increase from 1540 to 1940 MBytes/second.
This code is in Solaris 11 FCS.
It is available in the 64-bit libmd(3LIB) library for 64-bit programs and is in the Solaris kernel.
You must be running hardware that supports Intel's SSSE3 instructions
(for example, Intel Nehalem, Westmere, or Sandy Bridge microprocessor architectures).
The easiest way to determine if SSSE3 is available is with the isainfo(1) command. For example,
nehalem $ isainfo -v
If the output also shows "avx", the Solaris executes the even-more optimized 3-operand AVX instructions for SHA-1 mentioned above:
sandybridge $ isainfo -v
No special configuration or setup is needed to take advantage of this code.
Solaris libraries and kernel automatically determine if it's running on SSSE3 or AVX-capable machines
and execute the correctly-tuned code for that microprocessor.
The Solaris 11 Crypto Framework,
via the sha1 kernel module and libmd(3LIB) and libpkcs11(3LIB) libraries,
incorporated a useful SHA-1 optimization from Intel for SSSE3-capable microprocessors.
As with other Solaris optimizations, they come automatically "under the hood"
with the current Solaris release.