By sprack on Jun 23, 2008
In my recent CommunityOne Microparallelism presentation, one of the cases studies discusses how to convert high ILP code on superscalar processors into the TLP implementations on CMT processors. The case study is discussed with reference to the SPARC implementation of SHA-1, which I wrote several years ago. The code, tuned for sun4u processors, can actually be found in OpenSolaris here. The message expansion portion of the SHA-1 computation is performed in parallel with the compression function portion using the VIS instructions. The SIMD nature of the VIS instructions is not leveraged, merely the fact that they allow integer operations to be performed on the FP pipelines. As a result, the IPC on a UltraSPARC IV+ processor is increased from around 2 to almost 4 -- improving performance by over 1.7X...
On CMT processors, such as T2, this doesn't deliver optimal performance. However, given the low inter-thread synchronization costs, one can consider performing these two portions of the SHA-1 computation using two threads: