HPC Consortium: Shared Memory Parallelization on Multi-Core Processors
By Josh Simons on Jun 25, 2007
While I enjoyed Barton and Ruud's talks about the Niagara 2 processor yesterday at Sun's HPC Consortium meeting in Dresden, I always get more of a kick from customer presentations. In this case, Dieter an Mey from RWTH Aachen University gave a nice talk about the pitfalls and benefits of multi-core processors for programmers. While it was delivered here at a High Performance Computing event, the observations and lessons are applicable to anyone interested in application performance on multi-core processors. Given the direction of our industry, that encompasses a lot of programmers.
Dieter and his colleague Christian Terboven examined performance on several systems based on a variety of processors:: the UltraSPARC IV, UltraSPARC T2 (Niagara 2), Intel Woodcrest and Clovertown, and a quad-core AMD Opteron. For each of these systems, they measured achievable aggregate bandwidth over a variety of active thread counts, processor bindings, and memory placement. I've included a few of his slides with Niagara 2 performance results removed (sorry.) As Dieter said, the results aren't surprising once you look carefully at the non-uniformities in the underlying system architectures. If, however, programmers do not understand these issues, they will likely achieve very sub-optimal application performance. My own concern is that as multi-core and multi-threaded processors become the norm across the computer industry, programmers will not understand these issues in the way a seasoned HPC programmer might. This is one small part of the challenge the software industry faces in helping programmers achieve high performance on these new processors.
In addition to running the above bandwidth tests, Dieter and Christian ran two applications on each of these systems. The first was a very cache-friendly code used to compute contact interactions between bevel gears. The second was a Navier-Stokes code that put high stress on the memory sub-system due its manipulations of sparse data structures. They also did additional throughput tests, running multiple copies of these applications on each system.
The results for Niagara 2 demonstrated the value of the CMT approach in hiding latency for throughput workloads and being able to do so with floating-point intensive codes with the increased FP capabilities of this new processor.
Oh, and one more thing. The graph below shows performance results for the bevel gear code run with different numbers of threads. Look at Columns 5 and 6. These results were generated on Solaris and Linux using the same compilers and the exact same hardware. Higher is better.