Jack Dongarra: Four Important Concepts to Consider when Using Computing Clusters
By Josh Simons on Jun 17, 2008
My notes from Jack Dongarra's talk in the Cluster session at ISC 2008 in Dresden.
Four (maybe 6) Important Concepts to Consider when Using Computing Clusters. Jack Dongarra, University of Tennessee & Oak Ridge National Laboratory, USA
The concepts are:
- Effective use of Many-Core. Dynamic data driven execution, bock data layout
- Exploiting mixed precision in algorithms
- Self adapting and auto tuning of software to underlying architecture
- Fault tolerant algorithms
- Exploit hybrid architectures
- Communcation-avoiding algorithms
In the "old days" processors became faster each year. Today the clock speed is fixed or slowing. But things are still doubling every 18-24 months. Number of cores will double roughly every two years while clock speed decreases (does not increase.) Need to deal with millions of threads in a system.
TOP500 perspective. Regular, milestone increase in system peak performance every 11 years. 1 GFLOP 22 years ago (1 thread), 1 TFLOP 11 years ago (10\^3 threads), 1 PFLOP (10\^6 threads). Extrapolate to ExaFLOP: 10\^9 threads in about 2019.
How to code for these multicore systems? Fine granularity to support high level of parallelization. Asychrony will be important as granularity becomes finer. Must rethink the design of our software. A bigger disruption than the move from shared memory to message passing. Much resistance then and likely more at this transition point.
Steps in a current LAPACK LU decomposition with partial pivoting. Most of the parallelization comes from the matrid-multiplication step. It is bulk synchronous. Scalar step, parallel, scalar, parallel...in synchrony. Very inefficient use of resources. A more event-driven, multithreaded approach is needed based on an analysis of the directed graph of the computation. Adaptive lookahead. Not a new idea, but reviving it to exploit these new architectures.
PLASMA is a redesign of LAPACK/ScaLAPACK. Asynchonry, dynamic scheduling, fine granularity, locality of reference. He shows a graph of LU performance of PLASMA against LAPACK and AMD and Intel implementations -- does better at all or most problem sizes. Same for QR. And Cholesky. Consonsiderably better performance at all problem sizes.
Performance of single precision on conventional processors. Roughly a factor of two performance over double precision looking at SGEMM/DGEMM. Less compute, less data being moved. Exploit 32-bit as much as possible. Some algorithms can use mixed precision
Intriguing potential. Automatically switch between SP and DP to match desired accuracy. Potential for GPU and FPGA -- use as little precision as you can get away with.
Performance Optimization. Embed self-tuning optimizations into algorithms so codes can adapt themselves to underlying system architecture. Past successes: ATLAS, FFTW, Spiral, Open MPI (will adapt itself to the characteristics of global communication interconnect.)
Fault Tolerance. MPI not up to the task of dealing with faults. Mismatch between hardware and programming. Two kinds of faults important: Erasures--lose a resource/processor and Errors--detect an error and recover. Much work done looking at these issues -- will become more important.
Conclusions. For last decade, research tilted in favor of hardware. Need to rebalance this -- barriers to progress are increasingly in software. Hardware has half-life of a few years, software's is decaded.
Parallelism is exploding. Perfomance will be a software problem. Locality will continue to be important. Massive parallelism required, including pipelining and overlap.