How to Observe Performance of OpenMP Codes
By Josh Simons on Nov 17, 2008
A great benefit of the OpenMP standard is that it allows a programmer to specify parallelization strategies, leaving the implementation details to the compiler and its runtime system. A downside of this is that the programmer loses some understanding and visibility into what is actually happening, making it difficult to find and fix performance problems. This is precisely the issue discussed by Professor Barbara Chapman from the University of Houston during her talk at the Sun HPC Consortium Meeting here in Austin today.
Prof. Chapman briefly described the work she has been doing using the OpenUH compiler as a research base. The older POMP project had used source-level instrumentation and source-to-source translation to produce codes that allowed some access to performance information, but the approach wasn't very popular. Instead, instrumentation has now been directly implemented in the compiler and inserted much later in the compilation process. This allowed the instrumentation to be both improved and also reduced to a more selective set of probe points, greatly reducing the overhead of instrumentation.
Professor Chapman touched on a few application examples in which this selective implementation approach has resulted in significant performance improvements with little work needed to pinpoint the problem areas within the code. In one example, application performance was easily increased by between 20 and 25% over a range of problem sizes. In another case involving an untuned OpenMP code, the instrumentation quickly pointed to incorrect usage of shared arrays and initialization problems related to first-touch memory allocation.
A second thrust of this research work is to take advantage of the fact that the OpenMP runtime layer is basically in charge as the application executes. Because it controls execution, it can also be used to gather runtime performance information as part of a performance monitoring system.
Both of these techniques contribute to giving the programmer tools to performance debug their codes at the semantic level at which it was initially written, which is critically important as more and more HPC (and other) users attempt to extract good parallel performance from existing and future multi-core chips.