DProfile - Filtering Sets
By nk on Mar 27, 2007
Russell Brown and I optimized an HPC genetics code with the set token IN and the tokens STACK and LEAF.
The IN operator tests events for existance of scalars in a list.
lval IN rval operator tests that lval is present in rval. For example, in my previous blog entry this expression was in the filter area:
(US4p_L2CacheLine IN (5483)) && (VA_L2 && (2155))
The expression checks that event memory object US4p_L2CacheLine is IN set 5483, and event memory object VA_L2 is IN set 2155. The set can be a list such as:
VA_L2 IN (2155,2156,2157)
Two tokens are available to test instruction constructs. The LEAF token returns the identifier for the leaf routine; and STACK token returns the list of routine identifiers on the stack. When you're in the function tab, click of the filter button and the identifier for the highlighted function name will appear. Two expressions are very useful:
LEAF IN (123) (123) IN STACK
The first returns metrics when routine identifer 123 (the function identifier, not the name) is executing. (Exclusive Metrics)
The second returns metrics when the routine identifier 123 is on the stack. i.e. Inclusive metrics.
Russell A. Brown and I used these two expressions to diagnose various phases of Russ's program. He sent me a note with some findings using DProfile within Studio Performance Analyzer:
The following is a report of the progress to date in using DProfile to debug and improve the performance of a parallel application executing on a Niagara 1 system.
The application is a benchmark that performs genome sequence matching. This application is a new algorithm that uses binary trees to represent a sparse matrix. Older algorithms represent a dense matrix and execute several decimal orders of magnitude more slowly.
DProfile has identified several performance bugs in the new algorithm, all but the last of which have been corrected based on DProfile analysis:
1. The multi-threaded application was building a binary tree for each thread, and calling malloc() each time that a new tree node was needed. The different threads were required to synchronize on malloc(), a fact which required many mutex locks, as demonstrated by DProfile. The solution to this problem was for each thread to allocate an array that was large enough to contain all of the nodes required for a given binary tree. In this manner, each thread called malloc() once instead of many times, Thus the number of mutex locks was greatly reduced and the performance of the algorithm improved.
2. Each thread used some small matrices that were allocated as 2D arrays, i.e., a row vector of pointers was allocated, and each pointer referenced a column vector that was allocated as well. Because each thread needed to allocate a row vector as well as multiple column vectors, and further because the threads executed asynchronously, the column vectors for a particular thread were scattered throughout memory. This situation created many L2 cache misses, as demonstrated by DProfile. The solution to this problem was to allocate the matrices as 1D arrays that were large enough to contain all of the column vectors. Thus the column vectors for a particular thread were no longer scattered throughout memory, which resulted in fewer L2 cache misses and improved performance.
3. The 1D arrays were allocated in row-major order but were accessed sequentially both in row-major and in column-major order. Column-major access created L2 cache misses, as demonstrated by DProfile. The solution to this problem was to restructure the algorithm so as to use both a row vector and a column vector instead of a matrix. This change resulted in fewer L2 cache misses and improved performance.
4. Despite all of the above improvements which were motivated by Dprofile analysis, Dprofile now indicates that malloc() incurs substantial [user lock stall penalties], independent of whether -lmtmalloc or -lumem is specified in the link command. To investigate this problem more deeply, it is necessary to add functionality to DProfile.