Gathering unique signatures from ISV applications

In yesterday's entry, I hinted at a way of gathering ISV application data to guide performance tuning. The method by which this is done isn't new but how that data is used is unique as far as I know.

The PLG (Performance Library Group) has created an interpose library for every user-callable routine in the library. Greg Nakhimovsky wrote an introduction to writing interposers on the developers forum which is a good place to start if you're unfamiliar with interposers.

What this library does is intercept all the external calls to Perflib, records some information about the call (we call this the signature of the call) and then pass the control into the actual Perflib routine.

What is created after a run is a trace file that contains a list of unique signatures and a frequency count for each signature. A simple example might look like:

  • daxpy:146:2:1:1:512
  • daxpy:56:2:1:1:1536
  • dasum:152:1:2
  • dasum:60:1:18

What this tells me is that daxpy was called with N=146, Alpha=(not 1 and not 0), Incx=1, Incy=1 and it was called with those exact parameters 512 times. Also, dasum was called with N=152, Incx=1 and it was called with those parameters 2 times.

The reason the exact Alpha value isn't captured for daxpy is because it has no impact on the performance of the routine. This allows us to collapse all the calls to daxpy that have common integer parameters into a single unique signature. This isn't the case where alpha is one or zero since the code special cases those values of alpha.

This information allows us to look deeply into the way an application is actually using the library. Most tuning tools (ala prof, analyzer, etc) will tell you how much time you're spending in a particular routine but that's about it. In a recent example, we were benchmarking the Nastran application against several competitors on the Opteron boxes. In most cases we were doing fine but for one data set, we were getting beaten quite soundly. We used the interpose library to find that dgemv was being called with M=1.5 million and N=7. It just so happens that our code was blocked for 8 elements and the entire time was being spent in the cleanup code! After handling this case better, we were able to more than double the performance of the routine for that particular class of cases (tall skinny A matrices).

Some readers (I kid myself into thinking anyone has read this far) might wonder why on earth I'm writing about such wonkish details. The reason is simple, in upcoming entries I'm going to present some performance comparisons to competing libraries and I'm going to compare Perflib to our competitors on applications where one or both of the libraries may never have been used to actually run the application. I'm able to do this (to a certain extent) because once I have the signature profile of an application, I can project the impact a particular library will have on the performance of that application. Can I do it perfectly? No, but I can make some supportable generalizations about how Perflib would compare to a competing product on a particular application. As always, thanks for your attention.

Comments:

Post a Comment:
Comments are closed for this entry.
About

hinkthink

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today