The Sun Studio Performance Analyzer is one of the two tools that I use most frequently (the other is spot - which is now in SS12U1!). It's a very powerful tool, but a lot of that power is not immediately visible to users. I'm going to discuss a couple of ways I've used the analyzer to view parallel applications.
The most common first step for looking at the performance of parallel apps is to use the timeline. However, the timeline can look a bit cluttered with all of the call stack data. Often you are really just interested in the leaf node. Fortunately this can be configured from the data presentation dialog box. To get the view I want I'm only showing the top leaf in the call stack:
This results in a display of the samples in each routine, by default this can look very colourful. You can make it easier on the eye by selecting the colours used to display the graphic. In the following graphic I've picked green for one parallel routine that I'm interested in, and blue for another, then used a yellow to colour all the time waiting for more work to be assigned:
The graphic shows that the work is not evenly spread across all threads. The first few threads spend more time in the hot routines than the later threads. We can see this much more clearly using the 'threads' view of the data. To get this view you need to go back to the data presentation dialog and select the threads tab, it's also useful to select the 'cpus' tab at the same time.
The threads tab shows the activity of each thread for the currently displayed metrics. This is useful to see if one thread is doing more work than another. The cpus tab shows time that the app spends on each CPU in the machine - this can indicate whether a particular CPU is over subscribed. The thread activity looks like:
This confirms what we thought earlier that some of the threads are much more active than other threads. The top chart shows the user time, which indicates that all the threads spent the same amount of time running 'stuff', the middle chart shows the time that each thread spent running useful work, the lower chart shows the time spent in overhead. The exercise now is to try and improve the distribution of work across the threads......