Friday May 23, 2014

Generic hardware counter events

A while back, Solaris introduced support for PAPI - which is probably as close as we can get to a de-facto standard for performance counter naming. For performance counter geeks like me, this is not quite enough information, I actually want to know the names of the raw counters used. Fortunately this is provided in the generic_events man page:

$ man generic_events
Reformatting page.  Please Wait... done

CPU Performance Counters Library Functions   generic_events(3CPC)

     generic_events - generic performance counter events

     The Solaris  cpc(3CPC)  subsystem  implements  a  number  of
     predefined, generic performance counter events. Each generic
   Intel Pentium Pro/II/III Processor
       Generic Event          Platform Event          Event Mask
     PAPI_ca_shr          l2_ifetch                 0xf
     PAPI_ca_cln          bus_tran_rfo              0x0
     PAPI_ca_itv          bus_tran_inval            0x0
     PAPI_tlb_im          itlb_miss                 0x0
     PAPI_btac_m          btb_misses                0x0
     PAPI_hw_int          hw_int_rx                 0x0

Thursday Apr 28, 2011

Exploring Performance Analyzer experiments

I was recently profiling a script to see where the time went, and I ended up wanting to extract the profiles for just a single component. The structure of an analyzer experiment is that there's a single root directory ( and inside that there's a single level of subdirectories representing all the child processes. Each subdirectory is a profile of a single process - these can all be loaded as individual experiments. Inside every experiment directory there is a log.xml file. This file contains a summary of what the experiment contains.

The name of the executable that was run is held on an xml "process" line. So the following script can extract a list of all the profiles of a particular application.

$ grep myapp `find -name 'log.xml'`|grep process | sed 's/\\:.\*//' > myapp_profiles

Once we have a list of every time my application was run, I can now extract the times from that list, and sort them using the following line:

$ grep exit `cat <myapp_files`|sed 's/.\*tstamp=\\"//'|sed 's/\\".\*//'|sort -n 

Once I have the list of times I can use then locate an experiment with a particular runtime - it's probably going to be the longest runtime:

$ grep exit `cat <myapp_files`|grep 75.9 

Monday Jul 26, 2010

What does take_deferred_signal() mean in my profile?

Every so often you'll see take_deferred_signal() appear in the profile of an application. Sometimes as quite a time consuming function. So, what does it mean?

It actually comes from signal handling code in libc. If a signal comes in while the application is in a critical section, the signal gets deferred until the critical section is complete. When the application exits the critical section, all the deferred signals get taken.

Typically, this function becomes hot due to mutex locks in malloc and free, but other library calls can also cause it. The way to diagnose what is happening is to examine the call stack. So let's run through an example. Here is some multithreaded malloc/free heavy code.

#include <stdlib.h>
#include <pthread.h>

void \*work( void\* param )
  while ( 1 ) { free( malloc(100) ); }

int main()
  pthread_t thread;
  pthread_create( &thread, 0, work, 0 );
  for ( int i=0; i<10000000; i++ )
    free ( malloc (100) );

Profiling, we can see that take_deferred_signal() is the hottest function. The other hot functions would probably give us a clue as to the problem, but that is an artifact of the rather simple demonstration code.

Excl.     Incl.      Name
User CPU  User CPU
  sec.      sec.
36.456    36.456     <Total>
14.210    14.210     take_deferred_signal
 4.203    21.265     mutex_lock_impl
 3.082     3.082     clear_lockbyte
 2.872    17.062     mutex_trylock_adaptive

The next thing to look at is the call stack for take_deferred_signal() as this will tell us who is calling the function.

Attr.      Name
User CPU
14.210     do_exit_critical
14.210    \*take_deferred_signal

do_exit_critical() doesn't tell us anything, we already know that it is called when the code exits a critical section. Continuing up the call stack we find:

Attr.      Name
User CPU
14.190     mutex_trylock_adaptive
 0.020     mutex_unlock
 0.       \*do_exit_critical
14.210     take_deferred_signal

Which is more useful, we now know that the time is spent in mutex locks, but we don't know the user of those mutex locks. In this case the bulk of the time comes from mutex_trylock_adaptive(), so that is the routine to investigate:

Attr.      Name
User CPU
17.062     mutex_lock_impl
 2.872    \*mutex_trylock_adaptive
14.190     do_exit_critical

So we're still in the mutex lock code, we need to find who is calling the mutex locks:

Attr.      Name
User CPU
11.938     free
 9.327     malloc
 4.203    \*mutex_lock_impl
17.062     mutex_trylock_adaptive

So we finally discover that the time is due to calls to mutex locks in malloc() and free().

Monday Sep 21, 2009

Profiling scripts

If you try to use the Sun Studio Performance Analyzer on something that is not an executable, you'll end up with an error message:

$ collect kstat
Target `kstat' is not a valid ELF executable

The most reliable workaround for this that I've discovered is as follows. First of all make up shell script that executes the command passed into it:

$ more

Then run the collect command as:

$ collect -F on /bin/sh <script> <params>

The -F on is required so that collect follows forked processes, otherwise collect will just profile the top /bin/sh which will do minimal work before forking off the actual command.

When loading the resulting experiment into the Analyzer you have to load all the descendant processes. You can do this by going to the filter dialog box and selecting all the processes, or you can take the easier route of placing en_desc on into your .er.rc file in your home directory (this will tell the analyzer to always load the descendant processes, which might make loading experiments slower, but will guarantee that you actually load all the data, and not just the top-level code).

One other thing to note is that each new process can contribute wall and wait time, so the wall time shown in the analyzer can be a multiple of the actual wall time. To see this in action do:

$ collect -F on /bin/sh kstat

The wall time on this will be a multiple of the actual runtime because each shell script contributes wall time while it waits for the kstat command to complete.

Thursday Sep 13, 2007

Dwarf debug format

Just been looking at the dwarf debug format. In Sun Studio 12, all the compilers switched to using this format over the older stabs format. Chris Quenelle has put a useful guide to debug information up on the Sun Wiki. The dwarf format overview document is probably the easiest to read, the version 3 specification document is more detailed. The tool to extract the dwarf information from an object file is dwarfdump. For example the -l flag will extract line number info.

Friday Dec 15, 2006

Analyzer probe effects (part 2)

A while back I looked at the probe effect when running the Performance Analyzer on a multi-threaded code. The results showed that the probe effect, in terms of difference in runtime, was most pronounced when using multiple threads and recording the experiment to remote disk.

An alternative metric for probe effect was the amount of distortion in the results. To measure this I took the same code and observed the user time recorded by Analyzer when the experiments were recorded to remote disk or to /tmp. I also compared this to the user time reported by the command time. The results of this are shown in the following figure.

The results show that the runtime when recording the experiment to remote disk was more significantly increased than in my previous experiment. There is also one data point where the runtime was substantially perturbed.

The interesting part of the graph is the recorded user time. For both situations, the recorded user time is pretty much identical to that recorded by the time command. Possibly most critically, this is true even when the runtime is hugely disturbed. Anecdotally, this is what we've been assuming for a while, that the Analyzer data is largely correct even in the situation where there is some activity on the machine. (I'm sure that this would not be the case if the machine were placed under an extreme load, for example running two MT benchmarks.)

There is one rather odd observation, that when recording to the remote file system, the user time is slightly under estimated. But it is apparent from the graph that this is not a large effect.

The conclusion is that recording to a remote file system does not cause large distortion to the collected data, although it does cause the application to run more slowly.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« July 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming