Friday Apr 01, 2011

Profiling scripts

One feature that crept into the Oracle Solaris Studio 12.2 release was the ability for the performance analyzer to follow scripts. It is necessary to set the environment variable SP_COLLECTOR_SKIP_CHECKEXEC to use this feature - as shown below.

bash-3.00$ file `which which`
/bin/which:     executable /usr/bin/csh script
bash-3.00$ collect which
Target `which' is not a valid ELF executable
bash-3.00$ collect which
Creating experiment database ...

Monday Jul 26, 2010

What does take_deferred_signal() mean in my profile?

Every so often you'll see take_deferred_signal() appear in the profile of an application. Sometimes as quite a time consuming function. So, what does it mean?

It actually comes from signal handling code in libc. If a signal comes in while the application is in a critical section, the signal gets deferred until the critical section is complete. When the application exits the critical section, all the deferred signals get taken.

Typically, this function becomes hot due to mutex locks in malloc and free, but other library calls can also cause it. The way to diagnose what is happening is to examine the call stack. So let's run through an example. Here is some multithreaded malloc/free heavy code.

#include <stdlib.h>
#include <pthread.h>

void \*work( void\* param )
  while ( 1 ) { free( malloc(100) ); }

int main()
  pthread_t thread;
  pthread_create( &thread, 0, work, 0 );
  for ( int i=0; i<10000000; i++ )
    free ( malloc (100) );

Profiling, we can see that take_deferred_signal() is the hottest function. The other hot functions would probably give us a clue as to the problem, but that is an artifact of the rather simple demonstration code.

Excl.     Incl.      Name
User CPU  User CPU
  sec.      sec.
36.456    36.456     <Total>
14.210    14.210     take_deferred_signal
 4.203    21.265     mutex_lock_impl
 3.082     3.082     clear_lockbyte
 2.872    17.062     mutex_trylock_adaptive

The next thing to look at is the call stack for take_deferred_signal() as this will tell us who is calling the function.

Attr.      Name
User CPU
14.210     do_exit_critical
14.210    \*take_deferred_signal

do_exit_critical() doesn't tell us anything, we already know that it is called when the code exits a critical section. Continuing up the call stack we find:

Attr.      Name
User CPU
14.190     mutex_trylock_adaptive
 0.020     mutex_unlock
 0.       \*do_exit_critical
14.210     take_deferred_signal

Which is more useful, we now know that the time is spent in mutex locks, but we don't know the user of those mutex locks. In this case the bulk of the time comes from mutex_trylock_adaptive(), so that is the routine to investigate:

Attr.      Name
User CPU
17.062     mutex_lock_impl
 2.872    \*mutex_trylock_adaptive
14.190     do_exit_critical

So we're still in the mutex lock code, we need to find who is calling the mutex locks:

Attr.      Name
User CPU
11.938     free
 9.327     malloc
 4.203    \*mutex_lock_impl
17.062     mutex_trylock_adaptive

So we finally discover that the time is due to calls to mutex locks in malloc() and free().

Friday Apr 16, 2010

Kernel and user profiling with dtrace

Just put together a short dtrace script for profiling both userland and kernel activity.

#!/usr/sbin/dtrace -s
#pragma D option quiet



  printf("%5s %20s %20s %10s\\n","PID","EXECNAME","FUNC","COUNT");
  printa("%5d %20s %20A %10@d\\n",@);
  printa("%5d %20s %#20a %10@d\\n",@k);

The script samples the current pc for both user land and kernel about 100x per second. There's some risk of over counting since there's one probe for user and one probe for kernel. Every second the code prints out the top 25 user and kernel routines, broken down by pid and executable name. The output looks like:

  PID             EXECNAME                 FUNC                  COUNT
  556                 Xorg`pixman_image_unref   1
  556                 Xorg`pixman_fill          1
  556                 Xorg`memcpy                      1
    0                sched unix`dispatch_softint                 1
    0                sched unix`dispatch_hardint                 2
    0                sched unix`mach_cpu_idle                    91

Monday Oct 27, 2008

Tutorial on writing a hardware performance counter profiler

This post looks like a very useful tutorial on writing a hardware performance counter profiler. Probably a good read for anyone who fancied writing their own Performance Analyzer.

Monday Nov 26, 2007

Multi-threading webcast

A long while back I was asked to contribute a video that talked about parallelising applications. The final format is a webcast (audio and slides) rather than the expected video. This choice ended up being made to provide the clearest visuals of the slides, plus the smallest download.

I did get the opportunity to do the entire presentation on video - which was an interesting experience. I found it surprisingly hard to present to just a camera - I think the contrast with presenting to an audience is that you can look around the room and get feedback as to the appropriate level of energy to project. A video camera gives you no such feedback, and worse, there's no other place to look. Still I was quite pleased with the final video. The change to a webcast was made after this, so the audio from the video was carried over, and you still get to see about 3 seconds of the original film, but the rest has gone. I also ended up reworking quite a few of the slides - adding animation to clarify some of the topics.

The topics covered at a break-neck pace are, parallelising using Pthreads and OpenMP. Autoparallelisation by the compiler. Profiling parallel applications. Finally, detecting data races using the thread analyzer.

Wednesday Aug 15, 2007

Comparing analyzer experiments

When performance tuning an application it is really helpful to be able to compare the performance of the current version of the code with an older version. This was one of the motivators for the performance reporting tool spot. spot takes great care to capture information about the system that the code was run on, the flags used to build the application, together with the obvious things like the profile of the application. The tool spot_diff, which is included in the latest release of spot, pulls out the data from multiple experiments and produces a very detailed comparison between them - indicating if, for example, one version had more TLB misses than another version.

However, there are situations where it's necessary to compare two analyzer experiments, and er_xcmp is a tool which does just that.

er_xcmp extracts the time spent in each routine for the input data that is passed to it, and presents this as a list of functions together with the time spent in each function from each data set. er_xcmp handles an arbitrary number of input files, so it's just as happy comparing three profiles as it is two. It's also able to handle data from bit, so comparisons of instruction count as well as user time are possible.

The input formats can be Analyzer experiments, fsummary output from er_print, or output directories from er_html - all three formats get boiled down to the same thing and handled in the same way by the script.

Here's some example output:

% er_xcmp
    2.8     8.8 <Total>
    1.8     2.6 foo
    1.0     6.2 main
    N/A     N/A _start

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge


« April 2014
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming