Friday Oct 21, 2011


SPARC and x86 processors have different endianness. SPARC is big-endian and x86 is little-endian. Big-endian means that numbers are stored with the most significant data earlier in memory. Conversely little-endian means that numbers are stored with the least significant data earlier in memory.

Think of big endian as writing numbers as we would normally do. For example one thousand, one hundred and twenty would be written as 1120 using a big-endian format. However, writing as little endian it would be 0211 - the least significant digits would be recorded first.

For machines, this relates to which bytes are stored first. To make data portable between machines, a format needs to be agreed. For example in networking, data is defined as being big-endian. So to handle network packets, little-endian machines need to convert the data before using it.

Converting the bytes is a trivial matter, but it has some performance pitfalls. Let's start with a simple way of doing the conversion.

template <class T>
T swapslow(T in)
  T out;
  char * pcin = (char*)∈
  char * pcout = (char*)&out;

  for (int i=0; i<sizeof(T); i++)
    pcout[i] = pcin[sizeof(T)-i];
  return out;

The code uses templates to generalise it to different sizes of integers. But the following observations hold even if you use a C version for a particular size of input.

First thing to look at is instruction count. Assume I'm dealing with ints. I store the input to memory, then I access the input one byte at a time, storing each byte to a new location in memory, before finally loading the result. So for an int, I've got 10 memory operations.

Memory operations can be costly. Processors may be limited to only issuing one per cycle. In comparison most processors can issue two or more logical or integer arithmetic instructions per cycle. Loads are also costly as they have to access the cache, which takes a few cycles.

The other issue is more subtle, and I've discussed it in the past. There are RAW issues in this code. I'm storing an int, but loading it as four bytes. Then I'm storing four bytes, and loading them as an int.

A RAW hazard is a read-after-write hazard. The processor sees data being stored, but cannot convert that stored data into the format that the subsequent load requires. Hence the load has to wait until the result of the store reaches the cache before the load can complete. This can be multiple cycles of wait.

With endianness conversion, the data is already in the registers, so we can use logical operations to perform the conversion. This approach is shown in the next code snippet.

template <class T>
T swap(T in)
  T out=0;
  for (int i=0; i<sizeof(T); i++)
  return out;

In this case, we avoid the stores and loads, but instead we perform four logical operations per byte. This is higher cost than the load and store per byte. However, we can usually do more logical operations per cycle and the operations normally take a single cycle to complete. Overall, this is probably slightly faster than loads and stores.

However, you will usually see a greater performance gain from avoiding the RAW hazards. Obviously RAW hazards are hardware dependent - some processors may be engineered to avoid them. In which case you will only see a problem on some particular hardware. Which means that your application will run well on one machine, but poorly on another.

Wednesday Jul 13, 2011

Presenting at Oracle Open World

I'll be presenting at Oracle Open World in October. The full searchable catalog is on-line, or you can browse the speakers by name.

Wednesday May 18, 2011

Profiling running applications

Sometimes you want to profile an application, but you either want to profile it after it has started running, or you want to profile it for part of a run. There are a couple of approaches that enable you to do this.

If you want to profile a running application, then there is the option (-P <pid>) for collect to attach to a PID:

$ collect -P <pid>

Behind the scenes this generates a script and passes the script to dbx, which attaches to the process, starts profiling, and then stops profiling after about 5 minutes. If your application is sensitive to being stopped for dbx to attach, then this is not the best way to go. The alternative approach is to start the application under collect, then collect the profile over the period of interest.

The flag -y <signal> will run the application under collect, but collect will not gather any data until profiling is enabled by sending the selected signal to the application. Here's an example of doing this:

First of all we need an application that runs for a bit of time. Since the compiler doesn't optimise out floating point operations unless the flag -fsimple is used, we can quickly write an app that spends a long time doing nothing:

$ more slow.c
int main()
  double d=0.0;
  for (int i=0;i<10000000000; i++) {d+=d;}

$ cc -g slow.c

The next step is to run the application under collect with the option -y SIGUSR1 to indicate that collect should not start collecting data until it receives the signal USR1.

$ collect -y SIGUSR1 ./a.out &
[1] 1187
Creating experiment database ...

If we look at the generated experiment we can see that it exists, but it contains no data.

$ er_print -func
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
0.        0.         

To start gathering data we send SIGUSR1 to the application, sending the signal again stops data collection. Sending the signal twice we can collect two seconds of data:

$ kill -SIGUSR1 1187;sleep 2;kill -SIGUSR1 1187
$ er_print -func
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
2.001     2.001      
2.001     2.001      main
0.        2.001      _start

Friday May 06, 2011

Calling functions

I was looking at some code today and it reminded me of a very common performance issue - reloading data around calls. Suppose I have some code like:

int variable;

void function(int *array)
  for (i=0; i<1000; i++)
     if (variable==1) 

You might be surprised to find that "variable" is reloaded very iteration of the loop. The reason for this is that the loop calls another function - either func1() or func2() and the compiler knows that the function might change the value of "variable" - so to be correct it needs to be reloaded.

This problem can be fixed by caching a local copy of the variable. The compiler "knows" that local (or stack based) variables don't get modified by function calls.

However, the problem is more general than this, in C++ you might observe a reloading of variables that are members of objects - for similar reasons. The general rule for avoiding this is to examine every load or store in the hot region of code to check whether it is necessary, or whether it has been introduced because of a function call.

Friday Apr 01, 2011

Profiling scripts

One feature that crept into the Oracle Solaris Studio 12.2 release was the ability for the performance analyzer to follow scripts. It is necessary to set the environment variable SP_COLLECTOR_SKIP_CHECKEXEC to use this feature - as shown below.

bash-3.00$ file `which which`
/bin/which:     executable /usr/bin/csh script
bash-3.00$ collect which
Target `which' is not a valid ELF executable
bash-3.00$ collect which
Creating experiment database ...

Thursday Jan 27, 2011

Don't initialise local strings

Consider the following code:

void s(int i)
  char string[2048]="";
  sprinf(string,"Value = %i",i);
  printf("String = %s\\n",string);

The C standards require that if any elements of the character array string are initialised, then all of them should be. We can demonstrate this by compiling with gcc:

$ gcc -O -S f.c
$ more f.s
        .file   "f.c"
        .type   s, #function
        .proc   020
        save    %sp, -2160, %sp
        stx     %g0, [%fp-2064]
        add     %fp, -2056, %o0
        mov     0, %o1
        call    memset, 0
        mov    2040, %o2

You can see that explicitly initialising string caused all elements of string to be initialised with a call to memset(). Removing the explicit initialisation of string (the ="") avoids the call to memset().

Tuesday Jan 11, 2011

RAW pipeline hazards

When a processor stores an item of data back to memory it actually goes through quite a complex set of operations. A sketch of the activities is as follows. The first thing that needs to be done is that the cache line containing the target address of the store needs to be fetched from memory. While this is happening, the data to be stored there is placed on a store queue. When the store is the oldest item in the queue, and the cache line has been successfully fetched from memory, the data can be placed into the cache line and removed from the queue.

This works very well if data is stored and either never reused, or reused after a relatively long delay. Unfortunately it is common for data to be needed almost immediately. There are plenty of reasons why this is the case. If parameters are passed through the stack, then they will be stored to the stack, and then immediately reloaded. If a register is spilled to the stack, then the data will be reloaded from the stack shortly afterwards.

It could take some considerable number of cycles if the loads had to wait for the stores to exit the queue before they could fetch the data. So many processors implement some kind of bypassing. If a load finds the data it needs in the store queue, then it can fetch it from there. There are often some caveats associated with this bypass. For example, the store and load often have to be of the same size to the same address. i.e. you cannot bypass a byte from a store of a word. If the bypass fails, then the situation is referred to as a "RAW" hazard, meaning "Read-After-Write". If the bypass fails, then the load has to wait until the store has completed before it can retrieve the new value - this can take many cycles.

As a general rule it is best to avoid potential RAWs. It is hardware, and runtime situation dependent whether there will be a RAW hazard or not, so avoiding the possibility is the best defense. Consider the following code which uses loads and stores of bytes to construct an integer.

#include <stdio.h>
#include <sys/time.h>

void tick()
  hrtime_t now = gethrtime();
  static hrtime_t then = 0;
  if (then>0) printf("Elapsed = %f\\n", 1.0\*(now-then)/100000000.0);
  then = now;

int func(char \* value)
  int temp;
  ((char\*)&temp)[0] = value[3];
  ((char\*)&temp)[1] = value[2];
  ((char\*)&temp)[2] = value[1];
  ((char\*)&temp)[3] = value[0];
  return temp;

int main()
  int value = 0x01020304;
  for (int i=0; i<100000000; i++) func((char\*)&value);

In the above code we're reversing the byte order by loading the bytes one-by-one, and storing them into an integer in the correct position, then loading the integer. Running this code on a test machine it reports 12ns per iteration.

However, it is possible to perform the same reordering using logical operations (shifts and ORs) as follows:

int func2(char\* value)
  return (value[0]<<24) | (value[1]<<16) | (value[2]<<8) | value[0];

This modified routine takes about 8ns per iteration. Which is significantly faster than the original code.

The actual speed up observed will depend on many factors, the most obvious being how often the code is encountered. The more observation is that the speed up depends on the platform. Some platforms will be more sensitive to the impact of RAWs than others. So the best advice is, whereever possible, to avoid passing data through the stack.

Sunday Nov 28, 2010

Multicore Application Programming: Source code available

I've just uploaded all the source code to the examples in Multicore Application Programming. About 160 files.

Saturday Nov 13, 2010

Multicore Application Programming arrived!

It was an exciting morning - my copy of Multicore Application Programming was delivered. After reading the text countless times, it's great to actually see it as a finished article. It starting to become generally available. Amazon lists it as being available on Wednesday, although the Kindle version seems to be available already. It's also available on Safari books on-line. Even turned up at Tesco!

Tuesday Nov 09, 2010

Multicore application programming: sample chapter

No sign of the actual books yet - I expect to see them any day now - but there's a sample chapter up on the informit site. There's also a pdf version which includes preface and table of contents.

This is chapter 3 "Identifying opportunities for parallelism". These range from the various OS-level approaches, through virtualisation, and into multithread/multiprocess. It's this flexibility that makes multicore processors so appealing. You have the choice of whether you take advantage of them through some consolidation of existing applications, or whether you take advantage of them, as a developer, through scaling a single application.

mtmalloc performance

A while back I discussed how the performance of mtmalloc could be improved. Well Rick Weisner was working on this, so I provided him with a fix for my hot issue. So I'm very pleased to see, from the bug status, that this code was integrated last month!

Wednesday Sep 08, 2010

Oracle Solaris Studio 12.2

It's been just over a year since the release of Studio 12 Update 1, today we releasing the first Oracle branded Studio release - Oracle Solaris Studio 12.2. For the previous release I wrote a post for the AMD site looking at the growth in multicore processors. It seemed appropriate to take another look at this.

The graph in the chart below shows the cumulative number of SPECint2006 results broken down by the number of cores for each processor. This data does not represent the number of different types of processor that are available, since the same processor can be used in many different results. It is closer to a snapshot of how the market for multicore processors is growing. Each data point represents a system, so the curve approximates the number of different systems that are being released.

It's perhaps more dramatic to demonstrate the change using a stacked area chart. The chart perhaps overplays the number of single core results, but this is probably fair as "single core" represents pretty much all the results prior to the launch of CPU2006. So what is readily apparent is the rapid decline in the number of single core results, the spread of dual, and then quad core. It's also interesting to note the beginning of a spread of more than quad core chips.

If we look at what is happening with multicore processors in the context of what we are releasing with Solaris Studio, there's a very nice fit of features. We continue to refine our support for OpenMP and automatic parallelisation. We've been providing data race (and deadlock) detection through the Thread Analyzer for a couple of releases. The debugger and the performance analyzer have been fine with threads for a long time. The performance analyzer has the time line view which is wonderful for examining multithreaded (or multiprocess) applications.

In addition to these fundamentals Studio 12.2 introduces a bunch of new features. I discussed some of these when the express release came out:

  • For those who use the IDE, integration of support for the analysis of the runtime behaviour of applications has been very useful. It both provides more information directly back to the developer, and raises awareness of the available tools.
  • Understanding call trees is often an important part of interpreting the performance of the application. Being able to drill down the call tree has been a very useful extension to the Performance Analyzer.
  • Memory error checking is critical for all applications. The trouble with memory access errors is that, like data races, the "problem" is visible arbitrarily far from the point where the error occurred.

The release of a new version of a product is always an exciting time. It's a culmination of a huge amount of analysis, development, and testing, and it's wonderful to finally see it available for others to use. So download it and let us know what you think!

Footnote: SPEC, SPECint, reg tm of Standard Performance Evaluation Corporation. Results from as of 6 September 2010 and this report.

Parallelisation white paper

An interesting white paper on various approaches to writing parallel programs has just been released. Covers OpenMP, Threading Building Blocks, MPI, and a bunch of others.

Sunday Aug 01, 2010

Multicore application programming: podcast

A few weeks back I had the pleasure of being interviewed by Allan Packer as part of the Oracle author podcast series. The podcast is a brief introduction to the book.

Wednesday Jul 28, 2010

Multicore application programming on Safari books

A roughcut of Multicore Application Programming has been uploaded to Safari books. If you have access you can read it, and provide feedback or comments. If you don't have access to Safari, you can still see the table of contents, read the preface, and view the start of each chapter.

Monday Jul 26, 2010

What does take_deferred_signal() mean in my profile?

Every so often you'll see take_deferred_signal() appear in the profile of an application. Sometimes as quite a time consuming function. So, what does it mean?

It actually comes from signal handling code in libc. If a signal comes in while the application is in a critical section, the signal gets deferred until the critical section is complete. When the application exits the critical section, all the deferred signals get taken.

Typically, this function becomes hot due to mutex locks in malloc and free, but other library calls can also cause it. The way to diagnose what is happening is to examine the call stack. So let's run through an example. Here is some multithreaded malloc/free heavy code.

#include <stdlib.h>
#include <pthread.h>

void \*work( void\* param )
  while ( 1 ) { free( malloc(100) ); }

int main()
  pthread_t thread;
  pthread_create( &thread, 0, work, 0 );
  for ( int i=0; i<10000000; i++ )
    free ( malloc (100) );

Profiling, we can see that take_deferred_signal() is the hottest function. The other hot functions would probably give us a clue as to the problem, but that is an artifact of the rather simple demonstration code.

Excl.     Incl.      Name
User CPU  User CPU
  sec.      sec.
36.456    36.456     <Total>
14.210    14.210     take_deferred_signal
 4.203    21.265     mutex_lock_impl
 3.082     3.082     clear_lockbyte
 2.872    17.062     mutex_trylock_adaptive

The next thing to look at is the call stack for take_deferred_signal() as this will tell us who is calling the function.

Attr.      Name
User CPU
14.210     do_exit_critical
14.210    \*take_deferred_signal

do_exit_critical() doesn't tell us anything, we already know that it is called when the code exits a critical section. Continuing up the call stack we find:

Attr.      Name
User CPU
14.190     mutex_trylock_adaptive
 0.020     mutex_unlock
 0.       \*do_exit_critical
14.210     take_deferred_signal

Which is more useful, we now know that the time is spent in mutex locks, but we don't know the user of those mutex locks. In this case the bulk of the time comes from mutex_trylock_adaptive(), so that is the routine to investigate:

Attr.      Name
User CPU
17.062     mutex_lock_impl
 2.872    \*mutex_trylock_adaptive
14.190     do_exit_critical

So we're still in the mutex lock code, we need to find who is calling the mutex locks:

Attr.      Name
User CPU
11.938     free
 9.327     malloc
 4.203    \*mutex_lock_impl
17.062     mutex_trylock_adaptive

So we finally discover that the time is due to calls to mutex locks in malloc() and free().

Friday Jul 09, 2010

White paper on using Oracle Solaris Studio

I contributed a fair amount of material to a recent white paper about Oracle Solaris Studio. The paper is available for download from the developer portal.

Optimizing Applications with Oracle Solaris Studio Compilers and Tools

Oracle Solaris Studio delivers a fully integrated development platform for generating robust high-performance applications for the latest Oracle Sun systems (SPARC and x86). In order to take full advantage of the latest multicore systems, applications must be compiled for optimal performance and tuned to exploit the capabilities of the hardware. Learn how Oracle Solaris Studio helps you generate the highest performance applications for your target platform, from selecting the right compiler flags and optimization techniques to simplifying development with advanced multicore tools.

Tuesday Jun 08, 2010

Calltrees in analyzer

The Performance Analyzer has also had a number of new features and improvements. The most obvious one of these is the new call tree tab. This allows you to drill down into the call tree for an application and see exactly how the time is divided between the various call stacks.

Friday Apr 16, 2010

Kernel and user profiling with dtrace

Just put together a short dtrace script for profiling both userland and kernel activity.

#!/usr/sbin/dtrace -s
#pragma D option quiet



  printf("%5s %20s %20s %10s\\n","PID","EXECNAME","FUNC","COUNT");
  printa("%5d %20s %20A %10@d\\n",@);
  printa("%5d %20s %#20a %10@d\\n",@k);

The script samples the current pc for both user land and kernel about 100x per second. There's some risk of over counting since there's one probe for user and one probe for kernel. Every second the code prints out the top 25 user and kernel routines, broken down by pid and executable name. The output looks like:

  PID             EXECNAME                 FUNC                  COUNT
  556                 Xorg`pixman_image_unref   1
  556                 Xorg`pixman_fill          1
  556                 Xorg`memcpy                      1
    0                sched unix`dispatch_softint                 1
    0                sched unix`dispatch_hardint                 2
    0                sched unix`mach_cpu_idle                    91

Tuesday Feb 23, 2010

Presenting at the SVOSUG on Thursday

I'm presenting at the Silicon Valley OpenSolaris Users Group on Thursday evening. I was only asked today, so I'm putting together some slides this evening on "Multicore Application Programming". The talk is going to be a relatively high level presentation on writing parallel applications, and how the advent of multicore or CMT processors changes the dynamics.

Monday Feb 15, 2010

x86 performance tuning documents

Interesting set of x86 performance tuning documents.

Wednesday Nov 25, 2009

Viewing thread activity in the Performance Analyzer

The Sun Studio Performance Analyzer is one of the two tools that I use most frequently (the other is spot - which is now in SS12U1!). It's a very powerful tool, but a lot of that power is not immediately visible to users. I'm going to discuss a couple of ways I've used the analyzer to view parallel applications.

The most common first step for looking at the performance of parallel apps is to use the timeline. However, the timeline can look a bit cluttered with all of the call stack data. Often you are really just interested in the leaf node. Fortunately this can be configured from the data presentation dialog box. To get the view I want I'm only showing the top leaf in the call stack:

This results in a display of the samples in each routine, by default this can look very colourful. You can make it easier on the eye by selecting the colours used to display the graphic. In the following graphic I've picked green for one parallel routine that I'm interested in, and blue for another, then used a yellow to colour all the time waiting for more work to be assigned:

The graphic shows that the work is not evenly spread across all threads. The first few threads spend more time in the hot routines than the later threads. We can see this much more clearly using the 'threads' view of the data. To get this view you need to go back to the data presentation dialog and select the threads tab, it's also useful to select the 'cpus' tab at the same time.

The threads tab shows the activity of each thread for the currently displayed metrics. This is useful to see if one thread is doing more work than another. The cpus tab shows time that the app spends on each CPU in the machine - this can indicate whether a particular CPU is over subscribed. The thread activity looks like:

This confirms what we thought earlier that some of the threads are much more active than other threads. The top chart shows the user time, which indicates that all the threads spent the same amount of time running 'stuff', the middle chart shows the time that each thread spent running useful work, the lower chart shows the time spent in overhead. The exercise now is to try and improve the distribution of work across the threads......

Monday Oct 19, 2009

Fishing with cputrack

I'm a great fan of the hardware performance counters that you find on most processors. Often you can look at the profile and instantly identify what the issue is. Sometimes though, it is not obvious, and that's where the performance counters can really help out.

I was looking at one such issue last week, the performance of the application was showing some variation, and it wasn't immediately obvious what the issue was. The usual suspects in these cases are:

  • Excessive system time
  • Process migration
  • Memory placement
  • Page size
  • etc.

Unfortunately, none of these seemed to explain the issue. So I hacked together the following script cputrackall which ran the test code under cputrack for all the possible performance counters. Dumped the output into a spreadsheet, and compared the fast and slow runs of the app. This is something of a "fishing trip" script, just gathering as much data as possible in the hope that something leaps out, but sometimes that's exactly what's needed. I regularly get to sit in front of a new chip before the tools like ripc have been ported, and in those situations the easiest thing to do is to look for hardware counter events that might explain the runtime performance. In this particular instance, it helped me to confirm my suspicion that there was a difference in branch misprediction rates that was causing the issue.

Tuesday Oct 13, 2009

Surprisingly slow compile time

I had an e-mail which told the sorry tale of a new system which tool longer to build a project than an older system, of theoretically similar performance. The system showed low utilisation when doing the build indicating that it was probably spending a lot of time waiting for something.

The first thing to look at was a profile of the build process using `collect -F on`, which produced the interesting result that the build was taking just over 2 minutes of user time, a few seconds of system time, and thousands of seconds of "Other Wait" time.

"Other wait" often means waiting for network, or disk, or just sleeping. The other thing to realise about profiling multiple processes is that all the times are cumulative, so all the processes that are waiting accumulate "other wait" time. Hence it will be a rather large number if multiple processes are doing it. So this confirmed and half explained the performance issue. The build was slow because it was waiting for something.

Sorting the profile by "other wait" indicated two places that the wait was coming from, one was waitpid - meaning that the time was due to a process waiting for another process, well we knew that! The other was a door call. Tracing up the call stack eventually lead into the C and C++ compiler, which were calling gethostbyname. The routine doing the calling was "generate_prefix" which is the routine responsible for generating a random prefix for function names - the IP address of the machine was used as one of the inputs for the generation of a prefix.

The performance problem was due to gethostbyname timing out, common reasons for this are missed configurations in the /etc/hosts and /etc/nsswitch.conf files. In this example adding the host name to the hosts file cured the problem.

Monday Oct 12, 2009

An aliasing example

The compiler flag -xalias_level allows a user to assert the degree of aliasing that exists within the source code of an application. If the assertion is not true, then the behaviour of the application is undefined. It is definitely worth looking at the examples given in the user's guide, although they can be a bit "dry" to read. So here's an example which illustrates what can happen:

struct stuff{
 int value1;
 int value2;

void fill(struct stuff \*x)
  x->value1=0;      // Clear value1 
  int \* r=(int\*)x;  // Take the address of the structure
  int var = \*r;     // Take the value from value1
  x->value1=var;    // And store it back into value1

The above code will clear value1 and then load and store this value back. So for correctly working code value1 should exit the function containing zero. However, if -xalias_level=basic is used to build the application, then this tells the compiler that no two pointers to variables of different types will alias. So pointer to an int will never alias with an int. So the read from \*r does not alias with x.value1.

So with this knowledge the compiler is free to remove the original store to x.value1, because it has been told that nothing will alias with it, and there is a later store to the same address. The later store will overwrite the initial store.

Fortunately it the lint utility can pick up these issues:

$ lint -Xalias_level=basic alias.c
(9) warning: cast of nonscalar pointer to scalar pointer is valid only at -xalias_level=any

For the example above the compiler does the correct thing and eliminates all the instructions but the store to value1. For more complex examples there is no guarantee that the code will be correct if it violates the -xalias_level setting.

Friday Oct 09, 2009

Webcast: Improving the performance of parallel codes using the Performance Analyzer

Earlier in the summer I recorded a slidecast on using the Performance Analyzer on parallel codes, it's just come out on the HPC portal.

Monday Sep 28, 2009

Updated compiler flags article

Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.

Monday Sep 21, 2009

Profiling scripts

If you try to use the Sun Studio Performance Analyzer on something that is not an executable, you'll end up with an error message:

$ collect kstat
Target `kstat' is not a valid ELF executable

The most reliable workaround for this that I've discovered is as follows. First of all make up shell script that executes the command passed into it:

$ more

Then run the collect command as:

$ collect -F on /bin/sh <script> <params>

The -F on is required so that collect follows forked processes, otherwise collect will just profile the top /bin/sh which will do minimal work before forking off the actual command.

When loading the resulting experiment into the Analyzer you have to load all the descendant processes. You can do this by going to the filter dialog box and selecting all the processes, or you can take the easier route of placing en_desc on into your .er.rc file in your home directory (this will tell the analyzer to always load the descendant processes, which might make loading experiments slower, but will guarantee that you actually load all the data, and not just the top-level code).

One other thing to note is that each new process can contribute wall and wait time, so the wall time shown in the analyzer can be a multiple of the actual wall time. To see this in action do:

$ collect -F on /bin/sh kstat

The wall time on this will be a multiple of the actual runtime because each shell script contributes wall time while it waits for the kstat command to complete.

Wednesday Sep 02, 2009

Profiling a rate

Sometimes it's the rate of doing something which is the target that needs to be improved through optimisation. ie increase the widgets per second of some application. I've just been looking at a code that estimated performance by counting the number of computations completed in a known constant length of time. The code was showing a performance regression, and I wanted to find out what changed. The analysis is kind of counter intuitive, so I thought I'd share an example with you.

Here's an example code that does a computation for a fixed length of time, in this case about 30 seconds:

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>

double f1(int i)
  double t=0;
  while (i-->0) {t+=t;}
  return t;

double f2(int i)
  double t=0;
  while (i-->0) {t+=t;}
  return t;

void main(int argc,char\*\*args)
  struct timeval now;
  long startsecs;
  long count=0;
  int vcount;
  if (argc!=2){ printf("Needs a number to be passed in\\n"); exit(0);}


  } while (now.tv_sec<startsecs+30);

  printf("Iterations %i duration %i rate %f\\n",count, now.tv_sec-startsecs, 1.0\*count/(now.tv_sec-startsecs));


The code takes a command line parameter to indicate the number of iterations to do in function f2, function f1 always does 100 iterations.

If I compile and run this code under the performance analyzer with 50 and 70 as the commandline parameters I get the following profile:

Description50 Iterations70 Iterations
Total time26.6s25.87s
Total iterations942,684841,921

We can make the following observation when we go from 70 down to 50 for parameter passed to f2, we see a 12% gain in the total rate. This is to be expected as we are reducing the total number of iterations of the pair of loops in f1 and f2 will reduce from 170 down to 150, which is the same ~12% gain.

Where it gets counter intuitive is that for the run which achieves the higher rate, the time spent in the routines f1 and gettimeofday increases - by the same 12%. This is counter intuitive because increased time in a routine normally indicates that the routine is the one to be investigated, but for a 'rate' situation the opposite is true. These routines are being well behaved. The way to think about it is that each unit of work needs a smidgeon of time in both of these routines, if the number of units of work increases, then the absolute amount of time in these two routines needs to increase linearly with the increase in rate.

However, the time in routine f2 decreases as the rate increases. This is the routine which has been "improved" to get the better rate. The other thing to note is that the time went from ~6s to ~4.5s, but the rate went from 841k to 941k, so the time per unit work dropped further than that - this makes comparing the profiles of the two runs more tricky.

Note that Amdahl's law would still tell us that the routines that need to be optimised are the ones where the time is spent - so in one sense nothing has changed. But my particular scenario today is figuring out what has changed in the executable when compiled in two different ways that leads to the performance gain. In this context, I now know the routine, and I can dig into the assembly code to figure out the why.

Friday Aug 28, 2009

Maps in the STL

I was looking at some code with a colleague and we observed a bunch of time in some code which used the std::map to set up mappings between strings. The source code looked rather like the following:

#include <map>
#include <string>

using namespace std;

int func(map<string,string>&mymap, string &s1, string &s2)
  return 0;

When compiled with CC -O -c -library=stlport4 this expands to a horrendous set of calls, here's the first few:

$ er_src -dis func map.o|grep call 
        [?]     188d:  call    std::basic_string...::basic_string
        [?]     189f:  call    std::basic_string...::basic_string
        [?]     18b2:  call    std::basic_string...::basic_string
        [?]     18c2:  call    std::basic_string...::basic_string
        [?]     18d8:  call    std::_Rb_tree...::insert_unique
        [?]     18f8:  call    std::__node_alloc...::_M_deallocate
        [?]     190c:  call std::_STLP_alloc_proxy...::~_STLP_alloc_proxy

What's happening is that the act of making a pair object is causing copies to be made of the two strings that are passed into the pair constructor. Then the pair object is passed into the insert method of std::map and this results in two more copies of the strings being made. There's a bunch of other stuff going on, and the resulting code is a mess.

There's an alternative way of assigning the mapping:

#include <map>
#include <string>

using namespace std;

int func(map<string,string>&mymap, string &s1, string &s2)
  return 0;

When compiled the resulting code looks a lot neater:

$ er_src -dis func map.o|grep call
        [?]     28e6:  call    std::map...::operator[]
        [?]     2903:  call    std::basic_string...::_M_assign_dispatch

Of course a neater chunk of code is nice, but the question is whether the code for ::operator[] contains the same ugly mess. Rather than disassembling to find out, it's simpler to time the two versions and see which does better. A simple test harness looks like:

int main()
  string s1,s2;
  long long i;
  for (i=0; i<100000000; i++)

It's a less than ideal harness since it uses constant strings, and one version of the code might end up bailing early because of this. The performance of the two codes is quite surprising:

real           6.79
user           6.77
sys            0.00

real        1:03.53
user        1:03.26
sys            0.01 

So the version that creates the pair object is about 10x slower!


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« July 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming