Tuesday Feb 23, 2010

Presenting at the SVOSUG on Thursday

I'm presenting at the Silicon Valley OpenSolaris Users Group on Thursday evening. I was only asked today, so I'm putting together some slides this evening on "Multicore Application Programming". The talk is going to be a relatively high level presentation on writing parallel applications, and how the advent of multicore or CMT processors changes the dynamics.

Wednesday Nov 25, 2009

Viewing thread activity in the Performance Analyzer

The Sun Studio Performance Analyzer is one of the two tools that I use most frequently (the other is spot - which is now in SS12U1!). It's a very powerful tool, but a lot of that power is not immediately visible to users. I'm going to discuss a couple of ways I've used the analyzer to view parallel applications.

The most common first step for looking at the performance of parallel apps is to use the timeline. However, the timeline can look a bit cluttered with all of the call stack data. Often you are really just interested in the leaf node. Fortunately this can be configured from the data presentation dialog box. To get the view I want I'm only showing the top leaf in the call stack:

This results in a display of the samples in each routine, by default this can look very colourful. You can make it easier on the eye by selecting the colours used to display the graphic. In the following graphic I've picked green for one parallel routine that I'm interested in, and blue for another, then used a yellow to colour all the time waiting for more work to be assigned:

The graphic shows that the work is not evenly spread across all threads. The first few threads spend more time in the hot routines than the later threads. We can see this much more clearly using the 'threads' view of the data. To get this view you need to go back to the data presentation dialog and select the threads tab, it's also useful to select the 'cpus' tab at the same time.

The threads tab shows the activity of each thread for the currently displayed metrics. This is useful to see if one thread is doing more work than another. The cpus tab shows time that the app spends on each CPU in the machine - this can indicate whether a particular CPU is over subscribed. The thread activity looks like:

This confirms what we thought earlier that some of the threads are much more active than other threads. The top chart shows the user time, which indicates that all the threads spent the same amount of time running 'stuff', the middle chart shows the time that each thread spent running useful work, the lower chart shows the time spent in overhead. The exercise now is to try and improve the distribution of work across the threads......

Monday Nov 23, 2009

When threads go bad

When a thread hits an error in a multithreaded application, that error will take out the entire app. Here's some example code:

#include <pthread.h>
#include <stdio.h>

void \*work(void \* param)
  printf("Child thread exit\\n");

void main()
  pthread_t thread;
  printf("Main thread exit\\n");

Compiling and running this produces:

% cc -O -mt pthread_error.c
% ./a.out
Segmentation Fault (core dumped)

Not entirely unexpected, that. The app died without the main thread having the chance to clear up resources etc. This is probably not ideal. However, it is possible to write a signal handler to capture the segmentation fault, and terminate the child thread without causing the main thread to terminate. It's important to realise that there's probably little chance of actually recovering from the unspecified error, but this at least might give the app the chance to report the symptoms of its demise.

#include <pthread.h>
#include <stdio.h>
#include <signal.h>

void \*work(void \* param)
  printf("Child thread exit\\n");

void hsignal(int i)
  printf("Signal %i\\n",i);

void main()
  pthread_t thread;
  printf("Main thread exit\\n");

Which produces the output:

% cc -O -mt pthread_error.c
% ./a.out
Signal 11
Main thread exit

Tuesday Oct 13, 2009

Surprisingly slow compile time

I had an e-mail which told the sorry tale of a new system which tool longer to build a project than an older system, of theoretically similar performance. The system showed low utilisation when doing the build indicating that it was probably spending a lot of time waiting for something.

The first thing to look at was a profile of the build process using `collect -F on`, which produced the interesting result that the build was taking just over 2 minutes of user time, a few seconds of system time, and thousands of seconds of "Other Wait" time.

"Other wait" often means waiting for network, or disk, or just sleeping. The other thing to realise about profiling multiple processes is that all the times are cumulative, so all the processes that are waiting accumulate "other wait" time. Hence it will be a rather large number if multiple processes are doing it. So this confirmed and half explained the performance issue. The build was slow because it was waiting for something.

Sorting the profile by "other wait" indicated two places that the wait was coming from, one was waitpid - meaning that the time was due to a process waiting for another process, well we knew that! The other was a door call. Tracing up the call stack eventually lead into the C and C++ compiler, which were calling gethostbyname. The routine doing the calling was "generate_prefix" which is the routine responsible for generating a random prefix for function names - the IP address of the machine was used as one of the inputs for the generation of a prefix.

The performance problem was due to gethostbyname timing out, common reasons for this are missed configurations in the /etc/hosts and /etc/nsswitch.conf files. In this example adding the host name to the hosts file cured the problem.

Monday Oct 12, 2009

An aliasing example

The compiler flag -xalias_level allows a user to assert the degree of aliasing that exists within the source code of an application. If the assertion is not true, then the behaviour of the application is undefined. It is definitely worth looking at the examples given in the user's guide, although they can be a bit "dry" to read. So here's an example which illustrates what can happen:

struct stuff{
 int value1;
 int value2;

void fill(struct stuff \*x)
  x->value1=0;      // Clear value1 
  int \* r=(int\*)x;  // Take the address of the structure
  int var = \*r;     // Take the value from value1
  x->value1=var;    // And store it back into value1

The above code will clear value1 and then load and store this value back. So for correctly working code value1 should exit the function containing zero. However, if -xalias_level=basic is used to build the application, then this tells the compiler that no two pointers to variables of different types will alias. So pointer to an int will never alias with an int. So the read from \*r does not alias with x.value1.

So with this knowledge the compiler is free to remove the original store to x.value1, because it has been told that nothing will alias with it, and there is a later store to the same address. The later store will overwrite the initial store.

Fortunately it the lint utility can pick up these issues:

$ lint -Xalias_level=basic alias.c
(9) warning: cast of nonscalar pointer to scalar pointer is valid only at -xalias_level=any

For the example above the compiler does the correct thing and eliminates all the instructions but the store to value1. For more complex examples there is no guarantee that the code will be correct if it violates the -xalias_level setting.

Friday Oct 09, 2009

Webcast: Improving the performance of parallel codes using the Performance Analyzer

Earlier in the summer I recorded a slidecast on using the Performance Analyzer on parallel codes, it's just come out on the HPC portal.

Monday Sep 28, 2009

Updated compiler flags article

Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.

Monday Sep 21, 2009

Profiling scripts

If you try to use the Sun Studio Performance Analyzer on something that is not an executable, you'll end up with an error message:

$ collect kstat
Target `kstat' is not a valid ELF executable

The most reliable workaround for this that I've discovered is as follows. First of all make up shell script that executes the command passed into it:

$ more shell.sh

Then run the collect command as:

$ collect -F on /bin/sh shell.sh <script> <params>

The -F on is required so that collect follows forked processes, otherwise collect will just profile the top /bin/sh which will do minimal work before forking off the actual command.

When loading the resulting experiment into the Analyzer you have to load all the descendant processes. You can do this by going to the filter dialog box and selecting all the processes, or you can take the easier route of placing en_desc on into your .er.rc file in your home directory (this will tell the analyzer to always load the descendant processes, which might make loading experiments slower, but will guarantee that you actually load all the data, and not just the top-level code).

One other thing to note is that each new process can contribute wall and wait time, so the wall time shown in the analyzer can be a multiple of the actual wall time. To see this in action do:

$ collect -F on /bin/sh shell.sh shell.sh shell.sh shell.sh kstat

The wall time on this will be a multiple of the actual runtime because each shell script contributes wall time while it waits for the kstat command to complete.

Tuesday Sep 08, 2009

Performance tuning webcast

I wrote one of the TechDays 2008-2009 sessions on application performance tuning. Unfortunately I never actually got to give it to alive audience, but I did get this version recorded. Thanks to the HPC Watercooler for pointing it out to me.

Wednesday Sep 02, 2009

Profiling a rate

Sometimes it's the rate of doing something which is the target that needs to be improved through optimisation. ie increase the widgets per second of some application. I've just been looking at a code that estimated performance by counting the number of computations completed in a known constant length of time. The code was showing a performance regression, and I wanted to find out what changed. The analysis is kind of counter intuitive, so I thought I'd share an example with you.

Here's an example code that does a computation for a fixed length of time, in this case about 30 seconds:

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>

double f1(int i)
  double t=0;
  while (i-->0) {t+=t;}
  return t;

double f2(int i)
  double t=0;
  while (i-->0) {t+=t;}
  return t;

void main(int argc,char\*\*args)
  struct timeval now;
  long startsecs;
  long count=0;
  int vcount;
  if (argc!=2){ printf("Needs a number to be passed in\\n"); exit(0);}


  } while (now.tv_sec<startsecs+30);

  printf("Iterations %i duration %i rate %f\\n",count, now.tv_sec-startsecs, 1.0\*count/(now.tv_sec-startsecs));


The code takes a command line parameter to indicate the number of iterations to do in function f2, function f1 always does 100 iterations.

If I compile and run this code under the performance analyzer with 50 and 70 as the commandline parameters I get the following profile:

Description50 Iterations70 Iterations
Total time26.6s25.87s
Total iterations942,684841,921

We can make the following observation when we go from 70 down to 50 for parameter passed to f2, we see a 12% gain in the total rate. This is to be expected as we are reducing the total number of iterations of the pair of loops in f1 and f2 will reduce from 170 down to 150, which is the same ~12% gain.

Where it gets counter intuitive is that for the run which achieves the higher rate, the time spent in the routines f1 and gettimeofday increases - by the same 12%. This is counter intuitive because increased time in a routine normally indicates that the routine is the one to be investigated, but for a 'rate' situation the opposite is true. These routines are being well behaved. The way to think about it is that each unit of work needs a smidgeon of time in both of these routines, if the number of units of work increases, then the absolute amount of time in these two routines needs to increase linearly with the increase in rate.

However, the time in routine f2 decreases as the rate increases. This is the routine which has been "improved" to get the better rate. The other thing to note is that the time went from ~6s to ~4.5s, but the rate went from 841k to 941k, so the time per unit work dropped further than that - this makes comparing the profiles of the two runs more tricky.

Note that Amdahl's law would still tell us that the routines that need to be optimised are the ones where the time is spent - so in one sense nothing has changed. But my particular scenario today is figuring out what has changed in the executable when compiled in two different ways that leads to the performance gain. In this context, I now know the routine, and I can dig into the assembly code to figure out the why.

Friday Aug 28, 2009

Maps in the STL

I was looking at some code with a colleague and we observed a bunch of time in some code which used the std::map to set up mappings between strings. The source code looked rather like the following:

#include <map>
#include <string>

using namespace std;

int func(map<string,string>&mymap, string &s1, string &s2)
  return 0;

When compiled with CC -O -c -library=stlport4 map.cc this expands to a horrendous set of calls, here's the first few:

$ er_src -dis func map.o|grep call 
        [?]     188d:  call    std::basic_string...::basic_string
        [?]     189f:  call    std::basic_string...::basic_string
        [?]     18b2:  call    std::basic_string...::basic_string
        [?]     18c2:  call    std::basic_string...::basic_string
        [?]     18d8:  call    std::_Rb_tree...::insert_unique
        [?]     18f8:  call    std::__node_alloc...::_M_deallocate
        [?]     190c:  call std::_STLP_alloc_proxy...::~_STLP_alloc_proxy

What's happening is that the act of making a pair object is causing copies to be made of the two strings that are passed into the pair constructor. Then the pair object is passed into the insert method of std::map and this results in two more copies of the strings being made. There's a bunch of other stuff going on, and the resulting code is a mess.

There's an alternative way of assigning the mapping:

#include <map>
#include <string>

using namespace std;

int func(map<string,string>&mymap, string &s1, string &s2)
  return 0;

When compiled the resulting code looks a lot neater:

$ er_src -dis func map.o|grep call
        [?]     28e6:  call    std::map...::operator[]
        [?]     2903:  call    std::basic_string...::_M_assign_dispatch

Of course a neater chunk of code is nice, but the question is whether the code for ::operator[] contains the same ugly mess. Rather than disassembling to find out, it's simpler to time the two versions and see which does better. A simple test harness looks like:

int main()
  string s1,s2;
  long long i;
  for (i=0; i<100000000; i++)

It's a less than ideal harness since it uses constant strings, and one version of the code might end up bailing early because of this. The performance of the two codes is quite surprising:

real           6.79
user           6.77
sys            0.00

real        1:03.53
user        1:03.26
sys            0.01 

So the version that creates the pair object is about 10x slower!

Friday Jul 17, 2009

Sun Studio 12 Update 1 blog entry live on AMD site

Just had a blog entry about Sun Studio 12 Update 1 posted to the AMD forums site.

Wednesday Jul 01, 2009

Introduction to parallel programming

My colleague, Ruud van der Pas, has recorded a series of seven webcasts on parallel programming which will be released on the HPC portal. Ruud is an expert on parallel programming, and one of the authors of the book "Using OpenMP".

Friday Jun 26, 2009

mtmalloc vs umem

A little while back I was looking at the performance of the STL with multithreaded code. I got the opportunity to try this on a particular code, and rather shockingly to me performance was absolutely terrible! I'd linked the code with mtmalloc, and the hottest function in the profile was malloc_internal. I've put together a fake code, and here's the profile from that:

Excl.     Incl.      Name
User CPU  User CPU
   sec.      sec.
266.446   266.446    
258.301   263.084    malloc_internal
  1.661     1.951    free
  1.401     1.401    mutex_lock_impl
  0.961     0.961    mutex_unlock

We can dig into the disassembly of malloc_internal to find out what's going on:

    73.201    73.201            [?]     1724:  cmp         %o5, 0
     1.981     1.981            [?]     1728:  bne         0x1740
     0.320     0.320            [?]     172c:  nop
     0.490     0.490            [?]     1730:  ld          [%i2 + 44], %i2
     1.191     1.191            [?]     1734:  cmp         %i2, 0
     0.901     0.901            [?]     1738:  bne,a       0x1724
## 176.443   176.443            [?]     173c:  ld          [%i2 + 32], %o5

It's not hard to visualise what the original C code would look like:

  while ((ptr->value==0) && (ptr->next!=0)) { ptr=ptr->next; }

Fortunately the source code is searchable and the above loop looks sufficiently similar to line 1032 of mtmalloc.c:

   1032 	while (thiscache != NULL && thiscache->mt_nfree == 0)
   1033 		thiscache = thiscache->mt_next;

So what's going on?

Reading through the source of malloc_internal, it appears that mtmalloc builds up a linked list of chunks of memory for each size of memory request. The size of the chunks of memory is 8KB\*requestsize, and requestsize is 9. So basically each chunk of memory is 72KB in size. So when a memory request comes in, malloc_internal looks at the current chunk, and if memory can be allocated from there, then it returns memory from that chunk. If not it goes to the next chunk and so on. This works very well when memory is allocated at once, but as memory gets freed, these chunks of memory become like Swiss-cheese, with lots of holes in them. If a lot of memory of a particular size is requested, then freed, there can be a large number of these chunks in the linked list, and scanning through the chunks to find one with free space can be time consuming. And that is the condition that my test code exercises.

It's probably worth revealing the test code, at this point, so that you can see what it does:

#include <stdlib.h>
typedef struct s
  struct s \* next;
  char padding[508];
} S;

void main()
  struct s \* head;
  struct s \* keep;
  struct s \* current;
  for (int j=0; j<100; j++)
    for (int i=0; i<100000; i++)
      current=(struct s\*)malloc(sizeof(struct s));
      if (i&1)
    current = head;
    while (current!=0)
      struct s \* tmp = current;
      current = current -> next;
    head = 0;

The code maintains two lists, one that it places memory onto for a long duration, and another list that holds memory for only a short duration. The memory footprint of the code keeps increasing, so more chunks are added to the lists, and holding on to the memory for a long period of time ensures that the chunks end up with lots of gaps in them. The runtime of this code is as follows:

% cc -O mtm.c -lmtmalloc
% timex a.out
real        4:44.18
user        4:33.80
sys            8.70

However there is an API to libmtmalloc that allows us to adjust the size of the chunks. The following changes increase the requestsize from 9 to 20:


The performance reduces from nearly 5 minutes to about 1 minute:

% cc -O mtm.c -lmtmalloc
% timex a.out
real        1:09.10
user        1:01.09
sys            6.53

If we increase the requestsize to 30, performance improves still further:

% cc -O mtm.c -lmtmalloc
% timex a.out
real          38.36
user          31.41
sys            4.96

Of course, libmtmalloc is not the only memory allocator that is optimised for multi-threaded allocation. We also have libumem, compiling the original code to use this results in the following performance:

% cc -O mtm.c -lumem
% timex a.out
real          31.06
user          18.10
sys           10.95

So this is probably a good indication that you will get better performance from libumem if your application allocates and deallocates lots of memory. If you are using libmtmalloc in this role, then you may need to tune the requestsize to a greater number than the default - although this will increase the memory footprint of your application.

Tuesday Jun 23, 2009

Sun Studio 12 Update 1

Sun Studio 12 Update 1 went live yesterday. It's still a free download, and it's got a raft of new features. Many people will have been using the express releases, so they will already be familiar with the improvements.

It's been about two years since Sun Studio 12 came out, and the most obvious change in that time is the prevalence of multicore processors. I figured the easiest way to discern this would be to look at the submissions of SPEC CPU2006 results in that time period. The following chart shows the cummulative number of SPEC CPU2006 Integer speed results over that time broken down by the number of threads that the chip was capable of supporting.

Ok, the first surprising thing about the chart is that there's very few single threaded chips. There were a few results when the suite was launched back in 2006, but nothing much since. What is more apparent is the number of dual-thread chips, that was where the majority of the market was. There were also a number of quad-thread chips at that point. If we fast-forward to the situation today, we can see that the number of dual-thread chips has pretty much leveled off, the bulk of the chips are capable of supporting four threads. But you can see the start of a ramp of chips that are capable of supporting 6 or 8 simultaneous threads.

The relevance of this chart to Sun Studio is that Sun Studio has always been a tool that supports the development of multi-threaded applications. Every release of the product improves on the support in the previous release. Sun Studio 12 Update 1 includes improvements in the compiler's ability to automatically parallelise codes - afterall the easiest way to develop parallel applications is if the compiler can do it for you; improvements to the support of parallelisation specifications like OpenMP, this release includes support for the latest OpenMP 3.0 specification; and improvements in the tools and their ability to provide the developer meaningful feedback about parallel code, for example the ability of the Performance Analyzer to profile MPI code.

Footnote SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Benchmark results stated above reflect results posted on www.spec.org as of 15 June 2009.

Sunday Jun 14, 2009

Audio for JavaOne interview available

A couple of weeks back I recorded an interview where I discussed The Developer's Edge. I've just found the audio up at BlogTalkRadio, it's about 15 minutes in duration.

Friday Jun 12, 2009

Stlport4 and multithreaded code

I finally resolved a problem that's been annoying me for about 3 years. Codes that use the Standard Template Library don't scale to multiple threads.

First off, it's probably good to take a look at a code that illustrates the problem:

#include <vector>

int main()
  #pragma omp parallel for default (__auto)
  for (int i=0; i<10000000; i++)
    std::vector<int> v;

The first comparison is between the serial performance of the Solaris default STL and stlport4 which is provided with the compiler.

$ CC -O t1.cc
$ timex a.out
real          15.85
user          15.64
sys            0.01
$ CC -O -library=stlport4 t1.cc
$ timex a.out
real           7.87
user           7.78
sys            0.01

This doesn't tell me anything that I didn't already know. stlport4 is (as far as I know) always faster than the STL provided by Solaris. Hence if you use C++, then you should use stlport4 in preference to the Solaris default. The constraint is that each application (libraries and all) can only use one version of the STL. So if a library that is outside your control uses the Solaris default, then the entire app must use it.

The next thing to investigate is scaling when there are multiple threads:

$ CC -O -xopenmp -library=stlport4 t1.cc
$ timex a.out
real           7.00
user           6.96
sys            0.01
$ export OMP_NUM_THREADS=2
$ timex a.out
real           7.18
user          14.28
sys            0.01

So compiling the code to use OpenMP caused no performance overhead, but running with two threads had the same runtime as a run with a single thread. We can profile the code to see what's happening:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
8.076     8.076      
1.571     2.272      mutex_lock_impl
1.501     1.971      mutex_unlock
1.051     4.573      std::vector >::_M_insert_overflow(int\*,const int&,const std::__true_type&,unsigned,bool)
0.871     8.076      _$d1A5.main
0.871     3.272      std::__node_alloc<true,0>::_M_allocate(unsigned)
0.560     1.721      std::__node_alloc<true,0>::_M_deallocate(void\*,unsigned)
0.480     0.480      sigon
0.440     0.440      mutex_trylock_adaptive
0.250     0.470      mutex_unlock_queue

So the lost time is due to mutex locks, if you dig through the source you'll find that node_alloc has a single mutex lock that only allows a single thread to allocate or deallocate memory. Which is why the code shows no scaling.

This test code is basically creating and destroying vector objects, so it hits the allocate and deallocate routines very hard. Which is why I picked it. Real codes are much less likely to have this problem at quite the same level. It is not unusual to want to create and destroy objects within a loop. One workaround is to hoist the objects out of the hot loops. This works for some instances, but is not a great solution, as even in the best case it makes the code more complex.

The solution I ended up using was to build the Apache STL. It turned out to be a relatively straightforward experience. The compile line is a bit cryptic, I wanted the optimised, multithreaded, 64-bit version and this translates to:

$ gmake BUILDTYPE=12D CONFIG=sunpro.config 

Once I had it built, I could install it with:

$ gmake BUILDTYPE=12D CONFIG=sunpro.config install PREFIX=`pwd`/install

The steps necessary to use a different STL than the ones supplied with the compiler are documented here. The compile line for the test code was:

CC -m64  -O -xopenmp -library=no%Cstd \\
   -I ./stdcxx-4.2.1/install/include/ \\
   -L ./stdcxx-4.2.1/install/lib/     \\
   -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc 

So we can build the test and look at the scaling between one and two threads:

$ export OMP_NUM_THREADS=1
$ timex a.out
real          18.98
user          18.93
sys            0.01
$ export OMP_NUM_THREADS=2
$ timex a.out
real          18.42
user          36.73
sys            0.01

Which is not, to be honest, a great start, the runtime is slower, and the code still fails to scale. However, the profile is different:

Excl.     Incl.      Name  
User CPU  User CPU         
  sec.      sec.      
21.145    21.145     
 2.572    16.411     std::vector<int,std::allocator<int> >::_C_insert_n(int\*const&,unsigned long,const int&)
 2.402     4.293     mutex_unlock
 2.342     3.613     mutex_lock_impl
 1.961    10.697     std::vector<int,std::allocator<int> >::_C_realloc(unsigned long)
 1.681     5.634     free
 1.341     1.891     mutex_unlock_queue
 1.271     1.271     _free_unlocked
 0.991     0.991     sigon

So we still see a lot of mutex activity. Looking at where the mutex activity comes from provides an interesting insight:

(er_print) csingle mutex_lock
Attr.    Excl.     Incl.      Name  
User CPU  User CPU  User CPU         
 sec.      sec.      sec.       
0.170     1.681     5.634      free
0.020     0.690     4.623      malloc
0.190     0.190     0.190     \*mutex_lock

So the mutex activity is coming from malloc and free. Which are parts of the default Solaris memory allocator. The default memory allocator is thread safe, but does not give good performance for MT codes. There are two usual alternatives, mtmalloc and libumem. I've usually found mtmalloc to be good enough for me:

CC -m64  -O -xopenmp -library=no%Cstd \\
   -I ./stdcxx-4.2.1/install/include/ \\
   -L ./stdcxx-4.2.1/install/lib/     \\
   -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc -lmtmalloc

Then we can try the timing tests again:

$ export OMP_NUM_THREADS=1
$ timex a.out
real          18.02
user          17.98
sys            0.01
$ export OMP_NUM_THREADS=2
real          13.76
user          27.05
sys            0.01
$ export OMP_NUM_THREADS=4
$ timex a.out
real           6.92
user          26.97
sys            0.02
$ export OMP_NUM_THREADS=8
$ timex a.out
real           3.51
user          26.99
sys            0.02

So the code is now scaling to multiple threads, which was the original problem. We have lost some serial performance, which is perhaps a concern, but that performance loss may be only for a particular code path, and depending on the usage of the library, we might even see gains in some of the algorithms. So depending on the situation, this might be a good enough solution. [FWIW, I also tested with libumem and did not see a significant difference in performance between the two libraries.]

Wednesday Jun 03, 2009

The Developer's Edge talk in Second Life

Just finished talking in Second Life. The slides from the talk are available from SLX. I've got into the habit of writing a transcript for my SL presentations - basically in case the audio fails for some reason.

The talk focuses a bit more on the way that people now get information (through blog posts, articles, indexed by search engines) and the Q&A after the talk was more about that than the technical content of the book. This is a domain that I've given a fair amount of thought to. When writing technical books there is a challenge to balance the information so that it includes the necessary details without writing material that will be out of date by the time that the book hits the press. Fortunately a large amount of the information that developers need is relatively long lived. The challenges come when describing a particular revision of the software, or a particular processor - details which can be very useful for people, but also details which may not age gracefully!

Wednesday May 20, 2009

Libraries (4) - Runtime costs - Procedure Lookup Table (PLT)

Most applications spend the majority of their time running - rather than starting up. So it's useful to look at the costs of using libraries at runtime.

The most apparent cost of using libraries is that calls to routines now go indirectly to the target routine through the procedure look up table (PLT). Unless the developer explicitly limits the scope of a function, it is exported from the library as a global function, which means that even calls within the library will go through the PLT. Consider the following code snippet:

void func2()

void func1()

If this is compiled into an executable the assembly code will look like:

        11104:  82 10 00 0f  mov        %o7, %g1
        11108:  7f ff ff f8  call       func2   ! 0x110e8
        1110c:  9e 10 00 01  mov        %g1, %o7

However, if this is compiled as part of a library then the code looks like:

         664:  82 10 00 0f  mov         %o7, %g1
         668:  40 00 40 b9  call        .plt+0x3c       ! 0x1094c
         66c:  9e 10 00 01  mov         %g1, %o7

This is a doubling of the cost of the call.

In C it's possible to limit the scope of the function using the static keyword. Declaring func1 as static will cause the compiler to generate a direct call to that routine. The downside is that the routine will only be visible within the source file that defines it. It is also possible to use other methods to limit the visibility of symbols.

Libraries (3) - Application startup costs

As can be seen from the previous graphs, even a simple application (like ssh) can pull in a fair number of libraries. Whenever a library is pulled in, the linker has to request memory, load the image from disk, and then link in all the routines. This effort takes time - it's basically a large chunk of the start up time of an application. If you profile the start up of an application, you'll probably not see much because much of this time is basically the OS/disk activity of mapping the libraries into memory.

Of course applications also have start up costs associated with initialising data structures etc. However, the biggest risk is that applications will pull in libraries that they don't need, or perhaps do need, but don't need yet. The best work-around for this is to lazy load the libraries. Of course it's fairly easy to write code that either breaks under lazy loading or breaks lazy loading. It's not hard to work around these issues with care, and doing so can have a substantial impact on start up time.

Tuesday May 19, 2009

Developer's Edge safari rerelease

We've just pushed a new version of The Developer's Edge to safari. The original version didn't show any of the text from each section of the book unless you logged into the safari site. The new version shows the snippet from each section even if you're not a subscriber.

I was pleased to see that the book is featured on the Sun Studio developer portal.

I'm also scheduled to give a second life presentation during JavaOne at 9am PST on the 3rd June.

Thursday May 07, 2009

The perils of strlen

Just been looking at an interesting bit of code. Here's a suitably benign version of it:

#include <string.h>
#include <stdio.h>

void main()
  char string[50];
  int i;
  int j=0;
  for (i=0; i<strlen(string); i++)
   if (string[i]=='1') {j=i;}

Compiling this bit of code leads to a loop that looks like:

/\* 0x002c         12 \*/         cmp     %i5,49
/\* 0x0030         10 \*/         add     %i3,1,%i3
/\* 0x0034         12 \*/         move    %icc,%i4,%i2
/\* 0x0038         10 \*/         call    strlen  ! params =  %o0 ! Result =  %o0
/\* 0x003c            \*/         or      %g0,%i1,%o0
/\* 0x0040            \*/         add     %i4,1,%i4
/\* 0x0044            \*/         cmp     %i4,%o0
/\* 0x0048            \*/         bcs,a,pt        %icc,.L900000109
/\* 0x004c         12 \*/         ldsb    [%i3],%i5

The problem being that for each character tested there's also a call to strlen! The reason for this is that the compiler cannot be sure what the call to strlen actually returns. The return value might depend on some external variable that could change as the loop progresses.

There's a lot of functions defined in the libraries that the compiler could optimise, if it was certain that it recognised them. The compiler flag that enables recognition of the "builtin" functions is -xbuiltin (which is included in -fast. This enables the compiler to do things like recognise calls to memcpy or memset and in some instances produce more optimal code. However, it doesn't recognise the call the strlen.

In terms of solving the problem, there are two approaches. The most portable approach is to hold the length of the string in a temporary variable:

  int length=strlen(string);
  for (i=0; i<length; i++)

Another, less portable approach, is to use #pragma no_side_effect. This pragma means the return value of the function depends only on the parameters passed into the function. So the result of calling strlen only depends on the value of the constant string that is passed in. The modified code looks like:

#include <string.h>
#include <stdio.h>
#pragma no_side_effect(strlen)

void main()
  char string[50];
  int i;
  int j=0;
  for (i=0; i<strlen(string); i++)
   if (string[i]=='1') {j=i;}

And more importantly, the resulting disassembly looks like:

/\* 0x0028          0 \*/         sub     %i1,49,%o7
/\* 0x002c         11 \*/         add     %i3,1,%i3
/\* 0x0030          0 \*/         sra     %o7,0,%o5
/\* 0x0034         13 \*/         movrz   %o5,%i4,%i2
/\* 0x0038         11 \*/         add     %i4,1,%i4
/\* 0x003c            \*/         cmp     %i4,%i5
/\* 0x0040            \*/         bcs,a,pt        %icc,.L900000109
/\* 0x0044         13 \*/         ldsb    [%i3],%i1

Friday Apr 17, 2009

Compiling for Nehalem and other processors

First thing I'd suggest is to make sure that you're using a recent compiler. Practically that means Sun Studio 12 (which is quite old now), or the Sun Studio Express releases (aka Sun Studio 12 Update 1 Early Access). Obviously the latest features are only found in the latest compilers.

In terms of flags, the starting point should be -fast, you can later trim that if you need to remove floating point simplification, or are not happy with one or other of the flags that it includes.

The next flags to use are:

  • -xipo=2 This flag enables crossfile optimisation and tracking of allocated memory. I find crossfile optimisation useful because it limits the impact of the source code structure on the final application.
  • -xarch=sse4_2 This flag is included in -fast (assuming that the build system supports it). However, if you later plan to fine tune the compiler flags, it's best to start off with it explicitly. This allows the compiler to use the SSE4 instruction set. Probably -xarch=sse2 will be sufficient in most circumstances - it's a call depending on the system that the application will be deployed on.
  • -xvector=simd This flag tells the compiler to generate SIMD (single instruction multiple data) instructions - basically the combination of this and the architecture flag enables the compiler to generate applications that use SSE instructions. These instructions can lead to substantial performance gains in some floating point applications.
  • -m64 On x86 there's performance to be gained from using the 64-bit instruction set extensions. The code gets more registers and a better calling convention. These tend to outweigh the costs of the larger memory footprint of 64-bit applications.
  • -xpagesize=2M This tells the operating system to provide large pages to the application.
  • The other optimisation that I find very useful is profile feedback. This does complicate and lengthen the build process, but is often the most effective way of getting performance gains for codes dominated by branches and conditional code.
  • The other flags to consider are the aliasing flags -xalias_level=std for C, -xalias_level=compatible for C++, and -xrestrict. These flags do lead to performance gains, but require the developer to be comfortable that their code does conform to the requirements of the flags. (IMO, most code does.)

All this talk about flags should not be a replacement for what I consider to be the basic first step in optimising the performance of an application: to take a profile. Compiler flags tell the compiler how to do a good job of producing the code, but the compiler can't do much about the algorithms used. Profiling the application will often give a clue as to a way that the performance of the application can be improved by a change of algorithm - something that even the best compiler flags can't always do.

Friday Mar 27, 2009

The Developer's Edge available in hardcopy

The Developer's Edge is now available as hardcopy!

It is being made available as print-on-demand. You can either order through the publisher Vervante, or you can order through Amazon.

However, I suggest you wait until next week before ordering as the current cover art is not the final version (you can play spot the difference between the image on the left and the one on the Vervante website). I'll post when it gets fixed. Of course, you can order the "limited-edition" version if you want :)

I introduced the book in a previous post. I'll reproduce a bit more of the details in this post. The brief summary is:

The Developer's Edge: Selected Blog Posts and Articles focuses on articles in the following areas:

  • Native language issues
  • Performance and improving performance
  • Specific features of the x86 and SPARC processors
  • Tools that are available in the Solaris OS

The articles provide a broad overview on a topic, or an in-depth discussion. The texts should provide insights into new tools or new methods, or perhaps the opportunity to review a known domain in a new light.

You can get more details than this from the Safari site, either reading the preface or skimming the table of contents

I would love to hear feedback on this book, feel free to e-mail me directly, leave comments on amazon, or leave comments on this blog, or on the blogs of the other contributors.

Thursday Mar 26, 2009

OpenSPARC workshop - Brussels

April 24th and 25th I'm going to be in Brussels rerunning the OpenSPARC workshop. The workshop leverages the material in the OpenSPARC Internals book, together with the OpenSPARC presentations. There's a poster with all the details (for those with acute eyesight, the poster on the left is from the December workshop!).

Friday Mar 20, 2009

University of Washington Presentation

I was presenting at the University of Washington, Seattle, on Wednesday on Solaris and Sun Studio. The talk covers the tools that are available in Solaris and Sun Studio. This is my "Grand Unified" presentation, it covers tools, compilers, optimisation, parallelisation, and debug.

Volatile and Mutexes

A rough-and-ready guide to when to use the volatile keyword and when to use mutexes.

When you declare a variable to be volatile it ensures that the compiler always loads that variable from memory and immediately stores it back to memory after any operation on it.

For example:

int flag;

while (flag){}

In the absence of the volatile keyword the compiler will optimise this to:

if (!flag) while (1) {}

[If the flag is not zero to start with then the compiler assumes that there's nothing that can make it zero, so there is no exit condition on the loop.]

Not all shared data needs to be declared volatile. Only if you want one thread to see the effect of another thread.

[Example, if one thread populates a buffer, and another thread will later read from that buffer, then you don't need to declare the contents of the buffer as volatile. The important word being later, if you expect the two threads to access the buffer at the same time, then you would probably need the volatile keyword]

Mutexes are there to ensure exclusive access:

You will typically need to use them if you are updating a variable, or if you are performing a complex operation that should appear 'atomic'

For example:

volatile int total;


You need to do this to avoid a data race, where another thread could also be updating total:

Here's the situation without the mutex:

Thread 1    Thread 2
Read total  Read total
Add 5       Add 5
Write total Write total

So total would be incremented by 5 rather than 10.

An example of a complex operation would be:

My_account = My_account - bill;
Their_account = Their_account + bill;

You could use two separate mutexes, but then there would be a state where the amount bill would have been removed from My_account, but not yet placed into Their_account (this may or may not be a problem).

Wednesday Mar 18, 2009

Sun Studio 12 Update 1 Early Access programme

We've just released Sun Studio Express 03/09. We're also using this release to start the Sun Studio 12 Update 1 Early Access Programme.

There are a bunch of new features in Sun Studio 12 Update 1 - you should be familiar with these if you've already been using the Express releases, but if you're coming from Sun Studio 12, there's much to recommend the new release. Top of my list are the following:

  • spot is finally integrated into the suite - please try it out and let me know how it works out for you.
  • The Performance Analyzer has support for MPI profiling
  • Full support for OpenMP 3.0

If you join the Early Access programme, we'll be listening out in case you hit any bugs, or have any suggestions for improvements. There's a forum for posting questions for the duration of the EA programme.

There's also a number of incentives for registering.

Tuesday Mar 17, 2009

Welcome to "The Developer's Edge"

Late Sunday night I got news that my new book "The Developer's Edge" was available on Safari. This has been a project I've worked on for about six months - so a much quicker turnaround than "Solaris Application Programming".

The Developer's Edge is a collection of blog posts and articles. My contribution to the material is from either this blog, or the articles that I link from the left column. However, it also contains a lot of material that other people have written, and that I have found interesting. The original plan came out of some informal discussions with the docs folks around this time last year. We decided to try and capture some of the vast stream of material that goes out through the various sun.com websites (blogs.sun.com, developers.sun.com, wikis.sun.com etc.) - these are kind-of "Edge" publications - hence the title.

The book will also be available as print-on-demand. We're pushing the material through the channels as fast as we can, so the "launch" is asynchronous! I'll announce the print version when it happens.

Thursday Mar 05, 2009

Peak rate of window spill and fill traps

I've been looking at the performance of a code recently. Written in C++ using many threads. One of the things with C++ is that as a language it encourages developers to have lots of small routines. Small routines lead to many calls and returns; and in particular they lead to register window spills and fills. Read more about register windows. Anyway, I wondered what the peak rate of issuing register window spills and fills was?

I'm going to use some old code that I used to examine the cost of library calls a while back. The code uses recursion to reach a deep call depth. First off, I define a couple of library routines:

extern int jump2(int count);

int jump1(int count)
  if (count==0)
    return 1;
    return 1+jump2(count);
int jump1(int count);

int jump2(int count)
  if (count==0)
    return 1;
    return 1+jump1(count);

I can then turn these into libraries:

$ cc -O -G -o libjump1.so jump1.c
$ cc -O -G -o libjump2.so jump2.c

Done in this way, both libraries have hanging dependencies:

$ ldd -d libjump1.so
        symbol not found: jump2         (./libjump1.so)

So the main executable will have to resolve these. The main executable looks like:

#include <stdio.h>

#define RPT 100
#define SIZE 600

extern int jump1(int count);

int main()
  int index,count,links,tmp;
  #pragma omp parallel for default(none) private(count,index,tmp,links)
  for(links=1; links<10000; links++)
   for (count=0; count<RPT; count++)
     for (index=0;index<SIZE;index++) 
       if (tmp!=links) {printf("mismatch\\n");}

This needs to be compiled in such a way as to resolve the unresolved dependencies

$ cc -xopenmp=noopt -o par par.c -L. -R. -ljump1 -ljump2

Note that I'm breaking rules again by making the runtime linker look for the dependent libraries in the current directory rather than use $ORIGIN to locate them relative to the executable. Oops.

I'm also using OpenMP in the code. The directive tells the compiler to make the outer loop run in parallel over multiple processors. I picked 10,000 for the trip count so that the code would run for a bit of time, so I could look at the activity on the system. Also note that the outer loop defines the depth of the call stack, so this code will probably cause stack overflows at some point, if not before. Err...

I'm compiling with -xopenmp=noopt since I want the OpenMP directive to be recognised, but I don't want the compiler to use optimisation, since if the compiler saw the code it would probably eliminate most of it, and that would leave me with nothing much to test.

The first thing to test is whether this generates spill fill traps at all. So we run the application and use trapstat to look at trap activity:

vct name           |   cpu13  
 84 spill-user-32       |  3005899  
 c4 fill-user-32        |  3005779 

So on this 1.2GHz UltraSPARC T1 system, we're getting 3,000,000 traps/second. The generated code is pretty plain except for the save and restore instructions:

         21c:  9d e3 bf a0  save        %sp, -96, %sp
         220:  90 86 3f ff  addcc       %i0, -1, %o0
         224:  12 40 00 04  bne,pn      %icc,jump1+0x18 ! 0x234
         228:  01 00 00 00  nop       

         22c:  81 c7 e0 08  ret       
         230:  91 e8 20 01  restore     %g0, 1, %o0

         234:  40 00 00 00  call        jump2
         238:  01 00 00 00  nop       
         23c:  81 c7 e0 08  ret       
         240:  91 ea 20 01  restore     %o0, 1, %o0

So you can come up with an estimate of 300 ns/trap.

The reason for using OpenMP is to enable us to scale the number of active threads. Rerunning with 32 threads, by setting the environment variable OMP_NUM_THREADS to be 32, we get the following output from trapstat:

vct name                |    cpu21    cpu22    cpu23    cpu24    cpu25    cpu26 
 84 spill-user-32       |  1024589  1028081  1027596  1174373  1029954  1028695
 c4 fill-user-32        |   996739   989598   955669  1169058  1020349  1021877

So we're getting 1M traps per thread, with 32 threads running. Let's take a look at system activity using vmstat.

 vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s1 s2 s3 s4   in   sy   cs us sy id
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3022  427  812 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3020  428  797 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 2945  457  760 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3147  429 1025 99  1  0
 0 0 0 64800040 504168 0 15  0  0  0  0  0  0  0  0  0 3049  666  820 99  1  0
 0 0 0 64800040 504168 0  1  0  0  0  0  0  0  0  0  0 3044  543  866 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3021  422  798 100 0  0

So there's no system time being recorded - the register spill and fill traps are fast traps, so that's not a surprise.

One final thing to look at is the instruction issue rate. We can use cpustat to do this:

  2.009   6  tick  63117611 
  2.009  12  tick  69622769 
  2.009   7  tick  62118451 
  2.009   5  tick  64784126 
  2.009   0  tick  67341237 
  2.019  17  tick  62836527 

As might be expected from the cost of each trap, and the sparse number of instructions between traps, the issue rate of the instructions is quite low. Each of the four threads on a core is issuing about 65M instructions per second. So the core is issuing about 260M instructions per second - that's about 20% of the peak issue rate for the core.

If this were a real application, what could be done? Well, obviously the trick would be to reduce the number of calls and returns. At a compiler level, that would using flags that enable inlining - so an optimisation level of at least -xO4; adding -xipo to get cross-file inlining; using -g0 in C++ rather than -g (which disables front-end inlining). At a more structural level, perhaps the way the application is broken into libraries might be changed so that routines that are frequently called could be inlined.

The other thing to bear in mind, is that this code was designed to max out the register window spill/fill traps. Most codes will get nowhere near this level of window spills and fills. Most codes will probably max out at about a tenth of this level, so the impact from register window spill fill traps at that point will be substantially reduced.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge


« April 2014
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming