Saturday Nov 13, 2010

Multicore Application Programming arrived!

It was an exciting morning - my copy of Multicore Application Programming was delivered. After reading the text countless times, it's great to actually see it as a finished article. It starting to become generally available. Amazon lists it as being available on Wednesday, although the Kindle version seems to be available already. It's also available on Safari books on-line. Even turned up at Tesco!

Thursday Nov 11, 2010

Partitioning work over multiple threads

A few weeks back I was looking at some code that divided work across multiple threads. The code looked something like the following:

void \* dowork(void \* param)
  int threadid  = (int) param;
  int chunksize = totalwork / nthreads;
  int start     = chunksize \* threadid;
  int end       = start + chunksize;
  for (int iteration = start; iteration < end; iteration++ )

So there was a small error in the code. If the total work was not a multiple of the number of threads, then some of the work didn't get done. For example, if you had 7 iterations (0..6) to do, and two threads, then the chunksize would be 7/2 = 3. The first thread would do 0, 1, 2. The second thread would do 3, 4, 5. And neither thread would do iteration 6 - which is probably not the desired behaviour.

However, the fix is pretty easy. The final thread does what ever is left over:

void \* dowork(void \* param)
  int threadid  = (int) param;
  int chunksize = totalwork / nthreads;
  int start     = chunksize \* threadid;
  int end       = start + chunksize;
  if ( threadid + 1 == nthreads) { end = totalwork; }
  for (int iteration = start; iteration < end; iteration++ )

Redoing our previous example, the second thread would get to do 3, 4, 5, and 6. This works pretty well for small numbers of threads, and large iteration counts. The final thread at most does nthreads - 1 additional iterations. So long as there's a bundle of iterations to go around, the additional work is close to noise.

But.... if you look at something like a SPARC T3 system, you have 128 threads. Suppose I have 11,000 iterations to complete, I divide these between all the threads. Each thread gets 11,000 / 128 = 85 iterations. Except for the final thread which gets 85 + 120 iterations. So the final thread gets more than twice as much work as all the other threads do.

So we need a better approach for distributing work across threads. We want each thread to so a portion of the remaining work rather than having the final thread do all of it. There's various ways of doing this, one approach is as follows:

void \* dowork(void \* param)
  int threadid  = (int) param;
  int chunksize = totalwork / nthreads;
  int remainder = totalwork - (chunksize \* nthreads); // What's left over

  int start     = chunksize \* threadid;
  if ( threadid < remainder ) // Check whether this thread needs to do extra work
    chunksize++;              // Yes. Lengthen chunk
    start += threadid;        // Start from corrected position
    start += remainder;       // No. Just start from corrected position
  int end       = start + chunksize; // End after completing chunk

  for (int iteration = start; iteration < end; iteration++ )

If, like me, you feel that all this hacking around with the distribution of work is a bit of a pain, then you really should look at using OpenMP. The OpenMP library takes care of the work distribution. It even allows dynamic distribution to deal with the situation where the time it takes to complete each iteration is non-uniform. The equivalent OpenMP code would look like:

void \* dowork(void \*param)
  #pragma omp parallel for
  for (int iteration = 0; iteration < totalwork; iteration++ )

Wednesday Nov 10, 2010

Introduction to parallel programming

My colleague, Ruud van der Pas, recorded a number of lectures on parallel programming.

Tuesday Nov 09, 2010

Multicore application programming: sample chapter

No sign of the actual books yet - I expect to see them any day now - but there's a sample chapter up on the informit site. There's also a pdf version which includes preface and table of contents.

This is chapter 3 "Identifying opportunities for parallelism". These range from the various OS-level approaches, through virtualisation, and into multithread/multiprocess. It's this flexibility that makes multicore processors so appealing. You have the choice of whether you take advantage of them through some consolidation of existing applications, or whether you take advantage of them, as a developer, through scaling a single application.

Thursday Sep 09, 2010

Book update

I've just handed over the final set of edits to the manuscript. These are edits to the laid-out pages. Some are those last few grammatical errors you only catch after reading a sentence twenty times. Others are tweaks to the figures. There's still a fair amount of production work to do, but my final input will be a review of the indexing - probably next week.

So it's probably a good time to talk about the cover. This is a picture that my wife took last year. It's a picture of the globe at the cliff tops at Durlston Head near Swanage in England. It's 40 tonnes and over 100 years old. It's also surrounded by stone tablets some containing contemporary educational information, and a couple of blank ones that are there just so people can draw on them.

Wednesday Sep 08, 2010

Oracle Solaris Studio 12.2

It's been just over a year since the release of Studio 12 Update 1, today we releasing the first Oracle branded Studio release - Oracle Solaris Studio 12.2. For the previous release I wrote a post for the AMD site looking at the growth in multicore processors. It seemed appropriate to take another look at this.

The graph in the chart below shows the cumulative number of SPECint2006 results broken down by the number of cores for each processor. This data does not represent the number of different types of processor that are available, since the same processor can be used in many different results. It is closer to a snapshot of how the market for multicore processors is growing. Each data point represents a system, so the curve approximates the number of different systems that are being released.

It's perhaps more dramatic to demonstrate the change using a stacked area chart. The chart perhaps overplays the number of single core results, but this is probably fair as "single core" represents pretty much all the results prior to the launch of CPU2006. So what is readily apparent is the rapid decline in the number of single core results, the spread of dual, and then quad core. It's also interesting to note the beginning of a spread of more than quad core chips.

If we look at what is happening with multicore processors in the context of what we are releasing with Solaris Studio, there's a very nice fit of features. We continue to refine our support for OpenMP and automatic parallelisation. We've been providing data race (and deadlock) detection through the Thread Analyzer for a couple of releases. The debugger and the performance analyzer have been fine with threads for a long time. The performance analyzer has the time line view which is wonderful for examining multithreaded (or multiprocess) applications.

In addition to these fundamentals Studio 12.2 introduces a bunch of new features. I discussed some of these when the express release came out:

  • For those who use the IDE, integration of support for the analysis of the runtime behaviour of applications has been very useful. It both provides more information directly back to the developer, and raises awareness of the available tools.
  • Understanding call trees is often an important part of interpreting the performance of the application. Being able to drill down the call tree has been a very useful extension to the Performance Analyzer.
  • Memory error checking is critical for all applications. The trouble with memory access errors is that, like data races, the "problem" is visible arbitrarily far from the point where the error occurred.

The release of a new version of a product is always an exciting time. It's a culmination of a huge amount of analysis, development, and testing, and it's wonderful to finally see it available for others to use. So download it and let us know what you think!

Footnote: SPEC, SPECint, reg tm of Standard Performance Evaluation Corporation. Results from as of 6 September 2010 and this report.

Saturday Aug 07, 2010

I want my CMT

One problem with many parallel applications is that they don't scale to large numbers of threads. There are plenty of reasons why this might be the case. Perhaps the amount of work that needs to be done is insufficient for the number of threads being used.

On the other hand there are plenty of examples where codes scale to huge numbers of threads. Often they are called 'embarrassingly parallel codes', as if writing scaling codes is something to be ashamed of. The other term for this is 'delightfully parallel' which I don't really find any better!

So we have some codes that scale, and some that don't. Why don't all codes scale? There's a whole bunch of reasons:

  • Hitting some hardware constraint, like bandwidth. Adding more cores doesn't remove the constraint - although adding more processors, or systems might
  • Insufficient work. If the problem is too small, there just is not enough work to justify multiple threads
  • Algorithmic constraints or dependencies. If the code needs to calculate A and then B, there is no way that A and B can be calculated simultaneously.

These are all good reasons for poor scaling. But I think there's also another one that is, perhaps, less obvious. And that is access to machines with large numbers of cores.

Perhaps five years ago it was pretty hard to get time on a multicore system. The situation has completely reversed now. Obviously if a code is developed on a system with a single CPU, then it will run best on that kind of system. Over time applications are being "tuned" for multicore, but we're still looking at the 4-8 thread range in general. I would expect that to continue to change as access to large systems becomes more common place.

I'm convinced that as access to systems with large numbers of threads becomes easier, the ability of applications to utilise those threads will also increase. So all those applications that currently max out at eight threads, will be made to scale to sixteen, and beyond.

This is at the root of my optimism about multicore in general. Like many things, applications "evolve" to exploit the resources that are provided. You can see this in other domains like video games, where something new comes out of nowhere, and you are left wondering "How did they make the hardware do that?".

I also think that this will change attitudes to parallel programming. It has a reputation of being difficult. Whilst I agree that it's not a walk in the park, not all parallel programming is equally difficult. As developers become more familiar with it, coding style will evolve to avoid the common problems. Hence as it enters the mainstream, its complexity will be more realistically evaluated.

Sunday Aug 01, 2010

Multicore application programming: podcast

A few weeks back I had the pleasure of being interviewed by Allan Packer as part of the Oracle author podcast series. The podcast is a brief introduction to the book.

Wednesday Jul 28, 2010

Multicore application programming on Safari books

A roughcut of Multicore Application Programming has been uploaded to Safari books. If you have access you can read it, and provide feedback or comments. If you don't have access to Safari, you can still see the table of contents, read the preface, and view the start of each chapter.

Wednesday Jul 07, 2010

Multicore application programming: update

It's 2am and I've just handed over the final manuscript for Multicore Application Programming. Those who know publishing will realise that this is not the final step. The publishers will layout my text and send it back to me for a final review before it goes to press. It will probably take a few weeks to complete the process.

I've also uploaded the final version of the table of contents. I've written the book using It's almost certain not to be a one-to-one mapping of pages in my draft to pages in the finished book. But I expect the page count to be roughly the same - somewhere around 370 pages of text. It will be interesting to see what happens when it is properly typeset.

Monday May 17, 2010

Multicore application programming: Table of contents

I've uploaded the current table of contents for Multicore Application Programming. You can find all the detail in there, but I think it's appropriate to talk about how the book is structured.

Chapter 1. The design of any processor has a massive impact on its performance. This is particularly true for multicore processors since multiple software threads will be sharing hardware resources. Hence the first chapter provides a whistle-stop tour of the critical features of hardware. It is important to do this up front as the terminology will be used later in the book when discussing how hardware and software interact.

Chapter 2. Serial performance remains important, even for multicore processors. There's two main reasons for this. The first is that a parallel program is really a bunch of serial threads working together, so improving the performance of the serial code will improve the performance of the parallel program. The second reason is that even a parallel program will have serial sections of code. The performance of the serial code will limit the maximum performance that the parallel program can attain.

Chapter 3. One of important aspects of using multicore processors is identifying where the parallelism is going to come from. If you look at any system today, there are likely to be many active processes. So at one level no change is necessary, systems will automatically use multiple cores. However, we want to get beyond that, and so the chapter discusses approaches like virtualisation as well as discussing the more obvious approach of multi-thread or multi-process programming. One message that needs to be broadcast is that multicore processors do not need a rewrite of existing applications. However, getting the most from a multicore processor may well require that.

Chapter 4. The book discusses Windows native threading, OpenMP, automatic parallelisation, as well as the POSIX threads that are available on OS-X, Linux, and Solaris. Although the details do sometimes change across platforms, the concepts do not. This chapter discusses synchronisation primitives like mutex locks and so on, this enables the chapters which avoids having to repeat information in the implementation chapters.

Chapter 5. This chapter covers POSIX threads (pthreads), which are available on Linux, OS-X, and Solaris, as well as other platforms not covered in the book. The chapter covers multithreaded as well as multiprocess programming, together with methods of communicating between threads and processes.

Chapter 6. This chapter covers Windows native threading. The function names and the parameters that need to be passed to them are different to the POSIX API, but the functionality is the same. This chapter provides the same coverage for Windows native threads that chapter 5 provides for pthreads.

Chapter 7. The previous two chapters provide a low level API for threading. This gives very great control, but provides more opportunities for errors, and requires considerable lines of code to be written for even the most basic parallel code. Automatic parallelisation and OpenMP place more of the burden of parallelisation on the compiler, less on the developer. Automatic parallelisation is the ideal situation, where the compiler does all the work. However, there are limitations to this approach, and this chapter discusses the current limitations and how to make changes to the code that will enable the compiler to do a better job. OpenMP is a very flexible technology for writing parallel applications. It is widely supported and provides support for a number of different approaches to parallelism.

Chapter 8. Synchronisation primitives provided by the operating system or compiler can have high overheads. So it is tempting to write replacements. This chapter covers some of the potential problems that need to be avoided. Most applications will be adequately served by the synchronisation primitives already provided, the discussion in the chapter provides insight about how hardware, compilers, and software can cause bugs in parallel applications.

Chapter 9. The difference between a multicore system and a single core system is in its ability to simultaneously handle multiple active threads. The difference between a multicore system and a multiprocessor system is in the sharing of processor resources between threads. Fundamentally, the key attribute of a multicore system is how it scales to multiple threads, and how the characteristics of the application affect that scaling. This chapter discusses what factors impact scaling on multicore processors, and also what the benefits multicore processors bring to parallel applications.

Chapter 10. Writing parallel programs is a growing and challenging field. The challenges come from producing correct code and getting the code to scale to large numbers of cores. There are some approaches that provide high numbers of cores, there are other approaches which address issues of producing correct code. This chapter discusses a large number of other approaches to programming parallelism.

Chapter 11. The concluding chapter of the book reprises some of the key points of the previous chapters, and tackles the question of how to write correct, scalable, parallel applications.

Monday Oct 19, 2009

Fishing with cputrack

I'm a great fan of the hardware performance counters that you find on most processors. Often you can look at the profile and instantly identify what the issue is. Sometimes though, it is not obvious, and that's where the performance counters can really help out.

I was looking at one such issue last week, the performance of the application was showing some variation, and it wasn't immediately obvious what the issue was. The usual suspects in these cases are:

  • Excessive system time
  • Process migration
  • Memory placement
  • Page size
  • etc.

Unfortunately, none of these seemed to explain the issue. So I hacked together the following script cputrackall which ran the test code under cputrack for all the possible performance counters. Dumped the output into a spreadsheet, and compared the fast and slow runs of the app. This is something of a "fishing trip" script, just gathering as much data as possible in the hope that something leaps out, but sometimes that's exactly what's needed. I regularly get to sit in front of a new chip before the tools like ripc have been ported, and in those situations the easiest thing to do is to look for hardware counter events that might explain the runtime performance. In this particular instance, it helped me to confirm my suspicion that there was a difference in branch misprediction rates that was causing the issue.

Friday Oct 09, 2009

Webcast: Improving the performance of parallel codes using the Performance Analyzer

Earlier in the summer I recorded a slidecast on using the Performance Analyzer on parallel codes, it's just come out on the HPC portal.

Monday Sep 28, 2009

Updated compiler flags article

Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.

Monday Sep 21, 2009

Haskell (GHC) on UltraSPARC T2

Ben Lippmeier gave an excellent presentation at the recent Haskell conference in Edinburgh on his work on porting the Glasgow Haskell Compiler (GHC) back to SPARC. A video of the talk is available.

Update:Link to slides

Wednesday Jul 01, 2009

Introduction to parallel programming

My colleague, Ruud van der Pas, has recorded a series of seven webcasts on parallel programming which will be released on the HPC portal. Ruud is an expert on parallel programming, and one of the authors of the book "Using OpenMP".

Friday Jun 26, 2009

mtmalloc vs umem

A little while back I was looking at the performance of the STL with multithreaded code. I got the opportunity to try this on a particular code, and rather shockingly to me performance was absolutely terrible! I'd linked the code with mtmalloc, and the hottest function in the profile was malloc_internal. I've put together a fake code, and here's the profile from that:

Excl.     Incl.      Name
User CPU  User CPU
   sec.      sec.
266.446   266.446    
258.301   263.084    malloc_internal
  1.661     1.951    free
  1.401     1.401    mutex_lock_impl
  0.961     0.961    mutex_unlock

We can dig into the disassembly of malloc_internal to find out what's going on:

    73.201    73.201            [?]     1724:  cmp         %o5, 0
     1.981     1.981            [?]     1728:  bne         0x1740
     0.320     0.320            [?]     172c:  nop
     0.490     0.490            [?]     1730:  ld          [%i2 + 44], %i2
     1.191     1.191            [?]     1734:  cmp         %i2, 0
     0.901     0.901            [?]     1738:  bne,a       0x1724
## 176.443   176.443            [?]     173c:  ld          [%i2 + 32], %o5

It's not hard to visualise what the original C code would look like:

  while ((ptr->value==0) && (ptr->next!=0)) { ptr=ptr->next; }

Fortunately the source code is searchable and the above loop looks sufficiently similar to line 1032 of mtmalloc.c:

   1032 	while (thiscache != NULL && thiscache->mt_nfree == 0)
   1033 		thiscache = thiscache->mt_next;

So what's going on?

Reading through the source of malloc_internal, it appears that mtmalloc builds up a linked list of chunks of memory for each size of memory request. The size of the chunks of memory is 8KB\*requestsize, and requestsize is 9. So basically each chunk of memory is 72KB in size. So when a memory request comes in, malloc_internal looks at the current chunk, and if memory can be allocated from there, then it returns memory from that chunk. If not it goes to the next chunk and so on. This works very well when memory is allocated at once, but as memory gets freed, these chunks of memory become like Swiss-cheese, with lots of holes in them. If a lot of memory of a particular size is requested, then freed, there can be a large number of these chunks in the linked list, and scanning through the chunks to find one with free space can be time consuming. And that is the condition that my test code exercises.

It's probably worth revealing the test code, at this point, so that you can see what it does:

#include <stdlib.h>
typedef struct s
  struct s \* next;
  char padding[508];
} S;

void main()
  struct s \* head;
  struct s \* keep;
  struct s \* current;
  for (int j=0; j<100; j++)
    for (int i=0; i<100000; i++)
      current=(struct s\*)malloc(sizeof(struct s));
      if (i&1)
    current = head;
    while (current!=0)
      struct s \* tmp = current;
      current = current -> next;
    head = 0;

The code maintains two lists, one that it places memory onto for a long duration, and another list that holds memory for only a short duration. The memory footprint of the code keeps increasing, so more chunks are added to the lists, and holding on to the memory for a long period of time ensures that the chunks end up with lots of gaps in them. The runtime of this code is as follows:

% cc -O mtm.c -lmtmalloc
% timex a.out
real        4:44.18
user        4:33.80
sys            8.70

However there is an API to libmtmalloc that allows us to adjust the size of the chunks. The following changes increase the requestsize from 9 to 20:


The performance reduces from nearly 5 minutes to about 1 minute:

% cc -O mtm.c -lmtmalloc
% timex a.out
real        1:09.10
user        1:01.09
sys            6.53

If we increase the requestsize to 30, performance improves still further:

% cc -O mtm.c -lmtmalloc
% timex a.out
real          38.36
user          31.41
sys            4.96

Of course, libmtmalloc is not the only memory allocator that is optimised for multi-threaded allocation. We also have libumem, compiling the original code to use this results in the following performance:

% cc -O mtm.c -lumem
% timex a.out
real          31.06
user          18.10
sys           10.95

So this is probably a good indication that you will get better performance from libumem if your application allocates and deallocates lots of memory. If you are using libmtmalloc in this role, then you may need to tune the requestsize to a greater number than the default - although this will increase the memory footprint of your application.

Tuesday Jun 23, 2009

Sun Studio 12 Update 1

Sun Studio 12 Update 1 went live yesterday. It's still a free download, and it's got a raft of new features. Many people will have been using the express releases, so they will already be familiar with the improvements.

It's been about two years since Sun Studio 12 came out, and the most obvious change in that time is the prevalence of multicore processors. I figured the easiest way to discern this would be to look at the submissions of SPEC CPU2006 results in that time period. The following chart shows the cummulative number of SPEC CPU2006 Integer speed results over that time broken down by the number of threads that the chip was capable of supporting.

Ok, the first surprising thing about the chart is that there's very few single threaded chips. There were a few results when the suite was launched back in 2006, but nothing much since. What is more apparent is the number of dual-thread chips, that was where the majority of the market was. There were also a number of quad-thread chips at that point. If we fast-forward to the situation today, we can see that the number of dual-thread chips has pretty much leveled off, the bulk of the chips are capable of supporting four threads. But you can see the start of a ramp of chips that are capable of supporting 6 or 8 simultaneous threads.

The relevance of this chart to Sun Studio is that Sun Studio has always been a tool that supports the development of multi-threaded applications. Every release of the product improves on the support in the previous release. Sun Studio 12 Update 1 includes improvements in the compiler's ability to automatically parallelise codes - afterall the easiest way to develop parallel applications is if the compiler can do it for you; improvements to the support of parallelisation specifications like OpenMP, this release includes support for the latest OpenMP 3.0 specification; and improvements in the tools and their ability to provide the developer meaningful feedback about parallel code, for example the ability of the Performance Analyzer to profile MPI code.

Footnote SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Benchmark results stated above reflect results posted on as of 15 June 2009.

Monday Jun 15, 2009

Glasgow Haskell Compiler successfully ported to OpenSPARC

Ben Lippmeier has been working on the native port of the Glasgow Haskell Compiler (GHC) to SPARC. The port was completed a few months back, and since then he's been using an UltraSPARC T2 system to look at thread activity and scaling as the number of threads is increased. The full details of the project are on his GHC on SPARC blog. The latest SPARC/Solaris binary can be downloaded here, although the full set of changes probably won't be available for a couple of months.

Sunday Jun 14, 2009

Audio for JavaOne interview available

A couple of weeks back I recorded an interview where I discussed The Developer's Edge. I've just found the audio up at BlogTalkRadio, it's about 15 minutes in duration.

Friday Jun 12, 2009

Stlport4 and multithreaded code

I finally resolved a problem that's been annoying me for about 3 years. Codes that use the Standard Template Library don't scale to multiple threads.

First off, it's probably good to take a look at a code that illustrates the problem:

#include <vector>

int main()
  #pragma omp parallel for default (__auto)
  for (int i=0; i<10000000; i++)
    std::vector<int> v;

The first comparison is between the serial performance of the Solaris default STL and stlport4 which is provided with the compiler.

$ CC -O
$ timex a.out
real          15.85
user          15.64
sys            0.01
$ CC -O -library=stlport4
$ timex a.out
real           7.87
user           7.78
sys            0.01

This doesn't tell me anything that I didn't already know. stlport4 is (as far as I know) always faster than the STL provided by Solaris. Hence if you use C++, then you should use stlport4 in preference to the Solaris default. The constraint is that each application (libraries and all) can only use one version of the STL. So if a library that is outside your control uses the Solaris default, then the entire app must use it.

The next thing to investigate is scaling when there are multiple threads:

$ CC -O -xopenmp -library=stlport4
$ timex a.out
real           7.00
user           6.96
sys            0.01
$ export OMP_NUM_THREADS=2
$ timex a.out
real           7.18
user          14.28
sys            0.01

So compiling the code to use OpenMP caused no performance overhead, but running with two threads had the same runtime as a run with a single thread. We can profile the code to see what's happening:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
8.076     8.076      
1.571     2.272      mutex_lock_impl
1.501     1.971      mutex_unlock
1.051     4.573      std::vector >::_M_insert_overflow(int\*,const int&,const std::__true_type&,unsigned,bool)
0.871     8.076      _$d1A5.main
0.871     3.272      std::__node_alloc<true,0>::_M_allocate(unsigned)
0.560     1.721      std::__node_alloc<true,0>::_M_deallocate(void\*,unsigned)
0.480     0.480      sigon
0.440     0.440      mutex_trylock_adaptive
0.250     0.470      mutex_unlock_queue

So the lost time is due to mutex locks, if you dig through the source you'll find that node_alloc has a single mutex lock that only allows a single thread to allocate or deallocate memory. Which is why the code shows no scaling.

This test code is basically creating and destroying vector objects, so it hits the allocate and deallocate routines very hard. Which is why I picked it. Real codes are much less likely to have this problem at quite the same level. It is not unusual to want to create and destroy objects within a loop. One workaround is to hoist the objects out of the hot loops. This works for some instances, but is not a great solution, as even in the best case it makes the code more complex.

The solution I ended up using was to build the Apache STL. It turned out to be a relatively straightforward experience. The compile line is a bit cryptic, I wanted the optimised, multithreaded, 64-bit version and this translates to:

$ gmake BUILDTYPE=12D CONFIG=sunpro.config 

Once I had it built, I could install it with:

$ gmake BUILDTYPE=12D CONFIG=sunpro.config install PREFIX=`pwd`/install

The steps necessary to use a different STL than the ones supplied with the compiler are documented here. The compile line for the test code was:

CC -m64  -O -xopenmp -library=no%Cstd \\
   -I ./stdcxx-4.2.1/install/include/ \\
   -L ./stdcxx-4.2.1/install/lib/     \\
   -R ./stdcxx-4.2.1/install/lib/ -lstd12D 

So we can build the test and look at the scaling between one and two threads:

$ export OMP_NUM_THREADS=1
$ timex a.out
real          18.98
user          18.93
sys            0.01
$ export OMP_NUM_THREADS=2
$ timex a.out
real          18.42
user          36.73
sys            0.01

Which is not, to be honest, a great start, the runtime is slower, and the code still fails to scale. However, the profile is different:

Excl.     Incl.      Name  
User CPU  User CPU         
  sec.      sec.      
21.145    21.145     
 2.572    16.411     std::vector<int,std::allocator<int> >::_C_insert_n(int\*const&,unsigned long,const int&)
 2.402     4.293     mutex_unlock
 2.342     3.613     mutex_lock_impl
 1.961    10.697     std::vector<int,std::allocator<int> >::_C_realloc(unsigned long)
 1.681     5.634     free
 1.341     1.891     mutex_unlock_queue
 1.271     1.271     _free_unlocked
 0.991     0.991     sigon

So we still see a lot of mutex activity. Looking at where the mutex activity comes from provides an interesting insight:

(er_print) csingle mutex_lock
Attr.    Excl.     Incl.      Name  
User CPU  User CPU  User CPU         
 sec.      sec.      sec.       
0.170     1.681     5.634      free
0.020     0.690     4.623      malloc
0.190     0.190     0.190     \*mutex_lock

So the mutex activity is coming from malloc and free. Which are parts of the default Solaris memory allocator. The default memory allocator is thread safe, but does not give good performance for MT codes. There are two usual alternatives, mtmalloc and libumem. I've usually found mtmalloc to be good enough for me:

CC -m64  -O -xopenmp -library=no%Cstd \\
   -I ./stdcxx-4.2.1/install/include/ \\
   -L ./stdcxx-4.2.1/install/lib/     \\
   -R ./stdcxx-4.2.1/install/lib/ -lstd12D -lmtmalloc

Then we can try the timing tests again:

$ export OMP_NUM_THREADS=1
$ timex a.out
real          18.02
user          17.98
sys            0.01
$ export OMP_NUM_THREADS=2
real          13.76
user          27.05
sys            0.01
$ export OMP_NUM_THREADS=4
$ timex a.out
real           6.92
user          26.97
sys            0.02
$ export OMP_NUM_THREADS=8
$ timex a.out
real           3.51
user          26.99
sys            0.02

So the code is now scaling to multiple threads, which was the original problem. We have lost some serial performance, which is perhaps a concern, but that performance loss may be only for a particular code path, and depending on the usage of the library, we might even see gains in some of the algorithms. So depending on the situation, this might be a good enough solution. [FWIW, I also tested with libumem and did not see a significant difference in performance between the two libraries.]

Wednesday Jun 10, 2009

Utilising a CMT system

I got asked about how to utilise a CMT system, it's probably not an uncommon question, so I'll post my (somewhat brief) answer here:

The CMT processor appears as a system with many CPUs. These virtual CPUs can be provisioned in the same way as you would with any multiprocessor system:

  • The OS will naturally handle positioning multiple active threads so as to get the optimal performance.
  • If you wish to manually tune this then you can use Solaris tools like processor binding (pbind, or processor_bind) to statically allocate a particular thread or process to a particular core. You can use processor sets (psrset) to restrict a set of processes to a particular set of processors (or to exclude particular processes from using these processors).
  • The machine can be divided into multiple virtual machines either through Solaris Zones, where all zones run the same version of the Solaris operating system. Or through logical domains where multiple different operating systems can be installed onto the same machine.
    • The optimal configuration will depend on the problem to be solved.

      I've actually heard someone argue that multicore processors require a redesign of applications. Um, no. Applications will work just fine. However, multicore processors do give you opportunities to throw more threads at a problem - which can be very useful.

Wednesday May 20, 2009

Libraries (4) - Runtime costs - Procedure Lookup Table (PLT)

Most applications spend the majority of their time running - rather than starting up. So it's useful to look at the costs of using libraries at runtime.

The most apparent cost of using libraries is that calls to routines now go indirectly to the target routine through the procedure look up table (PLT). Unless the developer explicitly limits the scope of a function, it is exported from the library as a global function, which means that even calls within the library will go through the PLT. Consider the following code snippet:

void func2()

void func1()

If this is compiled into an executable the assembly code will look like:

        11104:  82 10 00 0f  mov        %o7, %g1
        11108:  7f ff ff f8  call       func2   ! 0x110e8
        1110c:  9e 10 00 01  mov        %g1, %o7

However, if this is compiled as part of a library then the code looks like:

         664:  82 10 00 0f  mov         %o7, %g1
         668:  40 00 40 b9  call        .plt+0x3c       ! 0x1094c
         66c:  9e 10 00 01  mov         %g1, %o7

This is a doubling of the cost of the call.

In C it's possible to limit the scope of the function using the static keyword. Declaring func1 as static will cause the compiler to generate a direct call to that routine. The downside is that the routine will only be visible within the source file that defines it. It is also possible to use other methods to limit the visibility of symbols.

Tuesday May 19, 2009

Developer's Edge safari rerelease

We've just pushed a new version of The Developer's Edge to safari. The original version didn't show any of the text from each section of the book unless you logged into the safari site. The new version shows the snippet from each section even if you're not a subscriber.

I was pleased to see that the book is featured on the Sun Studio developer portal.

I'm also scheduled to give a second life presentation during JavaOne at 9am PST on the 3rd June.

Thursday Apr 30, 2009

Computer organization and design

Just returned from Europe - customer visits and the OpenSPARC workshop in Brussels. Since I was doing a fair amount of air travel I took a number of books. With good timing, I'd just got a copy of Patterson & Hennessy's Computer Organisation and Design.

The book is certainly an interesting read. Although there are various ways you might read the book - or various lessons you might extract from it, I took it as basically a book that describes how to build a MIPS processor. Much of the work I do is at the hardware-software border, so I found it useful to read around the domain a bit more. The book is a relatively easy read, lots of detail (which I didn't need to memorise), and a nice pace that builds up from a simple core into a more complete implementation of a processor. The book has a chapter on multicore, but this was not treated to the same depth. There's also some material about GPUs, and that also didn't fit very well.

Friday Mar 27, 2009

The Developer's Edge available in hardcopy

The Developer's Edge is now available as hardcopy!

It is being made available as print-on-demand. You can either order through the publisher Vervante, or you can order through Amazon.

However, I suggest you wait until next week before ordering as the current cover art is not the final version (you can play spot the difference between the image on the left and the one on the Vervante website). I'll post when it gets fixed. Of course, you can order the "limited-edition" version if you want :)

I introduced the book in a previous post. I'll reproduce a bit more of the details in this post. The brief summary is:

The Developer's Edge: Selected Blog Posts and Articles focuses on articles in the following areas:

  • Native language issues
  • Performance and improving performance
  • Specific features of the x86 and SPARC processors
  • Tools that are available in the Solaris OS

The articles provide a broad overview on a topic, or an in-depth discussion. The texts should provide insights into new tools or new methods, or perhaps the opportunity to review a known domain in a new light.

You can get more details than this from the Safari site, either reading the preface or skimming the table of contents

I would love to hear feedback on this book, feel free to e-mail me directly, leave comments on amazon, or leave comments on this blog, or on the blogs of the other contributors.

Thursday Mar 26, 2009

OpenSPARC workshop - Brussels

April 24th and 25th I'm going to be in Brussels rerunning the OpenSPARC workshop. The workshop leverages the material in the OpenSPARC Internals book, together with the OpenSPARC presentations. There's a poster with all the details (for those with acute eyesight, the poster on the left is from the December workshop!).

Friday Mar 20, 2009

University of Washington Presentation

I was presenting at the University of Washington, Seattle, on Wednesday on Solaris and Sun Studio. The talk covers the tools that are available in Solaris and Sun Studio. This is my "Grand Unified" presentation, it covers tools, compilers, optimisation, parallelisation, and debug.

Thursday Mar 05, 2009

Peak rate of window spill and fill traps

I've been looking at the performance of a code recently. Written in C++ using many threads. One of the things with C++ is that as a language it encourages developers to have lots of small routines. Small routines lead to many calls and returns; and in particular they lead to register window spills and fills. Read more about register windows. Anyway, I wondered what the peak rate of issuing register window spills and fills was?

I'm going to use some old code that I used to examine the cost of library calls a while back. The code uses recursion to reach a deep call depth. First off, I define a couple of library routines:

extern int jump2(int count);

int jump1(int count)
  if (count==0)
    return 1;
    return 1+jump2(count);
int jump1(int count);

int jump2(int count)
  if (count==0)
    return 1;
    return 1+jump1(count);

I can then turn these into libraries:

$ cc -O -G -o jump1.c
$ cc -O -G -o jump2.c

Done in this way, both libraries have hanging dependencies:

$ ldd -d
        symbol not found: jump2         (./

So the main executable will have to resolve these. The main executable looks like:

#include <stdio.h>

#define RPT 100
#define SIZE 600

extern int jump1(int count);

int main()
  int index,count,links,tmp;
  #pragma omp parallel for default(none) private(count,index,tmp,links)
  for(links=1; links<10000; links++)
   for (count=0; count<RPT; count++)
     for (index=0;index<SIZE;index++) 
       if (tmp!=links) {printf("mismatch\\n");}

This needs to be compiled in such a way as to resolve the unresolved dependencies

$ cc -xopenmp=noopt -o par par.c -L. -R. -ljump1 -ljump2

Note that I'm breaking rules again by making the runtime linker look for the dependent libraries in the current directory rather than use $ORIGIN to locate them relative to the executable. Oops.

I'm also using OpenMP in the code. The directive tells the compiler to make the outer loop run in parallel over multiple processors. I picked 10,000 for the trip count so that the code would run for a bit of time, so I could look at the activity on the system. Also note that the outer loop defines the depth of the call stack, so this code will probably cause stack overflows at some point, if not before. Err...

I'm compiling with -xopenmp=noopt since I want the OpenMP directive to be recognised, but I don't want the compiler to use optimisation, since if the compiler saw the code it would probably eliminate most of it, and that would leave me with nothing much to test.

The first thing to test is whether this generates spill fill traps at all. So we run the application and use trapstat to look at trap activity:

vct name           |   cpu13  
 84 spill-user-32       |  3005899  
 c4 fill-user-32        |  3005779 

So on this 1.2GHz UltraSPARC T1 system, we're getting 3,000,000 traps/second. The generated code is pretty plain except for the save and restore instructions:

         21c:  9d e3 bf a0  save        %sp, -96, %sp
         220:  90 86 3f ff  addcc       %i0, -1, %o0
         224:  12 40 00 04  bne,pn      %icc,jump1+0x18 ! 0x234
         228:  01 00 00 00  nop       

         22c:  81 c7 e0 08  ret       
         230:  91 e8 20 01  restore     %g0, 1, %o0

         234:  40 00 00 00  call        jump2
         238:  01 00 00 00  nop       
         23c:  81 c7 e0 08  ret       
         240:  91 ea 20 01  restore     %o0, 1, %o0

So you can come up with an estimate of 300 ns/trap.

The reason for using OpenMP is to enable us to scale the number of active threads. Rerunning with 32 threads, by setting the environment variable OMP_NUM_THREADS to be 32, we get the following output from trapstat:

vct name                |    cpu21    cpu22    cpu23    cpu24    cpu25    cpu26 
 84 spill-user-32       |  1024589  1028081  1027596  1174373  1029954  1028695
 c4 fill-user-32        |   996739   989598   955669  1169058  1020349  1021877

So we're getting 1M traps per thread, with 32 threads running. Let's take a look at system activity using vmstat.

 vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s1 s2 s3 s4   in   sy   cs us sy id
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3022  427  812 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3020  428  797 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 2945  457  760 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3147  429 1025 99  1  0
 0 0 0 64800040 504168 0 15  0  0  0  0  0  0  0  0  0 3049  666  820 99  1  0
 0 0 0 64800040 504168 0  1  0  0  0  0  0  0  0  0  0 3044  543  866 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3021  422  798 100 0  0

So there's no system time being recorded - the register spill and fill traps are fast traps, so that's not a surprise.

One final thing to look at is the instruction issue rate. We can use cpustat to do this:

  2.009   6  tick  63117611 
  2.009  12  tick  69622769 
  2.009   7  tick  62118451 
  2.009   5  tick  64784126 
  2.009   0  tick  67341237 
  2.019  17  tick  62836527 

As might be expected from the cost of each trap, and the sparse number of instructions between traps, the issue rate of the instructions is quite low. Each of the four threads on a core is issuing about 65M instructions per second. So the core is issuing about 260M instructions per second - that's about 20% of the peak issue rate for the core.

If this were a real application, what could be done? Well, obviously the trick would be to reduce the number of calls and returns. At a compiler level, that would using flags that enable inlining - so an optimisation level of at least -xO4; adding -xipo to get cross-file inlining; using -g0 in C++ rather than -g (which disables front-end inlining). At a more structural level, perhaps the way the application is broken into libraries might be changed so that routines that are frequently called could be inlined.

The other thing to bear in mind, is that this code was designed to max out the register window spill/fill traps. Most codes will get nowhere near this level of window spills and fills. Most codes will probably max out at about a tenth of this level, so the impact from register window spill fill traps at that point will be substantially reduced.

Friday Jan 23, 2009

OSUM presentation on multi-threaded coding

I'll be giving a presentation titled "Multi-threaded coding for CMT processors" to OSUM members next friday (8am PST). If you are an OSUM member you can read the details here. OSUM stands for Open Source University Meetup - the definition is

"OSUM (pronounced "awesome") is a global community of students that are passionate about Free and Open Source Software (FOSS) and how it is Changing (Y)Our World. We call it a "Meetup" to encourage collaboration between student groups to create an even stronger open source community.".

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« February 2015
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming