Monday Apr 01, 2013

OpenMP and language level parallelisation

The C11 and C++11 standards introduced some very useful features into the language. In particular they provided language-level access to threading and synchronisation primitives. So using the new standards we can write multithreaded code that compiles and runs on standard compliant platforms. I've tackled translating Windows and POSIX threads before, but not having to use a shim is fantastic news.

There's some ideas afoot to do something similar for higher level parallelism. I have a proposal for consideration at the April meetings - leveraging the existing OpenMP infrastructure.

Pretty much all compilers use OpenMP, a large chunk of shared memory parallel programs are written using OpenMP. So, to me, it seems a good idea to leverage the existing OpenMP library code, and existing developer knowledge. The paper is not arguing that we need take the OpenMP syntax - that is something that can be altered to fit the requirements of the language.

What do you think?

Saturday Nov 13, 2010

Multicore Application Programming arrived!

It was an exciting morning - my copy of Multicore Application Programming was delivered. After reading the text countless times, it's great to actually see it as a finished article. It starting to become generally available. Amazon lists it as being available on Wednesday, although the Kindle version seems to be available already. It's also available on Safari books on-line. Even turned up at Tesco!

Wednesday Jul 07, 2010

Multicore application programming: update

It's 2am and I've just handed over the final manuscript for Multicore Application Programming. Those who know publishing will realise that this is not the final step. The publishers will layout my text and send it back to me for a final review before it goes to press. It will probably take a few weeks to complete the process.

I've also uploaded the final version of the table of contents. I've written the book using It's almost certain not to be a one-to-one mapping of pages in my draft to pages in the finished book. But I expect the page count to be roughly the same - somewhere around 370 pages of text. It will be interesting to see what happens when it is properly typeset.

Monday May 17, 2010

Multicore application programming: Table of contents

I've uploaded the current table of contents for Multicore Application Programming. You can find all the detail in there, but I think it's appropriate to talk about how the book is structured.

Chapter 1. The design of any processor has a massive impact on its performance. This is particularly true for multicore processors since multiple software threads will be sharing hardware resources. Hence the first chapter provides a whistle-stop tour of the critical features of hardware. It is important to do this up front as the terminology will be used later in the book when discussing how hardware and software interact.

Chapter 2. Serial performance remains important, even for multicore processors. There's two main reasons for this. The first is that a parallel program is really a bunch of serial threads working together, so improving the performance of the serial code will improve the performance of the parallel program. The second reason is that even a parallel program will have serial sections of code. The performance of the serial code will limit the maximum performance that the parallel program can attain.

Chapter 3. One of important aspects of using multicore processors is identifying where the parallelism is going to come from. If you look at any system today, there are likely to be many active processes. So at one level no change is necessary, systems will automatically use multiple cores. However, we want to get beyond that, and so the chapter discusses approaches like virtualisation as well as discussing the more obvious approach of multi-thread or multi-process programming. One message that needs to be broadcast is that multicore processors do not need a rewrite of existing applications. However, getting the most from a multicore processor may well require that.

Chapter 4. The book discusses Windows native threading, OpenMP, automatic parallelisation, as well as the POSIX threads that are available on OS-X, Linux, and Solaris. Although the details do sometimes change across platforms, the concepts do not. This chapter discusses synchronisation primitives like mutex locks and so on, this enables the chapters which avoids having to repeat information in the implementation chapters.

Chapter 5. This chapter covers POSIX threads (pthreads), which are available on Linux, OS-X, and Solaris, as well as other platforms not covered in the book. The chapter covers multithreaded as well as multiprocess programming, together with methods of communicating between threads and processes.

Chapter 6. This chapter covers Windows native threading. The function names and the parameters that need to be passed to them are different to the POSIX API, but the functionality is the same. This chapter provides the same coverage for Windows native threads that chapter 5 provides for pthreads.

Chapter 7. The previous two chapters provide a low level API for threading. This gives very great control, but provides more opportunities for errors, and requires considerable lines of code to be written for even the most basic parallel code. Automatic parallelisation and OpenMP place more of the burden of parallelisation on the compiler, less on the developer. Automatic parallelisation is the ideal situation, where the compiler does all the work. However, there are limitations to this approach, and this chapter discusses the current limitations and how to make changes to the code that will enable the compiler to do a better job. OpenMP is a very flexible technology for writing parallel applications. It is widely supported and provides support for a number of different approaches to parallelism.

Chapter 8. Synchronisation primitives provided by the operating system or compiler can have high overheads. So it is tempting to write replacements. This chapter covers some of the potential problems that need to be avoided. Most applications will be adequately served by the synchronisation primitives already provided, the discussion in the chapter provides insight about how hardware, compilers, and software can cause bugs in parallel applications.

Chapter 9. The difference between a multicore system and a single core system is in its ability to simultaneously handle multiple active threads. The difference between a multicore system and a multiprocessor system is in the sharing of processor resources between threads. Fundamentally, the key attribute of a multicore system is how it scales to multiple threads, and how the characteristics of the application affect that scaling. This chapter discusses what factors impact scaling on multicore processors, and also what the benefits multicore processors bring to parallel applications.

Chapter 10. Writing parallel programs is a growing and challenging field. The challenges come from producing correct code and getting the code to scale to large numbers of cores. There are some approaches that provide high numbers of cores, there are other approaches which address issues of producing correct code. This chapter discusses a large number of other approaches to programming parallelism.

Chapter 11. The concluding chapter of the book reprises some of the key points of the previous chapters, and tackles the question of how to write correct, scalable, parallel applications.

Tuesday May 11, 2010

New Book: Multicore application programming

I'm very pleased to be able to talk about my next book Multicore Application Programming. I've been working on this for some time, and it's a great relief to be able to finally point to a webpage indicating that it really exists!

The release date is sometime around September/October. Amazon has it as the 11th October, which is probably about right. It takes a chunk of time for the text to go through editing, typesetting, and printing, before it's finally out in the shops. The current status is that it's a set of documents with a fair number of virtual sticky tags attached indicating points which need to be refined.

One thing that should immediately jump out from the subtitle is that the book (currently) covers Windows, Linux, and Solaris. In writing the book I felt it was critical to try and bridge the gaps between operating systems, and avoid writing it about only one.

Obviously the difference between Solaris and Linux is pretty minimal. The differences with Windows are much greater, but, when writing to the Windows native threading API, the actual differences are more syntactic than functional.

By this I mean that the name of the function changes, the parameters change a bit, but the meaning of the function call does not change. For example, you might call pthread_create(), on Windows you might call _beginthreadex(); the name of the function changes, there are a few different parameters, but both calls create a new thread.

I'll write a follow up post containing more details about the contents of the book.

Tuesday Feb 23, 2010

Presenting at the SVOSUG on Thursday

I'm presenting at the Silicon Valley OpenSolaris Users Group on Thursday evening. I was only asked today, so I'm putting together some slides this evening on "Multicore Application Programming". The talk is going to be a relatively high level presentation on writing parallel applications, and how the advent of multicore or CMT processors changes the dynamics.

Wednesday Feb 03, 2010

Little book of semaphores

Interesting read that demonstrates that there's much more to semaphores than might be expected.

Tuesday Jun 23, 2009

Sun Studio 12 Update 1

Sun Studio 12 Update 1 went live yesterday. It's still a free download, and it's got a raft of new features. Many people will have been using the express releases, so they will already be familiar with the improvements.

It's been about two years since Sun Studio 12 came out, and the most obvious change in that time is the prevalence of multicore processors. I figured the easiest way to discern this would be to look at the submissions of SPEC CPU2006 results in that time period. The following chart shows the cummulative number of SPEC CPU2006 Integer speed results over that time broken down by the number of threads that the chip was capable of supporting.

Ok, the first surprising thing about the chart is that there's very few single threaded chips. There were a few results when the suite was launched back in 2006, but nothing much since. What is more apparent is the number of dual-thread chips, that was where the majority of the market was. There were also a number of quad-thread chips at that point. If we fast-forward to the situation today, we can see that the number of dual-thread chips has pretty much leveled off, the bulk of the chips are capable of supporting four threads. But you can see the start of a ramp of chips that are capable of supporting 6 or 8 simultaneous threads.

The relevance of this chart to Sun Studio is that Sun Studio has always been a tool that supports the development of multi-threaded applications. Every release of the product improves on the support in the previous release. Sun Studio 12 Update 1 includes improvements in the compiler's ability to automatically parallelise codes - afterall the easiest way to develop parallel applications is if the compiler can do it for you; improvements to the support of parallelisation specifications like OpenMP, this release includes support for the latest OpenMP 3.0 specification; and improvements in the tools and their ability to provide the developer meaningful feedback about parallel code, for example the ability of the Performance Analyzer to profile MPI code.

Footnote SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Benchmark results stated above reflect results posted on as of 15 June 2009.

Friday Jun 12, 2009

Stlport4 and multithreaded code

I finally resolved a problem that's been annoying me for about 3 years. Codes that use the Standard Template Library don't scale to multiple threads.

First off, it's probably good to take a look at a code that illustrates the problem:

#include <vector>

int main()
  #pragma omp parallel for default (__auto)
  for (int i=0; i<10000000; i++)
    std::vector<int> v;

The first comparison is between the serial performance of the Solaris default STL and stlport4 which is provided with the compiler.

$ CC -O
$ timex a.out
real          15.85
user          15.64
sys            0.01
$ CC -O -library=stlport4
$ timex a.out
real           7.87
user           7.78
sys            0.01

This doesn't tell me anything that I didn't already know. stlport4 is (as far as I know) always faster than the STL provided by Solaris. Hence if you use C++, then you should use stlport4 in preference to the Solaris default. The constraint is that each application (libraries and all) can only use one version of the STL. So if a library that is outside your control uses the Solaris default, then the entire app must use it.

The next thing to investigate is scaling when there are multiple threads:

$ CC -O -xopenmp -library=stlport4
$ timex a.out
real           7.00
user           6.96
sys            0.01
$ export OMP_NUM_THREADS=2
$ timex a.out
real           7.18
user          14.28
sys            0.01

So compiling the code to use OpenMP caused no performance overhead, but running with two threads had the same runtime as a run with a single thread. We can profile the code to see what's happening:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
8.076     8.076      
1.571     2.272      mutex_lock_impl
1.501     1.971      mutex_unlock
1.051     4.573      std::vector >::_M_insert_overflow(int\*,const int&,const std::__true_type&,unsigned,bool)
0.871     8.076      _$d1A5.main
0.871     3.272      std::__node_alloc<true,0>::_M_allocate(unsigned)
0.560     1.721      std::__node_alloc<true,0>::_M_deallocate(void\*,unsigned)
0.480     0.480      sigon
0.440     0.440      mutex_trylock_adaptive
0.250     0.470      mutex_unlock_queue

So the lost time is due to mutex locks, if you dig through the source you'll find that node_alloc has a single mutex lock that only allows a single thread to allocate or deallocate memory. Which is why the code shows no scaling.

This test code is basically creating and destroying vector objects, so it hits the allocate and deallocate routines very hard. Which is why I picked it. Real codes are much less likely to have this problem at quite the same level. It is not unusual to want to create and destroy objects within a loop. One workaround is to hoist the objects out of the hot loops. This works for some instances, but is not a great solution, as even in the best case it makes the code more complex.

The solution I ended up using was to build the Apache STL. It turned out to be a relatively straightforward experience. The compile line is a bit cryptic, I wanted the optimised, multithreaded, 64-bit version and this translates to:

$ gmake BUILDTYPE=12D CONFIG=sunpro.config 

Once I had it built, I could install it with:

$ gmake BUILDTYPE=12D CONFIG=sunpro.config install PREFIX=`pwd`/install

The steps necessary to use a different STL than the ones supplied with the compiler are documented here. The compile line for the test code was:

CC -m64  -O -xopenmp -library=no%Cstd \\
   -I ./stdcxx-4.2.1/install/include/ \\
   -L ./stdcxx-4.2.1/install/lib/     \\
   -R ./stdcxx-4.2.1/install/lib/ -lstd12D 

So we can build the test and look at the scaling between one and two threads:

$ export OMP_NUM_THREADS=1
$ timex a.out
real          18.98
user          18.93
sys            0.01
$ export OMP_NUM_THREADS=2
$ timex a.out
real          18.42
user          36.73
sys            0.01

Which is not, to be honest, a great start, the runtime is slower, and the code still fails to scale. However, the profile is different:

Excl.     Incl.      Name  
User CPU  User CPU         
  sec.      sec.      
21.145    21.145     
 2.572    16.411     std::vector<int,std::allocator<int> >::_C_insert_n(int\*const&,unsigned long,const int&)
 2.402     4.293     mutex_unlock
 2.342     3.613     mutex_lock_impl
 1.961    10.697     std::vector<int,std::allocator<int> >::_C_realloc(unsigned long)
 1.681     5.634     free
 1.341     1.891     mutex_unlock_queue
 1.271     1.271     _free_unlocked
 0.991     0.991     sigon

So we still see a lot of mutex activity. Looking at where the mutex activity comes from provides an interesting insight:

(er_print) csingle mutex_lock
Attr.    Excl.     Incl.      Name  
User CPU  User CPU  User CPU         
 sec.      sec.      sec.       
0.170     1.681     5.634      free
0.020     0.690     4.623      malloc
0.190     0.190     0.190     \*mutex_lock

So the mutex activity is coming from malloc and free. Which are parts of the default Solaris memory allocator. The default memory allocator is thread safe, but does not give good performance for MT codes. There are two usual alternatives, mtmalloc and libumem. I've usually found mtmalloc to be good enough for me:

CC -m64  -O -xopenmp -library=no%Cstd \\
   -I ./stdcxx-4.2.1/install/include/ \\
   -L ./stdcxx-4.2.1/install/lib/     \\
   -R ./stdcxx-4.2.1/install/lib/ -lstd12D -lmtmalloc

Then we can try the timing tests again:

$ export OMP_NUM_THREADS=1
$ timex a.out
real          18.02
user          17.98
sys            0.01
$ export OMP_NUM_THREADS=2
real          13.76
user          27.05
sys            0.01
$ export OMP_NUM_THREADS=4
$ timex a.out
real           6.92
user          26.97
sys            0.02
$ export OMP_NUM_THREADS=8
$ timex a.out
real           3.51
user          26.99
sys            0.02

So the code is now scaling to multiple threads, which was the original problem. We have lost some serial performance, which is perhaps a concern, but that performance loss may be only for a particular code path, and depending on the usage of the library, we might even see gains in some of the algorithms. So depending on the situation, this might be a good enough solution. [FWIW, I also tested with libumem and did not see a significant difference in performance between the two libraries.]

Tuesday Mar 31, 2009

NUMA, binding, and OpenMP

One of my colleagues did an excellent bit of analysis recently, it pulls together a fair number of related topics, so I hope you'll find it interesting.

We'll start with NUMA. Non-Uniform Memory Access. This is in contrast to UMA - Uniform Memory Access. This relates to memory latency - how long does it take to get data from memory to the processor. If you take a single CPU box, the memory latency is basically a measurement of the wires between the processor and the memory chips, it typically is about 90ns, can be as low as 60ns. For a 3GHz chip this is from around 200 to 300 cycles, which is a fair length of time.

Suppose we add a second chip into the system. The memory latency increases because there's now a bunch of communication that needs to happen between the two chips. The communication consists of things like checking that more recent data is not in the cache of the other chip, co-ordinating access to the same memory bank, accessing memory that is controlled by the other processor. The upshot of all this is that memory latency increases. However, that's not all.

If you have two chips together with a bunch of memory, you can have various configurations. The most likely one is that each chip gets half the memory. If one chip has to access memory that the other chip owns, this is going to take longer than if the memory is attached to that chip. Typically you might find that local memory takes 90ns to access, and remote memory 120ns.

One way of dealing with this disparity is to interleave the memory, so one cacheline will be local, the next remote. Doing this you'll see an average memory latency of 105ns. Although the memory latency is longer than the optimal, there's nothing a programmer (or an operating system) can do about it.

However, those of use who care about performance will jump through hoops of fire to get that lower memory latency. Plus as the disparity in memory latency grows larger, it makes less and less sense to average the cost. Imagine a situation on a large multi-board system where the on-board memory latency might be 150ns, but the cross-board latency would be closer to 300ns (I should point out that I'm using top-of-the-head numbers for all latencies, I'm not measuring them on any systems). The impact of doing this averaging could be a substantial slow-down in performance for any application that doesn't fit into cache (which is most apps). (There are other reasons for not doing this, such as limiting the amount of traffic that needs to go across the various busses.)

So most systems with more than one CPU will see some element of NUMA. Solaris has contained some memory placement optimisations MPO since Solaris 9. These optimisations attempt to allocate memory locally to the processor that is running the application. OpenSolaris has the lgrpinfo command that provides an interface to see the levels of memory locality in the system.

Solaris will attempt to schedule threads so that they remain in their locality group - taking advantage of the local memory. Another way of controlling performance is to use binding to keep processes, or threads on a particular processor. This can be done through the pbind command. Processor sets can performance a similar job (as can zones, or even logical domains), or directly through processor_bind.

Binding can be a tricky thing to get right. For example in a system where there are multiple active users, it is quite possible to end up in a situation where one virtual processor is oversubscribed with processes, whilst another is completely idle. However, in situations where this level of control enables better performance then binding can be hugely helpful.

One situation where binding is commonly used is for running OpenMP programs. In fact, it is so common that the OpenMP library has built in support for binding through the environment variable SUNW_MP_PROCBIND. This variable enables the user to specify which threads are bound to which logical processors.

It is worth pointing out that binding does not just help memory locality issues. Another situation where binding helps is thread migrations. This is the situation where an interrupt, or another thread requires attention and this causes the thread currently running on the processor to be descheduled. In some situations the descheduled thread will get scheduled onto another virtual processor. In some instances that may be the correct decision. In other instances it may result in lower than expected performance because the data that the thread needs is still in the cache on the old processor, and also because the migration of that thread may cause a cascade of migrations of other threads.

The particular situation we hit was that one code when bound showed bimodal distributions of runtimes. It had a 50% chance of running fast or slow. We were using OpenMP as well as the SUNW_MP_PROCBIND environment variable, so in theory we'd controlled for everything. However, the program didn't hit the parallel section until after a few minutes of running, and examining what was happening using both pbind and also the Performance Analyzer indicated what the problem was.

The environment variable SUNW_MP_PROCBIND currently binds threads once the code reaches the parallel region. Until that point the process is unbound. Since the process is unbound, Solaris can schedule it to any available virtual CPU. During the unbound time, the process allocated the memory that it needed, and the MPO feature of Solaris ensured that the memory was allocated locally. I'm sure you can see where this is heading.... Now, once the code hit the parallel region, the binding occurred, and the main thread was bound to a particular locality group, half the time this group was the same group where it had been running before, and half the time it was a different locality group. If the locality group was the same, then memory would be local, otherwise memory would be remote (and performance slower).

We put together a LD_PRELOAD library to prove it. The following code has a parallel section in it which gets called during initialisation. This ensures that binding has already taken place by the time the master thread starts.

#include <stdio.h>
#pragma init(s)
void s()
  #pragma omp parallel sections
  #pragma omp section

The code is compiled and used with:

$ cc -O -xopenmp -G -Kpic -o par.c
$ LD_PRELOAD=./ ./a.out

Friday Mar 20, 2009

University of Washington Presentation

I was presenting at the University of Washington, Seattle, on Wednesday on Solaris and Sun Studio. The talk covers the tools that are available in Solaris and Sun Studio. This is my "Grand Unified" presentation, it covers tools, compilers, optimisation, parallelisation, and debug.

Tuesday May 13, 2008

OpenMP 3.0 specification released

The specification for OpenMP 3.0 has been put up on the website. Using the previous OpenMP 2.5 standard, there's basically two supported modes of parallelisation:

  • Splitting a loop over multiple threads - each thread is responsible for a range of the iterations.
  • Splitting a serial code into sections - each thread executes a section of code.

The large change with OpenMP 3.0 is the introduction of tasks, where a thread can spawn a task to be completed by another thread at an unspecified point in the future. This should make OpenMP amenable to many more situations. An example of using tasks looks like:

  node \* p = head;
  while (p)
    #pragma omp task
    p = p->next;

The master thread iterates the linked list generating tasks for processing each element in the list. The brackets around the call to process(p) are unnecessary, but hopefully clarify what's happening.

Monday May 12, 2008

Slides for CommunityOne

All the slides for last week's CommunityOne conference are available for download. I was presenting in the CMT stream, you can find my slides here. Note that to download the slides, you'll need to use the username and password shown on the page.

My talk was on parallelisation. What's supported by the compiler, the steps to do it, and the tools that support that. I ended with an overview of microparallelisation.

Friday May 02, 2008

Official reschedule notice for CommunityOne

Session ID: S297077
Session Title: Techniques for Utilizing CMT
Track: Chip Multithreading (CMT): OpenSPARCâ„¢
Room: Esplanade 302
Date: 2008-05-05
Start Time: 13:30 

The official timetable has also been updated

Embedded Systems Conference Presentation

I got the opportunity to present at the embedded systems conference in San Jose a couple of weeks back. My presentation covered parallelising a serial application, a quick tour of what to do, together with an overview of the tools that Sun Studio provides to help out. The presentation is now available on the OpenSPARC website.

Tuesday Apr 29, 2008

Multicore expo available - Microparallelisation

My presentation "Strategies for improving the performance of single threaded codes on a CMT system" has been made available on the OpenSPARC site.

The presentation discusses "microparallelisation", in the the context of parallelising an example loop. Microparallelisation is the aim of obtaining parallelism through assigning small chunks of work to discrete processors. Taking a step back...

With traditional parallelisation the idea is to identify large chunks of work that can be split between multiple processors. The chunks of work need to be large to amortise the synchronisation costs. This usually means that the loops have a huge trip count.

The synchronisation costs are derived from the time it takes to signal that a core has completed its work. The lower the synchronisation costs, the smaller amount of work is needed to make parallelisation profitable.

Now, a CMT processor has two big advantages here. First of all it has many threads. Secondly these threads have low latency access to a shared level of cache. The result of this is that the cost of synchronisation between threads is greatly reduced, and therefore each thread is free to do a smaller chunk of work in a parallel region.

All that's great in theory, the presentation uses some example code to try this out, and discovers, rather fortunately, that the idea also works in practice!

The presentation also covers using atomic operations rather that microparallelisation.

In summary the presentation is more research than solid science, but I hoped that presenting it would get some people thinking about non-traditional ways to extract parallelism from applications. I'm not alone in this area of work, Lawrence Spracklen is also working on it. We're both at presenting CommunityOne next week.

Thursday Jan 31, 2008

Win $20,000!

Sun has announced a Community Innovation Awards Programme - basically a $1M of prize money available for various Sun-sponsored open source projects. There is an OpenSPARC programme, and the one that catches my eye is $20k for:

vi. Best Adaptation of a single-thread application to a multi-thread CMT (Chip Multi Threaded) environment

My guess is that they will expect more than the use of -xautopar -xreduction or a few OpenMP directives :) If I were allowed to enter (unfortunately Sun Employees are not) I'd be looking to exploit the features of the T1 or T2:

  • The threads can synchronise at the L2 cache level - so synchronisation costs are low
  • Memory latency is low

The upshot of this should be that it is possible to parallelise applications which traditionally have not been parallelisable because of synchronisation costs.

Funnily enough this is an area that I'm currently working in, and I do hope to have a paper accepted for the MultiExpo.

Friday Sep 28, 2007

Multi-threading resources on the developer portal

Found a page of links to useful resources about multi-threading on the developer portal.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« July 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming