Friday Mar 20, 2009

Volatile and Mutexes

A rough-and-ready guide to when to use the volatile keyword and when to use mutexes.

When you declare a variable to be volatile it ensures that the compiler always loads that variable from memory and immediately stores it back to memory after any operation on it.

For example:

int flag;

while (flag){}

In the absence of the volatile keyword the compiler will optimise this to:

if (!flag) while (1) {}

[If the flag is not zero to start with then the compiler assumes that there's nothing that can make it zero, so there is no exit condition on the loop.]

Not all shared data needs to be declared volatile. Only if you want one thread to see the effect of another thread.

[Example, if one thread populates a buffer, and another thread will later read from that buffer, then you don't need to declare the contents of the buffer as volatile. The important word being later, if you expect the two threads to access the buffer at the same time, then you would probably need the volatile keyword]

Mutexes are there to ensure exclusive access:

You will typically need to use them if you are updating a variable, or if you are performing a complex operation that should appear 'atomic'

For example:

volatile int total;


You need to do this to avoid a data race, where another thread could also be updating total:

Here's the situation without the mutex:

Thread 1    Thread 2
Read total  Read total
Add 5       Add 5
Write total Write total

So total would be incremented by 5 rather than 10.

An example of a complex operation would be:

My_account = My_account - bill;
Their_account = Their_account + bill;

You could use two separate mutexes, but then there would be a state where the amount bill would have been removed from My_account, but not yet placed into Their_account (this may or may not be a problem).

Friday Oct 31, 2008

The limits of parallelism

We all know Amdahl's law. The way I tend to think of it is if you reduce the time spent in the hot region of code, the most benefit you can get is the total time that you initially spent there. However, the original setting for the 'law' was parallelisation - the runtime improvement depends on the proportion of the code that can be made to run in parallel.

Aside: When I'm looking at profiles of applications, and I see a hot region of code, I typically consider what the improvement in runtime would be if I entirely eliminated the time spent there, or if I halved it. Then use this as a guide as to whether it's worth the effort of changing the code.

The issue with Amdahl is that it's completely unrealistic to consider parallelisation without considering issues of the synchronisation overhead introduced when you use multiple threads. So let's do that and see what happens. Assume that:

P = parallel runtime
S = Serial runtime
N = Number of threads
Z = Synchronisation cost

Amdahl would give you:

P = S / N

The flaw is that I can keep adding processors, and the parallel runtime keeps getting smaller and smaller - why would I ever stop? A more accurate equation would be something like:

P = S / N + Z\*log(N)

This is probably a fair approximation to the cost of synchronisation, some kind of binary tree object that synchronises all the threads. So we can differentiate that:

dP/DN = -S / N\^2 + Z / N 

And then solve:

0 = -S / N\^2 + Z / N
S / N = Z
N = S / Z

Ok, it's a bit disappointing, you start off with some nasty looking equation, and end up with a ratio. But let's take a look at what the ratio actually means. Let's suppose I reduce the synchronisation cost (Z). If I keep the work constant, then I can scale to a greater number of threads on the system with the lower synchronisation cost. Or, if I keep the number of threads constant, I can make a smaller chunk of work run in parallel.

Let's take a practical example. If I do synchronisation between threads on traditional SMP system, then communication between cores occurs at memory latency. Let's say that's ~200ns. Now compare that with a CMT system, where the synchronisation between threads can occur through the second level cache, with a latency of ~20ns. That's a 10x reduction in latency, which means that I can either use 10x the threads on the same chunk of work, or I can run in parallel a chunk of work that is 10x smaller.

The logical conclusion is that CMT is a perfect enabler of Microparallelism. You have both a system with huge numbers of threads, and the synchronisation costs between threads are potentially very low.

Now, that's exciting!

Thursday Oct 30, 2008

The multi-core is complex meme (but it's not)

Hidden amoungst the interesting stories (Gaming museum opening, Tennant stepping down) on the BBC was this little gem from Andrew Herbert, the head of Microsoft research in the UK.

The article describes how multi-core computing is hard. Here's a snippet of it:

"For exciting, also read 'complicated'; this presents huge programming challenges as we have to address the immensely complex interplay between multiple processors (think of a juggler riding a unicycle on a high wire, and you're starting to get the idea)."

Now, I just happened to see this particular article, but there's plenty of other places where the same meme appears. And yes, writing a multi-threaded application can be very complex, but probably only if you do it badly :) I mean just how complex is:

#pragma omp parallel for

Ok, so it's not fair comparing using OpenMP to parallelise a loop with writing some convoluted juggler riding unicycle application, but let's take a look at the example he uses:

"Handwriting recognition systems, for example, work by either identifying pen movement, or by recognising the written image itself. Currently, each approach works, but each has certain drawbacks that hinder adoption as a serious interface.

Now, with a different processor focusing on each recognition approach, learning our handwriting style and combining results, multi-core PCs will dramatically increase the accuracy of handwriting recognition."

This sounds like a good example (better than the old examples of using a virus scanner on one core whilst you worked on the other), but to me it implies two independent tasks. Or two independent threads. Yes, having a multicore chip means that the two tasks can execute in parallel, but assuming a sufficiently fast single core processor we could use the same approach.

So yes, to get the best from multi-core, you need to use multi-threaded programming (or multi-process, or virtualisation, or consolidation, but that's not the current discussion). But multi-threaded programming, whilst it can be tricky, is pretty well understood, and more importantly, for quite a large range of codes easy to do using OpenMP.

So I'm going to put it the other way around. It's easy to find parallelism in today's environment, from the desktop, through gaming, or numeric codes. There's abundant examples of where many threads are simultaneously active. Where it gets really exciting (for exciting read "fun") is if you start looking at using CMT processors to parallelise things that previously were not practical to run with multiple threads.

Monday Jan 07, 2008

putc in a multithreaded context

Just answering a question from a colleague. The application was running significantly slower when compiled as a multithreaded app compared to the original serial app. The profile showed mutex_unlock as being hot, but going up the callstack the routine that called mutex_unlock was putc.

This is the OpenSolaris source for putc, which shows a call to FLOCKFILE, which is defined in this file for MT programs. So for MT programs, a lock needs to be acquired before the character can be output.

Fortunately it is possible to avoid the locking using putc_unlocked. This call should not be used as a drop-in replacement for putc, but used after the appropriate mutex has been acquired. The details are in the Solaris Multi-threaded programming guide.

A test program that demonstrates this problem is:

#include <stdio.h>
#include <pthread.h>
#include <sys/time.h>

static double s_time;

void starttime()

void endtime(long its)
  double e_time=1.0\*gethrtime();
  printf("Time per iteration %5.2f ns\\n", (e_time-s_time)/(1.0\*its));


void \*dowork(void \*params)
  FILE\* s=fopen("/tmp/dldldldldld","w");
  for (int i=0; i<100000000; i++)

void main()
  FILE\* s=fopen("/tmp/dldldldldld","w");
  for (int i=0; i<100000000; i++)
  pthread_t threads[1];

Here's the results of running the code on Solaris 10:

$ cc -mt putc.c
Time per iteration 30.55 ns
Time per iteration 165.76 ns

The situation on Solaris 10 is better than Solaris 9, since on Solaris 9 the cost of the mutex was incurred by the -mt compiler flag rather than whether there were actually multiple threads active.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« June 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming