Thursday Mar 26, 2009

When to add a membar (an example)

I was recently having a discussion on one of the OpenSolaris lists on the topic of when to use the volatile keyword, and when it was necessary to use membars.

So volatile is a clue to the compiler to load the variable from memory and immediately store it back to memory. What it does not do is to tell the hardware anything. So the application can perform the store, but that store may not be immediately visible to the rest of the system. Most of the time this is not a problem - so long as the store is visible to the processor on which the thread is executing it's fine. Variability of when the store is visible to other processors may also be fine. There is one clear situation where the ordering of store operations could be a problem - and that's unlocking mutexes.

The problem here is best illustrated by the following scenario. I lock some data structure, then store new values into it, then unlock the structure. Immediately another thread comes along and uses the values in that structure. Not an uncommon situation. Unlocking a mutex is often just a case of storing a value (of zero) into the mutex structure. And here's the potential problem. In some weaker ordering architectures there is no guarantee that other processors see the stores in the same order that they are performed. So if you have Store A followed by Store B it may be possible for other processors to observe the change in the value of B before they see the change in the value of A. In the case of mutex unlock, the store of B would be the action that unlocked the mutex, enabling other threads to access the variable A... and there could be problems if they see the old value of A.

The solution to this is to put a membar in before the store that unlocks the mutex. You can see this happening in the OpenSolaris code:

     41 /\*
     42  \* lock_clear(lp)
     43  \*	- clear lock.
     44  \*/
     45 	ENTRY(_lock_clear)
     46 	membar	#LoadStore|#StoreStore
     47 	retl
     48 	  clrb	[%o0]
     49 	SET_SIZE(_lock_clear)

The membar ensures that all the pending stores are visible to other processors before the store that releases the lock becomes visible to them.

Friday Mar 20, 2009

Volatile and Mutexes

A rough-and-ready guide to when to use the volatile keyword and when to use mutexes.

When you declare a variable to be volatile it ensures that the compiler always loads that variable from memory and immediately stores it back to memory after any operation on it.

For example:

int flag;

while (flag){}

In the absence of the volatile keyword the compiler will optimise this to:

if (!flag) while (1) {}

[If the flag is not zero to start with then the compiler assumes that there's nothing that can make it zero, so there is no exit condition on the loop.]

Not all shared data needs to be declared volatile. Only if you want one thread to see the effect of another thread.

[Example, if one thread populates a buffer, and another thread will later read from that buffer, then you don't need to declare the contents of the buffer as volatile. The important word being later, if you expect the two threads to access the buffer at the same time, then you would probably need the volatile keyword]

Mutexes are there to ensure exclusive access:

You will typically need to use them if you are updating a variable, or if you are performing a complex operation that should appear 'atomic'

For example:

volatile int total;


You need to do this to avoid a data race, where another thread could also be updating total:

Here's the situation without the mutex:

Thread 1    Thread 2
Read total  Read total
Add 5       Add 5
Write total Write total

So total would be incremented by 5 rather than 10.

An example of a complex operation would be:

My_account = My_account - bill;
Their_account = Their_account + bill;

You could use two separate mutexes, but then there would be a state where the amount bill would have been removed from My_account, but not yet placed into Their_account (this may or may not be a problem).

Tuesday Jul 29, 2008

The cost of mutexes

One of the questions that came up last week was about the performance of mutex locks compared to the performance of atomic operations. The basic cost is that to acquire and free a mutex takes two operations whereas an atomic operation is a single call. There's also a bit more code required when using mutexes. The following code demonstrates the cost of calling mutexes, atomic operations, and also the inline template.

#include <pthread.h>
#include <atomic.h>
#include "timing.h"

#define SIZE 10000000

pthread_mutex_t mutex;
pthread_t thread;
volatile unsigned int counter;

void atomic_add(volatile unsigned int \*,int);

void \* count(void\* value)
  while (counter<SIZE)
  while (counter<SIZE)
  while (counter<SIZE)

void main()

Compiling and running an UltraSPARC T1 gives results like:

% cc test.c
% a.out
Time per iteration 250.61 ns
Time per iteration 75.85 ns
Time per iteration 65.21 ns

So the mutex calls are about 3x slower than atomic operations. Calling libc is about 10ns slower than using an inline template (not a bad difference in return for not having to write the inline template code).

It's interesting to see where the time goes. So here's the profile of the application:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
3.973     3.973      
1.341     3.973      count
1.331     1.331      mutex_unlock
0.781     0.781      mutex_lock_impl
0.490     0.490      atomic_add_32
0.030     0.030      mutex_lock

The routine mutex_lock tail calls mutex_lock_impl, which does the work of locking the mutex. The heart of mutex_unlock looks like:

   0.        0.                 [?]    beff8:  mov         %o1, %o3
   0.020     0.020              [?]    beffc:  cas         [%o0] , %o2, %o3
## 0.560     0.560              [?]    bf000:  cmp         %o2, %o3
   0.        0.                 [?]    bf004:  bne,a       0xbeff8
   0.        0.                 [?]    bf008:  mov         %o3, %o2

The core of mutex_lock_impl is not too dissimilar, so basically the mutex lock code contains two atomic operation loops, plus a bundle of other instructions that make up the rest of the cost of the calls.

Looking at where the time is spent for the atomic_add_32 call:

   0.010     0.010              [?]    2ecb8:  ld          [%o0], %o2
   0.040     0.040              [?]    2ecbc:  add         %o2, %o1, %o3
   0.010     0.010              [?]    2ecc0:  cas         [%o0] , %o2, %o3
## 0.370     0.370              [?]    2ecc4:  cmp         %o2, %o3
   0.        0.                 [?]    2ecc8:  bne,a,pn    %icc,0x2ecbc
   0.        0.                 [?]    2eccc:  mov         %o3, %o2
   0.050     0.050              [?]    2ecd0:  retl        
   0.010     0.010              [?]    2ecd4:  add         %o2, %o1, %o0

Which is again, a very similar loop, but with little overhead around it. And it pretty much matches the code that came from the inline template:

   0.040     0.040              [?]    110ec:  add         %o2, %o1, %o3
   0.010     0.010              [?]    110f0:  cas         [%o0] , %o2, %o3
## 0.360     0.360              [?]    110f4:  cmp         %o2, %o3
   0.        0.                 [?]    110f8:  bne         0x110ec
   0.040     0.040              [?]    110fc:  mov         %o3, %o2

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« February 2015
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming