• Work
    July 29, 2008

The cost of mutexes

Guest Author

One of the questions that came up last week was about the performance of mutex locks compared to the performance of atomic operations. The basic cost is that to acquire and free a mutex takes two operations whereas an atomic operation is a single call. There's also a bit more code required when using mutexes. The following code demonstrates the cost of calling mutexes, atomic operations, and also the inline template.

#include <pthread.h>
#include <atomic.h>
#include "timing.h"
#define SIZE 10000000
pthread_mutex_t mutex;
pthread_t thread;
volatile unsigned int counter;
void atomic_add(volatile unsigned int \*,int);
void \* count(void\* value)
while (counter<SIZE)
while (counter<SIZE)
while (counter<SIZE)
void main()

Compiling and running an UltraSPARC T1 gives results like:

% cc test.c add.il
% a.out
Time per iteration 250.61 ns
Time per iteration 75.85 ns
Time per iteration 65.21 ns

So the mutex calls are about 3x slower than atomic operations. Calling libc is about 10ns slower than using an inline template (not a bad difference in return for not having to write the inline template code).

It's interesting to see where the time goes. So here's the profile of the application:

Excl.     Incl.      Name  
User CPU User CPU
sec. sec.
3.973 3.973
1.341 3.973 count
1.331 1.331 mutex_unlock
0.781 0.781 mutex_lock_impl
0.490 0.490 atomic_add_32
0.030 0.030 mutex_lock

The routine mutex_lock tail calls mutex_lock_impl, which does the work of locking the mutex. The heart of mutex_unlock looks like:

   0.        0.                 [?]    beff8:  mov         %o1, %o3
0.020 0.020 [?] beffc: cas [%o0] , %o2, %o3
## 0.560 0.560 [?] bf000: cmp %o2, %o3
0. 0. [?] bf004: bne,a 0xbeff8
0. 0. [?] bf008: mov %o3, %o2

The core of mutex_lock_impl is not too dissimilar, so basically the mutex lock code contains two atomic operation loops, plus a bundle of other instructions that make up the rest of the cost of the calls.

Looking at where the time is spent for the atomic_add_32 call:

0.010 0.010 [?] 2ecb8: ld [%o0], %o2
0.040 0.040 [?] 2ecbc: add %o2, %o1, %o3
0.010 0.010 [?] 2ecc0: cas [%o0] , %o2, %o3
## 0.370 0.370 [?] 2ecc4: cmp %o2, %o3
0. 0. [?] 2ecc8: bne,a,pn %icc,0x2ecbc
0. 0. [?] 2eccc: mov %o3, %o2
0.050 0.050 [?] 2ecd0: retl
0.010 0.010 [?] 2ecd4: add %o2, %o1, %o0

Which is again, a very similar loop, but with little overhead around it. And it pretty much matches the code that came from the inline template:

   0.040     0.040              [?]    110ec:  add         %o2, %o1, %o3
0.010 0.010 [?] 110f0: cas [%o0] , %o2, %o3
## 0.360 0.360 [?] 110f4: cmp %o2, %o3
0. 0. [?] 110f8: bne 0x110ec
0.040 0.040 [?] 110fc: mov %o3, %o2

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.