The cost of mutexes

One of the questions that came up last week was about the performance of mutex locks compared to the performance of atomic operations. The basic cost is that to acquire and free a mutex takes two operations whereas an atomic operation is a single call. There's also a bit more code required when using mutexes. The following code demonstrates the cost of calling mutexes, atomic operations, and also the inline template.

#include <pthread.h>
#include <atomic.h>
#include "timing.h"

#define SIZE 10000000

pthread_mutex_t mutex;
pthread_t thread;
volatile unsigned int counter;

void atomic_add(volatile unsigned int \*,int);

void \* count(void\* value)
{
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    pthread_mutex_lock(&mutex);
    counter++;
    pthread_mutex_unlock(&mutex);
  }
  endtime(SIZE);
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    atomic_add_int(&counter,1);
  }
  endtime(SIZE);
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    atomic_add(&counter,1);
  }
  endtime(SIZE);
}

void main()
{
  pthread_mutex_init(&mutex,0);
  counter=0;
  pthread_create(&thread,0,count,0);
  
  pthread_join(thread,0);
  pthread_mutex_destroy(&mutex);
}

Compiling and running an UltraSPARC T1 gives results like:

% cc test.c add.il
% a.out
Time per iteration 250.61 ns
Time per iteration 75.85 ns
Time per iteration 65.21 ns

So the mutex calls are about 3x slower than atomic operations. Calling libc is about 10ns slower than using an inline template (not a bad difference in return for not having to write the inline template code).

It's interesting to see where the time goes. So here's the profile of the application:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
3.973     3.973      
1.341     3.973      count
1.331     1.331      mutex_unlock
0.781     0.781      mutex_lock_impl
0.490     0.490      atomic_add_32
0.030     0.030      mutex_lock

The routine mutex_lock tail calls mutex_lock_impl, which does the work of locking the mutex. The heart of mutex_unlock looks like:

   0.        0.                 [?]    beff8:  mov         %o1, %o3
   0.020     0.020              [?]    beffc:  cas         [%o0] , %o2, %o3
## 0.560     0.560              [?]    bf000:  cmp         %o2, %o3
   0.        0.                 [?]    bf004:  bne,a       0xbeff8
   0.        0.                 [?]    bf008:  mov         %o3, %o2

The core of mutex_lock_impl is not too dissimilar, so basically the mutex lock code contains two atomic operation loops, plus a bundle of other instructions that make up the rest of the cost of the calls.

Looking at where the time is spent for the atomic_add_32 call:

                                
   0.010     0.010              [?]    2ecb8:  ld          [%o0], %o2
   0.040     0.040              [?]    2ecbc:  add         %o2, %o1, %o3
   0.010     0.010              [?]    2ecc0:  cas         [%o0] , %o2, %o3
## 0.370     0.370              [?]    2ecc4:  cmp         %o2, %o3
   0.        0.                 [?]    2ecc8:  bne,a,pn    %icc,0x2ecbc
   0.        0.                 [?]    2eccc:  mov         %o3, %o2
   0.050     0.050              [?]    2ecd0:  retl        
   0.010     0.010              [?]    2ecd4:  add         %o2, %o1, %o0

Which is again, a very similar loop, but with little overhead around it. And it pretty much matches the code that came from the inline template:

   0.040     0.040              [?]    110ec:  add         %o2, %o1, %o3
   0.010     0.010              [?]    110f0:  cas         [%o0] , %o2, %o3
## 0.360     0.360              [?]    110f4:  cmp         %o2, %o3
   0.        0.                 [?]    110f8:  bne         0x110ec
   0.040     0.040              [?]    110fc:  mov         %o3, %o2
Comments:

Post a Comment:
Comments are closed for this entry.
About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs