Tuesday Jul 29, 2008

The cost of mutexes

One of the questions that came up last week was about the performance of mutex locks compared to the performance of atomic operations. The basic cost is that to acquire and free a mutex takes two operations whereas an atomic operation is a single call. There's also a bit more code required when using mutexes. The following code demonstrates the cost of calling mutexes, atomic operations, and also the inline template.

#include <pthread.h>
#include <atomic.h>
#include "timing.h"

#define SIZE 10000000

pthread_mutex_t mutex;
pthread_t thread;
volatile unsigned int counter;

void atomic_add(volatile unsigned int \*,int);

void \* count(void\* value)
{
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    pthread_mutex_lock(&mutex);
    counter++;
    pthread_mutex_unlock(&mutex);
  }
  endtime(SIZE);
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    atomic_add_int(&counter,1);
  }
  endtime(SIZE);
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    atomic_add(&counter,1);
  }
  endtime(SIZE);
}

void main()
{
  pthread_mutex_init(&mutex,0);
  counter=0;
  pthread_create(&thread,0,count,0);
  
  pthread_join(thread,0);
  pthread_mutex_destroy(&mutex);
}

Compiling and running an UltraSPARC T1 gives results like:

% cc test.c add.il
% a.out
Time per iteration 250.61 ns
Time per iteration 75.85 ns
Time per iteration 65.21 ns

So the mutex calls are about 3x slower than atomic operations. Calling libc is about 10ns slower than using an inline template (not a bad difference in return for not having to write the inline template code).

It's interesting to see where the time goes. So here's the profile of the application:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
3.973     3.973      
1.341     3.973      count
1.331     1.331      mutex_unlock
0.781     0.781      mutex_lock_impl
0.490     0.490      atomic_add_32
0.030     0.030      mutex_lock

The routine mutex_lock tail calls mutex_lock_impl, which does the work of locking the mutex. The heart of mutex_unlock looks like:

   0.        0.                 [?]    beff8:  mov         %o1, %o3
   0.020     0.020              [?]    beffc:  cas         [%o0] , %o2, %o3
## 0.560     0.560              [?]    bf000:  cmp         %o2, %o3
   0.        0.                 [?]    bf004:  bne,a       0xbeff8
   0.        0.                 [?]    bf008:  mov         %o3, %o2

The core of mutex_lock_impl is not too dissimilar, so basically the mutex lock code contains two atomic operation loops, plus a bundle of other instructions that make up the rest of the cost of the calls.

Looking at where the time is spent for the atomic_add_32 call:

                                
   0.010     0.010              [?]    2ecb8:  ld          [%o0], %o2
   0.040     0.040              [?]    2ecbc:  add         %o2, %o1, %o3
   0.010     0.010              [?]    2ecc0:  cas         [%o0] , %o2, %o3
## 0.370     0.370              [?]    2ecc4:  cmp         %o2, %o3
   0.        0.                 [?]    2ecc8:  bne,a,pn    %icc,0x2ecbc
   0.        0.                 [?]    2eccc:  mov         %o3, %o2
   0.050     0.050              [?]    2ecd0:  retl        
   0.010     0.010              [?]    2ecd4:  add         %o2, %o1, %o0

Which is again, a very similar loop, but with little overhead around it. And it pretty much matches the code that came from the inline template:

   0.040     0.040              [?]    110ec:  add         %o2, %o1, %o3
   0.010     0.010              [?]    110f0:  cas         [%o0] , %o2, %o3
## 0.360     0.360              [?]    110f4:  cmp         %o2, %o3
   0.        0.                 [?]    110f8:  bne         0x110ec
   0.040     0.040              [?]    110fc:  mov         %o3, %o2

Monday Jul 28, 2008

Atomic operations

Solaris 10 provides atomic operations in libc. Atomic operations are sequences of instructions that behave as if they are a single atomic instruction that no other thread can interrupt. The way that the operations are implemented uses the compare-and-swap instruction (cas). A rough outline is as follows:

do
{
  Load existing value
  New value = existing value + increment
  return value = compare and swap(existing value and new value)
}
while (return value != existing value)

The compare-and-swap instruction atomically swaps the value in a register with the value held in memory, but only if the value held in memory equals the value held in another register. The return value is the value held in memory. The pseudo-code uses this so that the value stored back to memory will only be the incremented value if the current value in memory is equal to the value before the increment. If the value held in memory is not the one that was expected, the code retries the operation until it does succeed.

Breaking the compare-and-swap instruction into pseudo code looks like:

  CAS (Value held in memory, Old value, New value)
  {
    Existing value = \*Value held in memory;
    if (Existing value == Old value)
    {
       \*Value held in memory = New value;
       return Old value;
    }
    else
    {
       return \*Value held in memory;
    }
  }

One of the questions from last week was how to use atomic operations in Solaris 9 if they are only available on Solaris 10. The answer is to write your own. The following code snippet demonstrates how to write an atomic increment operation:

.inline atomic_add,8
  ld [%o0],%o2         /\* Load existing value            \*/
1:
  add %o2, %o1, %o3    /\* Generate new value             \*/
  cas [%o0],%o2,%o3    /\* Swap memory contents           \*/
  cmp %o2, %o3         /\* Compare old value with return  \*/
  bne 1b               /\* Fail if the old value does not \*/
                       /\* equal the value returned from  \*/
                       /\* memory                         \*/
  mov %o3,%o2          /\* Retry using latest value from  \*/
                       /\* memory                         \*/
.end

It's probably a good idea to read this article on atomic operations on SPARC, and it may also be useful to read up on inline templates.

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs