Wednesday Feb 26, 2014

Multicore Application Programming available in Chinese!

This was a complete surprise to me. A box arrived on my doorstep, and inside were copies of Multicore Application Programming in Chinese. They look good, and have a glossy cover rather than the matte cover of the English version.

Friday Aug 09, 2013

How to use a lot of threads ....

The SPARC M5-32 has 1,536 virtual CPUs. There is a lot that you can do with that much resource and a bunch of us sat down to write a white paper discussing the options.

There are a couple of key observations in there. First of all it is quite likely that such a large system will not end up running a single instance of a single application. Therefore it is useful to understand the various options for virtualising the system. The second observation is that there are a number of details to take care of when writing an application that is expected to scale to large numbers of threads.

Anyhow, I hope you find it a useful read.

Tuesday May 11, 2010

New Book: Multicore application programming

I'm very pleased to be able to talk about my next book Multicore Application Programming. I've been working on this for some time, and it's a great relief to be able to finally point to a webpage indicating that it really exists!

The release date is sometime around September/October. Amazon has it as the 11th October, which is probably about right. It takes a chunk of time for the text to go through editing, typesetting, and printing, before it's finally out in the shops. The current status is that it's a set of documents with a fair number of virtual sticky tags attached indicating points which need to be refined.

One thing that should immediately jump out from the subtitle is that the book (currently) covers Windows, Linux, and Solaris. In writing the book I felt it was critical to try and bridge the gaps between operating systems, and avoid writing it about only one.

Obviously the difference between Solaris and Linux is pretty minimal. The differences with Windows are much greater, but, when writing to the Windows native threading API, the actual differences are more syntactic than functional.

By this I mean that the name of the function changes, the parameters change a bit, but the meaning of the function call does not change. For example, you might call pthread_create(), on Windows you might call _beginthreadex(); the name of the function changes, there are a few different parameters, but both calls create a new thread.

I'll write a follow up post containing more details about the contents of the book.

Monday Nov 23, 2009

When threads go bad

When a thread hits an error in a multithreaded application, that error will take out the entire app. Here's some example code:

#include <pthread.h>
#include <stdio.h>

void \*work(void \* param)
{
  int\*a;
  a=(int\*)(1024\*1024);
  (\*a)++;
  printf("Child thread exit\\n");
}

void main()
{
  pthread_t thread;
  pthread_create(&thread,0,work,0);
  pthread_join(thread,0);
  printf("Main thread exit\\n");
}

Compiling and running this produces:

% cc -O -mt pthread_error.c
% ./a.out
Segmentation Fault (core dumped)

Not entirely unexpected, that. The app died without the main thread having the chance to clear up resources etc. This is probably not ideal. However, it is possible to write a signal handler to capture the segmentation fault, and terminate the child thread without causing the main thread to terminate. It's important to realise that there's probably little chance of actually recovering from the unspecified error, but this at least might give the app the chance to report the symptoms of its demise.

#include <pthread.h>
#include <stdio.h>
#include <signal.h>

void \*work(void \* param)
{
  int\*a;
  a=(int\*)(1024\*1024);
  (\*a)++;
  printf("Child thread exit\\n");
}

void hsignal(int i)
{
  printf("Signal %i\\n",i);
  pthread_exit(0);
}

void main()
{
  pthread_t thread;
  sigset(SIGSEGV,hsignal);
  pthread_create(&thread,0,work,0);
  pthread_join(thread,0);
  printf("Main thread exit\\n");
}

Which produces the output:

% cc -O -mt pthread_error.c
% ./a.out
Signal 11
Main thread exit

Friday Jan 23, 2009

OSUM presentation on multi-threaded coding

I'll be giving a presentation titled "Multi-threaded coding for CMT processors" to OSUM members next friday (8am PST). If you are an OSUM member you can read the details here. OSUM stands for Open Source University Meetup - the definition is

"OSUM (pronounced "awesome") is a global community of students that are passionate about Free and Open Source Software (FOSS) and how it is Changing (Y)Our World. We call it a "Meetup" to encourage collaboration between student groups to create an even stronger open source community.".

Tuesday Jul 29, 2008

The cost of mutexes

One of the questions that came up last week was about the performance of mutex locks compared to the performance of atomic operations. The basic cost is that to acquire and free a mutex takes two operations whereas an atomic operation is a single call. There's also a bit more code required when using mutexes. The following code demonstrates the cost of calling mutexes, atomic operations, and also the inline template.

#include <pthread.h>
#include <atomic.h>
#include "timing.h"

#define SIZE 10000000

pthread_mutex_t mutex;
pthread_t thread;
volatile unsigned int counter;

void atomic_add(volatile unsigned int \*,int);

void \* count(void\* value)
{
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    pthread_mutex_lock(&mutex);
    counter++;
    pthread_mutex_unlock(&mutex);
  }
  endtime(SIZE);
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    atomic_add_int(&counter,1);
  }
  endtime(SIZE);
  counter=0;
  starttime();
  while (counter<SIZE)
  {
    atomic_add(&counter,1);
  }
  endtime(SIZE);
}

void main()
{
  pthread_mutex_init(&mutex,0);
  counter=0;
  pthread_create(&thread,0,count,0);
  
  pthread_join(thread,0);
  pthread_mutex_destroy(&mutex);
}

Compiling and running an UltraSPARC T1 gives results like:

% cc test.c add.il
% a.out
Time per iteration 250.61 ns
Time per iteration 75.85 ns
Time per iteration 65.21 ns

So the mutex calls are about 3x slower than atomic operations. Calling libc is about 10ns slower than using an inline template (not a bad difference in return for not having to write the inline template code).

It's interesting to see where the time goes. So here's the profile of the application:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
3.973     3.973      
1.341     3.973      count
1.331     1.331      mutex_unlock
0.781     0.781      mutex_lock_impl
0.490     0.490      atomic_add_32
0.030     0.030      mutex_lock

The routine mutex_lock tail calls mutex_lock_impl, which does the work of locking the mutex. The heart of mutex_unlock looks like:

   0.        0.                 [?]    beff8:  mov         %o1, %o3
   0.020     0.020              [?]    beffc:  cas         [%o0] , %o2, %o3
## 0.560     0.560              [?]    bf000:  cmp         %o2, %o3
   0.        0.                 [?]    bf004:  bne,a       0xbeff8
   0.        0.                 [?]    bf008:  mov         %o3, %o2

The core of mutex_lock_impl is not too dissimilar, so basically the mutex lock code contains two atomic operation loops, plus a bundle of other instructions that make up the rest of the cost of the calls.

Looking at where the time is spent for the atomic_add_32 call:

                                
   0.010     0.010              [?]    2ecb8:  ld          [%o0], %o2
   0.040     0.040              [?]    2ecbc:  add         %o2, %o1, %o3
   0.010     0.010              [?]    2ecc0:  cas         [%o0] , %o2, %o3
## 0.370     0.370              [?]    2ecc4:  cmp         %o2, %o3
   0.        0.                 [?]    2ecc8:  bne,a,pn    %icc,0x2ecbc
   0.        0.                 [?]    2eccc:  mov         %o3, %o2
   0.050     0.050              [?]    2ecd0:  retl        
   0.010     0.010              [?]    2ecd4:  add         %o2, %o1, %o0

Which is again, a very similar loop, but with little overhead around it. And it pretty much matches the code that came from the inline template:

   0.040     0.040              [?]    110ec:  add         %o2, %o1, %o3
   0.010     0.010              [?]    110f0:  cas         [%o0] , %o2, %o3
## 0.360     0.360              [?]    110f4:  cmp         %o2, %o3
   0.        0.                 [?]    110f8:  bne         0x110ec
   0.040     0.040              [?]    110fc:  mov         %o3, %o2

Tuesday Apr 29, 2008

Multicore expo available - Microparallelisation

My presentation "Strategies for improving the performance of single threaded codes on a CMT system" has been made available on the OpenSPARC site.

The presentation discusses "microparallelisation", in the the context of parallelising an example loop. Microparallelisation is the aim of obtaining parallelism through assigning small chunks of work to discrete processors. Taking a step back...

With traditional parallelisation the idea is to identify large chunks of work that can be split between multiple processors. The chunks of work need to be large to amortise the synchronisation costs. This usually means that the loops have a huge trip count.

The synchronisation costs are derived from the time it takes to signal that a core has completed its work. The lower the synchronisation costs, the smaller amount of work is needed to make parallelisation profitable.

Now, a CMT processor has two big advantages here. First of all it has many threads. Secondly these threads have low latency access to a shared level of cache. The result of this is that the cost of synchronisation between threads is greatly reduced, and therefore each thread is free to do a smaller chunk of work in a parallel region.

All that's great in theory, the presentation uses some example code to try this out, and discovers, rather fortunately, that the idea also works in practice!

The presentation also covers using atomic operations rather that microparallelisation.

In summary the presentation is more research than solid science, but I hoped that presenting it would get some people thinking about non-traditional ways to extract parallelism from applications. I'm not alone in this area of work, Lawrence Spracklen is also working on it. We're both at presenting CommunityOne next week.

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs