Thursday Nov 11, 2010

A few weeks back I was looking at some code that divided work across multiple threads. The code looked something like the following:

```void \* dowork(void \* param)
{
int chunksize = totalwork / nthreads;
int start     = chunksize \* threadid;
int end       = start + chunksize;
for (int iteration = start; iteration < end; iteration++ )
{
...
```

So there was a small error in the code. If the total work was not a multiple of the number of threads, then some of the work didn't get done. For example, if you had 7 iterations (0..6) to do, and two threads, then the chunksize would be 7/2 = 3. The first thread would do 0, 1, 2. The second thread would do 3, 4, 5. And neither thread would do iteration 6 - which is probably not the desired behaviour.

However, the fix is pretty easy. The final thread does what ever is left over:

```void \* dowork(void \* param)
{
int chunksize = totalwork / nthreads;
int start     = chunksize \* threadid;
int end       = start + chunksize;
if ( threadid + 1 == nthreads) { end = totalwork; }
for (int iteration = start; iteration < end; iteration++ )
{
...
```

Redoing our previous example, the second thread would get to do 3, 4, 5, and 6. This works pretty well for small numbers of threads, and large iteration counts. The final thread at most does nthreads - 1 additional iterations. So long as there's a bundle of iterations to go around, the additional work is close to noise.

But.... if you look at something like a SPARC T3 system, you have 128 threads. Suppose I have 11,000 iterations to complete, I divide these between all the threads. Each thread gets 11,000 / 128 = 85 iterations. Except for the final thread which gets 85 + 120 iterations. So the final thread gets more than twice as much work as all the other threads do.

So we need a better approach for distributing work across threads. We want each thread to so a portion of the remaining work rather than having the final thread do all of it. There's various ways of doing this, one approach is as follows:

```void \* dowork(void \* param)
{
int chunksize = totalwork / nthreads;
int remainder = totalwork - (chunksize \* nthreads); // What's left over

int start     = chunksize \* threadid;

if ( threadid < remainder ) // Check whether this thread needs to do extra work
{
chunksize++;              // Yes. Lengthen chunk
start += threadid;        // Start from corrected position
}
else
{
start += remainder;       // No. Just start from corrected position
}

int end       = start + chunksize; // End after completing chunk

for (int iteration = start; iteration < end; iteration++ )
{
...
```

If, like me, you feel that all this hacking around with the distribution of work is a bit of a pain, then you really should look at using OpenMP. The OpenMP library takes care of the work distribution. It even allows dynamic distribution to deal with the situation where the time it takes to complete each iteration is non-uniform. The equivalent OpenMP code would look like:

```void \* dowork(void \*param)
{
#pragma omp parallel for
for (int iteration = 0; iteration < totalwork; iteration++ )
{
...
```

Tuesday Nov 09, 2010

Multicore application programming: sample chapter

No sign of the actual books yet - I expect to see them any day now - but there's a sample chapter up on the informit site. There's also a pdf version which includes preface and table of contents.

This is chapter 3 "Identifying opportunities for parallelism". These range from the various OS-level approaches, through virtualisation, and into multithread/multiprocess. It's this flexibility that makes multicore processors so appealing. You have the choice of whether you take advantage of them through some consolidation of existing applications, or whether you take advantage of them, as a developer, through scaling a single application.

Friday Mar 20, 2009

Always use the latest firmware

Steve Sistare has an excellent write up of a scaling issue that we hit last year. The issue was frustrating because all the tools seemed to indicate a healthy system, but we were just not getting the scaling that we expected. The solution, as Steve writes, was a firmware update. Which was great - the problem was solved - but frustrating because we could have just started by updating the firmware.... but it's not something you always think of!

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Archives
Sun Mon Tue Wed Thu Fri Sat « June 2016 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Today