Monday Feb 14, 2011

Interview with Jim Mauro

I was really pleased that Jim Mauro agreed to interview about developing for multicore processors. The interview has just gone live on the informit site.

Thursday Jan 20, 2011

Interview about Multicore book

An interview I did with Christy Confetti-Higgins has just gone up on OTN.

Saturday Nov 13, 2010

Multicore Application Programming arrived!

It was an exciting morning - my copy of Multicore Application Programming was delivered. After reading the text countless times, it's great to actually see it as a finished article. It starting to become generally available. Amazon lists it as being available on Wednesday, although the Kindle version seems to be available already. It's also available on Safari books on-line. Even turned up at Tesco!

Tuesday Nov 09, 2010

Multicore application programming: sample chapter

No sign of the actual books yet - I expect to see them any day now - but there's a sample chapter up on the informit site. There's also a pdf version which includes preface and table of contents.

This is chapter 3 "Identifying opportunities for parallelism". These range from the various OS-level approaches, through virtualisation, and into multithread/multiprocess. It's this flexibility that makes multicore processors so appealing. You have the choice of whether you take advantage of them through some consolidation of existing applications, or whether you take advantage of them, as a developer, through scaling a single application.

Thursday Sep 09, 2010

Book update

I've just handed over the final set of edits to the manuscript. These are edits to the laid-out pages. Some are those last few grammatical errors you only catch after reading a sentence twenty times. Others are tweaks to the figures. There's still a fair amount of production work to do, but my final input will be a review of the indexing - probably next week.

So it's probably a good time to talk about the cover. This is a picture that my wife took last year. It's a picture of the globe at the cliff tops at Durlston Head near Swanage in England. It's 40 tonnes and over 100 years old. It's also surrounded by stone tablets some containing contemporary educational information, and a couple of blank ones that are there just so people can draw on them.

Saturday Aug 07, 2010

I want my CMT

One problem with many parallel applications is that they don't scale to large numbers of threads. There are plenty of reasons why this might be the case. Perhaps the amount of work that needs to be done is insufficient for the number of threads being used.

On the other hand there are plenty of examples where codes scale to huge numbers of threads. Often they are called 'embarrassingly parallel codes', as if writing scaling codes is something to be ashamed of. The other term for this is 'delightfully parallel' which I don't really find any better!

So we have some codes that scale, and some that don't. Why don't all codes scale? There's a whole bunch of reasons:

  • Hitting some hardware constraint, like bandwidth. Adding more cores doesn't remove the constraint - although adding more processors, or systems might
  • Insufficient work. If the problem is too small, there just is not enough work to justify multiple threads
  • Algorithmic constraints or dependencies. If the code needs to calculate A and then B, there is no way that A and B can be calculated simultaneously.

These are all good reasons for poor scaling. But I think there's also another one that is, perhaps, less obvious. And that is access to machines with large numbers of cores.

Perhaps five years ago it was pretty hard to get time on a multicore system. The situation has completely reversed now. Obviously if a code is developed on a system with a single CPU, then it will run best on that kind of system. Over time applications are being "tuned" for multicore, but we're still looking at the 4-8 thread range in general. I would expect that to continue to change as access to large systems becomes more common place.

I'm convinced that as access to systems with large numbers of threads becomes easier, the ability of applications to utilise those threads will also increase. So all those applications that currently max out at eight threads, will be made to scale to sixteen, and beyond.

This is at the root of my optimism about multicore in general. Like many things, applications "evolve" to exploit the resources that are provided. You can see this in other domains like video games, where something new comes out of nowhere, and you are left wondering "How did they make the hardware do that?".

I also think that this will change attitudes to parallel programming. It has a reputation of being difficult. Whilst I agree that it's not a walk in the park, not all parallel programming is equally difficult. As developers become more familiar with it, coding style will evolve to avoid the common problems. Hence as it enters the mainstream, its complexity will be more realistically evaluated.

Sunday Aug 01, 2010

Multicore application programming: podcast

A few weeks back I had the pleasure of being interviewed by Allan Packer as part of the Oracle author podcast series. The podcast is a brief introduction to the book.

Wednesday Jul 07, 2010

Multicore application programming: update

It's 2am and I've just handed over the final manuscript for Multicore Application Programming. Those who know publishing will realise that this is not the final step. The publishers will layout my text and send it back to me for a final review before it goes to press. It will probably take a few weeks to complete the process.

I've also uploaded the final version of the table of contents. I've written the book using It's almost certain not to be a one-to-one mapping of pages in my draft to pages in the finished book. But I expect the page count to be roughly the same - somewhere around 370 pages of text. It will be interesting to see what happens when it is properly typeset.

Friday Aug 28, 2009

Maps in the STL

I was looking at some code with a colleague and we observed a bunch of time in some code which used the std::map to set up mappings between strings. The source code looked rather like the following:

#include <map>
#include <string>

using namespace std;

int func(map<string,string>&mymap, string &s1, string &s2)
  return 0;

When compiled with CC -O -c -library=stlport4 this expands to a horrendous set of calls, here's the first few:

$ er_src -dis func map.o|grep call 
        [?]     188d:  call    std::basic_string...::basic_string
        [?]     189f:  call    std::basic_string...::basic_string
        [?]     18b2:  call    std::basic_string...::basic_string
        [?]     18c2:  call    std::basic_string...::basic_string
        [?]     18d8:  call    std::_Rb_tree...::insert_unique
        [?]     18f8:  call    std::__node_alloc...::_M_deallocate
        [?]     190c:  call std::_STLP_alloc_proxy...::~_STLP_alloc_proxy

What's happening is that the act of making a pair object is causing copies to be made of the two strings that are passed into the pair constructor. Then the pair object is passed into the insert method of std::map and this results in two more copies of the strings being made. There's a bunch of other stuff going on, and the resulting code is a mess.

There's an alternative way of assigning the mapping:

#include <map>
#include <string>

using namespace std;

int func(map<string,string>&mymap, string &s1, string &s2)
  return 0;

When compiled the resulting code looks a lot neater:

$ er_src -dis func map.o|grep call
        [?]     28e6:  call    std::map...::operator[]
        [?]     2903:  call    std::basic_string...::_M_assign_dispatch

Of course a neater chunk of code is nice, but the question is whether the code for ::operator[] contains the same ugly mess. Rather than disassembling to find out, it's simpler to time the two versions and see which does better. A simple test harness looks like:

int main()
  string s1,s2;
  long long i;
  for (i=0; i<100000000; i++)

It's a less than ideal harness since it uses constant strings, and one version of the code might end up bailing early because of this. The performance of the two codes is quite surprising:

real           6.79
user           6.77
sys            0.00

real        1:03.53
user        1:03.26
sys            0.01 

So the version that creates the pair object is about 10x slower!

Wednesday May 20, 2009

Libraries (2)

Just updated the ld_dot script to include filter libraries. Added a profile for ssh logging into a system, rather than just showing the help message (click the image for the full size version).


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« July 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming