Wednesday Feb 26, 2014

Multicore Application Programming available in Chinese!

This was a complete surprise to me. A box arrived on my doorstep, and inside were copies of Multicore Application Programming in Chinese. They look good, and have a glossy cover rather than the matte cover of the English version.

Wednesday Sep 11, 2013

Presenting at UK Oracle User Group meeting

I'm very excited to have been invited to present at the UK Oracle User Group conference in Manchester, UK on 1-4 December.

Currently I'm down for two presentations:

As you might expect, I'm very excited to be over there, I've not visited Manchester in about 20 years!

Friday Aug 09, 2013

How to use a lot of threads ....

The SPARC M5-32 has 1,536 virtual CPUs. There is a lot that you can do with that much resource and a bunch of us sat down to write a white paper discussing the options.

There are a couple of key observations in there. First of all it is quite likely that such a large system will not end up running a single instance of a single application. Therefore it is useful to understand the various options for virtualising the system. The second observation is that there are a number of details to take care of when writing an application that is expected to scale to large numbers of threads.

Anyhow, I hope you find it a useful read.

Monday Feb 14, 2011

Interview with Jim Mauro

I was really pleased that Jim Mauro agreed to interview about developing for multicore processors. The interview has just gone live on the informit site.

Sunday Nov 28, 2010

Multicore Application Programming: Source code available

I've just uploaded all the source code to the examples in Multicore Application Programming. About 160 files.

Saturday Nov 13, 2010

Multicore Application Programming arrived!

It was an exciting morning - my copy of Multicore Application Programming was delivered. After reading the text countless times, it's great to actually see it as a finished article. It starting to become generally available. Amazon lists it as being available on Wednesday, although the Kindle version seems to be available already. It's also available on Safari books on-line. Even turned up at Tesco!

Thursday Nov 11, 2010

Partitioning work over multiple threads

A few weeks back I was looking at some code that divided work across multiple threads. The code looked something like the following:

void \* dowork(void \* param)
{
  int threadid  = (int) param;
  int chunksize = totalwork / nthreads;
  int start     = chunksize \* threadid;
  int end       = start + chunksize;
  for (int iteration = start; iteration < end; iteration++ )
  {
...

So there was a small error in the code. If the total work was not a multiple of the number of threads, then some of the work didn't get done. For example, if you had 7 iterations (0..6) to do, and two threads, then the chunksize would be 7/2 = 3. The first thread would do 0, 1, 2. The second thread would do 3, 4, 5. And neither thread would do iteration 6 - which is probably not the desired behaviour.

However, the fix is pretty easy. The final thread does what ever is left over:

void \* dowork(void \* param)
{
  int threadid  = (int) param;
  int chunksize = totalwork / nthreads;
  int start     = chunksize \* threadid;
  int end       = start + chunksize;
  if ( threadid + 1 == nthreads) { end = totalwork; }
  for (int iteration = start; iteration < end; iteration++ )
  {
...

Redoing our previous example, the second thread would get to do 3, 4, 5, and 6. This works pretty well for small numbers of threads, and large iteration counts. The final thread at most does nthreads - 1 additional iterations. So long as there's a bundle of iterations to go around, the additional work is close to noise.

But.... if you look at something like a SPARC T3 system, you have 128 threads. Suppose I have 11,000 iterations to complete, I divide these between all the threads. Each thread gets 11,000 / 128 = 85 iterations. Except for the final thread which gets 85 + 120 iterations. So the final thread gets more than twice as much work as all the other threads do.

So we need a better approach for distributing work across threads. We want each thread to so a portion of the remaining work rather than having the final thread do all of it. There's various ways of doing this, one approach is as follows:

void \* dowork(void \* param)
{
  int threadid  = (int) param;
  int chunksize = totalwork / nthreads;
  int remainder = totalwork - (chunksize \* nthreads); // What's left over

  int start     = chunksize \* threadid;
  
  if ( threadid < remainder ) // Check whether this thread needs to do extra work
  { 
    chunksize++;              // Yes. Lengthen chunk
    start += threadid;        // Start from corrected position
  }
  else
  {
    start += remainder;       // No. Just start from corrected position
  }
    
  int end       = start + chunksize; // End after completing chunk

  for (int iteration = start; iteration < end; iteration++ )
  {
...

If, like me, you feel that all this hacking around with the distribution of work is a bit of a pain, then you really should look at using OpenMP. The OpenMP library takes care of the work distribution. It even allows dynamic distribution to deal with the situation where the time it takes to complete each iteration is non-uniform. The equivalent OpenMP code would look like:

void \* dowork(void \*param)
{
  #pragma omp parallel for
  for (int iteration = 0; iteration < totalwork; iteration++ )
  {
...

Wednesday Nov 10, 2010

Introduction to parallel programming

My colleague, Ruud van der Pas, recorded a number of lectures on parallel programming.

Tuesday Nov 09, 2010

Multicore application programming: sample chapter

No sign of the actual books yet - I expect to see them any day now - but there's a sample chapter up on the informit site. There's also a pdf version which includes preface and table of contents.

This is chapter 3 "Identifying opportunities for parallelism". These range from the various OS-level approaches, through virtualisation, and into multithread/multiprocess. It's this flexibility that makes multicore processors so appealing. You have the choice of whether you take advantage of them through some consolidation of existing applications, or whether you take advantage of them, as a developer, through scaling a single application.

Wednesday Sep 15, 2010

Details for Oracle Open World Presentation

I'm presenting at the Develop conference in San Francisco next week. I'll be in the Nikko Ballroom I at the Hotel Nikko (map), at 4pm on Monday. The title of the talk is "Multicore Application Programming with Oracle Solaris Studio 12.2". The abstract is:

Writing correct and fast parallel applications is often considered a hard problem. However, it doesn't need to be that way. This session will describe how Oracle Solaris Studio can be used to produce applications that are both fast and correct. The talk will cover parallelization strategies, implementation details, and common pitfalls, as well as describing how the tools provided by Oracle Solaris Studio can identify coding errors and performance opportunities in the application.

There's three talks from the Studio team, details are:

DateDetailsLocation
Monday 20th
4:00pm
S317573:
Multicore Application Programming with Oracle Solaris Studio
Darryl Gove
Hotel Nikko, Nikko Ballroom I
Tuesday 21st
11:30am
S317590:
Performance Measurement with Oracle Solaris Studio Performance Tools
Marty Itzkowitz
Hotel Nikko, Peninsula
Wednesday 22nd
1:00pm
S317585:
Building High-Quality C/C++ Applications
Don Kretsch
Hotel Nikko, Nikko Ballroom II

There appears to be no way to link directly to the talk details, but they are available if you search the entire programme.

Thursday Sep 09, 2010

Book update

I've just handed over the final set of edits to the manuscript. These are edits to the laid-out pages. Some are those last few grammatical errors you only catch after reading a sentence twenty times. Others are tweaks to the figures. There's still a fair amount of production work to do, but my final input will be a review of the indexing - probably next week.

So it's probably a good time to talk about the cover. This is a picture that my wife took last year. It's a picture of the globe at the cliff tops at Durlston Head near Swanage in England. It's 40 tonnes and over 100 years old. It's also surrounded by stone tablets some containing contemporary educational information, and a couple of blank ones that are there just so people can draw on them.

Wednesday Sep 08, 2010

Oracle Solaris Studio 12.2

It's been just over a year since the release of Studio 12 Update 1, today we releasing the first Oracle branded Studio release - Oracle Solaris Studio 12.2. For the previous release I wrote a post for the AMD site looking at the growth in multicore processors. It seemed appropriate to take another look at this.

The graph in the chart below shows the cumulative number of SPECint2006 results broken down by the number of cores for each processor. This data does not represent the number of different types of processor that are available, since the same processor can be used in many different results. It is closer to a snapshot of how the market for multicore processors is growing. Each data point represents a system, so the curve approximates the number of different systems that are being released.

It's perhaps more dramatic to demonstrate the change using a stacked area chart. The chart perhaps overplays the number of single core results, but this is probably fair as "single core" represents pretty much all the results prior to the launch of CPU2006. So what is readily apparent is the rapid decline in the number of single core results, the spread of dual, and then quad core. It's also interesting to note the beginning of a spread of more than quad core chips.

If we look at what is happening with multicore processors in the context of what we are releasing with Solaris Studio, there's a very nice fit of features. We continue to refine our support for OpenMP and automatic parallelisation. We've been providing data race (and deadlock) detection through the Thread Analyzer for a couple of releases. The debugger and the performance analyzer have been fine with threads for a long time. The performance analyzer has the time line view which is wonderful for examining multithreaded (or multiprocess) applications.

In addition to these fundamentals Studio 12.2 introduces a bunch of new features. I discussed some of these when the express release came out:

  • For those who use the IDE, integration of support for the analysis of the runtime behaviour of applications has been very useful. It both provides more information directly back to the developer, and raises awareness of the available tools.
  • Understanding call trees is often an important part of interpreting the performance of the application. Being able to drill down the call tree has been a very useful extension to the Performance Analyzer.
  • Memory error checking is critical for all applications. The trouble with memory access errors is that, like data races, the "problem" is visible arbitrarily far from the point where the error occurred.

The release of a new version of a product is always an exciting time. It's a culmination of a huge amount of analysis, development, and testing, and it's wonderful to finally see it available for others to use. So download it and let us know what you think!

Footnote: SPEC, SPECint, reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 6 September 2010 and this report.

Saturday Aug 07, 2010

I want my CMT

One problem with many parallel applications is that they don't scale to large numbers of threads. There are plenty of reasons why this might be the case. Perhaps the amount of work that needs to be done is insufficient for the number of threads being used.

On the other hand there are plenty of examples where codes scale to huge numbers of threads. Often they are called 'embarrassingly parallel codes', as if writing scaling codes is something to be ashamed of. The other term for this is 'delightfully parallel' which I don't really find any better!

So we have some codes that scale, and some that don't. Why don't all codes scale? There's a whole bunch of reasons:

  • Hitting some hardware constraint, like bandwidth. Adding more cores doesn't remove the constraint - although adding more processors, or systems might
  • Insufficient work. If the problem is too small, there just is not enough work to justify multiple threads
  • Algorithmic constraints or dependencies. If the code needs to calculate A and then B, there is no way that A and B can be calculated simultaneously.

These are all good reasons for poor scaling. But I think there's also another one that is, perhaps, less obvious. And that is access to machines with large numbers of cores.

Perhaps five years ago it was pretty hard to get time on a multicore system. The situation has completely reversed now. Obviously if a code is developed on a system with a single CPU, then it will run best on that kind of system. Over time applications are being "tuned" for multicore, but we're still looking at the 4-8 thread range in general. I would expect that to continue to change as access to large systems becomes more common place.

I'm convinced that as access to systems with large numbers of threads becomes easier, the ability of applications to utilise those threads will also increase. So all those applications that currently max out at eight threads, will be made to scale to sixteen, and beyond.

This is at the root of my optimism about multicore in general. Like many things, applications "evolve" to exploit the resources that are provided. You can see this in other domains like video games, where something new comes out of nowhere, and you are left wondering "How did they make the hardware do that?".

I also think that this will change attitudes to parallel programming. It has a reputation of being difficult. Whilst I agree that it's not a walk in the park, not all parallel programming is equally difficult. As developers become more familiar with it, coding style will evolve to avoid the common problems. Hence as it enters the mainstream, its complexity will be more realistically evaluated.

Sunday Aug 01, 2010

Multicore application programming: podcast

A few weeks back I had the pleasure of being interviewed by Allan Packer as part of the Oracle author podcast series. The podcast is a brief introduction to the book.

Wednesday Jul 28, 2010

Multicore application programming on Safari books

A roughcut of Multicore Application Programming has been uploaded to Safari books. If you have access you can read it, and provide feedback or comments. If you don't have access to Safari, you can still see the table of contents, read the preface, and view the start of each chapter.

Thursday Jul 08, 2010

Presenting at Oracle Develop

As part of Oracle Develop, I'll be presenting at the Hotel Nikko, in San Francisco on 20th September at 4pm. The session is S317573 titled "Multicore Application Programming with Oracle Solaris Studio". The abstract reads as follows:

Writing correct and fast parallel applications is often considered a hard problem. However, it doesn't need to be that way. This session will describe how Oracle Solaris Studio can be used to produce applications that are both fast and correct. The talk will cover parallelization strategies, implementation details, and common pitfalls, as well as describing how the tools provided by Oracle Solaris Studio can identify coding errors and performance opportunities in the application.

Wednesday Jul 07, 2010

Multicore application programming: update

It's 2am and I've just handed over the final manuscript for Multicore Application Programming. Those who know publishing will realise that this is not the final step. The publishers will layout my text and send it back to me for a final review before it goes to press. It will probably take a few weeks to complete the process.

I've also uploaded the final version of the table of contents. I've written the book using OpenOffice.org. It's almost certain not to be a one-to-one mapping of pages in my draft to pages in the finished book. But I expect the page count to be roughly the same - somewhere around 370 pages of text. It will be interesting to see what happens when it is properly typeset.

Wednesday Jun 30, 2010

The solution is multicore

Professor David Patterson wrote an interesting article in IEEE Spectrum, 'The trouble with multicore'. The tag line is "Chipmakers are busy designing microprocessors that most programmers can't handle". The thrust of the article is that multicore processors are a hardware development that software is poorly equipped to utilise.

There are two main arguments made in the article. The first is that programming languages are very poor at describing parallelism. There has been a long list of languages that were either designed to tackle parallelism or have had parallelism imposed upon them. To be fair parallel programming is littered with the ill-conceived corpses of languages that were meant to solve the problem. So his view is correct, but perhaps this is not relevant.

The second point he makes is that not all tasks break down to independent work. His example is that of ten reporters writing the same story, and not being able to write the story ten times faster because each section of text has to build on the previous sections. Again, this is true. There are some tasks that have implicit (or explicit) dependencies, but perhaps this is not relevant.

The example in his paper that best illustrates how multicore is the solution, and not the problem, is that of cloud computing. As he says "Expert programmers can take advantage of the task-level parallelism inherent in cloud computing.". Are you an expert programmer when you type a search term into Google? A lot of computation goes into finding the results for you, but they appear nearly instantly. It could be argued that Google put a considerable amount of effort into designing a system that produced results so quickly. Of course they did. However, they did it once, and its used for millions of search queries every day.

Observation 1: Many problems just need parallelising once. Or conversely, not every developer needs to worry about the parallelism – in the same way as not every developer on a project needs to worry about the GUI.

But this only addresses part of the argument. It is all very well using an anecdotal example to demonstrate that it is possible to utilise multiple cores, but that does not disprove Professor Patterson's argument.

Lets return to the example of the reporters. The way the reporters are working is perhaps not the best use of their resources. Much of the work of reporting is fact checking, talking to people, and gathering data. The writing part of this is only the final step in a long pipeline. Perhaps a better way of utilising the ten reporters would be during the data gathering stages, multiple people could be interviewed simultaneously, multiple sources consulted at the same time. On the other hand, a newspaper would rarely allocate more than a single reporter to a single story. More progress would be made if each reporter was working on a different story. So perhaps the critical observation is that dependencies within a task are an indication that parallelism needs to be discovered outside that task.

Observation 2: It is rare that there are no other ways of productively utilising compute resources. Meaning that given a number of cores, it is almost always possible to find work to keep them busy. For example, rendering a movie could have cores working on separate frames, or separate segments of the same frame. Sequencing genes could have multiple genes being examined simultaneously. Simulation models of different scenarios could be completed in parallel.

But, it can be argued that there are times when you need to do a single task, and you care how long that task takes to complete. So, lets consider exactly what problems we encounter during our day where we would benefit from a faster processor.

  • "I waited for my PC to boot.". Well booting a PC is pretty much a serial process, however, the boot time is largely dominated by disk access time rather than processor speed.
  • "I waited for my e-mail to download". Any downloading activity, be it e-mail or webpages is going to be dominated by network latency or bandwidth issues. There is undoubtedly some processor activity in the mix, but it is unlikely that a fast processor would make a noticeable difference to performance.
  • "I was watching a video when my virus scanner kicked in and caused the movie to stutter." Assuming it wasn't a disc activity, this is a great example of where having multiple cores will help rather than hinder. Two cores would allow the video to continue playing while the virus scanner did its work. This was, of course, the frequently given example of why multicore processors were a good thing – as if virus scanner were a desirable use of processor time!
  • "I was compiling an application and it took all afternoon." Some stages of compilation, like linking or crossfile optimisation, are inherently serial. But, unless the entire source code was placed into a single file, most projects have multiple source files, so these could be compiled in parallel. Again, the performance can be dominated by disk or network performance, so it is not entirely a processor performance issue.

These are a few situations where you might possibly feel frustration at the length of time a task takes. You may have plenty more. The point is that it is rare that there is no parallelism available, and no opportunity to make parallel progress on some other task.

Observation 3: There are very few day to day tasks that are actually limited by processor performance. Most tasks have substantial bottlenecks in other parts of the system (disk, network, speed of devices). If anything having multiple cores enables a system to remain useful while other compute tasks are completed.

All this discussion has not truly refuted Professor Patterson's observation that there exist problems which are inherently serial, or fiendishly difficult to parallelise. But that's ok. Most commonly encountered computational activities are either easy to parallelise, or there are ways of extracting parallelism at other levels.

But what of software? There is great allure to using threads on a multicore processor to deliver many times the performance of a single core processor. And this is the crux of the matter. Advances in computer languages haven't 'solved' this problem for us. It can still be hard, for some problems, to write parallel programs that are both functionally correct and scale well.

However, we don't all need to solve the hard problems. There are plenty of opportunities for exploiting parallelism in a large number of common problems, and in other situations there are opportunities for task level parallelism. This combination should cover 90+% of the problem space.

Perhaps there are 10% of problems that don't map well to multicore processors, but why focus on those when the other 90% do?

Monday May 17, 2010

Multicore application programming: Table of contents

I've uploaded the current table of contents for Multicore Application Programming. You can find all the detail in there, but I think it's appropriate to talk about how the book is structured.

Chapter 1. The design of any processor has a massive impact on its performance. This is particularly true for multicore processors since multiple software threads will be sharing hardware resources. Hence the first chapter provides a whistle-stop tour of the critical features of hardware. It is important to do this up front as the terminology will be used later in the book when discussing how hardware and software interact.

Chapter 2. Serial performance remains important, even for multicore processors. There's two main reasons for this. The first is that a parallel program is really a bunch of serial threads working together, so improving the performance of the serial code will improve the performance of the parallel program. The second reason is that even a parallel program will have serial sections of code. The performance of the serial code will limit the maximum performance that the parallel program can attain.

Chapter 3. One of important aspects of using multicore processors is identifying where the parallelism is going to come from. If you look at any system today, there are likely to be many active processes. So at one level no change is necessary, systems will automatically use multiple cores. However, we want to get beyond that, and so the chapter discusses approaches like virtualisation as well as discussing the more obvious approach of multi-thread or multi-process programming. One message that needs to be broadcast is that multicore processors do not need a rewrite of existing applications. However, getting the most from a multicore processor may well require that.

Chapter 4. The book discusses Windows native threading, OpenMP, automatic parallelisation, as well as the POSIX threads that are available on OS-X, Linux, and Solaris. Although the details do sometimes change across platforms, the concepts do not. This chapter discusses synchronisation primitives like mutex locks and so on, this enables the chapters which avoids having to repeat information in the implementation chapters.

Chapter 5. This chapter covers POSIX threads (pthreads), which are available on Linux, OS-X, and Solaris, as well as other platforms not covered in the book. The chapter covers multithreaded as well as multiprocess programming, together with methods of communicating between threads and processes.

Chapter 6. This chapter covers Windows native threading. The function names and the parameters that need to be passed to them are different to the POSIX API, but the functionality is the same. This chapter provides the same coverage for Windows native threads that chapter 5 provides for pthreads.

Chapter 7. The previous two chapters provide a low level API for threading. This gives very great control, but provides more opportunities for errors, and requires considerable lines of code to be written for even the most basic parallel code. Automatic parallelisation and OpenMP place more of the burden of parallelisation on the compiler, less on the developer. Automatic parallelisation is the ideal situation, where the compiler does all the work. However, there are limitations to this approach, and this chapter discusses the current limitations and how to make changes to the code that will enable the compiler to do a better job. OpenMP is a very flexible technology for writing parallel applications. It is widely supported and provides support for a number of different approaches to parallelism.

Chapter 8. Synchronisation primitives provided by the operating system or compiler can have high overheads. So it is tempting to write replacements. This chapter covers some of the potential problems that need to be avoided. Most applications will be adequately served by the synchronisation primitives already provided, the discussion in the chapter provides insight about how hardware, compilers, and software can cause bugs in parallel applications.

Chapter 9. The difference between a multicore system and a single core system is in its ability to simultaneously handle multiple active threads. The difference between a multicore system and a multiprocessor system is in the sharing of processor resources between threads. Fundamentally, the key attribute of a multicore system is how it scales to multiple threads, and how the characteristics of the application affect that scaling. This chapter discusses what factors impact scaling on multicore processors, and also what the benefits multicore processors bring to parallel applications.

Chapter 10. Writing parallel programs is a growing and challenging field. The challenges come from producing correct code and getting the code to scale to large numbers of cores. There are some approaches that provide high numbers of cores, there are other approaches which address issues of producing correct code. This chapter discusses a large number of other approaches to programming parallelism.

Chapter 11. The concluding chapter of the book reprises some of the key points of the previous chapters, and tackles the question of how to write correct, scalable, parallel applications.

Tuesday May 11, 2010

New Book: Multicore application programming

I'm very pleased to be able to talk about my next book Multicore Application Programming. I've been working on this for some time, and it's a great relief to be able to finally point to a webpage indicating that it really exists!

The release date is sometime around September/October. Amazon has it as the 11th October, which is probably about right. It takes a chunk of time for the text to go through editing, typesetting, and printing, before it's finally out in the shops. The current status is that it's a set of documents with a fair number of virtual sticky tags attached indicating points which need to be refined.

One thing that should immediately jump out from the subtitle is that the book (currently) covers Windows, Linux, and Solaris. In writing the book I felt it was critical to try and bridge the gaps between operating systems, and avoid writing it about only one.

Obviously the difference between Solaris and Linux is pretty minimal. The differences with Windows are much greater, but, when writing to the Windows native threading API, the actual differences are more syntactic than functional.

By this I mean that the name of the function changes, the parameters change a bit, but the meaning of the function call does not change. For example, you might call pthread_create(), on Windows you might call _beginthreadex(); the name of the function changes, there are a few different parameters, but both calls create a new thread.

I'll write a follow up post containing more details about the contents of the book.

Tuesday Feb 23, 2010

Presenting at the SVOSUG on Thursday

I'm presenting at the Silicon Valley OpenSolaris Users Group on Thursday evening. I was only asked today, so I'm putting together some slides this evening on "Multicore Application Programming". The talk is going to be a relatively high level presentation on writing parallel applications, and how the advent of multicore or CMT processors changes the dynamics.

Wednesday Feb 03, 2010

Little book of semaphores

Interesting read that demonstrates that there's much more to semaphores than might be expected.

Wednesday Jul 01, 2009

Introduction to parallel programming

My colleague, Ruud van der Pas, has recorded a series of seven webcasts on parallel programming which will be released on the HPC portal. Ruud is an expert on parallel programming, and one of the authors of the book "Using OpenMP".

Tuesday Jun 23, 2009

Sun Studio 12 Update 1

Sun Studio 12 Update 1 went live yesterday. It's still a free download, and it's got a raft of new features. Many people will have been using the express releases, so they will already be familiar with the improvements.

It's been about two years since Sun Studio 12 came out, and the most obvious change in that time is the prevalence of multicore processors. I figured the easiest way to discern this would be to look at the submissions of SPEC CPU2006 results in that time period. The following chart shows the cummulative number of SPEC CPU2006 Integer speed results over that time broken down by the number of threads that the chip was capable of supporting.

Ok, the first surprising thing about the chart is that there's very few single threaded chips. There were a few results when the suite was launched back in 2006, but nothing much since. What is more apparent is the number of dual-thread chips, that was where the majority of the market was. There were also a number of quad-thread chips at that point. If we fast-forward to the situation today, we can see that the number of dual-thread chips has pretty much leveled off, the bulk of the chips are capable of supporting four threads. But you can see the start of a ramp of chips that are capable of supporting 6 or 8 simultaneous threads.

The relevance of this chart to Sun Studio is that Sun Studio has always been a tool that supports the development of multi-threaded applications. Every release of the product improves on the support in the previous release. Sun Studio 12 Update 1 includes improvements in the compiler's ability to automatically parallelise codes - afterall the easiest way to develop parallel applications is if the compiler can do it for you; improvements to the support of parallelisation specifications like OpenMP, this release includes support for the latest OpenMP 3.0 specification; and improvements in the tools and their ability to provide the developer meaningful feedback about parallel code, for example the ability of the Performance Analyzer to profile MPI code.

Footnote SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Benchmark results stated above reflect results posted on www.spec.org as of 15 June 2009.

Friday May 02, 2008

Embedded Systems Conference Presentation

I got the opportunity to present at the embedded systems conference in San Jose a couple of weeks back. My presentation covered parallelising a serial application, a quick tour of what to do, together with an overview of the tools that Sun Studio provides to help out. The presentation is now available on the OpenSPARC website.

Tuesday Apr 29, 2008

Multicore expo available - Microparallelisation

My presentation "Strategies for improving the performance of single threaded codes on a CMT system" has been made available on the OpenSPARC site.

The presentation discusses "microparallelisation", in the the context of parallelising an example loop. Microparallelisation is the aim of obtaining parallelism through assigning small chunks of work to discrete processors. Taking a step back...

With traditional parallelisation the idea is to identify large chunks of work that can be split between multiple processors. The chunks of work need to be large to amortise the synchronisation costs. This usually means that the loops have a huge trip count.

The synchronisation costs are derived from the time it takes to signal that a core has completed its work. The lower the synchronisation costs, the smaller amount of work is needed to make parallelisation profitable.

Now, a CMT processor has two big advantages here. First of all it has many threads. Secondly these threads have low latency access to a shared level of cache. The result of this is that the cost of synchronisation between threads is greatly reduced, and therefore each thread is free to do a smaller chunk of work in a parallel region.

All that's great in theory, the presentation uses some example code to try this out, and discovers, rather fortunately, that the idea also works in practice!

The presentation also covers using atomic operations rather that microparallelisation.

In summary the presentation is more research than solid science, but I hoped that presenting it would get some people thinking about non-traditional ways to extract parallelism from applications. I'm not alone in this area of work, Lawrence Spracklen is also working on it. We're both at presenting CommunityOne next week.

Tuesday Mar 25, 2008

Conference schedule

The next two months are likely to be a bit hectic for me. I'm presenting at three different conferences, as well as a chat session in Second Life. So I figured I'd put the information up in case anyone reading this is also going to one or other of the events. So in date order:

I'll be talking about parallelisation at the various conferences, the talks will be different. The multi-core expo talks focuses on microparallelisation. The ESC talk will probably be higher level, and the CommunityOne talk will probably be wider ranging, and I hope more interactive.

In the Second Life event I'll be talking about the book, although the whole idea of appearing is to do Q&A, so I hope that will be more of a discussion.

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs