Multicore? Threads? Parallel Programming Models? OpenMP? MPI? Parallel Architectures? Amdahl's Law? Efficiency? Overhead?
If you're interested in what these mean, plus other topics fundamental to parallel programming, read on!
With great pleasure I announce the availability of a comprehensive technical white paper, titled "Parallel Programming with Oracle Developer Tools". It targets the developer new to parallel programming. No background in this topic is assumed. The paper is available through the Oracle Solaris Studio web page and can also be downloaded directly here.
Quite often I get asked how to get started with parallel programming. There is a lot of training material in the form of books, online tutorials and blogs available, but most of these focus on a few specific and specialized topics only. Where and how to begin can therefore be an overwhelming problem to the developer who is interested to apply parallelization as a way to further enhance the performance of his or her application.
For a number of years I've given talks that cover the various aspects of parallel programming, targeting the developer who wants to learn more about this topic. What was missing was a write up of these talks. To address this gap, I started working on a comprehensive technical white paper on the basics of parallel programming and am very glad it is out now. The paper will help you to get started with parallel programming and along the way you'll learn how to use the Oracle Solaris Studio Compilers and Tools to get the job done.
I would like to encourage you to download and read the paper, but for those that like to get more detail on the contents first, a fairly extensive summary of the contents can be found below.
Enjoy the paper and I welcome your feedback!
Summary of "Parallel Programming with Oracle Developer Tools"
The paper starts with a brief overview of multicore technology. This is after all what drives the increased and more widespread interest in parallel computing.
In the next chapter, some important terminology is explained. Since it plays such a crucial role in parallel programming, the concept of a "thread" is covered first. The goal of parallelization is to reduce the execution time of an application. It is the next topic and may seem trivial, but I found that not everybody is aware of the fact that a performance improvement is not a given. Here, the stage is set for a more extensive discussion on parallel performance in a later chapter. The chapter concludes with a definition of parallelization.
The chapter following is about parallel architectures. One can develop and run a parallel program on any computer, even on one with a single core only, but clearly multiple cores are needed if a performance gain is to be expected. Here an overview is given of the types of basic parallel architectures available today.
The choice of a specific platform not only affects the performance, but to some extent is also determined by the parallel programming model chosen to implement the parallelism. That is the topic of the next chapter.
There are many ways to implement parallelism in an application. In order to do so, one has to select a parallel programming model. This choice is driven by several factors, including the programming language used, portability, the type of application and parallelism, the target architecture(s) and personal preferences.
An important distinction is whether the parallel application is to be run on a single parallel computer system ("shared memory'), or across a cluster of systems ("distributed memory"). This choice has a profound impact on the choice of a programming model, since only a few models support a cluster architecture.
In the chapter, several programming models for both types of architectures are presented and discussed, but by no means is this an extensive overview of the entire field. That is a big topic in itself and beyond the scope of the paper.
The more in-depth part of the chapter starts with Automatic Parallelization by the compiler. Through a compiler switch, the user requests the compiler to identify those parts in the program that can be parallelized. If such an opportunity is found, the compiler generates the parallel code for the user and no extra effort is needed. The Oracle Solaris Studio compilers support this feature.
We then zoom in on OpenMP for shared memory and MPI for distributed memory programming. These are explicit programming models to parallelize an application and have been selected because they are the dominant parallel programming models in technical computing. They are however strong candidates to parallelize other types of applications too.
The chapter concludes with the Hybrid programming model, combining two parallel programming models. For example, MPI is used to parallelize the application at a fairly high level. The more fine grained parts are then further parallelized with OpenMP, or another shared memory model. In certain cases this is a natural way to parallelize an application. The Hybrid model also provides a natural fit for today's systems, since many consist of a cluster with multicore nodes.
The next chapter is very extensive and covers an example in great detail. The computation of the average of a set of numbers was chosen, since this is a real world type of operation and parallelizing it is not entirely straightforward.
In the first section, Automatic Parallelization by the Oracle Solaris Studio compilers is introduced and demonstrated on this example. The compiler is able to identify the parallelism in this computation and generates a parallel binary without the need for the user to do anything, other than using some compiler switches (-xautopar and -xreduction to be more precise).
Next, a general strategy how to explicitly parallelize this computation is given. This provides the general framework for the various parallel implementations.
Using this framework, the parallel version of the computation is then implemented using OpenMP, MPI and the MPI+OpenMP Hybrid model. Full source code for all 3 implementations is shown and discussed in great detail. Throughout, it is demonstrated how the Oracle Solaris Studio compilers and the Oracle Message Passing Toolkit can be used to compile and run these parallel versions.
Now that the parallel program has been designed and implemented, it is time to consider other, more advanced, aspects of parallel computing. These topics are covered in the next chapter. They are not needed to get started, but are important enough to read up on. For example, how parallelization may affect the round off behavior in case floating-point numbers are used.
The majority of the chapter is however dedicated to performance, since that is after all the goal of parallelization. In addition to parallel overheads, parallel speed up and efficiency, Amdahl's Law is derived and discussed in quite some detail. This formula plays a very important role to understand the measured performance of a parallel application. As shown, it can also be used to assert how well the application has been parallelized and what performance to expect when increasing the number of threads.
The last chapter presents performance results obtained with the 3 different parallel versions discussed earlier. It is demonstrated how the performance of these various implementations depends on the number of threads.