Monday Sep 02, 2013

READ_ME_FIRST: What Do I Do With All Of Those SPARC Threads?

This is a technical white paper with a fairly challenging title, but it actually describes the contents quite well.

We wrote this paper because the new SPARC T5 and M5 Oracle servers provide so much main memory plus so many cores and threads that one may wonder how to manage and deploy such a kind of system. They're pretty unique in the market.

This is why we joined forces and set ourselves the goal of providing a holistic view on this topic.

The paper is written in a modular way and readers can select the individual topics they're interested in, but of course we hope you'll read it front to back and find it useful. Perhaps the Glossary at the end comes in handy too. 

The first part covers the processor and system architectures, but only to the extent we felt is needed for the remainder of the paper. There are several other white papers that go into an awful lot more detail on this.

The next part targets those (thinking about) developing parallel applications and looking for tips and tricks what choices need to be made and how to make real world codes scale. That is no mean feat, but rewarding and long lasting. Think about it. The trend is upward and the size of systems can be expected to continue to scale up. Any investment done today in improving scalability is going to help. There is a learning curve as well and the sooner one begins, the better.

We feel these chapters are however also of use to those not directly involved writing parallel code. It helps to understand what happens under the hood and may explain things one has observed experimentally. For example why there may be a diminishing return on adding more and more threads.

The third part covers the virtualization features available and how these can be used to configure the system to the needs. Perhaps to run legacy applications that require an older software environment and do this side by side with applications running in a more modern environment. On top of that, each of such applications can be multi-threaded, providing the optimal configuration per application.

The paper concludes with a brief coverage of key Solaris features. Not everybody realizes the importance of a scalable OS and how much on-going engineering investment is needed to continue to improve the scalability of Solaris.

We hope you find this technical white paper useful and of course feedback is encouraged! 

Sunday Aug 05, 2012

IWOMP 2012

IWOMP stands for  "International Workshop on OpenMP". It is a workshop held once a year, rotating across the US, Asia and Europe. IWOMP started in 2005 and since then has been held every year. It is the place to be for those interested in OpenMP. The talks cover usage of OpenMP, performance, suggestions for new features and updates on upcoming features. There is usually also a tutorial day prior to the workshop.

June 11-13, 2012, CASPUR in Rome, Italy, hosted IWOMP 2012. This was the 8-th workshop and as always, the event was very well organized with a variety of high quality talks and two interesting key note speakers. The beautiful and very special Horti Sallustiani had been chosen as the workshop venue.

This year was rather special since Bjarne Stroustrup, the creator of C++, had accepted the invitation to give the opening key note talk. The focus of his presentation was how to use C++ to write more reliable and robust code.

Most presentations are on line and can be found here. Several of the tutorial presentations can be downloaded from the tutorial page. The tutorial material includes a full overview of OpenMP.

The proceedings are also available. They are published by Springer and are part of the LNCS series. More information can be found here.

The precise details of IWOMP 2013 are not known yet, but the candidate location is definitely very interesting. Stay tuned for more details on this! 

Sunday Dec 05, 2010

Video Lectures on Parallel Programming

A series of 7 short video introductions into parallel programming are now available on line. They can be viewed here. Each video is about 10-15 minutes long and covers a specific topic, or topics. Although it is recommended to view them in sequence, this is not required.

The slides are available as well and can be downloaded from the same web page.

Readers interested in these topics may also want to take a look at the "Parallel Programming with Oracle Developer Tools" white paper. There is little overlap between the two. The videos cover the big picture, whereas the paper goes into much more detail on the programming aspects.

The first video sets the stage. It covers the more general topic of application performance tuning, summarizing the different ways how to optimize an application, with parallelization being one of them.

The second video is about multicore architectures. The main purpose of this talk is to underline that multicore is here today and supported by all major microprocessor manufacturers. Admittedly, the processors covered here have been replaced by their respective successors, but that is in the nature of the beast. This topic is after all a moving target, since new microprocessors appear on the horizon all the time. For the purpose of the talks given here, the processor details are not relevant however. Of course there is an impact on the performance, but when parallelizing an application, the details of the processor architecture need not be taken into account.

The most important thing to realize is that there are three features that are going to stay for a long time to come. First of all, multicore is not going to go away; developers can count on any new general purpose microprocessor to support parallel execution in hardware. Secondly, not only does the number of threads continue to increase, the cost of a thread continues to come down too.

The next video covers parallel architectures. There are even more different architectures than processors. This is why the systems described here are generic and covered at the block diagram level. The goal is to help understand at a high level what the differences between an SMP system, a single/multi core cluster and a cc-NUMA system are.

Video number 4 is another conceptual talk, but with an exclusive focus on parallel programming. This is where you'll find more about topics like threads, parallel overheads and Amdah'ls law.

By then, all necessary information to get started writing a parallel program has been covered and it is high time to dig deeper into parallel programming.

The fifth video covers the Message Passing Interface (MPI) in some detail. This is a widely used distributed memory model, targeting a cluster of systems.  MPI has been around for quite some time, but is still alive and kicking. Many recent other distributed memory programming models (e.g. CUDA) rely on the same principles. This is why the information presented here is of more general use than to those only interested in using MPI.

The next video is about shared memory parallel programming, starting with automatic parallelization. Often overlooked, but absolutely worth giving a try. The mechanism is simply activated by using the appropriate option on the compiler. Success or failure depends on many factors, but compiler technology continues to improve and can handle increasingly complex code structures.

OpenMP is  the second shared memory model covered in this video. It is a mature and easy to use directive based model to explicitly parallelize applications for multicore based systems. Due to the multicore revolution, interest in OpenMP has never been as strong as it is today.

Clusters of multicore based systems are more and more common. The question is how to program them. This is where the Hybrid Parallel Programming model comes into the picture. It is the topic of the short 7-th video. With a Hybrid model, a distributed memory model (like MPI) is used to parallelize the application across the nodes of the cluster. Within the node one can either use MPI (or a similar model) again, but since the overhead of such models tends to be relatively high, a more lightweight model like OpenMP (or a native threading model like POSIX threads) is often more efficient.

This last video only touches upon this important and interesting topic. Those interested to learn much more about it may want to read the appropriate chapters in my white paper on parallel programming.

The title of this 7-th talk includes "What's Next?". It is very hard to predict what's down the road two or more years from now, but it is very safe to assume that parallel computing is here to stay and will only get more interesting. Anybody developing software is strongly advised to look into parallel programming as a way to enhance the performance of his or her application.

I would like to acknowledge Richard Friedman and Steve Kimmey at Oracle, as well as Deirdré Straughan (now at Joyent) for their support creating these videos and to make them available to a wide audience.

Sunday May 09, 2010

Parallel Programming with Oracle Developer Tools

Multicore? Threads? Parallel Programming Models? OpenMP? MPI? Parallel Architectures? Amdahl's Law? Efficiency? Overhead?

If you're interested in what these mean, plus other topics fundamental to parallel programming, read on!

With great pleasure I announce the availability of a comprehensive technical white paper, titled "Parallel Programming with Oracle Developer Tools". It targets the developer new to parallel programming. No background in this topic is assumed. The paper is available through the Oracle Solaris Studio web page and can also be downloaded directly here

Quite often I get asked how to get started with parallel programming. There is a lot of training material in the form of books, online tutorials and blogs available, but most of these focus on a few specific and specialized topics only. Where and how to begin can therefore be an overwhelming problem to the developer who is interested to apply parallelization as a way to further enhance the performance of his or her application.

For a number of years I've given talks that cover the various aspects of parallel programming, targeting the developer who wants to learn more about this topic. What was missing was a write up of these talks. To address this gap, I started working on a comprehensive technical white paper on the basics of parallel programming and am very glad it is out now. The paper will help you to get started with parallel programming and along the way you'll learn how to use the Oracle Solaris Studio Compilers and Tools to get the job done.

I would like to encourage you to download and read the paper, but for those that like to get more detail on the contents first, a fairly extensive summary of the contents can be found below. 

Enjoy the paper and I welcome your feedback!

Summary of "Parallel Programming with Oracle Developer Tools"

The paper starts with a brief overview of multicore technology. This is after all what drives the increased and more widespread interest in parallel computing.

In the next chapter, some important terminology is explained. Since it plays such a crucial role in parallel programming, the concept of a "thread" is covered first. The goal of parallelization is to reduce the execution time of an application. It is the next topic and may seem trivial, but I found that not everybody is aware of the fact that a performance improvement is not a given. Here, the stage is set for a more extensive discussion on parallel performance in a later chapter. The chapter concludes with a definition of parallelization.

The chapter following is about parallel architectures. One can develop and run a parallel program on any computer, even on one with a single core only, but clearly multiple cores are needed if a performance gain is to be expected. Here an overview is given of the types of basic parallel architectures available today.

The choice of a specific platform not only affects the performance, but to some extent is also determined by the parallel programming model chosen to implement the parallelism. That is the topic of the next chapter.

There are many ways to implement parallelism in an application. In order to do so, one has to select a parallel programming model. This choice is driven by several factors, including the programming language used, portability, the type of application and parallelism, the target architecture(s) and personal preferences.

An important distinction is whether the parallel application is to be run on a single parallel computer system ("shared memory'), or across a cluster of systems ("distributed memory"). This choice has a profound impact on the choice of a programming model, since only a few models support a cluster architecture.

In the chapter, several programming models for both types of architectures are presented and discussed, but by no means is this an extensive overview of the entire field. That is a big topic in itself and beyond the scope of the paper. 

The more in-depth part of the chapter starts with Automatic Parallelization by the compiler. Through a compiler switch, the user requests the compiler to identify those parts in the program that can be parallelized. If such an opportunity is found, the compiler generates the parallel code for the user and no extra effort is needed. The Oracle Solaris Studio compilers support this feature.

We then zoom in on OpenMP for shared memory and MPI for distributed memory programming. These are explicit programming models to parallelize an application and have been selected because they are the dominant parallel programming models in technical computing. They are however strong candidates to parallelize other types of applications too. 

The chapter concludes with the Hybrid programming model, combining two parallel programming models. For example, MPI is used to parallelize the application at a fairly high level. The more fine grained parts are then further parallelized with OpenMP, or another shared memory model. In certain cases this is a natural way to parallelize an application. The Hybrid model also provides a natural fit for today's systems, since many consist of a cluster with multicore nodes.

The next chapter is very extensive and covers an example in great detail. The computation of the average of a set of numbers was chosen, since this is a real world type of operation and parallelizing it is not entirely straightforward.

In the first section, Automatic Parallelization by the Oracle Solaris Studio compilers is introduced and demonstrated on this example. The compiler is able to identify the parallelism in this computation and generates a parallel binary without the need for the user to do anything, other than using some compiler switches (-xautopar and -xreduction to be more precise).

Next, a general strategy how to explicitly parallelize this computation is given. This provides the general framework for the various parallel implementations.

Using this framework, the parallel version of the computation is then implemented using OpenMP, MPI and the MPI+OpenMP Hybrid model. Full source code for all 3 implementations is shown and discussed in great detail. Throughout, it is demonstrated how the Oracle Solaris Studio compilers and the Oracle Message Passing Toolkit can be used to compile and run these parallel versions.

Now that the parallel program has been designed and implemented, it is time to consider other, more advanced, aspects of parallel computing. These topics are covered in the next chapter. They are not needed to get started, but are important enough to read up on. For example, how parallelization may affect the round off behavior in case floating-point numbers are used.

The majority of the chapter is however dedicated to performance, since that is after all the goal of parallelization. In addition to parallel overheads, parallel speed up and efficiency, Amdahl's Law is derived and discussed in quite some detail. This formula plays a very important role to understand the measured performance of a parallel application. As shown, it can also be used to assert how well the application has been parallelized and what performance to expect when increasing the number of threads.

The last chapter presents performance results obtained with the 3 different parallel versions discussed earlier. It is demonstrated how the performance of these various implementations depends on the number of threads.

Thursday Apr 02, 2009

Using OpenMP - The Examples

It is with great pleasure that I announce availability of 41 OpenMP example programs. All these examples are introduced and discussed in the book "Using OpenMP". 

 Cover page of the book "Using OpenMP"

The sources are available as a free download under the BSD license. Each source comes with a copy of the license. Please do not remove this.

The zip file that contains everything needed (sources, example make files for various compilers, and a brief user's guide) can be downloaded here

Although tempting at times to deviate, the core part of each example has not been changed compared to what is shown in the book.

You are encouraged to try out these examples and perhaps use them as a starting point to better understand and possibly further explore OpenMP. 

Simplicity was an important goal for this project. For example. each source file constitutes a full working program. Other than a compiler and run time environment that support OpenMP, nothing else is needed to compile and run the program(s).

With the exception of one example, there are no source code comments. Not only are these examples very straightforward, they are also discussed in the above mentioned book. 

As a courtesy, each source directory has a straightforward make file called "Makefile". This file can be used to build and run the examples in the specific directory. Before doing so, you need to activate the appropriate include line in file Makefile. There are include files for several compilers on Unix based Operating Systems (Linux, Solaris and Mac OS to be precise). These files have been put together on a best effort basis.

The User's Guide that is bundled with the examples explains the compiler specific include file for "make" in more detail.

I would like to encourage you to submit your feedback, questions, suggestions, etc to the forum "Using OpenMP - The Book and Examples" on


Tuesday Mar 31, 2009

PPCES 2009, Aachen, Germany, March 23-27, 2009

The RWTH Aachen University in Aachen, Germany, organized and hosted the first "Parallel Programming in Computational Science and Engineering" (PPCES) HPC tutorials series. It was held March 23-27, 2009. I participated as well as presented several times throughout the week.

This tutorial was a natural follow on to the "SunHPC" workshops held from 2001-2007 and the combined SunHPC 2008 and VI-HPS event in 2008. 

This first PPCES tutorial week was very well attended and the group also actively participated. Many of the talks have been recorded and will appear on line.  I'll add a link when they are available.

Below I include some pictures of Aachen, with its beautiful historic old city.

Parts of the outer wall are still present. This is the Pontwall, near the Pontstraße, where one enters the city from the A4 (Aachen-Laurensberg exit). The second picture was made on the backside, facing the city.

Pontwall, Aachen

Pontwall, Aachen

The market square is the most prominent place in the old city. There are several restaurants and shops, but literally the most visible building is the huge and beautiful town hall. Below a picture of this building as seen from the square, plus a shot taken at the back side.

Aachen Town Hall

Aachen Town Hall, view from the back side

The huge cathedral is a true landmark and on the UNESCO world heritage list. It is also the burial site of Charlemagne. On the first picture below it can be seen on the left. The second picture has a more up close view. The third picture was taken from the other side. There is a small square between the town hall and cathedral. That's where this picture was made.

Cathedral is seen on the left Aachen cathedral up close

Aachen cathedral as seen from the other side

The picture below was taken in one of the small streets near the big market square. It was taken on the only day the weather was relatively good while I was there. It was somewhat chilly and windy, but shielded from the wind one could sit outside, as shown by the people at the end this street.

Small street near the market square

This fountain is very funny to see and a great attraction for children in particular. It is very fascinating to them that you can turn the hands around. This fountain is in another fairly narrow street, connecting the market square and cathedral. On busy days it can be really crowded here.

Fountain near the cathedral

The picture below was made while I stood on the small balcony in front of the main entrance to the town hall. On the left you can see one of my favorite places there. It is a fixed stop each day I walk to the RWTH.

View on the market square, standing on the stairs of the town hall

One of the nice other things about Aachen is the choice of restaurants.  One of my favorites is the "Best Friends" restaurant in the Pontstraße. It offers a variety of Asian dishes and I really enjoyed the Bento Box. The picture below was made when I went there with a couple of friends. No comments necessary I think.

Best Friends restaurant in the Pontstrasse

Dieter an Mey and his team at the Computer Centre of the RWTH always do a great job in general, but they also select really good places for the social dinner.  We've been at the Kazan restaurant a couple of times before and have never been disappointed regarding the food and the service. Below a picture made of the restaurant, followed by a live in action picture, shot by Agnes Mendes from the RWTH.

The Kazan restaurant

The social dinner

Tuesday Mar 17, 2009

Houston, March 7-14, 2009

I was in Houston, Texas, the week of March 7-14, 2009. The purpose of this trip was twofold. I was going to visit Barbara Chapman's Computer Science group at the University of Houston, as well as give an OpenMP class at Texas Instruments. In this blog I would like to share my impressions and pictures made during my stay there.

I've been in Houston several times now, but I continue to be amazed to see the indoor ice skating rink in the middle of the huge Galleria shopping mall. Given how hot and humid it typically is outside, it is fascinating to see people skating inside. In good US tradition there is a food court around the rink. It actually makes for an entertaining view while eating.

Ice skating rink at the Galleria Houston

I stayed in a nice hotel in Sugar Land on the South West side of Houston. The hotel is along the Southwest Freeway. The name of this part of Houston is kind of charming. It dates back to the days of sugarcane plantations. Below two pictures of the hotel, as well as two from the town hall.

Lobby of my hotel

Lobby of my hotel

Sugar Land Town Hall

Sugar Land Town Hall

It is always a pleasure to visit the University of Houston campus. I like the way it has been organized, as well as the lawns, the trees and the walkways. Below some impressions. The second picture shows the Philip Guthrie Hoffman Hall, the home of the Computer Science department. The third picture has the library on it.

Main entrance UH campus

PGH Building

UH Library

Main entrance to the UH campus

Barbara and I met several times. I also had an interesting discussion session with her students and staff members. Also the social side of this visit was not overlooked. Below a picture of the joint dinners we had. The first one was made at a dinner we had in a local pub called "Cafe Adobe" in Sugar Land. The second one was made in the "Mo Mong" Vietnamese restaurant on Westheimer.

Dinner at a pub in Sugar Land Dinner at a Vietnamese restaurant

Wednesday evening we went to Houston Livestock and Rodeo Show.  This is a very big and lively event that was held March 3-22, 2009, in the Reliant Stadium. Prior to going there we went for dinner at a Texas Barbeque place on Kirby Street. Below a picture of the restaurant, as well as a group picture. 

Goode Company Texas Barbeque on Kirby Street

Group picture at Texas BBQ

Next to the BBQ restaurant was a place I recognized from a previous visit to Houston. It is called "Goode's Armadillo Palace". There is a huge armadillo in front of it and I could not resist to make a picture of it.

Goode's Armadillo Palace in Houston

This was my first rodeo experience. Quite entertaining and very efficiently organized. There were short sessions with various contests. With some of those you had to be really quick to watch. It could be over in a few seconds, but that is after all the whole purpose of these contests.

Below a picture of the stadium, plus an early evening view on Houston. Our seats were very high up. The upside was that the outside view from there was really nice. The boots shown in the third picture attracted a lot of attention from people that wanted to have their picture made together with these boots.

Reliant Stadium in Houston

View on Houston from the stadium

Texas boots

Below are four pictures made at the show. On the first one can clearly see how the bull is mastered by the cowboy. The second one is not so fortunate, as he is about to fall of his horse. Luckily he did not seem to have any serious injuries. The third picture shows the wagon races. Quite spectacular. It reminded me of what the definition of "horsepower" stands for. On the fourth picture the podium for the concert is wheeled in. 

Cowboy wins

Cowboy loses Wagon races

Preparations for the concert

The last picture shows the Texas Instruments building where the OpenMP training was held. The turnout was really impressive and I very much enjoyed the discussions, as well as conversations with the attendees. They had really good and detailed questions.

TI building 


Saturday Feb 28, 2009

Demystifying Persistent OpenMP Myths - Part I

Unfortunately, the September 5, 2008 blog titled "The OpenMP Concurrency Platform" written by Charles Leiserson from Cilk Arts repeats some of the persistent myths regarding OpenMP.

Certain comments made also may give rise to a somewhat distorted view on OpenMP for those readers that are less into the aspects of parallel programming. For example, the statement that OpenMP is most suitable for loops only. This has never been the case and certainly the introduction of the flexible and powerful tasking concept in OpenMP 3.0 (released May 2008) is a big step forward.

In this article I would like to respond to this blog and share my view on the claims made. The format chosen is to give a quote, followed by my comment(s).

"OpenMP does not attempt to determine whether there are dependencies between loop iterations." 

This is correct, but there are two additional and important comments to be made.

OpenMP is a high level, but explicit programming model. The developer specifies the parallelism. The compiler and run time system translate this into a corresponding parallel execution model. The task of the programmer is to correctly identify the parallelism. As far as I can tell, this is the case for all parallel programming models. In that sense, OpenMP is not any different than other models. It is therefore not clear what the intention of this comment is (note that the exception is automatic parallelization. In this case it is the responsibility of the compiler to identify those portions of the program that can be executed in parallel, as well as generate the correct underlying infrastructure).

Another aspect not mentioned is that one of the strengths of OpenMP is that the directive based model allows compilers to check for possible semantic errors made by the developer. For example, several compilers perform a static dependence analysis to warn against possible data races. Such errors are much harder to detect if function calls are used to specify the parallelism (e.g. POSIX threads).

"Unfortunately, some of the OpenMP directives for managing memory consistency and local copies of variables affect the semantics of the serial code, compromising this desirable property unless the code avoids these pragmas." 

I don't think OpenMP directives affect the semantics of the serial code, so how can this be the case? An OpenMP program can always be compiled in such a way that the directives are ignored, effectively compiling the serial code.

I suspect the author refers to the "#pragma omp flush" and "#pragma omp private" directives. These affect the semantics of the parallel version, not the serial code, but either or both could be required to ensure correct parallel execution. The need for this depends on the specific situation.

We can only further guess what is meant here, but it is worth doing so. 

Part of the design of a shared memory computer system is to define the memory consistency model. Several choices are possible and have indeed been implemented in commercially available parallel systems, as well as in more research oriented architectures.

As suggested by the name, memory consistency defines what memory state the various threads of execution observe. They need not have the same view at a specific point in time.

The problem with this is that at certain points in a shared memory parallel program, the developer may want to enforce a consistent view to ensure that modifications made to shared data are visible to all threads, not only the thread(s) that modified this data.

This has however nothing to do with OpenMP. It is something that comes with shared memory programming and needs to be dealt with.

Ensuring a consistent view on memory is exactly what the "#pragma omp flush" directive does. This is guaranteed by the OpenMP implementation,. Therefore, the developer has a powerful yet portable mechanism to achieve this. In other words, it is a strength, not a weakness. Also, for ease of development, many OpenMP constructs already have this construct implied. This dramatically reduces the need to explicitly use the flush directive, but if it is needed still, this construct is a nice feature to have.

Given what it achieves, this directive does not impact correct execution of the serial or single threaded version of an OpenMP program. Therefore this can also not explain the claim made in this blog.

The second item mentioned ("local copies of variables") is also not applicable to the serial version of the program, nor single threaded execution of the parallel version. The "#pragma omp private" directive allows the programmer to specify what variables are local to the thread. There are also default rules for this by the way. As a result of this directive, each thread has its unique instance of the variable(s) specified. This is not only a very natural feature to wish for, it also has no impact on the serial code.

Perhaps the author refers to the "firstprivate" and "lastprivate" clauses, but these can be used to preserve the sequential semantics in the parallel program, not the other way round. Their use is rare, but if needed, very convenient to have.

"Since work-sharing induces communication and synchronization overhead whenever a parallel loop is encountered, the loop must contain many iterations in order to be worth parallelizing."

Again some important details are left out. OpenMP provides several strategies to assign loop iterations to threads. If not specified explicitly, the compiler provides for one. 

There is a good reason to provide multiple choices to the developer. The most efficient strategy is the static distribution of iterations over the threads. In contrast with the claim made above, the overhead for this is close to zero. It is also the reason why many compilers use this as the default in absence of an explicit specification by the user.

This choice may however not be optimal if the workload is less balanced. For this, OpenMP provides several alternatives like the "dynamic" and "guided" workload distribution schemes. It is true that more synchronization is needed then, but this is understandable. The run time system needs to make choices how to distribute the work. This is not needed with the static scheduling.

Of course the developer can always implement a specific scheme manually, but these 2 choices come a long way to accommodate many real world situations. Moreover, an implementor will try very hard to provide the most efficient implementation of these constructs, relieving the developer from this task.

"Although OpenMP was designed primarily to support a single level of loop parallelization"

I'm not sure what this comment is based on, because nested parallelism has been supported since the very first 1.0 release of OpenMP that came out in 1997. The one thing is that it has taken some time for compilers and run time systems to support this, but it is a widely available feature these days.

"The work-sharing scheduling strategy is often not up to the task, however, because it is sometimes difficult to determine at the start of a nested loop how to allocate thread resources."

OpenMP has a fairly high level of abstraction. I fail to see what is meant with "allocate thread resources". Actually, there is no such concept available to the user, other than various types of data-sharing attributes like "private" or "shared". It is also not clear what is really meant here. Nested parallelism works, each thread becomes the master thread of a new pool of threads, and resources are available whenever needed.

The next line gives us somewhat more of a clue as to what is really meant here.

"In particular, when nested parallelism is turned on, it is common for OpenMP applications to "blow out" memory at runtime because of the inordinate space demands."

In my experience this has not been an issue, but of course one can not exclude that an (initial) implementation of nested parallelism for a specific platform suffered from certain deficiencies. Even if so, that is a Quality Of Implementation (QoI) issue and has nothing to do with the programming model. Shared data is obviously not copied, so there are no additional memory resources, and by design (and desire) each (additional) thread gets a copy of its private data.

The fact this is really a QoI issue seems to be confirmed by the next statement.

"The latest generation OpenMP compilers have started to address this problem, however."

In other words, if there was a problem at all, it is being addressed.

 "In summary, if your code looks like a sequence of parallelizable Fortran-style loops, OpenMP will likely give good speedups."

This is one of those persistent myths. OpenMP has always been more flexible than for "just" parallelizing loops. As for example shown in the book "Using OpenMP" (by Chapman, Jost and van der Pas), the sections concept can be used to overlap input, processing and output in a pipelined manner. 

"If your control structures are more involved, in particular, involving nested parallelism, you may find that OpenMP isn't quite up to the job."

This is not only a surprisingly bold and general claim, some more specific information would be helpful. As already mentioned above, it is not at all clear why nested parallelism should not be suitable and performant. It actually is and is successfully used for certain kinds of algorithms.

Regrettably the author of this blog also does not seem to be aware of the huge leap forward made with OpenMP 3.0. The specifications have been released in May 2008 and are supported by several major compilers already.

The main new functionality added is the concept of tasking. A task can be any block of code. The developer has the responsibility to ensure that different tasks can be executed concurrently. The run time system generates and executes the tasks, either at implicit synchronization points in the program, or under explicit control of the programmer.

This adds a tremendous flexibility to OpenMP. It also uplifts the level of abstraction. Although never true in the past either, a claim that OpenMP is only suitable for a loop level style of parallelization is certainly way too restrictive.

In addition to tasking, numerous other features have been added, including enhanced support for nested parallelism and C++.

 Last, but certainly not least, I can strongly recommend anybody interested in OpenMP to visit the main web site


Picture of Ruud

Ruud van der Pas is a Senior Principal Software Engineer in the SPARC Microelectronics organization at Oracle. His focus is on application performance, both for single threaded, as well as for multi-threaded programs. He is also co-author on the book Using OpenMP

Cover of the Using OpenMP book


« August 2016