Demystifying Persistent OpenMP Myths - Part I

Unfortunately, the September 5, 2008 blog titled "The OpenMP Concurrency Platform" written by Charles Leiserson from Cilk Arts repeats some of the persistent myths regarding OpenMP.

Certain comments made also may give rise to a somewhat distorted view on OpenMP for those readers that are less into the aspects of parallel programming. For example, the statement that OpenMP is most suitable for loops only. This has never been the case and certainly the introduction of the flexible and powerful tasking concept in OpenMP 3.0 (released May 2008) is a big step forward.

In this article I would like to respond to this blog and share my view on the claims made. The format chosen is to give a quote, followed by my comment(s).

"OpenMP does not attempt to determine whether there are dependencies between loop iterations." 

This is correct, but there are two additional and important comments to be made.

OpenMP is a high level, but explicit programming model. The developer specifies the parallelism. The compiler and run time system translate this into a corresponding parallel execution model. The task of the programmer is to correctly identify the parallelism. As far as I can tell, this is the case for all parallel programming models. In that sense, OpenMP is not any different than other models. It is therefore not clear what the intention of this comment is (note that the exception is automatic parallelization. In this case it is the responsibility of the compiler to identify those portions of the program that can be executed in parallel, as well as generate the correct underlying infrastructure).

Another aspect not mentioned is that one of the strengths of OpenMP is that the directive based model allows compilers to check for possible semantic errors made by the developer. For example, several compilers perform a static dependence analysis to warn against possible data races. Such errors are much harder to detect if function calls are used to specify the parallelism (e.g. POSIX threads).

"Unfortunately, some of the OpenMP directives for managing memory consistency and local copies of variables affect the semantics of the serial code, compromising this desirable property unless the code avoids these pragmas." 

I don't think OpenMP directives affect the semantics of the serial code, so how can this be the case? An OpenMP program can always be compiled in such a way that the directives are ignored, effectively compiling the serial code.

I suspect the author refers to the "#pragma omp flush" and "#pragma omp private" directives. These affect the semantics of the parallel version, not the serial code, but either or both could be required to ensure correct parallel execution. The need for this depends on the specific situation.

We can only further guess what is meant here, but it is worth doing so. 

Part of the design of a shared memory computer system is to define the memory consistency model. Several choices are possible and have indeed been implemented in commercially available parallel systems, as well as in more research oriented architectures.

As suggested by the name, memory consistency defines what memory state the various threads of execution observe. They need not have the same view at a specific point in time.

The problem with this is that at certain points in a shared memory parallel program, the developer may want to enforce a consistent view to ensure that modifications made to shared data are visible to all threads, not only the thread(s) that modified this data.

This has however nothing to do with OpenMP. It is something that comes with shared memory programming and needs to be dealt with.

Ensuring a consistent view on memory is exactly what the "#pragma omp flush" directive does. This is guaranteed by the OpenMP implementation,. Therefore, the developer has a powerful yet portable mechanism to achieve this. In other words, it is a strength, not a weakness. Also, for ease of development, many OpenMP constructs already have this construct implied. This dramatically reduces the need to explicitly use the flush directive, but if it is needed still, this construct is a nice feature to have.

Given what it achieves, this directive does not impact correct execution of the serial or single threaded version of an OpenMP program. Therefore this can also not explain the claim made in this blog.

The second item mentioned ("local copies of variables") is also not applicable to the serial version of the program, nor single threaded execution of the parallel version. The "#pragma omp private" directive allows the programmer to specify what variables are local to the thread. There are also default rules for this by the way. As a result of this directive, each thread has its unique instance of the variable(s) specified. This is not only a very natural feature to wish for, it also has no impact on the serial code.

Perhaps the author refers to the "firstprivate" and "lastprivate" clauses, but these can be used to preserve the sequential semantics in the parallel program, not the other way round. Their use is rare, but if needed, very convenient to have.

"Since work-sharing induces communication and synchronization overhead whenever a parallel loop is encountered, the loop must contain many iterations in order to be worth parallelizing."

Again some important details are left out. OpenMP provides several strategies to assign loop iterations to threads. If not specified explicitly, the compiler provides for one. 

There is a good reason to provide multiple choices to the developer. The most efficient strategy is the static distribution of iterations over the threads. In contrast with the claim made above, the overhead for this is close to zero. It is also the reason why many compilers use this as the default in absence of an explicit specification by the user.

This choice may however not be optimal if the workload is less balanced. For this, OpenMP provides several alternatives like the "dynamic" and "guided" workload distribution schemes. It is true that more synchronization is needed then, but this is understandable. The run time system needs to make choices how to distribute the work. This is not needed with the static scheduling.

Of course the developer can always implement a specific scheme manually, but these 2 choices come a long way to accommodate many real world situations. Moreover, an implementor will try very hard to provide the most efficient implementation of these constructs, relieving the developer from this task.

"Although OpenMP was designed primarily to support a single level of loop parallelization"

I'm not sure what this comment is based on, because nested parallelism has been supported since the very first 1.0 release of OpenMP that came out in 1997. The one thing is that it has taken some time for compilers and run time systems to support this, but it is a widely available feature these days.

"The work-sharing scheduling strategy is often not up to the task, however, because it is sometimes difficult to determine at the start of a nested loop how to allocate thread resources."

OpenMP has a fairly high level of abstraction. I fail to see what is meant with "allocate thread resources". Actually, there is no such concept available to the user, other than various types of data-sharing attributes like "private" or "shared". It is also not clear what is really meant here. Nested parallelism works, each thread becomes the master thread of a new pool of threads, and resources are available whenever needed.

The next line gives us somewhat more of a clue as to what is really meant here.

"In particular, when nested parallelism is turned on, it is common for OpenMP applications to "blow out" memory at runtime because of the inordinate space demands."

In my experience this has not been an issue, but of course one can not exclude that an (initial) implementation of nested parallelism for a specific platform suffered from certain deficiencies. Even if so, that is a Quality Of Implementation (QoI) issue and has nothing to do with the programming model. Shared data is obviously not copied, so there are no additional memory resources, and by design (and desire) each (additional) thread gets a copy of its private data.

The fact this is really a QoI issue seems to be confirmed by the next statement.

"The latest generation OpenMP compilers have started to address this problem, however."

In other words, if there was a problem at all, it is being addressed.

 "In summary, if your code looks like a sequence of parallelizable Fortran-style loops, OpenMP will likely give good speedups."

This is one of those persistent myths. OpenMP has always been more flexible than for "just" parallelizing loops. As for example shown in the book "Using OpenMP" (by Chapman, Jost and van der Pas), the sections concept can be used to overlap input, processing and output in a pipelined manner. 

"If your control structures are more involved, in particular, involving nested parallelism, you may find that OpenMP isn't quite up to the job."

This is not only a surprisingly bold and general claim, some more specific information would be helpful. As already mentioned above, it is not at all clear why nested parallelism should not be suitable and performant. It actually is and is successfully used for certain kinds of algorithms.

Regrettably the author of this blog also does not seem to be aware of the huge leap forward made with OpenMP 3.0. The specifications have been released in May 2008 and are supported by several major compilers already.

The main new functionality added is the concept of tasking. A task can be any block of code. The developer has the responsibility to ensure that different tasks can be executed concurrently. The run time system generates and executes the tasks, either at implicit synchronization points in the program, or under explicit control of the programmer.

This adds a tremendous flexibility to OpenMP. It also uplifts the level of abstraction. Although never true in the past either, a claim that OpenMP is only suitable for a loop level style of parallelization is certainly way too restrictive.

In addition to tasking, numerous other features have been added, including enhanced support for nested parallelism and C++.

 Last, but certainly not least, I can strongly recommend anybody interested in OpenMP to visit the main web site


Just how good are the implementations of tasks these days? For example suppose that I use the task models to implement doubly recursive fibonacci in parallel, how good is it? In MIT Cilk it's
cilk int fib (int n) {
if (n<2) return n;
else {
int a = spawn fib(n-1);
int b = spawn fib(n-2);
return a+b;
In OpenMP it's very similar.

Last time I measured an implementation (icc) the overhead of a parallelizable function call a task was an order of magnitude higher than for Cilk. You may argue that this is a quality-of-implementation issue, but I would counter that the semantics of OpenMP make it difficult to implement high-performance task parallelism.

Disclaimer: I am one of the developers of MIT Cilk, but not of Cilk++.

Posted by Bradley C. Kuszmaul on March 09, 2009 at 12:42 AM PDT #

Intel has freely admitted that their taskq implementation was not optimized. Unless Mr. Kuszmaul measured the task implementation using icc Version 11, I am afraid that his comparison doesn't mean much.

Posted by Eric Duncan on March 09, 2009 at 11:41 PM PDT #

Hi Bradley,

Thank you for your feedback. I'm afraid I fail to understand why the semantics of OpenMP inhibit an efficient implementation of tasking. Can you please elaborate?

Thanks. Ruud.

Posted by Ruud van der Pas on March 10, 2009 at 07:20 AM PDT #

Dear Mr. van der Pas:

I consider many of your criticisms of your article to be unfair and inaccurate. For a full response, please see

Charles E. Leiserson

Posted by Charles E. Leiserson on March 13, 2009 at 02:03 PM PDT #

Dear Mr. Leiserson, I have read your response. I don't think it is very productive if I were to post an elaborate answer to this, but I would like to make a few comments.

You have a valid point regarding preserving the semantics of the serial code versus the OpenMP version executed on a single thread. I should have been more careful in my wording. One can think of situations as shown in the example, but these are not only rare, this difference in behaviour can also be avoided by using local variables inside the code block.

The example given to measure the scheduling overhead is flawed I think. I maintain my point that static scheduling is very efficient. The reduction clause in your example implies a critical section and this of course significantly affects the performance.

I'm afraid I also do not find your example on nested parallelism very convincing, as there is no computational work performed.

Your comment on OpenMP 3.0 tasking support seems to be largely based on gcc. As far as I know, tasking is supported in several other widely used compilers.

Kind regards, Ruud van der Pas

Posted by Ruud van der Pas on March 19, 2009 at 07:48 AM PDT #

Please find my comment in the OpenMP Forum "Using OpenMP"
"OpenMP versus Cilk++ ?"

Posted by Dieter an Mey on March 20, 2009 at 08:48 AM PDT #

this is cool, this is what we want dude......

Posted by links of london on November 28, 2009 at 10:06 AM PST #

I have started using OpenMP recently. I have been using sections and find them quite useful. I have a naive
question. What is the difference between sections and tasks.


Posted by shankha on March 14, 2010 at 12:13 AM PST #

@ shankha:

Q: What is the difference between using omp parallel task versus using omp parallel sections calling a task?
A: Calling a task within a section just creates extra overhead and cannot control and synchronize the tasks since
each parallel section is independent of each other. OpenMP 3.0 tasking is more flexible and efficient compared to
using parallel sections. With parallel sections, there is no way to coordinate the task in each section, so it is not
possible to determine whether one section will be executed before another, regardless of which section comes first
in the program source. On the other hand, the task directive can take an "if" clause to cause the task to be
executed immediately or be deferred; a thread can be "hard wired" to a task (called "tied"), or can be "untied",
which allows any available thread in the thread pool to start executing the task. Tasking has much better
performance and scalability for nested parallel and recursive algorithms, compared to parallel sections. There is
much more overhead creating and destroying the nested parallel regions (parallel sections tasking), versus
executing tasks which are all created by a single parallel region containing a task directive. You can control the total number of threads (OMP_NUM_THREADS) with tasking, whereas with nested parallel regions (sections), with each newly created region you get OMP_NUM_THREADS new threads, and that can easily oversubscribe the host.

From here:

Posted by sd9 on June 15, 2010 at 06:35 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed

Picture of Ruud

Ruud van der Pas is a Senior Principal Software Engineer in the SPARC Microelectronics organization at Oracle. His focus is on application performance, both for single threaded, as well as for multi-threaded programs. He is also co-author on the book Using OpenMP

Cover of the Using OpenMP book


« October 2016