Saturday Feb 28, 2009

Demystifying Persistent OpenMP Myths - Part I

Unfortunately, the September 5, 2008 blog titled "The OpenMP Concurrency Platform" written by Charles Leiserson from Cilk Arts repeats some of the persistent myths regarding OpenMP.

Certain comments made also may give rise to a somewhat distorted view on OpenMP for those readers that are less into the aspects of parallel programming. For example, the statement that OpenMP is most suitable for loops only. This has never been the case and certainly the introduction of the flexible and powerful tasking concept in OpenMP 3.0 (released May 2008) is a big step forward.

In this article I would like to respond to this blog and share my view on the claims made. The format chosen is to give a quote, followed by my comment(s).

"OpenMP does not attempt to determine whether there are dependencies between loop iterations." 

This is correct, but there are two additional and important comments to be made.

OpenMP is a high level, but explicit programming model. The developer specifies the parallelism. The compiler and run time system translate this into a corresponding parallel execution model. The task of the programmer is to correctly identify the parallelism. As far as I can tell, this is the case for all parallel programming models. In that sense, OpenMP is not any different than other models. It is therefore not clear what the intention of this comment is (note that the exception is automatic parallelization. In this case it is the responsibility of the compiler to identify those portions of the program that can be executed in parallel, as well as generate the correct underlying infrastructure).

Another aspect not mentioned is that one of the strengths of OpenMP is that the directive based model allows compilers to check for possible semantic errors made by the developer. For example, several compilers perform a static dependence analysis to warn against possible data races. Such errors are much harder to detect if function calls are used to specify the parallelism (e.g. POSIX threads).

"Unfortunately, some of the OpenMP directives for managing memory consistency and local copies of variables affect the semantics of the serial code, compromising this desirable property unless the code avoids these pragmas." 

I don't think OpenMP directives affect the semantics of the serial code, so how can this be the case? An OpenMP program can always be compiled in such a way that the directives are ignored, effectively compiling the serial code.

I suspect the author refers to the "#pragma omp flush" and "#pragma omp private" directives. These affect the semantics of the parallel version, not the serial code, but either or both could be required to ensure correct parallel execution. The need for this depends on the specific situation.

We can only further guess what is meant here, but it is worth doing so. 

Part of the design of a shared memory computer system is to define the memory consistency model. Several choices are possible and have indeed been implemented in commercially available parallel systems, as well as in more research oriented architectures.

As suggested by the name, memory consistency defines what memory state the various threads of execution observe. They need not have the same view at a specific point in time.

The problem with this is that at certain points in a shared memory parallel program, the developer may want to enforce a consistent view to ensure that modifications made to shared data are visible to all threads, not only the thread(s) that modified this data.

This has however nothing to do with OpenMP. It is something that comes with shared memory programming and needs to be dealt with.

Ensuring a consistent view on memory is exactly what the "#pragma omp flush" directive does. This is guaranteed by the OpenMP implementation,. Therefore, the developer has a powerful yet portable mechanism to achieve this. In other words, it is a strength, not a weakness. Also, for ease of development, many OpenMP constructs already have this construct implied. This dramatically reduces the need to explicitly use the flush directive, but if it is needed still, this construct is a nice feature to have.

Given what it achieves, this directive does not impact correct execution of the serial or single threaded version of an OpenMP program. Therefore this can also not explain the claim made in this blog.

The second item mentioned ("local copies of variables") is also not applicable to the serial version of the program, nor single threaded execution of the parallel version. The "#pragma omp private" directive allows the programmer to specify what variables are local to the thread. There are also default rules for this by the way. As a result of this directive, each thread has its unique instance of the variable(s) specified. This is not only a very natural feature to wish for, it also has no impact on the serial code.

Perhaps the author refers to the "firstprivate" and "lastprivate" clauses, but these can be used to preserve the sequential semantics in the parallel program, not the other way round. Their use is rare, but if needed, very convenient to have.

"Since work-sharing induces communication and synchronization overhead whenever a parallel loop is encountered, the loop must contain many iterations in order to be worth parallelizing."

Again some important details are left out. OpenMP provides several strategies to assign loop iterations to threads. If not specified explicitly, the compiler provides for one. 

There is a good reason to provide multiple choices to the developer. The most efficient strategy is the static distribution of iterations over the threads. In contrast with the claim made above, the overhead for this is close to zero. It is also the reason why many compilers use this as the default in absence of an explicit specification by the user.

This choice may however not be optimal if the workload is less balanced. For this, OpenMP provides several alternatives like the "dynamic" and "guided" workload distribution schemes. It is true that more synchronization is needed then, but this is understandable. The run time system needs to make choices how to distribute the work. This is not needed with the static scheduling.

Of course the developer can always implement a specific scheme manually, but these 2 choices come a long way to accommodate many real world situations. Moreover, an implementor will try very hard to provide the most efficient implementation of these constructs, relieving the developer from this task.

"Although OpenMP was designed primarily to support a single level of loop parallelization"

I'm not sure what this comment is based on, because nested parallelism has been supported since the very first 1.0 release of OpenMP that came out in 1997. The one thing is that it has taken some time for compilers and run time systems to support this, but it is a widely available feature these days.

"The work-sharing scheduling strategy is often not up to the task, however, because it is sometimes difficult to determine at the start of a nested loop how to allocate thread resources."

OpenMP has a fairly high level of abstraction. I fail to see what is meant with "allocate thread resources". Actually, there is no such concept available to the user, other than various types of data-sharing attributes like "private" or "shared". It is also not clear what is really meant here. Nested parallelism works, each thread becomes the master thread of a new pool of threads, and resources are available whenever needed.

The next line gives us somewhat more of a clue as to what is really meant here.

"In particular, when nested parallelism is turned on, it is common for OpenMP applications to "blow out" memory at runtime because of the inordinate space demands."

In my experience this has not been an issue, but of course one can not exclude that an (initial) implementation of nested parallelism for a specific platform suffered from certain deficiencies. Even if so, that is a Quality Of Implementation (QoI) issue and has nothing to do with the programming model. Shared data is obviously not copied, so there are no additional memory resources, and by design (and desire) each (additional) thread gets a copy of its private data.

The fact this is really a QoI issue seems to be confirmed by the next statement.

"The latest generation OpenMP compilers have started to address this problem, however."

In other words, if there was a problem at all, it is being addressed.

 "In summary, if your code looks like a sequence of parallelizable Fortran-style loops, OpenMP will likely give good speedups."

This is one of those persistent myths. OpenMP has always been more flexible than for "just" parallelizing loops. As for example shown in the book "Using OpenMP" (by Chapman, Jost and van der Pas), the sections concept can be used to overlap input, processing and output in a pipelined manner. 

"If your control structures are more involved, in particular, involving nested parallelism, you may find that OpenMP isn't quite up to the job."

This is not only a surprisingly bold and general claim, some more specific information would be helpful. As already mentioned above, it is not at all clear why nested parallelism should not be suitable and performant. It actually is and is successfully used for certain kinds of algorithms.

Regrettably the author of this blog also does not seem to be aware of the huge leap forward made with OpenMP 3.0. The specifications have been released in May 2008 and are supported by several major compilers already.

The main new functionality added is the concept of tasking. A task can be any block of code. The developer has the responsibility to ensure that different tasks can be executed concurrently. The run time system generates and executes the tasks, either at implicit synchronization points in the program, or under explicit control of the programmer.

This adds a tremendous flexibility to OpenMP. It also uplifts the level of abstraction. Although never true in the past either, a claim that OpenMP is only suitable for a loop level style of parallelization is certainly way too restrictive.

In addition to tasking, numerous other features have been added, including enhanced support for nested parallelism and C++.

 Last, but certainly not least, I can strongly recommend anybody interested in OpenMP to visit the main web site


Picture of Ruud

Ruud van der Pas is a Senior Staff Engineer in the Microelectronics organization at Oracle. His focus is on application performance, both for single threaded, as well as for multi-threaded programs. He is also co-author on the book Using OpenMP

Cover of the Using OpenMP book


« April 2014