Friday Nov 21, 2008

Extending the OpenMP profiling API for OpenMP 3.0

Last Tuesday at the OpenMP BOF of SC08, Oleg Mazurov presented our work on extending the OpenMP profiling API for OpenMP 3.0 (pdf slides).

The current existing API was first published in 2006 and was last updated in 2007. Since then, two more developments now beg for another update - one is for supporting the new OpenMP tasking feature, and the other is for supporting vendor specific extensions.

The extension for tasking support is straight forward. A few events that corresponding to the creation, execution, and termination of tasks are added. Also added are a few requests to get the task ID and other properties.

Vendor specific extensions are implemented essentially by sending a establishing-extension request with a vendor unique ID from the collector tool to the OpenMP runtime library. The OpenMP runtime library accepts the request if it supports the vendor, otherwise rejects it. After a successful rendezvous, the request establishes a new name space for subsequent requests and events.

One pending issue is how to support multiple vendor agents in one session. Not that a solution cannot be engineered, we are waiting for a use case to emerge.

During the execution of an OpenMP program, any arbitrary program event can be associated with

  • an OpenMP state,
  • a user callstack,
  • a node in the thread tree with parallel region ID's and OpenMP thread ID's along the path, and
  • a node in the task tree with task ID's along the path.

Because the execution of an OpenMP task may be asynchronous, and the executing thread may be different from the encountering thread, getting the user callstack of an event happened within a task becomes tricky.

At our Sun booth in SC08, we demoed a prototype Performance Analyzer that can present user callstacks in a cool way when OpenMP tasks are involved.

Take a simple quick sort code for an example.

        void quick_sort ( int lt,  int rt,  float \*data )  { 
            int md = partition( lt,  rt,  data ); 
            #pragma omp task 
            quick_sort( lt,  md - 1,  data ); 
            #pragma omp task 
            quick_sort( md + 1,  rt,  data ); 

The following figure shows the time line display of one execution of the program. The same original data are sorted three times, once sequential, once using two threads, and once using four threads.

The spikes in callstacks in the sequential sort show the recursive nature of the quick sort. But when you look at the parallel sort, the callstacks are flat. That's because each call to quick_sort() is now a task, and the tasking execution essentially changes the recursive execution into a work-list execution. The low-level callstack in the above figure shows close-to what actually happens in one thread.

While these pieces of information are useful in showing the execution details, they do not help answering the question which tasks are actually being executing. Where was the current executing task created? In the end, the user needs to debug the performance problem in his/her code (not in the OpenMP runtime). Representing information close to the user program logic is crucial.

The following figure shows the time line and user callstacks in the user view constructed by our prototype tool. Notice the callstacks in the parallel run are almost the same as in the sequential run. In the time line, it is just like the work in the sequential run is being distributed among the threads in the parallel run. Isn't this what happens intuitively when you parallelize a code using OpenMP? :)

Wednesday Jul 30, 2008

New LEGO in Store: Sun Studio Express 07/08 with OpenMP 3.0 Support

Today, we are making available, as a free download, Sun Studio Express 07/08 Release. One of the most exciting things about this release is the beta-level support for OpenMP 3.0 in our C/C++/Fortran Compilers.

I feel really excited about this. One of the major 3.0 features supported is tasking, which was finalized in the language specification after a looooong labor. It expends a whole new dimension of what OpenMP can do. It is like a new piece of LEGO. We are looking forward to seeing innovative (or not :)) ways of using this new feature.

This is a functional beta release. We are still working on fixing a few bugs and improving performance. One of the best ways to give us feedback is using our online forum.

Here are two short articles that may help users jump-start using the tasking feature.

Saturday Nov 17, 2007

Non-concurrency Analysis Used in PTP

When visiting the IBM booth at SC07, I was a little surprised to learn that the non-concurrency analysis technology for OpenMP programs had also been adopted and implemented in the Parallel Tools Platform.

Beth Tibbitts from IBM has kindly sent me the reference details: STMCS'07 program, paper, and presentation.

The technology is used by Sun Studio Compilers to do static error check for OpenMP programs.

Thursday Sep 20, 2007

Common Mistakes in Using OpenMP 5: Assuming Non-existing Synchronization Before Entering Worksharing Construct

There is no synchronization between the threads in a team when they enter a worksharing construct. Many people assume there is a barrier before the threads enter a worksharing construct, especially when there is a FIRSTPRIVATE used in the worksharing construct. This is a common mistake.

For example, in the following code, assume two threads - thread 1 and thread 2 are in the team, and Read1 is executed by thread 1 and Read2 is executed by thread 2.

  #pragma omp parallel
     if (omp_get_thread_num()==0)
        z = 1;
        z = 2;
     #pragma omp sections firstprivate(z)
       #pragma omp section
          ... = z;      // Read1
       #pragma omp section
          ... = z;      // Read2

What are the values of z at Read1 and Read2? All the following three combinations are possible,

  1. Read1:1 Read2:1
  2. Read1:1 Read2:2
  3. Read1:2 Read2:2

If there were a synchronization before the worksharing construct, then the above (Read1:1, Read2:2) is not possible.

Now, look at the following example which has both FIRSTPRIVATE and LASTPRIVATE,

  #pragma omp parallel
     z = 1;
     #pragma omp for firstprivate(z) lastprivate(z) nowait
     for (i=0; i<n; i++) {
          ... = z;      // Read1
          z = 2;        // Write1

What could be the value of z at Read1? Would it be 2? OpenMP 3.0 Draft has clarified this situation. It says

If a list item appears in both firstprivate and lastprivate clauses, the update required for lastprivate occurs after all initializations for firstprivate.

So, the value of z at Read1 cannot be 2.

Thursday Jun 29, 2006

The idea behind environment variable "SUNW_MP_MAX_POOL_THREADS"

Sun's OpenMP implementation supports true nested parallel regions - when nested parallelism is enabled, the inner parallel region can be executed by multiple threads concurrently.

We provide an environment variable called SUNW_MP_MAX_POOL_THREADS for users to control the total number of OpenMP slave threads in a process.

For example, if you have want a maximum of 16 threads to be used for a nest of parallel regions in your program, you can set SUNW_MP_MAX_POOL_THREADS to 15. That's 15 slave threads (some of them may become masters in inner parallel regions) plus one user thread which is the master thread for the out-most parallel region.

Why did we design an environment variable like SUNW_MP_MAX_NUM_THREADS so that a user can set it to 16 in the above example? Intel's implementation has KMP_ALL_THREADS and KMP_MAX_THREADS which do that.

Well, we were trying to have a scheme that works on more general cases, not just pure OpenMP codes. In particular, we think our scheme works better than others for mixed pthread and OpenMP thread code. The pool defines a set of threads that can be used as OpenMP slave threads. If the program has two pthreads and both will create a team, then both will try to grab slave threads from the same pool. The env var SUNW_MP_MAX_POOL_THREADS was NOT designed for users to control the total number of threads in a process. We cannot control that because of the use of pthreads. The env var is designed for users to control the total number of OpenMP slave threads.

The env var SUNW_MP_MAX_NUM_THREADS is documented here. We also have a short article "How Many Threads Does It Take?" if you want to understand it better.

Sunday Jun 11, 2006

Common Mistakes in Using OpenMP 4: Orphaned Worksharing Constructs

More precisely, this mistake should be classified as a common mis-understanding of OpenMP.

When a worksharing construct, such omp for or omp sections, is encountered outside any explicit parallel region, the arising worksharing region is called orphaned worksharing region. A common mis-understanding is that in this case the worksharing construct is simply being ignored and the region is executed sequentially.

Orphaned worksharing constructs are not ignored. All the data sharing attribute clauses are honored. The worksharing regin is executed as if a team of only one thread is executing the region.

For example, in the following C++ code,

         class_type_1  a;
         #pragma omp for private(a) schedule(dynamic)
         for (i=1; i<100; i++) {
             printf("%d\\n", i);

the default constructor for class_type_1 will be called, and a comforming implementation is not forced to execute the loop in the order of 1, 2, 3, ..., 99.

Wednesday Jun 07, 2006

Common Mistakes in Using OpenMP 3: Fifteen Cases from a IWOMP 2006 paper by Michael Süß and Claudia Leopold

The coming International Workshop on OpenMP (IWOMP 2006) has a paper titled "Common Mistakes in OpenMP and How to Avoid Them" written by Michael Süß and Claudia Leopold (University of Kassel, Germany).

The result is based on a survey of two undergraduate courses. The authors of the paper kindly allow me to list the 15 common mistakes presented in their paper here,

  1. (Correctness) Access to shared variables not protected
  2. (Correctness) Use of locks without flush
  3. (Correctness) Read of shared variable without flush
  4. (Correctness) Forget to mark private variables as such
  5. (Correctness) Use of ordered clause without ordered construct
  6. (Correctness) Declare loop variable in #pragma omp parallel for as shared
  7. (Correctness) Forget to put down for in #pragma omp parallel for
  8. (Correctness) Try to change num. of thr. in parallel reg. after start of reg.
  9. (Correctness) omp_unset_lock() called from non-owner thread
  10. (Correctness) Attempt to change loop variable while in #pragma omp for
  11. (Performance) Use of critical when atomic would be sufficient
  12. (Performance) Put too much work inside critical region
  13. (Performance) Use of orphaned construct outside parallel region
  14. (Performance) Use of unnecessary flush
  15. (Performance) Use of unnecessary critical

For detail, please read the full paper.

Monday Feb 20, 2006

Common Mistakes in Using OpenMP 2: Atomic

The following code finds good members in array member[] and stores the indices of the good members in array good_members[].

#define N 1000

struct data member[N];

int good_members[N];

int pos = 0;

void find_good_members()
for (i=0; i < N; i++) {
if (is_good(member[i])) {
good_members[pos] = i;
pos ++;

The following is a navie way of parallelizing the above code,

#define N 1000

struct data member[N];

int good_members[N];

int pos = 0;

void find_good_members()
#pragma omp parallel for
for (i=0; i < N; i++) {
if (is_good(member[i])) {
good_members[pos] = i; // line a
#pragma omp atomic
pos ++; // line b

In order to avoid data races between different updates of global variable pos, the code puts the increment (at line b) in a atomic construct. However, the code does not work, because there is a data race between the read of pos at line a and write of pos at line b.

Changing the body of the if statement to the following gives the correct result.

      int mypos;
#pragma omp critical
mypos = pos;
pos ++;
good_members[mypos] = i;

In OpenMP 2.5 (the latest Specification), inside a parallel region, the only place where you can safely get the value of a variable that is updated in an atomic region is another atomic region.

Friday Dec 30, 2005

Common Mistakes in Using OpenMP 1: Incorrect Directive Format

In C/C++, OpenMP directives are specified by using the #pragma mechanism; and in Fortran, they are specified by using special comments that are identified by unique sentinels.

This design allows users to write OpenMP programs that can be compiled with compilers that do not support OpenMP or compiled with OpenMP compiles with OpenMP support disabled.

However, if you do not follow the directive format, you might get a program that compiles and runs but gives unexpected results, because the compiler does not recognize your OpenMP directives and thinks they are non-OpenMP related pragmas (C/C++) or regular comments (Fortran).


How many "me"s does the following code print? Assume a team of 4 threads are executing the parallel region.

    #pragma omp parallel
        #pragma single

Common Mistakes in Using OpenMP

I will post a list of common mistakes found in parallel programs written using OpenMP.

Although it is always true that users of a language need to spend effort to understand the language so to avoid mistakes, I wonder what it means to the language designers if many many users keep making the same set of mistakes again and again.




« April 2014