Saturday Jan 03, 2009

New Post @ touchdreams.net/blog: Thread Cleanup Handlers, exit(), _exit(), atexit(), and pthread_exit()

This new post at touchdreams.net/blog describes what happens when a multi-threaded process is shutting down, what are the differences between exit(), _exit() and pthread_exit(), when and what cleanup routines will be called. Shutting down a multi-threaded application gracefully and cleanly is a challenging task. Sometimes you do not even want to do that, but you still need to know what are happening during the shutdown.

Monday Dec 29, 2008

Moving to touchdreams.net/blog

I am consolidating my blogs. Most of my entries here have been copied to touchdreams.net/blog.

I will continue writing about parallel programming at touchdreams.net/blog.

Sunday Dec 21, 2008

More on Concurrency vs Parallelism

A reader asked why concurrency programming is not a super-set of parallel programming since the parallel entities are also concurrent. Well, it is just like black-white vs color photography. Though black and white are two colors, the techniques in taking good black-white pictures are different from those for color pictures. One need to think and see differently in terms of contrast, texture, lighting and even composition.

Now back to our programming world. Recently while I was working on the OpenMP profiling, I fixed a concurrency bug that was related to asynchronous signals and had nothing to do with parallelism. I used a data structure to store the OpenMP context of a thread. Since an OpenMP context can be described in a tuple <current parallel region, current task region, OpenMP state, user callstack>, the data structure has several 64-bit long fields. One challenge is to update the context data structure atomically, i.e. when my program needs to report the OpenMP context, it should report a consistent context. For example, it should not report a thread is in a new parallel region but is still in an old task region. The atomicity here has nothing to do with parallelism here - the context data is thread private, so there is no sharing between different threads and there is no data race. The atomicity issue happens when a profiling signal (SIGPROF) comes while the program is in the middle of updating the fields of the context data structure. At the signal handler, the program needs to report the context and need to report them consistently. In the end, I had to crafted a way to update all the fields atomically (asynchronously safe) without masking out the SIGPROF.

Here is another interesting discussion on concurrency vs parallelism. I checked the manual. The exact wording used is "The maximum number of active threads per multiprocessor is 768".

Friday Nov 21, 2008

Extending the OpenMP profiling API for OpenMP 3.0

Last Tuesday at the OpenMP BOF of SC08, Oleg Mazurov presented our work on extending the OpenMP profiling API for OpenMP 3.0 (pdf slides).

The current existing API was first published in 2006 and was last updated in 2007. Since then, two more developments now beg for another update - one is for supporting the new OpenMP tasking feature, and the other is for supporting vendor specific extensions.

The extension for tasking support is straight forward. A few events that corresponding to the creation, execution, and termination of tasks are added. Also added are a few requests to get the task ID and other properties.

Vendor specific extensions are implemented essentially by sending a establishing-extension request with a vendor unique ID from the collector tool to the OpenMP runtime library. The OpenMP runtime library accepts the request if it supports the vendor, otherwise rejects it. After a successful rendezvous, the request establishes a new name space for subsequent requests and events.

One pending issue is how to support multiple vendor agents in one session. Not that a solution cannot be engineered, we are waiting for a use case to emerge.

During the execution of an OpenMP program, any arbitrary program event can be associated with

  • an OpenMP state,
  • a user callstack,
  • a node in the thread tree with parallel region ID's and OpenMP thread ID's along the path, and
  • a node in the task tree with task ID's along the path.

Because the execution of an OpenMP task may be asynchronous, and the executing thread may be different from the encountering thread, getting the user callstack of an event happened within a task becomes tricky.

At our Sun booth in SC08, we demoed a prototype Performance Analyzer that can present user callstacks in a cool way when OpenMP tasks are involved.

Take a simple quick sort code for an example.

        void quick_sort ( int lt,  int rt,  float \*data )  { 
            int md = partition( lt,  rt,  data ); 
            #pragma omp task 
            quick_sort( lt,  md - 1,  data ); 
            #pragma omp task 
            quick_sort( md + 1,  rt,  data ); 
        } 

The following figure shows the time line display of one execution of the program. The same original data are sorted three times, once sequential, once using two threads, and once using four threads.

The spikes in callstacks in the sequential sort show the recursive nature of the quick sort. But when you look at the parallel sort, the callstacks are flat. That's because each call to quick_sort() is now a task, and the tasking execution essentially changes the recursive execution into a work-list execution. The low-level callstack in the above figure shows close-to what actually happens in one thread.

While these pieces of information are useful in showing the execution details, they do not help answering the question which tasks are actually being executing. Where was the current executing task created? In the end, the user needs to debug the performance problem in his/her code (not in the OpenMP runtime). Representing information close to the user program logic is crucial.

The following figure shows the time line and user callstacks in the user view constructed by our prototype tool. Notice the callstacks in the parallel run are almost the same as in the sequential run. In the time line, it is just like the work in the sequential run is being distributed among the threads in the parallel run. Isn't this what happens intuitively when you parallelize a code using OpenMP? :)

Tuesday Nov 18, 2008

Sun Studio Express 11/08

The Sun Studio Express 11/08 is out by now and can be downloaded for free.

Among many interesting and important features it provides, here are a few I would like to list

  • It now supports (besides Solaris) RHEL 5, SuSE 10, Ubuntu 8.04 and CentOS 5.1.
  • It has full OpenMP 3.0 compiler support.
  • Performance of OpenMP tasking has been improved.
  • It was used to deliver a new World Record SPECompM2001 score for all 16-thread x86 systems [Sun Blade X6440 (4 x AMD Opteron "Shanghai" 8384 chips, 16 cores, 4 cores/chip, 16 threads) SPECompM2001 - 35,896].

Wednesday Jul 30, 2008

New LEGO in Store: Sun Studio Express 07/08 with OpenMP 3.0 Support

Today, we are making available, as a free download, Sun Studio Express 07/08 Release. One of the most exciting things about this release is the beta-level support for OpenMP 3.0 in our C/C++/Fortran Compilers.

I feel really excited about this. One of the major 3.0 features supported is tasking, which was finalized in the language specification after a looooong labor. It expends a whole new dimension of what OpenMP can do. It is like a new piece of LEGO. We are looking forward to seeing innovative (or not :)) ways of using this new feature.

This is a functional beta release. We are still working on fixing a few bugs and improving performance. One of the best ways to give us feedback is using our online forum.

Here are two short articles that may help users jump-start using the tasking feature.

Tuesday Jan 08, 2008

Gulf of Execution

Gulf of Execution is a term used to describe the the difference between the steps one actually needs to take to achieve a goal and the steps that one perceives.

After learning this term, the example that quickly jumps into my mind is setting up those wifi-enabled devices, like Wii, PSP, NDS, Wireless gateway, etc. In my experience, the one with the narrowest gap is iPhone. The worst one is, well, some operating system.

Michael G Schwern had a blog entry about this on Perl.

Who wide is the Gulf in your favorite parallel programming language/model/scheme/library?

Tuesday Nov 20, 2007

Think in Parallel or Not

Dr. John Shalf posted his view on Prof. Wen-mei Hwu's IEEE MICRO-39 paper. Dr. Shalf states the importance and advance of parallel algorithms and his view on the programming model for parallel algorithms.

Saturday Nov 17, 2007

Non-concurrency Analysis Used in PTP

When visiting the IBM booth at SC07, I was a little surprised to learn that the non-concurrency analysis technology for OpenMP programs had also been adopted and implemented in the Parallel Tools Platform.

Beth Tibbitts from IBM has kindly sent me the reference details: STMCS'07 program, paper, and presentation.

The technology is used by Sun Studio Compilers to do static error check for OpenMP programs.

Friday Nov 09, 2007

Maximum Automation for Mundane Tasks

Adam Kolawa (Parasoft) said in his recent article on DDJ,

"Many people ... want tools to find these bugs automatically. After 20 years of examining how and why errors occur, I believe this is the wrong response. Only a small class of errors can be found automatically; most bugs are related to functionality and requirements, and cannot be identified with just the click of a button."

and

"Our current mission is to address this problem by inventing technologies and strategies to support the brain as it performs this evaluation. We are building automated infrastructures that provide maximum automation for mundane tasks (compiling code, building/running regression test suites, checking adherence to policies, supporting code reviews, and so on) in such a way that each day the brain is presented with the minimal information needed to determine if yesterday's code modifications negatively impacted the application."

There is probably no magic button one can push to turn a piece of legacy code that is not thread-safe into a thread-safe code. A tool should offload the mundane tasks from human brain which can be set free to finish the magic touch.

Saturday Sep 22, 2007

Some History and Background Information about Threading on Unix-like Systems

The C10K problem refers to the problem of serving ten thousand clients simultaneously on a web server. This article written by Daniel Kegel contains some history and background information (and lots links) about threading on Linux, Solaris, BSD, MAC OS X, etc.

Thursday Sep 20, 2007

Common Mistakes in Using OpenMP 5: Assuming Non-existing Synchronization Before Entering Worksharing Construct

There is no synchronization between the threads in a team when they enter a worksharing construct. Many people assume there is a barrier before the threads enter a worksharing construct, especially when there is a FIRSTPRIVATE used in the worksharing construct. This is a common mistake.

For example, in the following code, assume two threads - thread 1 and thread 2 are in the team, and Read1 is executed by thread 1 and Read2 is executed by thread 2.

  #pragma omp parallel
  {
     if (omp_get_thread_num()==0)
        z = 1;
     else
        z = 2;
     #pragma omp sections firstprivate(z)
     {
       #pragma omp section
       {
          ... = z;      // Read1
       }
       #pragma omp section
       {
          ... = z;      // Read2
       }
     }
  }

What are the values of z at Read1 and Read2? All the following three combinations are possible,

  1. Read1:1 Read2:1
  2. Read1:1 Read2:2
  3. Read1:2 Read2:2

If there were a synchronization before the worksharing construct, then the above (Read1:1, Read2:2) is not possible.

Now, look at the following example which has both FIRSTPRIVATE and LASTPRIVATE,

  #pragma omp parallel
  {
     z = 1;
     #pragma omp for firstprivate(z) lastprivate(z) nowait
     for (i=0; i<n; i++) {
          ... = z;      // Read1
          z = 2;        // Write1
     }
  }

What could be the value of z at Read1? Would it be 2? OpenMP 3.0 Draft has clarified this situation. It says

If a list item appears in both firstprivate and lastprivate clauses, the update required for lastprivate occurs after all initializations for firstprivate.

So, the value of z at Read1 cannot be 2.

Friday Oct 13, 2006

A Presentation of DRDT at SVOSUG 9/28/2006

Here is a short presentation of DRDT at the Silicon Valley OpenSolaris Users Group (SVOSUG) meeting on Setp 28, 2006.

Friday Aug 04, 2006

Understanding Data Races 3: Several currently available tools (2/3)

Runtime Checking (simulation based) Tool

Tool 3: Helgrind from Valgrind

In this blog entry, I will describe my experiment of the test cases with Helgrind. Helgrind is a data race detection module of Valgrind, which is pretty successful framework and tool suite for debugging and profiling Linux programs.

Unlike other runtime checking tools I will describe later (e.g. Intel's Thread Checker and Sun's DRDT), Valgrind is simulation based. One advantage of simulation based approach is the two active entities - the target application and the detection module are in different processes. They have different address spaces and name spaces. Therefore this approach can avoid many conflicts between the two entities. For example, the detection module can call any library routines that it monitors without worrying about re-entry problems. [Update: Valgrind actually runs in the same namespace as the target application. And the target application and the detection module are part of the same process. The detection module (core and tools) are designed carefully to avoid dependence on glibc.so.] One challenge of simulation based approach is dealing with system calls. The simulation based approach simulates only the execution of the user process, and it is NOT simulating the OS. A even more bigger challenge is to deal with threading calls. Valgrind is not multi-threaded itself, and all threading executions are serialized. I have not got a chance to study how it works. It must be very interesting. Valgrind's manual claims it works with NPTL or LinuxThreads "well enough for significant threaded applications".

Helgrind is based on the famous Eraser method enhanced with detection of thread creation and thread join. The method is very similar to that used in Compaq/HP's Visual Threads (as described in Harrow's paper). Lockset based methods (such as Eraser) tend to have a lot of false positives.

Currently Valgrind is at release 3.2.0. But the latest version that Helgrind works is 2.2.0. When I ran Helgrind in 3.2.0, I got

Helgrind is currently not working, because:
 (a) it is not yet ready to handle the Vex IR and the use with 64-bit
     platforms introduced in Valgrind 3.0.0
 (b) we need to get thread operation tracking working again after
     the changes added in Valgrind 2.4.0
 If you want to use Helgrind, you'll have to use Valgrind 2.2.0, which is
 the most recent Valgrind release that contains a working Helgrind.

Sorry for the inconvenience.  Let us know if this is a problem for you.

Then I swithced to 2.2.0. First I tried with pthr_prime.c.

$ cc -g pthr_prime.c -lm -lpthread -o pthr_prime

$ valgrind --tool=helgrind ./pthr_prime

==32368== Helgrind, a data race detector for x86-linux.
==32368== Copyright (C) 2002-2004, and GNU GPL'd, by Nicholas Nethercote et al.
==32368== Using valgrind-2.2.0, a program supervision framework for x86-linux.
==32368== Copyright (C) 2000-2004, and GNU GPL'd, by Julian Seward et al.
==32368== For more details, rerun with: -v
==32368== 
==32368== Thread 2:
==32368== Possible data race writing variable at 0x80498B0 (total)
==32368==    at 0x80486BD: work (pthr_prime.c:51)
==32368==    by 0x1D4AFE79: thread_wrapper (vg_libpthread.c:867)
==32368==    by 0xB0010EF3: (within /home/yl140942/vg2/lib/valgrind/stage2)
==32368==  Address 0x80498B0 is in data section of /home/yl140942/tmp/vg/a.out
==32368==  Previous state: shared RO, no locks
==32368== 
==32368== Possible data race writing variable at 0x57EFE95C 
==32368==    at 0x804877D: main (pthr_prime.c:75)
==32368==  Address 0x57EFE95C == &(i) at pthr_prime.c:75
==32368==  Previous state: shared RO, no locks
==32368== 
==32368== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 2 from 2)
==32368== 4 possible data races found; 0 lock order problems
Helgrind finds the race access of total at line 51 and race access of i at line 75. Note that a data race is caused by a pair of accesses. Helgrind reports only one access of a pair. For the first one, the report is ok because line 51 reads and updates total, therefore it is fairly easy to guess what are the racing access pairs. For the second one (i at line 75), I would imagine it would take a fair large of amount of time for one to figure out the other race access of the pair is in line 46. Helgrind also misses several data races (e.g. write-write race at line 50, write-read race between line 50 and 76) due to the heuristic it uses.

Next, I tried with pthr_prime_fixed.c.

$ cc -g pthr_prime_fixed.c -lm -lpthread -o pthr_prime_fixed
$ valgrind --tool=helgrind ./pthr_prime_fixed

==21596== Helgrind, a data race detector for x86-linux.
==21596== Copyright (C) 2002-2004, and GNU GPL'd, by Nicholas Nethercote et al.
==21596== Using valgrind-2.2.0, a program supervision framework for x86-linux.
==21596== Copyright (C) 2000-2004, and GNU GPL'd, by Julian Seward et al.
==21596== For more details, rerun with: -v
==21596==
==21596== Thread 2:
==21596== Possible data race writing variable at 0x804CA10 (pflag+16)
==21596==    at 0x80486DC: is_prime (pthr_prime_fixed.c:34)
==21596==    by 0x8048756: work (pthr_prime_fixed.c:50)
==21596==    by 0x1D4EAE79: thread_wrapper (vg_libpthread.c:867)
==21596==    by 0xB0010EF3: (within /home/yl140942/vg2/lib/valgrind/stage2)
==21596==  Address 0x804CA10 is in BSS section of /home/yl140942/tmp/vg/pthr_prime_fixed
==21596==  Previous state: exclusively owned by thread 1
==21596==
==21596== Thread 2:
==21596== Possible data race writing variable at 0x804CA18 (pflag+24)
==21596==    at 0x80486DC: is_prime (pthr_prime_fixed.c:34)
==21596==    by 0x8048756: work (pthr_prime_fixed.c:50)
==21596==    by 0x1D4EAE79: thread_wrapper (vg_libpthread.c:867)
==21596==    by 0xB0010EF3: (within /home/yl140942/vg2/lib/valgrind/stage2)
==21596==  Address 0x804CA18 is in BSS section of /home/yl140942/tmp/vg/pthr_prime_fixed
==21596==  Previous state: exclusively owned by thread 1

<similar messages repeated for various pflag+offset>

==21596== Thread 2:
==21596== Possible data race writing variable at 0x804CAC0 (pflag+192)
==21596==    at 0x80486DC: is_prime (pthr_prime_fixed.c:34)
==21596==    by 0x8048756: work (pthr_prime_fixed.c:50)
==21596==    by 0x1D4EAE79: thread_wrapper (vg_libpthread.c:867)
==21596==    by 0xB0010EF3: (within /home/yl140942/vg2/lib/valgrind/stage2)
==21596==  Address 0x804CAC0 is in BSS section of /home/yl140942/tmp/vg/pthr_prime_fixed
==21596==  Previous state: exclusively owned by thread 1
==21596==
==21596==
==21596== Possible data race reading variable at 0x80499B0 (total)
==21596==    at 0x8048873: main (pthr_prime_fixed.c:80)
==21596==  Address 0x80499B0 is in data section of /home/yl140942/tmp/vg/pthr_prime_fixed
==21596==  Previous state: shared RW, locked by:0x80499B4(mutex)
==21596==
==21596== ERROR SUMMARY: 33 errors from 33 contexts (suppressed: 2 from 2)
==21596== 35 possible data races found; 0 lock order problems

This time Helgrind reports 32 races accesses of pflag[] at line 34. As explained in DRDT tutorial, these are benign data races. Helgrind also reports a false positive race that has an access of total at line 80.

Helgrind does a good job of reporting the name of the variable involved in the data races (e.g. total, pflag[] and i) and the lock variables (e.g. mutex). The Previous state gives a hint why Helgrind thinks an access might cause data race. For example, in the above experiment with pthr_prime_fixed.c, for the access of total at line 80, it says "Previous state: shared RW, locked by:0x80499B4(mutex)". The accesses of total at lines 52-53 are protected by mutex locks. When Helgrind finds the read access of total is not protected by the same lock (or any lock in this case), it reports a possbile data race. The detection of the thread_join sometime did not work to get rid of the false positive though.

Thursday Jul 06, 2006

Understanding Data Races 2: Several currently available tools (1/3)

Introduction

This blog entry begins to describe a couple of currently available tools that detect data races in multi-threaded C/C++/Fortran programs. These tools and the categories they can be roughly put into are

  1. Static Checking
    • LockLint from Sun
    • vpara compile time check for OpenMP programs from Sun
  2. Runtime Checking - simulation based
    • Helgrind from Valgrind
  3. Runtime Checking - execution based
    • Visual Threads from HP
    • Thread Checker from Intel
    • Data Race Detection Tool from Sun

What not covered here are the tools from some research work. Some of them use combined static and runtime methods, and some use post-mortem based approaches.

Code Examples

I will reuse the following four code examples from the Tutorial of Using Sun Data Race Detection Tool. If you have downloaded and installed the Sun Studio Express June 2006, you should be able to find the example codes under

{installed-directory}/opt/SUNWspro/examples/rdt/prime.

All four codes find the prime numbers between 2 and 3000 using 4 threads. An OpenMP version and a Pthread version are provided,

  1. omp_prime.c: OpenMP version, contains data races
  2. omp_prime_fixed.c: OpenMP version, bugs fixed
  3. pthr_prime.c: Pthread version, contains data races and bugs
  4. pthr_prime_fixed.c: Pthread version, bugs fixed

Read the Tutorial to find out what the data races are and how the bugs are fixed.

omp_prime.c
    ...
    12	#include <stdio.h>
    13	#include <math.h>
    14	#include <omp.h>
    15	
    16	#define THREADS 4
    17	#define N 3000
    18	
    19	int primes[N];
    20	int pflag[N];
    21	
    22	int is_prime(int v)
    23	{
    24	    int i;
    25	    int bound = floor(sqrt(v)) + 1;
    26	    
    27	    for (i = 2; i < bound; i++) {
    28	        /\* no need to check against known composites \*/ 
    29	        if (!pflag[i]) 
    30	            continue;
    31	        if (v % i == 0) { 
    32	            pflag[v] = 0;
    33	            return 0;
    34	        }
    35	    }
    36	    return (v > 1); 
    37	}
    38	
    39	int main(int argn, char \*\*argv)
    40	{
    41	    int i;
    42	    int total = 0;
    43	
    44	#ifdef _OPENMP
    45	    omp_set_num_threads(THREADS);
    46	    omp_set_dynamic(0);
    47	#endif
    48	
    49	    for (i = 0; i < N; i++) {
    50	        pflag[i] = 1; 
    51	    }
    52	    
    53	    #pragma omp parallel for
    54	    for (i = 2; i < N; i++) {
    55	        if ( is_prime(i) ) {
    56	            primes[total] = i;
    57	            total++;
    58	        }
    59	    }
    60	    printf("Number of prime numbers between 2 and %d: %d\\n",
    61	           N, total);
    62	    for (i = 0; i < total; i++) {
    63	        printf("%d\\n", primes[i]);
    64	    }
    65	}
pthr_prime.c
    ...
    12	#include <stdio.h>
    13	#include <math.h>
    14	#include <pthread.h>
    15	
    16	#define THREADS 4
    17	#define N 3000
    18	
    19	int primes[N];
    20	int pflag[N];
    21	int total = 0;
    22	
    23	int is_prime(int v)
    24	{
    25	    int i;
    26	    int bound = floor(sqrt(v)) + 1;
    27	
    28	    for (i = 2; i < bound; i++) {
    29	        /\* no need to check against known composites \*/ 
    30	        if (!pflag[i])
    31	            continue;
    32	        if (v % i == 0) {
    33	            pflag[v] = 0;
    34	            return 0;
    35	        }
    36	    }
    37	    return (v > 1); 
    38	}
    39	
    40	void \*work(void \*arg)
    41	{
    42	    int start;
    43	    int end;
    44	    int i;
    45	
    46	    start = (N/THREADS) \* (\*(int \*)arg) ;
    47	    end = start + N/THREADS;
    48	    for (i = start; i < end; i++) {
    49	        if ( is_prime(i) ) {
    50	            primes[total] = i;
    51	            total++;        
    52	        }
    53	    }
    54	    return NULL;
    55	}
    56	
    57	int main(int argn, char \*\*argv)
    58	{
    59	    int i;
    60	    pthread_t tids[THREADS-1];
    61	
    62	    for (i = 0; i < N; i++) {
    63	        pflag[i] = 1; 
    64	    }
    65	
    66	    for (i = 0; i < THREADS-1; i++) {
    67	        pthread_create(&tids[i], NULL, work, (void \*)&i);
    68	    }
    69	
    70	    i = THREADS-1;
    71	    work((void \*)&i);
    72	    
    73	    printf("Number of prime numbers between 2 and %d: %d\\n",
    74	           N, total);
    75	    for (i = 0; i < total; i++) {
    76	        printf("%d\\n", primes[i]);
    77	    }
    78	}
omp_prime_fixed.c
    ...
    12	#include <ststdio.h>
    13	#include <math.h>
    14	#include <pthread.h>
    15	
    16	#define THREADS 4
    17	#define N 3000
    18	
    19	int primes[N];
    20	int pflag[N];
    21	int total = 0;
    22	pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
    23	
    24	int is_prime(int v)
    25	{
    26	    int i;
    27	    int bound = floor(sqrt(v)) + 1;
    28	
    29	    for (i = 2; i < bound; i++) {
    30	        /\* no need to check against known composites \*/ 
    31	        if (!pflag[i])
    32	            continue;
    33	        if (v % i == 0) {
    34	            pflag[v] = 0;
    35	            return 0;
    36	        }
    37	    }
    38	    return (v > 1); 
    39	}
    40	
    41	void \*work(void \*arg)
    42	{
    43	    int start;
    44	    int end;
    45	    int i;
    46	    
    47	    start = (N/THREADS) \* ((int)arg) ;
    48	    end = start + N/THREADS;
    49	    for (i = start; i < end; i++) {
    50	        if ( is_prime(i) ) {
    51	            pthread_mutex_lock(&mutex);
    52	            primes[total] = i;
    53	            total++;        
    54	            pthread_mutex_unlock(&mutex);
    55	        }
    56	    }
    57	    return NULL;
    58	}
    59	
    60	int main(int argn, char \*\*argv)
    61	{
    62	    int i;
    63	    pthread_t tids[THREADS-1];
    64	
    65	    for (i = 0; i < N; i++) {
    66	        pflag[i] = 1; 
    67	    }
    68	
    69	    for (i = 0; i < THREADS-1; i++) {
    70	        pthread_create(&tids[i], NULL, work, (void \*)i);
    71	    }
    72	
    73	    i = THREADS-1;
    74	    work((void \*)i);
    75	    
    76	    for (i = 0; i < THREADS-1; i++) {
    77	        pthread_join(tids[i], NULL);
    78	    }
    79	
    80	    printf("Number of prime numbers between 2 and %d: %d\\n",
    81	           N, total);
    82	    for (i = 0; i < total; i++) {
    83	        printf("%d\\n", primes[i]);
    84	    }
    85	}
pthr_prime_fixed.c
    ...
    12	#include <stdio.h>
    13	#include <math.h>
    14	#include <pthread.h>
    15	
    16	#define THREADS 4
    17	#define N 3000
    18	
    19	int primes[N];
    20	int pflag[N];
    21	int total = 0;
    22	pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
    23	
    24	int is_prime(int v)
    25	{
    26	    int i;
    27	    int bound = floor(sqrt(v)) + 1;
    28	
    29	    for (i = 2; i < bound; i++) {
    30	        /\* no need to check against known composites \*/ 
    31	        if (!pflag[i])
    32	            continue;
    33	        if (v % i == 0) {
    34	            pflag[v] = 0;
    35	            return 0;
    36	        }
    37	    }
    38	    return (v > 1); 
    39	}
    40	
    41	void \*work(void \*arg)
    42	{
    43	    int start;
    44	    int end;
    45	    int i;
    46	    
    47	    start = (N/THREADS) \* ((int)arg) ;
    48	    end = start + N/THREADS;
    49	    for (i = start; i < end; i++) {
    50	        if ( is_prime(i) ) {
    51	            pthread_mutex_lock(&mutex);
    52	            primes[total] = i;
    53	            total++;        
    54	            pthread_mutex_unlock(&mutex);
    55	        }
    56	    }
    57	    return NULL;
    58	}
    59	
    60	int main(int argn, char \*\*argv)
    61	{
    62	    int i;
    63	    pthread_t tids[THREADS-1];
    64	
    65	    for (i = 0; i < N; i++) {
    66	        pflag[i] = 1; 
    67	    }
    68	
    69	    for (i = 0; i < THREADS-1; i++) {
    70	        pthread_create(&tids[i], NULL, work, (void \*)i);
    71	    }
    72	
    73	    i = THREADS-1;
    74	    work((void \*)i);
    75	    
    76	    for (i = 0; i < THREADS-1; i++) {
    77	        pthread_join(tids[i], NULL);
    78	    }
    79	
    80	    printf("Number of prime numbers between 2 and %d: %d\\n",
    81	           N, total);
    82	    for (i = 0; i < total; i++) {
    83	        printf("%d\\n", primes[i]);
    84	    }
    85	}

Static Checking Tools

Static checking tools find data races in a program without actually executing the program.

The static checking approach has three advantages, as compared with runtime based approachs.

  1. It can be very fast and consume little memory.
  2. The analysis does not affect the behavior of program because it is performed offline.
  3. It can detect potential data races that do not happen in a particular run with a particular input data set.

Because of the above advantages, static checking can be used in situations where it is very difficult or impossible to get a runtime experiment or where it is very difficult or impossible to get a precise runtime experiment without altering the runtime result, such as OS kernels and device drivers.

The biggest disadvantage of static checking is the large amount of false positives it may generate. Static checking is always puzzled by imprecise information due pointer aliasing and vague execution paths.

Tool 1: LockLint from Sun

Sun Studio provides a utility called LockLint, which analyzes the use of mutex and reader/writer locks, and reports data races and deadlocks due to inconsistent use of locking techniques.

LockLint reports a data race when accesses to a variable are not consistently protected by at least one lock, or accesses violate assertions about which locks protect the variable.

LockLint originates from WARLOCK, which was designed to detect data races and deadlocks in Solaris kernels and device drivers. Search for warlock in opensolaris.org, and you can still find the use of it there.

The following shows the result of using LockLint on pthr_prime.c. Notice the false positive at line 63, and false negative with respect to variable i.

$ cc -mt -Zll pthr_prime.c
$ lock_lint start
$ lock_lint load pthr_prime.ll
$ lock_lint analyze -v

\* Warning: A main function was loaded with no annotations to indicate the
      presence or absence of concurrency. Lock_lint will assume concurrency.
  Please annotate source with:
      NOTE(COMPETING_THREADS_NOW) or NOTE(NO_COMPETING_THREADS_NOW)

\* Writable variable read while no locks held!
  variable = :pflag
     where = :is_prime [pthr_prime.c,30]

\* Variable written while no locks held!
  variable = :pflag
     where = :is_prime [pthr_prime.c,33]

\* Variable written while no locks held!
  variable = :pflag
     where = :main [pthr_prime.c,63]

\* Writable variable read while no locks held!
  variable = :total
     where = :main [pthr_prime.c,74]

\* Writable variable read while no locks held!
  variable = :total
     where = :main [pthr_prime.c,75]

\* Writable variable read while no locks held!
  variable = :primes
     where = :main [pthr_prime.c,76]

\* Writable variable read while no locks held!
  variable = :total
     where = :main [pthr_prime.c,77]

\* Writable variable read while no locks held!
  variable = :total
     where = :work [pthr_prime.c,50]

\* Variable written while no locks held!
  variable = :primes
     where = :work [pthr_prime.c,50]

\* Variable written while no locks held!
  variable = :total
     where = :work [pthr_prime.c,51]

The following shows the result of using LockLint on pthr_prime_fixed.c. Notice that the data races in routine work() are now gone, but the false positives and the false negatives in the previous experiment with pthr_prime.c are still there.

$ cc -mt -Zll pthr_prime_fixed.c
$ lock_lint start
$ lock_lint load pthr_prime_fixed.ll
$ lock_lint analyze -v

\* Warning: A main function was loaded with no annotations to indicate the
      presence or absence of concurrency. Lock_lint will assume concurrency.
  Please annotate source with:
      NOTE(COMPETING_THREADS_NOW) or NOTE(NO_COMPETING_THREADS_NOW)

\* Writable variable read while no locks held!
  variable = :pflag
     where = :is_prime [pthr_prime_fixed.c,31]

\* Variable written while no locks held!
  variable = :pflag
     where = :is_prime [pthr_prime_fixed.c,34]

\* Variable written while no locks held!
  variable = :pflag
     where = :main [pthr_prime_fixed.c,66]

\* Writable variable read while no locks held!
  variable = :total
     where = :main [pthr_prime_fixed.c,81]

\* Writable variable read while no locks held!
  variable = :total
     where = :main [pthr_prime_fixed.c,82]

\* Writable variable read while no locks held!
  variable = :primes
     where = :main [pthr_prime_fixed.c,83]

\* Writable variable read while no locks held!
  variable = :total
     where = :main [pthr_prime_fixed.c,84]

LockLint provides a rich set of source code notations and interactive subcommands that can be used to provide more precise information to LockLint so to improve the analysis.

Tool 2: vpara option in Sun Studio Fortran/C compilers

Strickly, this is not a tool. It is a compile-time check option provided in Sun Studio Fortran and C compilers. The following is from the man page of the cc command.

     -xvpara
          Show parallelization warning messages

          Issues warnings about potential parallel programming
          related problems that may cause incorrect results when
          using OpenMP or Sun/Cray parallel directives and prag-
          mas.

          Use with -xopenmp and OpenMP API directives, or with
          -explictpar and MP parallelization directives.

          Warnings are issued when the compiler detects the fol-
          lowing situations:

          o Loops that are parallelized using MP directives when
          there are data dependencies between different loop
          iterations

          o Problematic use of OpenMP data sharing attributes
          clauses, such as declaring a variable "shared" whose
          accesses in an OpenMP parallel region may cause data
          race, or declaring a variable "private" whose value in
          a parallel region is used after the parallel region.

In short, when -xvpara is used as an option to compile an OpenMP program, the compiler is able to report problems in the source code caused by incorrect use of data sharing attribute clause. One typical problem is data race introduced by incorrectly declaring a variable "shared".

When using vpara checking on the omp_prime.c, the compiler finds the data race between the write accesses to variable total at line 57 by different threads, as illustrated below. The checking analyzes the code enclosed lexically inside an OpenMP parallel region only, therefore it does not find data races in routine is_prime(). The checking also misses the data race on array primes[] due to a technique to reduce false positives. Unfortunately, the technique introduces a false negative here.

$ cc -xopenmp -xO3 -xvpara omp_prime.c -lm
"omp_prime.c", line 53: Warning: inappropriate scoping
        variable 'total' may be scoped inappropriately as 'shared'
        . write at line 57 and write at line 57 may cause data race

$ cc -xopenmp -xO3 -xvpara omp_prime_fixed.c -lm
$

The vpara compile-time checking is based on the static non-concurrency analysis techniques for OpenMP programs, which is also used by the OpenMP autoscoping feature provided in Sun Studio compilers.

Friday Jun 30, 2006

Understanding Data Races 1: the Role of Data Race Detection Tools

This is the first of a series of blogs on understanding data races I am going to post.


With the release of Sun Studio Express (June 2006 Build), we are offering a run-time data race detection tool (DRDT) for developers on Sun's platforms for FREE. It compliments other data race detection tools Sun already offers now.

If you have been bugged by data race problems in the past, you should give it a try. Go here (scroll to 'How to get started') to download it. And here is the page dedicated to the DRDT project.


I would like to start the series with understanding the role data race detection tools first.

Many mt programs have race conditions, the existence of which makes debugging mt programs very hard. One class of race conditions is data race condition or data race. (The difference between general race condtion and data race condition will be explained in another blog.)

Data race is a condition that happens in a program. People often think a data race is always a bug. This is not true. A data race could be the root cause of a bug; it could be caused by a bug; or it could be there because the programmer wants it there.

If a data race is the root cause of a bug, we want to find it. If a data race is caused by a bug, showing where the data race is can help the programmer locate the real bug. If a data race is there by design, we want to make sure it is there and we also want to make sure there is no unexpected data race.

The role of a data race detection tool is to check whether a program contains data races and pin-point the locations of them if there is any.

There are many ways of using a data race detection tool. Some use it as debugging tool: run it when there is a bug in the program. Someone use it as a sanity checking tool: run it as part of regression tests. And some use it as a programming assistance tool in parallelizing sequential programs: find thread unsafe routines and global variables that should be private to threads.

Thursday Jun 29, 2006

The idea behind environment variable "SUNW_MP_MAX_POOL_THREADS"

Sun's OpenMP implementation supports true nested parallel regions - when nested parallelism is enabled, the inner parallel region can be executed by multiple threads concurrently.

We provide an environment variable called SUNW_MP_MAX_POOL_THREADS for users to control the total number of OpenMP slave threads in a process.

For example, if you have want a maximum of 16 threads to be used for a nest of parallel regions in your program, you can set SUNW_MP_MAX_POOL_THREADS to 15. That's 15 slave threads (some of them may become masters in inner parallel regions) plus one user thread which is the master thread for the out-most parallel region.

Why did we design an environment variable like SUNW_MP_MAX_NUM_THREADS so that a user can set it to 16 in the above example? Intel's implementation has KMP_ALL_THREADS and KMP_MAX_THREADS which do that.

Well, we were trying to have a scheme that works on more general cases, not just pure OpenMP codes. In particular, we think our scheme works better than others for mixed pthread and OpenMP thread code. The pool defines a set of threads that can be used as OpenMP slave threads. If the program has two pthreads and both will create a team, then both will try to grab slave threads from the same pool. The env var SUNW_MP_MAX_POOL_THREADS was NOT designed for users to control the total number of threads in a process. We cannot control that because of the use of pthreads. The env var is designed for users to control the total number of OpenMP slave threads.

The env var SUNW_MP_MAX_NUM_THREADS is documented here. We also have a short article "How Many Threads Does It Take?" if you want to understand it better.

Sunday Jun 11, 2006

Common Mistakes in Using OpenMP 4: Orphaned Worksharing Constructs

More precisely, this mistake should be classified as a common mis-understanding of OpenMP.

When a worksharing construct, such omp for or omp sections, is encountered outside any explicit parallel region, the arising worksharing region is called orphaned worksharing region. A common mis-understanding is that in this case the worksharing construct is simply being ignored and the region is executed sequentially.

Orphaned worksharing constructs are not ignored. All the data sharing attribute clauses are honored. The worksharing regin is executed as if a team of only one thread is executing the region.

For example, in the following C++ code,

     main() 
     {
         class_type_1  a;
         #pragma omp for private(a) schedule(dynamic)
         for (i=1; i<100; i++) {
             printf("%d\\n", i);
         } 
     } 

the default constructor for class_type_1 will be called, and a comforming implementation is not forced to execute the loop in the order of 1, 2, 3, ..., 99.

Concurrency vs Parallelism, Concurrent Programming vs Parallel Programming

THIS BLOG HAS BEEN MOVED TO touchdreams.net/blog.

In the danger of hairsplitting, ...

Concurrency and parallelism are NOT the same thing. Two tasks T1 and T2 are concurrent if the order in which the two tasks are executed in time is not predetermined,

  • T1 may be executed and finished before T2,
  • T2 may be executed and finished before T1,
  • T1 and T2 may be executed simultaneously at the same instance of time (parallelism),
  • T1 and T2 may be executed alternatively,
  • ...

If two concurrent threads are scheduled by the OS to run on one single-core non-SMT non-CMP processor, you may get concurrency but not parallelism. Parallelism is possible on multi-core, multi-processor or distributed systems.

Concurrency is often referred to as a property of a program, and is a concept more general than parallelism.

Interestingly, we cannot say the same thing for concurrent programming and parallel programming. They are overlapped, but neither is the superset of the other. The difference comes from the sets of topics the two areas cover. For example, concurrent programming includes topic like signal handling, while parallel programming includes topic like memory consistency model. The difference reflects the different orignal hardware and software background of the two programming practices.

Update: More on Concurrency vs Parallelism THIS BLOG HAS BEEN MOVED TO touchdreams.net/blog.

Wednesday Jun 07, 2006

Common Mistakes in Using OpenMP 3: Fifteen Cases from a IWOMP 2006 paper by Michael Süß and Claudia Leopold

The coming International Workshop on OpenMP (IWOMP 2006) has a paper titled "Common Mistakes in OpenMP and How to Avoid Them" written by Michael Süß and Claudia Leopold (University of Kassel, Germany).

The result is based on a survey of two undergraduate courses. The authors of the paper kindly allow me to list the 15 common mistakes presented in their paper here,

  1. (Correctness) Access to shared variables not protected
  2. (Correctness) Use of locks without flush
  3. (Correctness) Read of shared variable without flush
  4. (Correctness) Forget to mark private variables as such
  5. (Correctness) Use of ordered clause without ordered construct
  6. (Correctness) Declare loop variable in #pragma omp parallel for as shared
  7. (Correctness) Forget to put down for in #pragma omp parallel for
  8. (Correctness) Try to change num. of thr. in parallel reg. after start of reg.
  9. (Correctness) omp_unset_lock() called from non-owner thread
  10. (Correctness) Attempt to change loop variable while in #pragma omp for
  11. (Performance) Use of critical when atomic would be sufficient
  12. (Performance) Put too much work inside critical region
  13. (Performance) Use of orphaned construct outside parallel region
  14. (Performance) Use of unnecessary flush
  15. (Performance) Use of unnecessary critical

For detail, please read the full paper.

Sunday Jun 04, 2006

Read: "The Rise and Fall of CORBA"

The June 2006 issue (Vol 4, No 5) of ACM Queue features an aritcle by Michi Henning of ZeroC on the rise and fall of CORBA.

Technical issues and procedural issues contribute to the fall of CORBA. And the procedural problems are the root cause of the procedural problems. Many of the issues the article points out are alarming familiar!

The following is a list of lessons learnt in how to have a better standards process,

  • Standards consortia need iron-clad rules to ensure that they standardize existing best practice.
  • No standard should be approved without a reference implementation.
  • No standard should be approved without having been used to implement a few projects of realistic complexity.
  • Open source innovation usually is subject to a Darwinian selection proecess.
  • To create quality software, the ability to say "no" is usually far more important than the ability to say "yes".

Read the whole article.

Sunday May 28, 2006

Static Code Analysis Tools

New.com recently has an article on companies making comercial static code analysis tools for checking security flaws.

Companies and products to watch: 

Most of them use context sensitive, interprocedural, cross module, and mixed language analysis. A major difference between the analysis used in static error detection and the one used in compiler optimization is that the former can be incomplete and unsound.


Here is a link to a site that lists a collection of static analysis tools for C code.


Monday Feb 20, 2006

Common Mistakes in Using OpenMP 2: Atomic

The following code finds good members in array member[] and stores the indices of the good members in array good_members[].

#define N 1000

struct data member[N];

int good_members[N];

int pos = 0;

void find_good_members()
{
for (i=0; i < N; i++) {
if (is_good(member[i])) {
good_members[pos] = i;
pos ++;
}
}
}

The following is a navie way of parallelizing the above code,


#define N 1000

struct data member[N];

int good_members[N];

int pos = 0;

void find_good_members()
{
#pragma omp parallel for
for (i=0; i < N; i++) {
if (is_good(member[i])) {
good_members[pos] = i; // line a
#pragma omp atomic
pos ++; // line b
}
}
}

In order to avoid data races between different updates of global variable pos, the code puts the increment (at line b) in a atomic construct. However, the code does not work, because there is a data race between the read of pos at line a and write of pos at line b.

Changing the body of the if statement to the following gives the correct result.

      int mypos;
#pragma omp critical
{
mypos = pos;
pos ++;
}
good_members[mypos] = i;

In OpenMP 2.5 (the latest Specification), inside a parallel region, the only place where you can safely get the value of a variable that is updated in an atomic region is another atomic region.

Friday Dec 30, 2005

Common Mistakes in Using OpenMP 1: Incorrect Directive Format

In C/C++, OpenMP directives are specified by using the #pragma mechanism; and in Fortran, they are specified by using special comments that are identified by unique sentinels.

This design allows users to write OpenMP programs that can be compiled with compilers that do not support OpenMP or compiled with OpenMP compiles with OpenMP support disabled.

However, if you do not follow the directive format, you might get a program that compiles and runs but gives unexpected results, because the compiler does not recognize your OpenMP directives and thinks they are non-OpenMP related pragmas (C/C++) or regular comments (Fortran).

Quiz:

How many "me"s does the following code print? Assume a team of 4 threads are executing the parallel region.

foo() 
{
    #pragma omp parallel
    {
        #pragma single
        {
            printf("me\\n");
        }
    }
}

Common Mistakes in Using OpenMP

I will post a list of common mistakes found in parallel programs written using OpenMP.

Although it is always true that users of a language need to spend effort to understand the language so to avoid mistakes, I wonder what it means to the language designers if many many users keep making the same set of mistakes again and again.

Sunday Dec 25, 2005

Must Read: ACM Queue Microprocessors issue (9-2005)

The following articles from the ACM Queue Microprocessors issue (vol. 3, no. 7 - September 2005) are must reads.

Multicore CPUs for the Masses
Mache Creeger, Emergent Technology Associates

Software and the Concurrency Revolution
Herb Sutter and James Larus, Microsoft

The Price of Performance
Luiz André Barroso, Google

Extreme Software Scaling
Richard McDougall, Sun Microsystems

The Future of Microprocessors
Kunle Olukotun and Lance Hammond, Stanford University

About

yuanlin

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today