Saturday Jan 03, 2009

New Post @ Thread Cleanup Handlers, exit(), _exit(), atexit(), and pthread_exit()

This new post at describes what happens when a multi-threaded process is shutting down, what are the differences between exit(), _exit() and pthread_exit(), when and what cleanup routines will be called. Shutting down a multi-threaded application gracefully and cleanly is a challenging task. Sometimes you do not even want to do that, but you still need to know what are happening during the shutdown.

Monday Dec 29, 2008

Moving to

I am consolidating my blogs. Most of my entries here have been copied to

I will continue writing about parallel programming at

Sunday Dec 21, 2008

More on Concurrency vs Parallelism

A reader asked why concurrency programming is not a super-set of parallel programming since the parallel entities are also concurrent. Well, it is just like black-white vs color photography. Though black and white are two colors, the techniques in taking good black-white pictures are different from those for color pictures. One need to think and see differently in terms of contrast, texture, lighting and even composition.

Now back to our programming world. Recently while I was working on the OpenMP profiling, I fixed a concurrency bug that was related to asynchronous signals and had nothing to do with parallelism. I used a data structure to store the OpenMP context of a thread. Since an OpenMP context can be described in a tuple <current parallel region, current task region, OpenMP state, user callstack>, the data structure has several 64-bit long fields. One challenge is to update the context data structure atomically, i.e. when my program needs to report the OpenMP context, it should report a consistent context. For example, it should not report a thread is in a new parallel region but is still in an old task region. The atomicity here has nothing to do with parallelism here - the context data is thread private, so there is no sharing between different threads and there is no data race. The atomicity issue happens when a profiling signal (SIGPROF) comes while the program is in the middle of updating the fields of the context data structure. At the signal handler, the program needs to report the context and need to report them consistently. In the end, I had to crafted a way to update all the fields atomically (asynchronously safe) without masking out the SIGPROF.

Here is another interesting discussion on concurrency vs parallelism. I checked the manual. The exact wording used is "The maximum number of active threads per multiprocessor is 768".

Friday Nov 21, 2008

Extending the OpenMP profiling API for OpenMP 3.0

Last Tuesday at the OpenMP BOF of SC08, Oleg Mazurov presented our work on extending the OpenMP profiling API for OpenMP 3.0 (pdf slides).

The current existing API was first published in 2006 and was last updated in 2007. Since then, two more developments now beg for another update - one is for supporting the new OpenMP tasking feature, and the other is for supporting vendor specific extensions.

The extension for tasking support is straight forward. A few events that corresponding to the creation, execution, and termination of tasks are added. Also added are a few requests to get the task ID and other properties.

Vendor specific extensions are implemented essentially by sending a establishing-extension request with a vendor unique ID from the collector tool to the OpenMP runtime library. The OpenMP runtime library accepts the request if it supports the vendor, otherwise rejects it. After a successful rendezvous, the request establishes a new name space for subsequent requests and events.

One pending issue is how to support multiple vendor agents in one session. Not that a solution cannot be engineered, we are waiting for a use case to emerge.

During the execution of an OpenMP program, any arbitrary program event can be associated with

  • an OpenMP state,
  • a user callstack,
  • a node in the thread tree with parallel region ID's and OpenMP thread ID's along the path, and
  • a node in the task tree with task ID's along the path.

Because the execution of an OpenMP task may be asynchronous, and the executing thread may be different from the encountering thread, getting the user callstack of an event happened within a task becomes tricky.

At our Sun booth in SC08, we demoed a prototype Performance Analyzer that can present user callstacks in a cool way when OpenMP tasks are involved.

Take a simple quick sort code for an example.

        void quick_sort ( int lt,  int rt,  float \*data )  { 
            int md = partition( lt,  rt,  data ); 
            #pragma omp task 
            quick_sort( lt,  md - 1,  data ); 
            #pragma omp task 
            quick_sort( md + 1,  rt,  data ); 

The following figure shows the time line display of one execution of the program. The same original data are sorted three times, once sequential, once using two threads, and once using four threads.

The spikes in callstacks in the sequential sort show the recursive nature of the quick sort. But when you look at the parallel sort, the callstacks are flat. That's because each call to quick_sort() is now a task, and the tasking execution essentially changes the recursive execution into a work-list execution. The low-level callstack in the above figure shows close-to what actually happens in one thread.

While these pieces of information are useful in showing the execution details, they do not help answering the question which tasks are actually being executing. Where was the current executing task created? In the end, the user needs to debug the performance problem in his/her code (not in the OpenMP runtime). Representing information close to the user program logic is crucial.

The following figure shows the time line and user callstacks in the user view constructed by our prototype tool. Notice the callstacks in the parallel run are almost the same as in the sequential run. In the time line, it is just like the work in the sequential run is being distributed among the threads in the parallel run. Isn't this what happens intuitively when you parallelize a code using OpenMP? :)

Tuesday Nov 18, 2008

Sun Studio Express 11/08

The Sun Studio Express 11/08 is out by now and can be downloaded for free.

Among many interesting and important features it provides, here are a few I would like to list

  • It now supports (besides Solaris) RHEL 5, SuSE 10, Ubuntu 8.04 and CentOS 5.1.
  • It has full OpenMP 3.0 compiler support.
  • Performance of OpenMP tasking has been improved.
  • It was used to deliver a new World Record SPECompM2001 score for all 16-thread x86 systems [Sun Blade X6440 (4 x AMD Opteron "Shanghai" 8384 chips, 16 cores, 4 cores/chip, 16 threads) SPECompM2001 - 35,896].

Wednesday Jul 30, 2008

New LEGO in Store: Sun Studio Express 07/08 with OpenMP 3.0 Support

Today, we are making available, as a free download, Sun Studio Express 07/08 Release. One of the most exciting things about this release is the beta-level support for OpenMP 3.0 in our C/C++/Fortran Compilers.

I feel really excited about this. One of the major 3.0 features supported is tasking, which was finalized in the language specification after a looooong labor. It expends a whole new dimension of what OpenMP can do. It is like a new piece of LEGO. We are looking forward to seeing innovative (or not :)) ways of using this new feature.

This is a functional beta release. We are still working on fixing a few bugs and improving performance. One of the best ways to give us feedback is using our online forum.

Here are two short articles that may help users jump-start using the tasking feature.

Tuesday Jan 08, 2008

Gulf of Execution

Gulf of Execution is a term used to describe the the difference between the steps one actually needs to take to achieve a goal and the steps that one perceives.

After learning this term, the example that quickly jumps into my mind is setting up those wifi-enabled devices, like Wii, PSP, NDS, Wireless gateway, etc. In my experience, the one with the narrowest gap is iPhone. The worst one is, well, some operating system.

Michael G Schwern had a blog entry about this on Perl.

Who wide is the Gulf in your favorite parallel programming language/model/scheme/library?

Tuesday Nov 20, 2007

Think in Parallel or Not

Dr. John Shalf posted his view on Prof. Wen-mei Hwu's IEEE MICRO-39 paper. Dr. Shalf states the importance and advance of parallel algorithms and his view on the programming model for parallel algorithms.

Saturday Nov 17, 2007

Non-concurrency Analysis Used in PTP

When visiting the IBM booth at SC07, I was a little surprised to learn that the non-concurrency analysis technology for OpenMP programs had also been adopted and implemented in the Parallel Tools Platform.

Beth Tibbitts from IBM has kindly sent me the reference details: STMCS'07 program, paper, and presentation.

The technology is used by Sun Studio Compilers to do static error check for OpenMP programs.

Friday Nov 09, 2007

Maximum Automation for Mundane Tasks

Adam Kolawa (Parasoft) said in his recent article on DDJ,

"Many people ... want tools to find these bugs automatically. After 20 years of examining how and why errors occur, I believe this is the wrong response. Only a small class of errors can be found automatically; most bugs are related to functionality and requirements, and cannot be identified with just the click of a button."


"Our current mission is to address this problem by inventing technologies and strategies to support the brain as it performs this evaluation. We are building automated infrastructures that provide maximum automation for mundane tasks (compiling code, building/running regression test suites, checking adherence to policies, supporting code reviews, and so on) in such a way that each day the brain is presented with the minimal information needed to determine if yesterday's code modifications negatively impacted the application."

There is probably no magic button one can push to turn a piece of legacy code that is not thread-safe into a thread-safe code. A tool should offload the mundane tasks from human brain which can be set free to finish the magic touch.

Sunday Jun 11, 2006

Concurrency vs Parallelism, Concurrent Programming vs Parallel Programming


In the danger of hairsplitting, ...

Concurrency and parallelism are NOT the same thing. Two tasks T1 and T2 are concurrent if the order in which the two tasks are executed in time is not predetermined,

  • T1 may be executed and finished before T2,
  • T2 may be executed and finished before T1,
  • T1 and T2 may be executed simultaneously at the same instance of time (parallelism),
  • T1 and T2 may be executed alternatively,
  • ...

If two concurrent threads are scheduled by the OS to run on one single-core non-SMT non-CMP processor, you may get concurrency but not parallelism. Parallelism is possible on multi-core, multi-processor or distributed systems.

Concurrency is often referred to as a property of a program, and is a concept more general than parallelism.

Interestingly, we cannot say the same thing for concurrent programming and parallel programming. They are overlapped, but neither is the superset of the other. The difference comes from the sets of topics the two areas cover. For example, concurrent programming includes topic like signal handling, while parallel programming includes topic like memory consistency model. The difference reflects the different orignal hardware and software background of the two programming practices.

Update: More on Concurrency vs Parallelism THIS BLOG HAS BEEN MOVED TO

Sunday Jun 04, 2006

Read: "The Rise and Fall of CORBA"

The June 2006 issue (Vol 4, No 5) of ACM Queue features an aritcle by Michi Henning of ZeroC on the rise and fall of CORBA.

Technical issues and procedural issues contribute to the fall of CORBA. And the procedural problems are the root cause of the procedural problems. Many of the issues the article points out are alarming familiar!

The following is a list of lessons learnt in how to have a better standards process,

  • Standards consortia need iron-clad rules to ensure that they standardize existing best practice.
  • No standard should be approved without a reference implementation.
  • No standard should be approved without having been used to implement a few projects of realistic complexity.
  • Open source innovation usually is subject to a Darwinian selection proecess.
  • To create quality software, the ability to say "no" is usually far more important than the ability to say "yes".

Read the whole article.

Sunday Dec 25, 2005

Must Read: ACM Queue Microprocessors issue (9-2005)

The following articles from the ACM Queue Microprocessors issue (vol. 3, no. 7 - September 2005) are must reads.

Multicore CPUs for the Masses
Mache Creeger, Emergent Technology Associates

Software and the Concurrency Revolution
Herb Sutter and James Larus, Microsoft

The Price of Performance
Luiz André Barroso, Google

Extreme Software Scaling
Richard McDougall, Sun Microsystems

The Future of Microprocessors
Kunle Olukotun and Lance Hammond, Stanford University




« April 2014