A Brief Explanation Of Race Condition Problem In Parallel Programming

Introduction

A race condition is a programming fault producing undetermined program state and behavior due to un-synchronized parallel program executions. Race condition is the most worried programming fault by experienced programmers in parallel programming space. However there are many subtle aspects of race condition issues. A race condition problem is often caused by common data accessing, but it can also occur in a sequence of operations which require a protection such as atomic transaction to ensure the overall state integrity. Not every data race case is a programming bug. There is a compromised aspect of allowing race condition in a parallel program for performance reason. Last but not least, there may be a very subtle unexpected programming error underneath a race condition problem symptom.

Here a simple and popular parallel partitioning example is used to explain the above various aspects of race condition issues. Partitioning is one of the popular tasks in HPC application programs dealing with huge number of objects. In the partitioning example shown below, it creates N Pthreads to sort and collect the objects into N group containers according to the object attributes simulataneously. During the collection, each thread counts its group objects and add the group count into the common total sum to check if any object is missed in the collection.


partition_main.cpp

// global declaration

#include <stdio.h>

#include <math.h>

#include <pthread.h>

#include “element.h”

#include “container.h”


#define NGRPS 30

int object_count = 0;

element\* object_array;

container group_array[NGRPS];

int total_count = 0;

void\* collect(void\* arg)

{

int j;

int group_id = *((int *) arg);

int group_count = 0;

attribute group_attribute = get_group_attribute(group_id);

container group_container = group_array[group_id];

for (j = 0; j < object_count; j++) {

element current_object = object_array[j];

if (current_object.collectFlag == true) continue; // this flag is initialized to false

if (current_object.matchAttribute( group_attribute)) {

current_object.collectFlag = true;

group_container.add(current_object);

group_count++;

}

}

total_count += group_count;

return NULL;

}

int main(int argc, char\*\* argv)

{

int i;

pthread_t pids[NTHRS -1];

object_count = process_input_data(argv[1], &object_array);

for (i = 0; i < NTHRS; i++) {

pthread_create(&pids[i], NILL, collect, (void\*) &i);

}

if (total_count != object_count) {

printf(“ the collected object count %d doesn't match the original object count %d\\n”,total_count, object_count);

}

}


Data Race Condition Problem

Data race condition problem is very often to occur in shared memory parallel programming model such as Pthread and OpenMP programs. A data race occurs when multiple threads access a shared memory location with undetermined accessing order and at least one access is to write a new data into the shared memory location.

In the above program example, a data race problem occurs at the second to last statement total_count += group_count in the collect routine. Total_count is a common static variable shared by all the collection threads. When a thread in the process of reading the value of total count and adding its group count value to the total count, another thread may step in and read the old value of total count simultaneously. Right after the first thread writes a new value to the total count, the second thread may overwrite with its new value to the total count and wipe out the computing effort from the first thread.

The effect of a data race problem is quite subtle and hard to locate. In my own experience, I encountered a data race problem almost identical to the problem mentioned above. Because the problem symptom only showed up once in probably hundred runs, I felt puzzled about the program behavior and suspected many things including unstable memory chips. I was over confident about my programming skill and never thought of a race condition in my code until a colleague pointed it out.

Benign Data Race Condition

Not every data race condition is a harmful problem. In the above program example there is another data race problem occurs in the collect routine. The second data race occurs at the checking of current_object's collect_flag and the update of the collect_flag when a thread collects a object into its group container. In this program example, an object attribute uniquely matches one group attribute only. Therefore the collect flag checking seems to be redundant and troublesome by the data race condition. But when you analyze this program further, you will find that the collect flag checking will eliminate the unnecessary computation to match an owned object attribute to the group attribute. The second data race condition in this program case only affects the degree of eliminating the object attribute matching computation. No harmful result will be caused by this data race condition. On the contrary, it is good to keep this kind of data race condition there for performance reason.

In shared memory parallel programming model, the threads must communicate a critical data among one another. In principle all the threads need to read this critical data to be aware of the current overall program state and decide what to proceed further. At least one thread needs to produce new critical data value and update it. This kind of checking and update pattern is quite often in parallel programming, but it causes a data race in nature. If every update action needs to halt and synchronize all the participating threads, it will reduce the parallel computing efficiency significantly. Therefore a data race is a necessary compromise in this case.

General Race Condition

A general race condition problem is caused by the undetermined program sequence of executions to violate the program state integrity and cannot attribute to a single memory location access. The paper “What are Race Conditions?” by Robert H.B. Netzer and Barton P. Miller from University of Wisconsin defines formally a general race condition. This is a much harder problem than a data race condition problem. The challenging nature of general race problem prompts many computer science researchers to study transaction memory approach.

Here we continue with the partition program example to explain what is a general race problem. Let's say after the first phase of collection, the partitioning program needs to fine tune and shuffle some objects from a group container to another group container as shown in the code below.


partition_shuffle.cpp

void shuffle_objects( container\* source, container\* destination, element\* target_objects ) {

// remove target objects from source container

mutex_lock();

source.remove_array( target_objects );

mutex_unlock();

// Here is a transitory state which may cause general race condition

// add target objects to source container

mutex_lock();

destination.add_array( target_objects );

mutex_unlock();

}


The program looks clean and simple enough. However a snap shot of transitory state occurs between the remove and add actions of the two group containers. If there is another thread working on the objects to perform another computing task in parallel, it may find the target objects are homeless at the critical time. The obvious answer seems to encapsulate the entire shuffle_objects method into an atomic derivative and don't allow another other thread or process to interfere in the middle. However this fix may not be a complete solution to meet the application partitioning requirements. For example, if the partitioning program deals with electronic component objects and their child pin objects, the grouping of the components and their pins must be atomic to keep the parent and child objects in a consistent state.

The program code below shows a transitory state which may produce general race condition fault.


partition_shuffle.cpp (continuous)

void shuffle_components(container\* component_source, container\* component_destination, elements\* target_components)

{

shuffle_objects(component_source, component_destination, target_components);

// Here is another transitory state which may cause general race condition

pins\* target_pins = get_child_pins(target_components);

container\* pin_source = get_pin_container( component_source);

container\* pin_destination = get_pin_container(component_destination);

shuffle_objects(pin_source, pin_destination, target_pins);

}


Therefore the atomic transaction type requirement may impose on a long sequence of operations.

But when the atomic sequence of operations become complex and long, it misses the original intention of parallel programming . The big issue of general race is that it is very subtle to avoid in the first place and also very hard to fix even you are lucky to discover the problem. Furthermore different from data race, general case can occur in the distributed memory parallel programming such as MPI program as well as shared memory parallel programming.

Understand The Cause Of A Race Condition Problem

Let's come back to collect routine in the beginning partitioning example program. There is a third data race condition problem in this routine. This data race problem is quite subtle and hard to understand the cause without serious investigation. In the collect routine, the statement group_container.add(current_object) has a data race problem.

Almost every programmer will not believe there is a data race problem at this statement by looking into the code. As a matter of fact, this data race problem is caused by another data race problem, the fourth one in such a simple program.

It is easier to explain the fourth data race problem in this routine. The group ID is coming from the routine argument arg which is an pointer passed from the address of loop index i in the main program. The main thread program will advance loop index i and write a new value to the collect argument arg memory location. Therefore it is a data race condition to read this loop index and convert it to group ID in collect routine. Because of this data race problem, two different threads working on collect routine may get the same group ID value and produce the data race problem at the statement group_container.add(current_object).

Summary

Parallel programming is a new world for most software developers. The subtlety and complexity of parallel programming is far beyond the sequential programming. The software developers need to use the right tools in this new world. No doubt race detection tool is a critical tool the parallel software developer need to learn and use regularly. Although the current state of parallel developer tools is still not mature enough, there are some early tools available today and more engineering resources are working on it. Current Sun Studio Express June 2006 build features Data Race Detection Tool, you can go to the website http://developers.sun.com/prodtech/cc/downloads/express.jsp to look for more information and get the free download if you are interested in exploring the race condition problems.

Comments:

Post a Comment:
Comments are closed for this entry.
About

swdeveloper

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today