Thursday Aug 09, 2007

One More Reason the C++ Preprocessor is Dangerous

A co-worker recently asked me to look at an interesting C++ compilation problem. The constructor for one of her classes refused to compile until she simply changed one of the parameter names! Why should the parameter name matter?

Here's a much-simplified version of the class, with a trivial main(), removing proprietary information (this code isn't yet open-source).


class myclass {
   myclass(char \*s_net);

myclass::myclass(char \*s_net)
   // do stuff

int main()
   // do stuff

The Sun Studio 10 compiler gives the following errors when attempting to compile the translation unit:

"", line 7: Error: No direct declarator preceding "(".
"", line 10: Error: myclass is not a static data member.
"", line 10: Error: Badly formed expression.

Changing the name of the only parameter to the myclass constructor to something else like net allows the code to compile with no errors.

What's going on?

It turns out that s_net is defined in netinet/in.h as follows

#define s_net _S_un._S_un_b.s_b1 /\* OBSOLETE: network \*/

So when the preprocessor makes its substitutions prior to the actual compilation, the symbol s_net is replaced with something like _S_un._S_un_b.s_b1. (There are actually a few more substitutions, as we'll see momentarily). We can see this ourselves by using the -P option to CC to run the test program through just the preprocessor. This gives, in part:

class myclass {
public :
myclass ( char \* S_un . S_un_b . s_b1 ) ;
} ;

myclass :: myclass ( char \* S_un . S_un_b . s_b1 )


That obviously won't compile!

This example demonstrates two problems. The first is the use of globals, specifically #defines. Who would expect a generic symbol like s_net to be #defined in a standard header file? It would be bad enough if it were simply a global variable, but that at least would probably not have caused a problem in this case, because the use of s_net in the constructor would hide the global s_net. However, the preprocessor actually search/replaces the #define term so that the compiler never gets a chance to see s_net as the parameter name in the constructor.

The second problem is that the preprocessor does this substitution and other modifications such that the code for which the compiler is giving a warning is not the code that you see when you look at your source file. This can make debugging difficult.

Moral of the story: Use the preprocessor judiciously!

Monday Apr 23, 2007

Debugging a Multi-Process Deadlock

I thought deadlocks within a process were hard enough to debug until I ran into a deadlock between three different processes.

The observed misbehavior on the system was a user-initiated command that hung. This command doesn't do much except make an inter-process call to a daemon that handles the commands, so I immediately assumed some sort of deadlock in that daemon. Call it process A.

I was able to get a good stack of all the threads in process A with pstack(1). As expected the thread handling the hung command was waiting for a lock. It turned out that the thread holding the lock was waiting on an inter-process call to a second daemon, process B. A pstack of process B showed the thread handling the call from process A was waiting for a lock. The thread in process B that was holding that lock was waiting on a third process, C. Three guesses what process C was waiting for... Yup, an inter-process call to process A. So, going back to process A, the thread handling the call from process C was waiting on the same lock that was held by the thread waiting on the call to process B. Deadlock!

How did this situation happen? I think there are a few factors that contributed to the bug:

  1. Both daemons A and B use single global locks to protect all their data. Coarse-grained locks give less opportunity for deadlock within the process, and help prevent accidental unprotected access to critical sections. However, they also reduce opportunities for parallelism and, as evidenced here, make inter-process deadlocks more likely. More fine-grained locking could be one solution to this bug.

  2. Inter-process calls are made cyclically between these three processes. A good guiding principle for inter-process calls is to make them in only one direction (from the client to the server). In this case, process A is the client of process B, and process B never makes calls directly to process A. However, it does so indirectly via process C. The code path involving process C was added much later than the original development of these daemons, by a different engineering team. That brings me to the next point...

  3. Every system needs a well-documented locking strategy. There may have been an implicit assumption that B would never initiate a connection to A that required A to acquire a lock, but I don't think it's documented anywhere.

It would be interesting to try some deadlock-detecting static analysis tools on this system to see what they uncover. If anyone has any suggestions of good tools in this area, let me know!


Nick Solter is a software engineer and author living in Colorado.


« August 2016