By Nicholas Solter on Apr 23, 2007
I thought deadlocks within a process were hard enough to debug until I ran into a deadlock between three different processes.
The observed misbehavior on the system was a user-initiated command that hung. This command doesn't do much except make an inter-process call to a daemon that handles the commands, so I immediately assumed some sort of deadlock in that daemon. Call it process A.
I was able to get a good stack of all the threads in process A with pstack(1). As expected the thread handling the hung command was waiting for a lock. It turned out that the thread holding the lock was waiting on an inter-process call to a second daemon, process B. A pstack of process B showed the thread handling the call from process A was waiting for a lock. The thread in process B that was holding that lock was waiting on a third process, C. Three guesses what process C was waiting for... Yup, an inter-process call to process A. So, going back to process A, the thread handling the call from process C was waiting on the same lock that was held by the thread waiting on the call to process B. Deadlock!
How did this situation happen? I think there are a few factors that contributed to the bug:
- Both daemons A and B use single global locks to protect all their data. Coarse-grained locks give less opportunity for deadlock within the process, and help prevent accidental unprotected access to critical sections. However, they also reduce opportunities for parallelism and, as evidenced here, make inter-process deadlocks more likely. More fine-grained locking could be one solution to this bug.
- Inter-process calls are made cyclically between these three processes. A good guiding principle for inter-process calls is to make them in only one direction (from the client to the server). In this case, process A is the client of process B, and process B never makes calls directly to process A. However, it does so indirectly via process C. The code path involving process C was added much later than the original development of these daemons, by a different engineering team. That brings me to the next point...
- Every system needs a well-documented locking strategy. There may have been an implicit assumption that B would never initiate a connection to A that required A to acquire a lock, but I don't think it's documented anywhere.
It would be interesting to try some deadlock-detecting static analysis tools on this system to see what they uncover. If anyone has any suggestions of good tools in this area, let me know!