system-administration | Tuesday, July 13, 2010

Source diving for sysadmins

As a system administrator, I work with dozens of large systems every day--Apache, MySQL, Postfix, Dovecot, and the list goes on from there. While I have a good idea of how to configure all of these pieces of software, I'm not intimately familiar with all of their code bases. And every so often, I'll run into a problem which I can't configure around.

When I'm lucky, I can reproduce the bug in a testing environment. I can then drop in arbitrary print statements, recompile with debugging flags, or otherwise modify my application to give me useful data. But all too often, I find that either the bug vanishes when it's not in my production environment, or it would simply take too much time or resources to even set up a testing deployment. When this happens, I find myself left with no alternative but to sift through the source code of the failing system, hoping to find clues as to the cause of the bug of the day. Doing so is never painless, but over time I've developed a set of techniques to make the source diving experience as focused and productive and possible.

To illustrate these techniques, I'll walk you through a real-world debugging experience I had a few weeks ago. I am a maintainer of the XVM project, an MIT-internal VPS service. We keep the disks of our virtual servers in shared storage, and we use clustering software to coordinate changes to the disks.

For a long time, we had happily run a cluster of four nodes. After receiving a grant for new hardware, we attempted to increase the size of our cluster from four nodes to eight nodes. But once we added the new nodes to the cluster, disaster struck. With five or more nodes in the cluster, no matter what we tried, we received the same error message:

root@babylon-four:~# lvs
 cluster request failed: Cannot allocate memory
 Can't get lock for xenvg
 Skipping volume group xenvg
 cluster request failed: Cannot allocate memory
 Can't get lock for babylon-four

And to make matters even more exciting, by the time we observed the problem, users had already booted enough virtual servers that we did not have the RAM to go back to four nodes. So there we were, with a broken cluster to debug.

Tip 1: Check the likely causes of failure first.

It can be quite tempting to believe that a given problem is caused by a bug in someone else's code rather than your own error in configuration. In reality, the common bugs in large, widely-used projects have already been squashed, meaning the most likely cause of error is something that you are doing wrong. I've lost track of the number of times I was sure I encountered a bug in some software, only to later discover that I had forgotten to set a configuration variable. So when you encounter a failure of some kind, make sure that your environment is not obviously at fault. Check your configuration files, check resource usage, check log files.

In the case of XVM, after seeing memory errors, we naturally figured we were out of memory--but free -m showed plenty of spare RAM. Thinking a rogue process might be to blame, we ran ps aux and top, but no process was consuming abnormal amounts of RAM or CPU. We consulted man pages, we scoured the relevant configuration files in /etc, and we even emailed the clustering software's user list, trying to determine if we were doing something wrong. Our efforts failed to uncover any problems on our end.

Tip 2: Gather as much debugging output as you can. You're going to need it.

Once you're sure you actually need to do a source dive, you should make sure you have all the information you can get about what your program is doing wrong. See if your program has a "debugging" or "verbosity" level you can turn up. Check /var/log/ for dedicated log files for the software under consideration, or perhaps check a standard log such as syslog. If your program does not provide enough output on its own, try using strace -p to dump the system calls it's issuing.

Before doing our clustering-software source dive, we cranked debugging as high as it would go to get the following output:

Got new connection on fd 5
Read on local socket 5, len = 28
creating pipe, [9, 10]
Creating pre&post thread
in sub thread: client = 0x69f010
Sub thread ready for work.
doing PRE command LOCK_VG 'V_xenvg' at 1 (client=0x69f010)
lock_resource 'V_xenvg-1', flags=0, mode=1
Created pre&post thread, state = 0
Writing status 12 down pipe 10
Waiting for next pre command
read on PIPE 9: 4 bytes: status: 12
background routine status was 12, sock_client=0x69f010
Send local reply

Note that this spew does not contain an obvious error message. Still, it had enough information for us to ultimately track down and fix the problem that beset us.

Tip 3: Use the right tools for the job

Perhaps the worst part of living in a world with many debugging tools is that it's easy to waste time using the wrong ones. If you are seeing a segmentation fault or an apparent deadlock, then your first instinct should be to reach for gdb. gdb has all sorts of nifty capabilities, including the ability to attach to an already-running process. But if you don't have an obvious crash site, often the information you glean from dynamic debugging is too narrow or voluminous to be helpful. Some, such as Linus Torvalds, have even vehemently opposed debuggers in general.

Sometimes the simplest tools are the best: together grep and find can help you navigate an entire codebase knowing only fragments of text or filenames (or guesses thereof). It can also be helpful to use a language-specific tool. For C, I recommend cscope, a tool which lets you find symbol usages or definitions.

XVM's clustering problem was with a multiprocess network-based program, and we had no idea where the failure was originating. Both properties would have made the use of a dynamic debugger quite onerous. Thus we elected to dive into the source code using nothing but our familiar command-line utilities.

Tip 4: Know your error.

Examine your system's failure mode. Is it printing an error message? If so, where is that error message originating? What can possibly cause that error? If you don't understand the symptoms of a failure, you certainly won't be able to diagnose its cause.

Often, grep as you might, you won't find the text of the error message in the codebase under consideration. Rather, a standard UNIX error-reporting mechanism is to internally set the global variable errno, which is converted to a string using strerror.

Here's a Python script that I've found useful for converting the output of strerror to the corresponding symbolic error name. (Just pass the script any substring of your error as an argument.)

#!/usr/bin/env python
import errno, os, sys
msg = ' '.join(sys.argv[1:]).lower()
for i in xrange(256):err = os.strerror(i)if msg in err.lower():print '%s [errno %d]: %s' % (errno.errorcode.get(i, '(unknown)'), i, err)

This script shows that the "Cannot allocate memory" message we had seen was caused by errno being set to the code ENOMEM.

Tip 5: Map lines of output to lines of code.

You can learn a lot about the state of a program by determining which lines of code it is executing. First, fetch the source code for the version of the software you are running (generally one of apt-get source and yumdownloader --source). Using your handy command-line tools, you should then be able to trace lines of debugging output back to the lines of code that emitted them. You can thus determine a set of lines that are definitively being executed.

Returning to the XVM example, we used apt-get source to fetch the relevant source code and dpkg -l to verify we were running the same version. We then ran a grep for each line of debugging output we had obtained. One such invocation,
grep -r "lock_resource '.*'" .,

showed that the corresponding log entry was emitted by a line in the middle of a function entitled _lock_resource.

Tip 6: Be systematic.

If you've followed the preceding tips, you'll know what parts of the code the program is executing and how it's erroring out. From there, you should work systematically, eliminating parts of the code that you can prove are not contributing to your error. Be sure you have actual evidence for your conclusions--the existence of a bug indicates that the program is in an unexpected state, and thus the core assumptions of the code may be violated.

At this point in the XVM debugging, we examined the _lock_resource function. After the debugging message we had in our logs, all paths of control flow except one printed a message we had not seen. That path terminated with an error from a function called saLckResourceLock. Hence we had found the source of our error.

We also noticed that _lock_resource transforms error values returned by the function it calls using using ais_to_errno. Reading the body of ais_to_errno, we noted it just maps internal error values to standard UNIX error codes. So instead of ENOMEM, the real culprit was one of SA_AIS_ERR_NO_MEMORY, SA_AIS_ERR_NO_RESOURCES, or SA_AIS_ERR_NO_SECTIONS. This certainly explained why we could see this error message even on machines with tens of gigabytes of free memory!

Ultimately, our debugging process brought us to the following block of code:

if (global_lock_count == LCK_MAX_NUM_LOCKS)     {
goto error_exit;

This chunk of code felt exactly right. It was bound by some hard-coded limit (namely, LCK_MAX_NUM_LOCKS, the maximum number of locks) and hitting it returned one of the error codes we were seeking. We bumped the value of the constant and have been running smoothly ever since.

Tip 7: Make sure you really fixed it.

How many times have you been certain you finally found an elusive bug, spent hours recompiling and redeploying, and then found that the bug was actually still there? Or even better, the bug simply changed when it appears, and you failed to find this out before telling everyone that you fixed it?

It's important that after squashing a bug, you examine, test, and sanity-check your changes. Perhaps explain your reasoning to someone else. It's all too easy to "fix" code that isn't broken, only cover a subset of the relevant cases, or introduce a new bug in your patch.

After bumping the value of LCK_MAX_NUM_LOCKS, we checked the project's changelog. We found a commit increasing the maximum number of locks without any changes to code, so our patch seemed safe. We explained our reasoning and findings to the other project developers, quietly deployed our patched version, and then after a week of stability sent an announce email proclaiming that we had fixed the cluster.

Your turn

What techniques have you found useful for debugging unfamiliar code?


Join the discussion

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha

Visit the Oracle Blog


Contact Us


Integrated Cloud Applications & Platform Services