Broken allocators and paleolithic debugging strategies
By wesolows on Aug 16, 2005
Not so long ago I was looking through Solaris's shells for memory allocators - functions that perform tasks similar to
malloc(3c). These functions often store the size of the allocated block at the beginning of each block; if that size is stored as a 4-byte value, the return value from the allocator may not be aligned on an 8-byte boundary. This is a major problem on SPARC, because it's not uncommon to allocate structs or unions containing types that require 8-byte alignment, especially long long. As it turns out, gcc correctly assumes that long long variables are aligned on 8-byte boundaries and uses the
std instructions to access them. Our Studio compiler doesn't; it always issues two
st instructions. The result is that programs using this kind of allocator can crash when built with gcc but not with Studio, not a pleasant condition.
As part of my search, I found that, indeed, the Bourne and Korn shells have some alignment problems. Though these are bugs, we've decided that there's no reliable way to find all possible bugs of this type, so we worked around them in the compiler as well as fixed the ones we've found. This is, if nothing else, a good argument against compilers that "help" programmers by covering up this kind of error. But the best prize of all wasn't the kind of problem I was looking for, but rather this gem from the C shell:
showall(av); printf("i=%d: Out of memory\\n", i); chdir("/usr/bill/cshcore"); abort();
This is the systems programming equivalent of finding a live wooly mammoth contentedly smoking a cigar in your recliner. Unfortunately, there's no way to trigger this behaviour, as it's protected by the "debug" preprocessor symbol, which we never set in a normal build. Nevertheless, thanks to OpenSolaris, you can see it for yourself.
We harp incessantly on the need to be able to debug production code, with no recompilation needed; there are a number of better ways to debug this particular condition. For example, you could use the DTrace pid provider to stop a
csh process when
nomem() is called, and even provide a backtrace. If that weren't enough, you could then use
mdb(1) to debug the problem in greater detail, or
gcore(1) to produce a core dump. But the best part, the real joy, if you'll pardon the pun, is the
chdir call. Clearly the purpose was to drop core in a predictable location for later analysis by the author. I think you'll find that
coreadm(1m), along with other corefile improvements, offers a far more flexible and powerful way to accomplish this - and it complements nicely the other debugging strategies I mentioned above.