Broken allocators and paleolithic debugging strategies

Not so long ago I was looking through Solaris's shells for memory allocators - functions that perform tasks similar to malloc(3c). These functions often store the size of the allocated block at the beginning of each block; if that size is stored as a 4-byte value, the return value from the allocator may not be aligned on an 8-byte boundary. This is a major problem on SPARC, because it's not uncommon to allocate structs or unions containing types that require 8-byte alignment, especially long long. As it turns out, gcc correctly assumes that long long variables are aligned on 8-byte boundaries and uses the ldd and std instructions to access them. Our Studio compiler doesn't; it always issues two ld or st instructions. The result is that programs using this kind of allocator can crash when built with gcc but not with Studio, not a pleasant condition.

As part of my search, I found that, indeed, the Bourne and Korn shells have some alignment problems. Though these are bugs, we've decided that there's no reliable way to find all possible bugs of this type, so we worked around them in the compiler as well as fixed the ones we've found. This is, if nothing else, a good argument against compilers that "help" programmers by covering up this kind of error. But the best prize of all wasn't the kind of problem I was looking for, but rather this gem from the C shell:

        showall(av);
        printf("i=%d: Out of memory\\n", i);
 	chdir("/usr/bill/cshcore");
 	abort();

This is the systems programming equivalent of finding a live wooly mammoth contentedly smoking a cigar in your recliner. Unfortunately, there's no way to trigger this behaviour, as it's protected by the "debug" preprocessor symbol, which we never set in a normal build. Nevertheless, thanks to OpenSolaris, you can see it for yourself.

We harp incessantly on the need to be able to debug production code, with no recompilation needed; there are a number of better ways to debug this particular condition. For example, you could use the DTrace pid provider to stop a csh process when nomem() is called, and even provide a backtrace. If that weren't enough, you could then use mdb(1) to debug the problem in greater detail, or gcore(1) to produce a core dump. But the best part, the real joy, if you'll pardon the pun, is the chdir call. Clearly the purpose was to drop core in a predictable location for later analysis by the author. I think you'll find that coreadm(1m), along with other corefile improvements, offers a far more flexible and powerful way to accomplish this - and it complements nicely the other debugging strategies I mentioned above.

Comments:

That's definately one old mammoth: PDP-11/Trees/2.11BSD/usr/src/bin/csh/sh.misc.c

Posted by Ivan R. on August 16, 2005 at 05:31 PM UTC #

I stumbled on that same gem quite a few years ago, and at a Sun Labs social, had an opportunity to ask Bill Joy about it -- he claimed to have no memory of the code, and my description of the code evoked a stone-faced reaction. Certainly a let-down -- if I'd dropped something that foul in the punchbowl, the pangs of guilt and shame would still ring true even decades later.

Posted by meem on October 17, 2005 at 09:20 PM UTC #

Post a Comment:
  • HTML Syntax: NOT allowed
About

wesolows

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today