Beautiful Code

So my copy of Beautiful Code showed up last week. Although I am one of the (many) authors and I have thus had access to the entire book online for some time, I do all of my pleasure reading in venues that need the printed page (e.g. the J Church) and have therefore waited for the printed copy to start reading.

Although I have only read the first twelve chapters or so, it's already clear (and perhaps not at all surprising) that there are starkly different definitions of beauty here: the book's greatest strength -- and, frankly, its greatest weakness -- is that the chapters are so incredibly varied. For one chapter, beauty is a small and titilating act of recursion; for the next, it's that a massive and complicated integrated system could be delivered quickly and cheaply. (I might add that the definition of beauty in my own chapter draws something from both of these poles: that in software, the smallest and most devilish details can affect the system at the largest and most basic levels.)

If one can deal with the fact that the chapters are widely divergent, and that there is not even a token attempt to weave them together into a larger tapestry, this book (at least so far, anyway) is (if nothing else) exceptionally thought provoking; if Oprah were a code cranking propeller head, this would be the ideal choice for her book club.

Now in terms of some of my specific thoughts that have been provoked: as I mentioned, quite a few of my coauthors are enamored with the elegance of recursion. While I confess that I like writing a neatly recursive routine, I also find that I frequently end up having to unroll the recursion when I discover that I must deal with data structures that are bigger than I anticipated -- and that my beautiful code is resulting (or can result) in a stack overflow. (Indeed, I spent several unpleasant days last week doing exactly this when I discovered that pathologically bad input could cause blown stacks in some software that I'm working on.)

To take a concrete example, Brian Kernighan has a great chapter in Beautiful Code about some tight, crisp code written by Rob Pike to perform basic globbing. And the code is indeed beautiful. But it's also (at least in a way) busted: it overflows the stack on some categories of bad input. Admittedly, one is talking about very bad input here -- strings that consist of hundreds of thousands of stars in this case -- but this highlights exactly the problem I have with recursion: it leaves you with edge conditions that on the one hand really are edge conditions (deeply pathological input), but with a failure mode (a stack overflow) that's just too nasty to ignore.

Now, there are ways to deal with this. If one can stomach it, the simplest way to deal with this is to setup a sigaltstack and then siglongjmp out of a SIGSEGV/SIGBUS signal handler. You have to be very careful about doing this: the signal handler should look at the si_addr field in the siginfo and comparing it to the stack bounds to confirm that it's a stack overflow, lest it end up siglongjmp'ing out of a non-recursion induced SIGSEGV (which, needless to say, would make a bad problem much worse). While an alternative signal stack solution may sound hideous to some, at least the recursion doesn't have to go under the knife in this approach. If having a SIGSEGV handler to catch this condition feels uncomfortably brittle (as well it might), or if one's state cannot be neatly unwound after an arbitrary siglongjmp (as well it might not), the code will have to change: either a depth counter will have to be passed down and failure propagated when depth exceeds a reasonable maximum, or the recursion will have to be unrolled into iteration. For most aesthetic senses, none of these options is going to make the code more beautiful -- but they will make it indisputably more correct.

I was actually curious about where exactly the Pike/Kernighan code would blow up, so I threw together a little program that uses sigaltstack along with sigsetjmp/siglongjmp to binary search to find the shortest input that induces the failure. My program, which (naturally) includes the Pike/Kernighan code, is here.

Here are the results of running my program on a variety of Solaris platforms, with each number denoting the maximum string length that can be processed by the Pike/Kernighan code without the possibility of stack overflow.

x86 SPARC
32-bit 64-bit 32-bit 64-bit
Sun cc, unoptimized 403265 187225 77649 38821
gcc, unoptimized 327651 218429 69883 40315
Sun cc, optimized 327651 327645 174723 95303
gcc, optimized 582489 524227 149769 87367

As can be seen, there is a tremendous range here, even across just two different ISAs, two different data models and two different compilers: from 38,821 on 64-bit SPARC using Sun cc without optimization to 582,489 on 32-bit x86 using gcc with optimization -- an order of magnitude difference. So while recursion is a beautiful technique, it is one that ends up with the ugliest of implicit dependencies: on the CPU architecture, on the data model and on the compiler. And while recursion is still beautiful to me personally, it will always be a beauty that is more superficial than profound...

Comments:

This is the principal reason why I always shunned and will continue to shun recursion.

Coming from assembler background, it didn't take very long for me to recognize the fallacy of recursion -- problems can be solved "elegantly", but, apart from designing a special harness to "babysit" the recursion loop, the whole thing can overflow the stack, which makes it inherently unreliable.

Elegance of code is more than just beauty: it need not necessarily be short/small code, but it should be as simple as possible, because, as I often teach, simple is robust; and only when one has robust building blocks as foundations, can one hope to build more complex robust solutions. Recursion, alas, does not belong into that category because of its conventional implementation.

Finally, I'll leave with quoting a famous programmer:

"Programming perfection is not when there's nothing more to add to a program, but when there is nothing left to take away."

Posted by UX-admin on July 10, 2007 at 07:27 PM PDT #

As my computer science teacher in school taught us: "Use as much recursion as needed and as few as possible"

Posted by Chris on July 11, 2007 at 01:30 AM PDT #

What about tail call elimination by the compiler? All the programmer needs to do is make sure the recursion is a tail call.

Posted by Rafael de F. Ferreira on July 12, 2007 at 01:22 PM PDT #

A nice example of this problem with recursion we found in Solaris NFS mount code some years back. rpcgen was used to create the XDR encoding & decoding functions. When some poor customers crossed some random boundary of about 20,000 (sic) mounts, various related programs started falling over with extremely hard to debug stack overflows. The solution, of course, was to use manually-written iterative XDR routines. Not much fun either, that :)

Posted by Calum Mackay on July 12, 2007 at 09:04 PM PDT #

Rafael, tail call elimination obviously eliminates the stack overflow pathology but (1) it assumes that one has recursion which can be tail-call optimized (which is not true of the Pike/Kernighan code) and (2) it relies on a compiler optimization for correctness. I have been bit personally by relying on the compiler to tail-call optimize and then having situations where it didn't (for whatever reason). In my experience, one must be very cautious when relying on not just correctness but optimality from a lower layer of software...

Posted by Bryan Cantrill on July 13, 2007 at 05:01 AM PDT #

There are some peculiarities about Pike’s matching engine. One is that the given algorithm is non-greedy for the star quantifier – that is, a star will try to match as few characters as possible. This is the result of <code>matchstar</code> using a <code>do</code> loop to attempt to match the rest of the regex first before letting the quantifier consume an input character. This is significant insofar as that this non-greedy algorithm requires backtracking.

But if greediness is acceptable, then expressions with a grammar as simple as the one that this matcher accepts can be tested without any backtracking at all! In in this grammar, atoms can only ever consist of single characters, which means the length of an atom is constant. Therefore the regex itself as given is sufficient as a full state machine. (If quantifiers applied to groups of atoms, you would have to compile a state machine from the regex; this is what DFA matchers do. You still don’t need backtracking then, though; it is only actually necessary once you introduce backreferences, which make the expressions non-regular.) So you can write an engine that is driven by the input string, as opposed to having it driven by the regex à la Pike. Such an engine would simply use the position in the regex as its state, and would consumes the input string character by character, checking whether the current character statisfies the current state’s assertion.

This sounds very iterative. Where is the beauty in that?

Well, personally, I don’t find recursion all that beautiful, nor have I found it very natural (in general). This may seem surprising if you consider my propensity for functional programming; but to me, elegance is rather found in treating input as an immutable sequence to be <em>mapped</em> to another immutable sequence. Some other primitives are trivially derivable from this: filtering is simply the act of mapping a sequence to another sequence that omits some of the input elements but is otherwise identical; reducing maps a sequence of arbitrary length to a sequence of length one.

Matching an input against a regex, in these terms, means reducing an input given as a sequence of characters to a boolean.

Note if you write this purely functionally, recursion still enters the picture, because changing state when values are not mutable involves a call. However, this can be written purely tail-recursively.

(Disclaimer: this is just dashed off, so I might have made mixed up some classes of languages and how much computational complexity they entail.)

Posted by Aristoteles Pagaltzis on July 15, 2007 at 07:27 AM PDT #

I agree that the recursion here <em>could</em> blow up, but really, who is going to pass a 300K regular expression? In engineering, parts have tolerances. The standard for the space shuttle is different from the standard for a bike. We can look at Pike's code as having a tolerance too. In an application, I'd probably have a check before the algorithm that disallowed any regular expression larger than say 1024 characters, and told the user that they should create a different one. And, chances are, that check would never fire its action production. I think we can get ourselves in trouble when we assume that functions have to work across all inputs. And, we're seduced into that because so much of what we do is like mathematics. But, really, I think that we should think about tolerances. And, tolerances can be different depending upon the type of software we're writing.

Posted by Michael Feathers on July 16, 2007 at 04:00 PM PDT #

Michael, surely you see the security problem with user input having the ability to blow the stack?

Posted by Geoff on July 16, 2007 at 07:49 PM PDT #

If you overflow your stack, the OS terminates the process. There is no security problem.

Posted by josh on July 17, 2007 at 12:14 AM PDT #

Uh....what? Recursion is almost \*always\* the right answer. That way generation of programmers won't be trying to make sense of silly, state-dependent spaghetti. Try using a language that does recursion right (ie tail call optimization guaranteed), like Scheme. Try rereading SICP.

Posted by Klein on July 17, 2007 at 01:27 AM PDT #

Aristoteles: interesting thoughts, and I hope that you will blog both your detailed thoughts on the Pike/Kernighan code, and more generally other thoughts from reading Beautiful Code. What makes the book interesting is that there are too many definitions of "beauty" to (in my opinion, anyway) fit any one person's definition. That is, for some chapter or chapters, what the author finds beautiful, you will find downright fugly -- and I would love to read your rants on what you deem to be the ugly ones.

Michael, as Geoff mentioned, if input is not checked (and I hasten to add that the length-of-death is much lower than 300K -- as low as 37K) this is a potential security problem. Now, as I believe Josh is trying to say, this does not constitute a buffer overflow attack -- an attacker could not use this vector to execute arbitrary code on the stack. But contrary to Josh's assertion that there is "no security problem", this does allow for a potential DoS attack.

That said, to me the issue is less about security and more about failure mode. Michael, while your assertion about tolerances is interesting, the physical world is ultimately a poor analogy for software: this is not physical failure, it is logical failure -- and there's quite a difference. To me, the better analogy is with mathematics: if you prove a theorem for all n less than 38,821, is that proof correct? It depends, of course, on how that proof is labelled: if one asserts that one has proved the theorem for all n when, in fact, it's only proved for n less than 38,821, than the proof is incorrect. If, however, one accurately claims that one has only proved the theorem for n less than 38,821, then the proof is correct -- but it is also clearly a proof of limited consequence, and certainly not one for Erdös's Book. My argument is that it is better to correctly solve a smaller problem (i.e. limited input) than to incorrectly solve a more ambitious problem (i.e. arbitrary input).

Finally, as for the assertion that we "get ourselves in trouble when we assume that functions have to work across all inputs", I could not disagree more -- indeed, I believe that we get ourselves into trouble when we make implicit assumptions that our input will fit some (ill-defined) notion of "reasonable." In my experience, software that makes such implicit assumptions becomes especially problematic as its used as foundation software, introducing new, cascading failure modes into the system. Certainly if one fancies writing software that will itself be used by higher layers of software, one must be paranoid about input and explicit about failure modes.

Posted by Bryan Cantrill on July 17, 2007 at 01:56 AM PDT #

Klein, tail-call optimization won't do the trick here -- the recursion in the Pike/Kernighan code is not a tail-call, and no amount of compiler craftiness can eliminate it. Now, in a stack-based language, presumably much more sophisticated stack management is possible (including much crisper failure modes), but such management would require a lot of nasty trickery if one wanted to both (1) use the stack provided by the machine architecture and (2) provide multiple threads of control. (And considering that using the machine's notion of the stack is very much necessary for performance, and that a compelling feature of stack-based languages is the ease with which they accommodate threads, it's a fair assumption that one does indeed want both.)

Posted by Bryan Cantrill on July 17, 2007 at 02:06 AM PDT #

Bryan, we can go back and forth about whether it's a logical or a physical problem.. but those are just labels. To me, the crux of this is the fact that the algorithm has certain computational characteristics that place constraints on the environment. If the environment doesn't meet them, we probably shouldn't run it.

You're proof analogy is fine as it stands, but I think we have to question the assumption. Nobody is making the assertion that the algorithm is correct for all N input sequences. We might be able to construct a proof, but that proof would say something about the algorithm on an ideal machine (infinite stack). It wouldn't say anything about what you can expect in the face of physical (whoops, I said it) resource constraints.

I think that the biggest beef I have with your position is that I feel that there is no single standard for correctness or appropriateness of coding style in the industry. What you do in software that you release to thousands of users can be different from what you do in software you use in a tool for a small workgroup. Safety critical is another area, so is teaching software. As well, not all software is in the middle of an application stack. Not all software is distributed either. Some has trusted users.

FIT is a good example. It's heavily recursive, and I imagine it could overflow the stack if the size of the tables it processes are too large. I don't think it's a serious problem, though. FIT is used in work groups where developers are always present. If it happened, the size of tables would be the obvious culprit. Would I want flight control software to be that trusting? No, but that's a completely different domain. There is no "one size fits all" standard for code.

Posted by Michael Feathers on July 17, 2007 at 06:12 AM PDT #

In terms of the machine, the stack overflow is not a physical problem -- it is (quite literally) a virtual problem. (The problem is that you've run out of the VA range allocated for your stack, not that you've run out of physical memory.) And while I think there's only one definition for correctness, I certainly acknowledge that different domains care less about correctness than other factors like the cost of development or the time it takes to bring software to market...

Posted by Bryan Cantrill on July 17, 2007 at 08:03 AM PDT #

What am I missing, the code you provided looks like it can be tail call optimized.(match, matchstar, matchchere) All of the calls are in the tail position and no instructions are left to compute in the frame after the call. While I do not have a reference nor the ability to proving it mathematical on the spot, an iterative algorithm can always be rewritten as a tail recursive algorithm.

Posted by PJW on July 17, 2007 at 12:40 PM PDT #

matchstar's call to matchhere is not a tail-call -- it takes conditional action based on the return from matchhere.

Posted by Bryan Cantrill on July 17, 2007 at 02:19 PM PDT #

I think a lot of people are missing the point. This is a very concise and elegant solution to a simple pattern matcher. It will work on all practical examples. If you want to treat some arbitrary input from an untrusted user as a match pattern, it's your own fault if you don't verify and place limits on that input first.

A robust regexp library would compile simple patterns like this to a DFA. Or, more likely, include more features such as alternations and remembering submatches, and compile some or all patterns to NFA's. This will be a lot more complicated. For something as commonly used as regexps it may be worth having such a full library, but on small and embedded systems, or for less common tasks for which there isn't likely to be a pre-existing library, it's a great advantage to be able to quickly put together simple, clean code like this.

Regarding tail-call optimization, the reason the example has non-tail calls is because it's using backtracking, and it needs somewhere to store that information. The stack is the easiest place to do this, but you can always move the data into the heap (a technique Scheme programmers are used to).

Below is a straightforward translation of the same code to use the heap instead of the stack, and thus never blows out the stack when compiled with gcc -O2. It's basically the same thing, except you pass around an extra "stack" parameter. matchstar is made simpler by just pushing a backtrack point onto the stack and recursing (and could just as easily do greedy matching).

For technical reasons, matchhere needs an unused extra "int c" parameter to force tail-call optimizations. See http://www.ddj.com/dept/cpp/184401756 for the explanation. This and explicit malloc/free make the code a lot more clumsy than the equivalent in Scheme, where you have GC and tail-calls are guaranteed to be optimized.

-- Alex


typedef struct _matchstack {
  int c;
  char \*regexp;
  char \*text;
  struct _matchstack \*next;
} \*matchstack;

int matchhere(int, char \*, char \*, matchstack);
int matchstar(int, char \*, char \*, matchstack);
int matchfail(matchstack);
matchstack matchpushstack(int, char \*, char \*, matchstack);

int
match(char \*regexp, char \*text)
{
        if (regexp[0] == '\^') 
          return (matchhere('\\0', regexp + 1, text, NULL));
        else
          return (matchstar('.', regexp, text, NULL));
}

int
matchhere(int c, char \*regexp, char \*text, matchstack stack)
{
        if (regexp[0] == '\\0')
                return (1);

        if (regexp[1] == '\*')
                return (matchstar(regexp[0], regexp + 2, text, stack));

        if (regexp[0] == '$' && regexp[1] == '\\0')
                return (\*text == '\\0');

        if (\*text != '\\0' && (regexp[0] == '.' || regexp[0] == \*text))
                return (matchhere(c, regexp + 1, text + 1, stack));

        return (0);
}

int
matchstar(int c, char \*regexp, char \*text, matchstack stack)
{
        if (\*text != '\\0' && (\*text == c || c == '.'))
              stack = matchpushstack(c, regexp, text+1, stack);
        return matchhere(c, regexp, text, stack);
}

int
matchfail(matchstack stack)
{
        int c;
        char \*regexp, \*text;
        matchstack next;
        if (stack) {
               c = stack->c;
               regexp = stack->regexp;
               text = stack->text;
               next = stack->next;
               free(stack);
               return matchstar(c, regexp, text, next);
        } else {
               return (0);
        }
}

matchstack
matchpushstack(int c, char \*regexp, char \*text, matchstack stack)
{
        matchstack newstack = (matchstack) malloc(sizeof(struct _matchstack));
        newstack->c = c;
        newstack->regexp = regexp;
        newstack->text = text;
        newstack->next = stack;
        return newstack;
}

Posted by Alex Shinn on July 18, 2007 at 02:00 PM PDT #

Alex, I hasten to add that your new code -- while fine in some respects -- does not check or propagate the potential failure condition from malloc(3C). You may view such error handling as sullying up your code, and that was actually the original point of my blog entry: code that is clean but not robust may be attractive, but to me it will always lack something in terms of beauty.

Posted by Bryan Cantrill on July 19, 2007 at 05:33 AM PDT #

Your solution is fine if you're just writing a one-off program, but if this is software that is to be used by other software, your solution is horrifically bad: at least if you died on the NULL pointer dereference, I would be able to look at the resulting core dump and immediately know that it was your busted code that took me down on an allocation failure; with this "solution" I get practically nothing -- just your (highly overloaded) "out of memory" message in my log files. If I have a body of software that has a complicated nest of dependencies, I would be resorting to grep'ing through source to try to find your broken code. If you fancy yourself a software dilettante, fine -- but if you want to write industrial-strength code that others can depend on, you'll need to treat failure semantics much more seriously.

This all brings to mind a conversation that Bill Moore once relayed to me: he was talking to a friend of his who is a professional drummer. The friend said that the primary difference between the amateur drummer and the professional drummer is that the professional can recover when things go wrong. In short, the difference between the amateur and the professional is error handling -- and I think this holds true for software as well.

Posted by Bryan Cantrill on July 19, 2007 at 05:33 AM PDT #

Bryan,

There's only a single malloc, which can easily be wrapped in an

if (! ... ) fatal("out of memory);
. That one extra line doesn't exactly sully up the code :) More likely, you'd define a safemalloc utility which did this automatically, and the code is then no longer nor less clear.

Sure, in general robust, carefully tested code will be longer and more complex than simple code, but this is completely orthogonal to recursion. Recursion is a useful and powerful tool that can lead to cleaner code - especially if your language gives you a tail-call optimization guarantee.

Posted by Alex Shinn on July 19, 2007 at 05:33 AM PDT #

I just threw up a 5 minute transformation of the code to show how you can use recursion without blowing the stack. It was a quick blog comment and I didn't realize you were going to hold me accountable to write "industrial-strength" code. At any rate it's no reason to start attacking me and calling me an amateur.

Posted by Alex Shinn on July 19, 2007 at 09:00 AM PDT #

With no offense intended, it's not as if making this algorithm iterative is terribly subtle -- indeed, as I mentioned in the original post, I have had to unroll my own recursion many times over the years. The point was the failure semantics of the recursion, not the recursion per se -- so fixing one failure mode while introducing another very much missed the point I was trying to make. But apologies if you felt that I put too sharp a point on my rebuttal.

Posted by Bryan Cantrill on July 19, 2007 at 03:04 PM PDT #

So, are relational databases (or any other databases) a failure ("busted") because they couldn't handle the amount of data that Google deals with? And so Google had to build their own: BigTable.

Posted by sri on July 20, 2007 at 10:31 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

bmc

Search

Top Tags
Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today