Is your software crash-only?
By val on Jul 12, 2004
This is a short workshop paper that appeared in HotOS IX. From the abstract: "Crash-only programs crash safely and recover quickly. There is only one way to stop such software - by crashing it - and only one way to bring it up - by initiating recovery." As motivation, the authors show that with a system running Red Hat 8.0 with ext3 as the root file system, it is faster to crash and recover (as in a power outage) than to cleanly shutdown and restart the system - 75 seconds to crash and recover versus 104 seconds for a clean reboot, with no "important" data loss in either case. (The irony of this result should be apparent to any systems person.) The authors argue that crash-only systems, which are made up exclusively of crash-only components, are a good choice for certain classes of problems, where the best way to deal with bugs is to simply restart (crash) the component behaving badly.
The crash-only philosophy is already widespread, most notably to myself as a file systems developer in Google's clustered file system, GFS (indeed, in all of Google's software), NetApp's WAFL file system, used internally in their filers, and ZFS, in which the on-disk state is always self-consistent. This paper simply clarifies the tradeoffs and properties of crash-only software, and, most importantly, introduces a nifty name for the concept. Recommended reading for all programmers.