Is your software crash-only?

I've resisted starting a weblog for all the usual reasons - lack of time, neophobia, anti-herding instinct - but mostly because the only thing I really wanted to write was a rant about Sun's marketing strategy for ZFS (that's the Zettabyte File System, not Dynamic File Service, DFS, or DynFS), and management doesn't want us writing rants about Sun's marketing, no matter how entertaining they are. Finally, I had a halfway decent idea for a blog topic while talking to a presenter at USENIX '04: my favorite systems papers.

Crash-only Software

This is a short workshop paper that appeared in HotOS IX. From the abstract: "Crash-only programs crash safely and recover quickly. There is only one way to stop such software - by crashing it - and only one way to bring it up - by initiating recovery." As motivation, the authors show that with a system running Red Hat 8.0 with ext3 as the root file system, it is faster to crash and recover (as in a power outage) than to cleanly shutdown and restart the system - 75 seconds to crash and recover versus 104 seconds for a clean reboot, with no "important" data loss in either case. (The irony of this result should be apparent to any systems person.) The authors argue that crash-only systems, which are made up exclusively of crash-only components, are a good choice for certain classes of problems, where the best way to deal with bugs is to simply restart (crash) the component behaving badly.

The crash-only philosophy is already widespread, most notably to myself as a file systems developer in Google's clustered file system, GFS (indeed, in all of Google's software), NetApp's WAFL file system, used internally in their filers, and ZFS, in which the on-disk state is always self-consistent. This paper simply clarifies the tradeoffs and properties of crash-only software, and, most importantly, introduces a nifty name for the concept. Recommended reading for all programmers.

Comments:

Again, I'm pretty sure zfs is a useless piece of shit. I still think I've crapped better ideas out of my rectal cavity.

Posted by joe on July 19, 2004 at 06:48 PM PDT #

Crash-only is the only way to build systems of robustness, IMO. The notion of programs carrying on in the face of bugs and proceeding to corrupt data seems fine in an isolated non-production environment, but data corruption costs big time, often enough to break the application permanently. These costs are rarely seen by the original developer, only when on site and dealing with the results. The other idea that is critical is storing data in fashion that permits later data recovery. http://www.financialcryptography.com/cgi-bin/mt/mt-tb.cgi/58

Posted by Iang on July 19, 2004 at 11:56 PM PDT #

<quote>I've resisted starting a weblog for all the usual reasons - lack of time, neophobia, anti-herding instinct</quote> I started it because I wanted to get into the Blog Scene, and I only write about OSS stuffs. Nothing about my personal life.

Posted by lotso on August 20, 2004 at 07:49 PM PDT #

Post a Comment:
Comments are closed for this entry.
About

val

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today