Designing for Failure

In the last few weeks, I've been completely re-designing the ZFS commands from the ground up1. When I stood back and looked at the current state of the utilities, several glaring deficiencies jumped out at me2. I thought I'd use this blog entry to focus on one that near and dear to me. Having spent a great deal of time with the debugging and observability tools, I've invariably focused on answering the question "How do I diagnose and fix a problem when something goes wrong?". When it comes to command line utilities, the core this problem is in well-designed error messages. To wit, running the following (former) ZFS command demonstrates the number one mistake when reporting error messages:

    # zfs create -c pool/foo pool/bar
    zfs: Can't create pool/bar: Invalid argument
    #

The words "Invalid argument" should never appear as an error message. This means that at some point in the software stack, you were able to determine there was a specific problem with an argument. But in the course of passing that error up the stack, any semantic information about the exact nature of the problem has been reduced to simply EINVAL. In the above case, all we know is that one of the two arguments was invalid for some unknown reason, and we have no way of knowing how to fix it. When choosing to display an error message, you should always take the following into account:

An error message must clearly identify the source of the problem in a way that that the user can understand.

An error message must suggest what the user can do to fix the problem.

If you print an error message that the administrator can't understand or doesn't suggest what to do, then you have failed and your design is fundamentally broken. All too often, error semantics are given a back seat during the design process. When approaching the ZFS user interface, I made sure that error semantics were a fundamental part of the design document. Every command has complete usage documentation, examples, and every possible error message that can be emitted. By making this part of the design process, I was forced to examine every possible error scenario from the perspective of an administrator.

A grand vision of proper failure analysis can be seen in the Fault Management Architecture in Solaris 10, part of Predictive Self Healing. A complete explanation of FMA and its ramifications is beyond the scope of a single blog entry, but the basic premise is to move from a series of unrelated error messages to a unified framework of fault diagnosis. Historically, when hardware errors would occur, an arbitrary error message may or may not have been sent to the system log. The error may have been transient (such as an isolated memory CE), or the result of some other fault. Administrators were forced to make costly decisions based on a vague understanding of our hardware failure semantics. When error messages did succeed in describing the problem sufficiently, they invariably failed in suggesting how to fix the problem. With FMA, the sequence of errors is instead fed to a diagnosis engine that is intimately familiar with the characteristics of the hardware, and is able to produce a fault message that both adequately describes the real problem, as well as how to fix it (when it cannot be automatically repaired by FMA).

Such a wide-ranging problem doesn't necessarily compare to a simple set of command line utilities. A smaller scale example can be seen with the Solaris Management Facility. When SMF first integrated, it was incredibly difficult to diagnose problems when they occurred3. The result, after a few weeks of struggle, was one of the best tools to come out of SMF, svcs -x. If you haven't tried this command on your Solaris 10 box, you should give it a shot. It does automated gathering of error information and combines it into output that is specific, intelligible, and repair-focused. During development of the final ZFS command line interface, I've taken a great deal of inspiration from both svcs -x and FMA. I hope that this is reflected in the final product.

So what does this mean for you? First of all, if there's any Solaris error message that is unclear or uninformative that is a bug. There are some rare cases when we have no other choice (because we're relying on an arbitrary subsystem that can only communicate via errno values), but 90% of the time its because the system hasn't been sufficiently designed with failure in mind.

I'll also leave you with a few cardinal4 rules of proper error design beyond the two principles above:

  1. Never distill multiple faults into a single error code. Any error that gets passed between functions or subsystems must be traceable back to a single specific failure.
  2. Stay away from strerror(3c) at all costs. Unless you are truly interfacing with an arbitrary UNIX system, the errno values are rarely sufficient.
  3. Design your error reporting at the same time you design the interface. Put all possible error messages in a single document and make sure they are both consistent and effective.
  4. When possible, perform automated diagnosis to reduce the amount of unimportant data or give the user more specific data to work with.
  5. Distance yourself from the implementation and make sure that any error message makes sense to the average user.

1No, I cannot tell you when ZFS will integrate, or when it will be available. Sorry.

2This is not intended as a jab at the ZFS team. They have been working full steam on the (significantly more complicated) implementation. The commands have grown organically over time, and are beginning to show their age.

3Again, this is not meant to disparage the SMF team. There were many more factors here, and all the problems have since been fixed.

4 "cardinal" might be a stretch here. A better phrase is probably "random list of rules I came up with on the spot".

Comments:

I'd like to add to your list of rules for error handling, Eric. This is an extension of your rule 1:

1a. Never hide the source of an error.

Too often, I see Java applications which go something like this:

   try {
      // do something which throws an exception
      ...
   }
   catch (RuntimeException ux) {
      throw new ExceptionHiddenException("Something bad happened");
   }

Solving a problem when all you know is that a problem occurred, but not where, is a nightmare, and not something that should be inflicted on even outsourced maintainers ;).

Posted by Trevor Watson on April 14, 2005 at 06:05 PM PDT #

Eric, I like what you've written here. And for those reading along, I've been lucky enough to review the aforementioned design document several times. I'm very excited about the step forward it represents.

While I'm mostly in agreement, I suggest that the directive about not utilizing strerror() may muddy the waters for programmers learning to do UNIX development. Maybe this is me being an old fart, set in my ways. I'm willing to be convinced :)

There is a lot of code in the world like this:

    printf(stderr, "open failed");
A slightly improved version is:
    fprintf(stderr, "open %s failed", filename);
The problem here is that the programmer is simply throwing valuable information (the datum represented in the errno) away. There is also a lot of code in the world like this:
    fprintf(stderr, "open %s failed with errno=%d", filename, errno);
This is even worse because it yields program behavior which appears to the user to be the work of psychotic nerds. At a minimum, strerror() can sometimes point one in the right direction; messages such as 'No such file or directory' or 'Permission Denied' still seem valuable to me. This points up an inherent flaw in the way we use errno: some errno's represent programmer errors (EINVAL, usually, and always EFAULT), some represent user error or programmer error (EACCES, EROFS), and some represent transient or permanent system problems (ENOSPC, EAGAIN). It's also the case that system calls can sometimes return errno values which are wholly unexpected. Having that strerror message as a part of a larger error message could speed the time to resolution for a service person. In your example:
    # zfs create -c pool/foo pool/bar
    zfs: Can't create pool/bar: Invalid argument
    #
I'd like to see:
    # zfs create -c pool/foo pool/bar
    zfs: Unexpected error creating pool/bar: Invalid argument
This accomplishes two things: we preserved that scrap of data from errno, and we indicated to the administrator that the subsystem was not designed to cope with this error. The path to resolution is, by necessity: call service personnel.

Not to be crass or condescending-- but I'd rather the average programmer use strerror/perror than trying to roll their own complex, unique, and probably broken error subsystem. Of course, you're not an average programmer, and ZFS is not the average subsystem.

All of that aside, do you think we could begin to encode some of the practices you're developing into a library of some sort?

Posted by Dan Price on April 14, 2005 at 07:19 PM PDT #

Post a Comment:
Comments are closed for this entry.
About

Musings about Fishworks, Operating Systems, and the software that runs on them.

Search

Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today