Designing for Failure
By eschrock on Apr 14, 2005
In the last few weeks, I've been completely re-designing the ZFS commands from the ground up1. When I stood back and looked at the current state of the utilities, several glaring deficiencies jumped out at me2. I thought I'd use this blog entry to focus on one that near and dear to me. Having spent a great deal of time with the debugging and observability tools, I've invariably focused on answering the question "How do I diagnose and fix a problem when something goes wrong?". When it comes to command line utilities, the core this problem is in well-designed error messages. To wit, running the following (former) ZFS command demonstrates the number one mistake when reporting error messages:
# zfs create -c pool/foo pool/bar zfs: Can't create pool/bar: Invalid argument #
The words "Invalid argument" should never appear as an error message. This means that at some point in the software stack, you were able to determine there was a specific problem with an argument. But in the course of passing that error up the stack, any semantic information about the exact nature of the problem has been reduced to simply EINVAL. In the above case, all we know is that one of the two arguments was invalid for some unknown reason, and we have no way of knowing how to fix it. When choosing to display an error message, you should always take the following into account:
An error message must clearly identify the source of the problem in a way that that the user can understand.
An error message must suggest what the user can do to fix the problem.
If you print an error message that the administrator can't understand or doesn't suggest what to do, then you have failed and your design is fundamentally broken. All too often, error semantics are given a back seat during the design process. When approaching the ZFS user interface, I made sure that error semantics were a fundamental part of the design document. Every command has complete usage documentation, examples, and every possible error message that can be emitted. By making this part of the design process, I was forced to examine every possible error scenario from the perspective of an administrator.
A grand vision of proper failure analysis can be seen in the Fault Management Architecture in Solaris 10, part of Predictive Self Healing. A complete explanation of FMA and its ramifications is beyond the scope of a single blog entry, but the basic premise is to move from a series of unrelated error messages to a unified framework of fault diagnosis. Historically, when hardware errors would occur, an arbitrary error message may or may not have been sent to the system log. The error may have been transient (such as an isolated memory CE), or the result of some other fault. Administrators were forced to make costly decisions based on a vague understanding of our hardware failure semantics. When error messages did succeed in describing the problem sufficiently, they invariably failed in suggesting how to fix the problem. With FMA, the sequence of errors is instead fed to a diagnosis engine that is intimately familiar with the characteristics of the hardware, and is able to produce a fault message that both adequately describes the real problem, as well as how to fix it (when it cannot be automatically repaired by FMA).
Such a wide-ranging problem doesn't necessarily compare to a simple set of command line utilities. A smaller scale example can be seen with the Solaris Management Facility. When SMF first integrated, it was incredibly difficult to diagnose problems when they occurred3. The result, after a few weeks of struggle, was one of the best tools to come out of SMF, svcs -x. If you haven't tried this command on your Solaris 10 box, you should give it a shot. It does automated gathering of error information and combines it into output that is specific, intelligible, and repair-focused. During development of the final ZFS command line interface, I've taken a great deal of inspiration from both svcs -x and FMA. I hope that this is reflected in the final product.
So what does this mean for you? First of all, if there's any Solaris error message that is unclear or uninformative that is a bug. There are some rare cases when we have no other choice (because we're relying on an arbitrary subsystem that can only communicate via errno values), but 90% of the time its because the system hasn't been sufficiently designed with failure in mind.
I'll also leave you with a few cardinal4 rules of proper error design beyond the two principles above:
- Never distill multiple faults into a single error code. Any error that gets passed between functions or subsystems must be traceable back to a single specific failure.
- Stay away from strerror(3c) at all costs. Unless you are truly interfacing with an arbitrary UNIX system, the errno values are rarely sufficient.
- Design your error reporting at the same time you design the interface. Put all possible error messages in a single document and make sure they are both consistent and effective.
- When possible, perform automated diagnosis to reduce the amount of unimportant data or give the user more specific data to work with.
- Distance yourself from the implementation and make sure that any error message makes sense to the average user.
1No, I cannot tell you when ZFS will integrate, or when it will be available. Sorry.
2This is not intended as a jab at the ZFS team. They have been working full steam on the (significantly more complicated) implementation. The commands have grown organically over time, and are beginning to show their age.
3Again, this is not meant to disparage the SMF team. There were many more factors here, and all the problems have since been fixed.
4 "cardinal" might be a stretch here. A better phrase is probably "random list of rules I came up with on the spot".