ZFS and FMA
By eschrock on Nov 21, 2005
ZFS Today (phase zero)
The FMA support in ZFS today is what we like to call "phase zero". It's basically the minimal amount of integration needed in order to leverage (arguably) the most useful feature in FMA: knowledge articles. One of the key FMA concepts is to present faults in a readable, consistent manner across all subsystems. Error messages are human readable, contain precise descriptions of what's wrong and how to fix it, and point the user to a website that can be updated more frequently with the latest information.
ZFS makes use of this strategy in the 'zpool status' command:
$ zpool status pool: tank state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool online' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror ONLINE 0 0 0 c1d0s0 ONLINE 0 0 3 c0d0s0 ONLINE 0 0 0
In this example, one of our disks has experienced multiple checksum errors (because I dd(1)ed over most of the disk), but it was automatically corrected thanks to the mirrored configuration. The error message described exactly what has happened (we tried to self-heal the data from the other side of the mirror) and the impact (none - applications are unaffected). It also directs the user to the appropriate repair procedure, which is either to clear the errors (if they are not indicative of hardware fault) or replace the device.
It's worth noting here the ambiguity of the fault. We don't actually know if the errors are due to bad hardware, transient phenomena, or administrator error. More on this later.
ZFS tomorrow (phase one)
If we look at the implementation of 'zpool status', we see that there is no actual interaction with the FMA subsystem apart from the link to the knowledge article. There are no generated events or faults, and no diagnosis engine or agent to subscribe to the events. The implementation is entirely static, and contained within libzfs_status.c.
This is obviously not an ideal solution. It doesn't give the administrator any notification of errors as they occur, nor does it allow them to leverage other pieces of the FMA framework (such as upcoming SNMP trap support). This is going to be addressed in the near term by the first "real" phase of FMA integration. The goal of this phase, in addition to a number of other fault capabilities under the hood, is the following:
- Event Generation - I/O errors and vdev transitions will result in true FMA ereports. These will be generated by the SPA and fed through the FMA framework for further analysis.
- Simple Diagnosis Engine - An extremely dumb diagnosis engine will be provided to consume these ereports. It will not perform any sort of predictive analysis, but will be able to keep track of whether these errors have been seen before, and pass them off to the appropriate agent.
- Syslog agent - The results from the DE will be fed to an agent that simply forwards the faults to syslog for the administrator's benefit. This will give the same messages as seen in 'zpool status' (with slightly less information) synchronous with a fault event. Future work to generate SNMP traps will allow the administrator to be email, paged, or implement a poor man's hot spare system.
ZFS in the future (phase next)
Where we go from here is rather an open road. The careful observer will notice that ZFS never makes any determination that a device is faulted due to errors. If the device fails to reopen after an I/O error, we will mark it as faulted, but this only catches cases where a device has gotten completely out of whack. Even if your device is experiencing a hundred uncorrectable I/O errors per second, ZFS will continue along its merry way, notifying the administrator but otherwise doing nothing. This is not because we don't want to take the device offline; it's just that getting the behavior right is hard.
What we'd like to see is some kind of predictive analysis of the error rates, in an attempt to determine if a device is truly damaged, or whether it was just a random event. The diagnosis engines provided by FMA are designed for exactly this, though the hard part is determining the algorithms for making this distinction. ZFS is both a consumer of FMA faults (in order to take proactive action) as well as a producer of ereports (detecting checksum errors). To be done right, we need to harden all of our I/O drivers to generate proper FMA ereports, implement a generic I/O retire mechanism, and link it in with all the additional data from ZFS. We also want to gather SMART data from the drives to notice correctible errors fixed by the firmware, as well doing experiments to determine the failure rates and pathologies of common storage drives.
As you can tell, this is not easy. I/O fault diagnosis is much more complicated than CPU and memory diagnosis. There are simply more components, as well as more changes for administrative error. But between FMA and ZFS, we have laid the groundwork for a truly fault-tolerant system capable of predictive self healing and automated recovery. Once we get past the initial phase above, we'll start to think about this in more detail, and make our ideas public as well.