Fault management

Guest Author

The Fishworks storage appliance stands on the shoulders of giants. Many of the most exciting features -- Analytics, the hybrid storage pool, and integrated fault management,
for example -- are built upon existing technologies in OpenSolaris
(DTrace, ZFS, and FMA, respectively). The first two of these have been
covered extensively elsewhere, but I'd like to discuss our integrated
fault management, a central piece of our RAS
(reliability/availability/serviceability) architecture.

Let's start with a concrete example: suppose hard disk #4 is acting up in your
new 7000 series server. Rather than returning user data, it's giving back garbage.
ZFS checksums the garbage, immediately detects the corruption, reconstructs the
correct data from the redundant disks, and writes the correct block back to
disk #4. This is great, but if the disk is really going south, such failures will keep
happening, and the system will generate a fault and take the disk out of

Faults represent active problems on the system, usually hardware failures. OpenSolaris users are familiar with observing and managing faults through fmadm(1M). The appliance integrates fault management in several ways: as alerts, which allow administrators to configure automated responses to these events of interest; in
the active problems screen, which provides a summary view of the current faults
on the system; and through the maintenance view, which correlates faults with
the actual failed hardware components. Let's look at each of these in turn.


Faults are part of a broader group of events we call alerts.
Alerts represent events of interest to appliance administrators, ranging from
hardware failures to backup job notifications. When one of these events occurs,
the system posts an alert, taking whatever action has been configured
for it. Most commonly, administrators configure the appliance to send mail or
trigger an SNMP trap in response to certain alerts:

Alerts configuration

Managing faults

In our example, you'd probably discover the failed hard disk because you
previously configured the appliance to send mail on hardware failure (or
hot spare activation, or resilvering completion...). Once you get the mail,
you'd log into the appliance web UI (BUI) and navigate to the active

Active problems screen

The above screen presents all current faults on the system, summarizing each
failure, its impact on the system, and suggested actions for the administrator. You might next click the "more info" button (next to the "out of service" text), which
would bring you to the maintenance screen for the faulted chassis, highlighting
the broken disk both in the diagram and the component list:

Faulted chassis

This screen connects the fault with the actual physical component that's
failed. From here you could also activate the locator LED (which is
no simple task behind the scenes) and have a service technician go replace the
blinking disk. Of course, once they do, you'll get another mail saying that ZFS has finished resilvering the new disk.

Beyond disks

Disks are interesting examples because they are the heart of the storage server. Moreover, disks are often some of the first components to fail (in part because there are so many of them). But FMA allows us to diagnose many other kinds of components. For example, here are the same screens on a machine with a broken CPU cache:

Failed CPU

Failed CPU (chassis)

Under the hood

This complete story -- from hardware failure to replaced disk, for example -- is built on foundational technologies in OpenSolaris like FMA. Schrock has described much of the additional work that makes this simple but powerful user experience possible for the appliance. Best of all, little of the code is specific to our NAS appliances - we could conceivably leverage the same infrastructure to manage faults on other kinds of systems.

If you want to see more, download our VMware simulator and try it out for

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.