Saturday Apr 25, 2009

Getting email reports of hardware faults with OpenSolaris

I woke up today to find that one of the disks in my home server had failed overnight. I was actually able to work this out while still in bed shortly after waking up, because I could hear it clicking and whirring pathetically as it tried to spin up - not a nice way to start your day. As I write this I'm in the process of filing an RMA to get the disk replaced, which promises to be a painful, drawn-out process, but hey - at least my data is still safe thanks to ZFS (so long as none of my other disks decide to break - not inconcievable, seeing as they're all identical...).

However, hardware faults aren't always audible, so I was pleased to see that my script for detecting hardware faults and then emailing me had triggered. Here's what I got sent:

-------- Original Message --------
Subject: Hardware failed on zebedee
Date:    Sat, 25 Apr 2009 13:54:02 +0200
From:    lamsey@zebedee

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Apr 25 09:32:07 43d4b6e4-1219-e9d5-bac5-f829b8fb2f2a  ZFS-8000-D3    Major    

Fault class : fault.fs.zfs.device

Description : A ZFS device failed.  Refer to for
              more information.

Response    : No automated response will occur.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

Logging into my server and running zpool status -x showed me which disk was at fault (c4d0), and a bit of searching in the output of prtconf -v allowed me to work out the serial number of the affected disk (more specifically, it allowed me to work out the serial numbers of the disks which were still working, meaning I could work out which physical disk was broken by a process of elimination after cracking the box open).

So, how do I achieve the above? The answer is actually incredibly simple. The content of the email is just the output of fmadm faulty, a command which interrogates Solaris' FMA (Fault Management Architecture) feature to see if there's any hardware issues on a system. Wrap it up in a script (the below is based on one I found on the 'net eons ago and can no longer find), and you end up with something like:

lamsey@zebedee:bin$ cat check_hardware.ksh
# Public domain. Use as you wish. TMPFILE=/tmp/fmadm.output.$$ # run fmadm and cut away the first two lines (headers) /usr/bin/pfexec /usr/sbin/fmadm faulty | /usr/bin/sed 1,2d > $TMPFILE # Check if the file size is greater than zero. This means we got
# some output from fmadm and therefore some hardware may be bad.
# Using HTML here means we can use <pre> to preserve formatting. if [ -s $TMPFILE ]; then ( /usr/bin/echo "Subject: Hardware failed on `hostname`" /usr/bin/echo "From: lamsey@zebedee" /usr/bin/echo "MIME-Version: 1.0" /usr/bin/echo "Content-Type: text/html" /usr/bin/echo "Content-Disposition: inline" /usr/bin/echo /usr/bin/echo '<pre>' # don't just use the temp file, it's missing headers /usr/bin/pfexec /usr/sbin/fmadm faulty /usr/bin/echo '</pre>' ) | /usr/local/bin/msmtp -a 1and1 $EMAIL fi # clean up the temp file /usr/bin/rm -f $TMPFILE

Simply slap a call to the above script into your crontab, ideally running at least once a day, and you're good to go. Note that I use msmtp for sending emails automatically as it's a heck of a lot easier to configure than sendmail (which is important if you use an ISP like o2 which blocks outgoing SMTP traffic, preventing you from using sendmail in its out-of-the-box configuration). It doesn't come with Solaris though, so you'll need to compile it if you want to do the same (very simple, works fine with configure / make / make install).

Edit (01/5/09): I received the replacement disk today (took them long enough...). Slammed it into the server, issued a quick zpool replace c4d0 command, and all is good with the world again :-)

lamsey@zebedee:~$ zpool status shared
  pool: shared
 state: ONLINE
 scrub: resilver completed after 3h13m with 0 errors on Fri May  1 17:12:23 2009

        NAME        STATE     READ WRITE CKSUM
        shared      ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c3d0    ONLINE       0     0     0  217M resilvered
            c3d1    ONLINE       0     0     0  217M resilvered
            c4d0    ONLINE       0     0     0  236G resilvered
            c4d1    ONLINE       0     0     0  217M resilvered
            c6d1    ONLINE       0     0     0  217M resilvered

errors: No known data errors


The blog of Liam McBrien, a Sun Microsystems Campus Ambassador promoting and demonstrating the latest and greatest Sun technology at Strathclyde University in Scotland.


« February 2016