Solaris platform integration - disk monitoring

Two weeks ago I putback PSARC 2007/202, the second step in generalizing the x4500 disk monitor. As explained in my previous blog post, one of the tasks of the original sfx4500-disk module was reading SMART data from disks and generating associated FMA faults. This platform-specific functionality needed to be generalized to effectively support future Sun platforms.

This putback did not add any new user-visible features to Solaris, but it did refactor the code in the following ways:

  • A new private library, libdiskstatus, was added. This generic library uses uSCSI to read data from SCSI (or SATA via emulation) devices. It is not a generic SMART monitoring library, focusing only on the three generally available disk faults: over temperature, predictive failure, and self-test failure. There is a single function, disk_status_get() that reurns an nvlist describing the current parameters reported by the drive and whether any faults are present.

  • This library is used by the SATA libtopo module to export a generic TOPO_METH_DISK_STATUS method. This method keeps all the implementation details within libtopo and exports a generic inerface for consumers.

  • A new fmd module, disk-transport, periodically iterates over libtopo nodes and invokes the TOPO_METH_DISK_STATUS method on any supported nodes. The module generates FMA ereports for any detected errors.

  • These ereports are translated to faults by a simple eversholt DE. These are the same faults that were originally generated by the sfx4500-disk module, so the code that consumes them remains unchanged.

These changes form the foundation that will allow future Sun platforms to detect and react to disk failures, eliminating 5200 lines of platform-specific code in the process. The next major steps are currently in progress:

The FMA team, as part of the sensor framework, is expanding libtopo to include the ability to represent indicators (LEDs) in a generic fashion. This will replace the x4500 specific properties and associated machinery with generic code.

The SCSI FMA team is finalizing the libtopo enumeration work that will allow arbitrary SCSI devices (not just SATA) to be enumerated under libtopo and therefore be monitored by the disk-transport module. The first phase will simply replicate the existing sfx4500-disk functionality, but will enable us to model future non-SATA platforms as well as external storage devices.

Finally, I am finishing up my long-overdue ZFS FMA work, a necessary step towards connecting ZFS and disk diagnosis. Stay tuned for more info.

Comments:

This is awesome news. Will we be able to configure the disk-transport DE to kick off drive self tests at configurable intervals (say every 24-hours)? Thanks, - Ryan

Posted by Matty on May 27, 2007 at 04:34 AM PDT #

What about monitoring of SMART-capable devices on generic, non-Sun i86pc platform? Will that work too?

Posted by UX-admin on May 27, 2007 at 05:28 AM PDT #

Ryan -

No, there is no way to configure proactive self-tests (via SEND DIAGNOSTIC). This seems like a reasonable extension, though ensuring it doesn't conflict with ongoing activity could be tricky.

UX-admin -

Yes, this currently monitors all SATA disks, regardless of platform. With the upcoming SCSI FMA work, this will be expanded to include all SCSI/SATA devices. Some platform-specific operations, such as lighting LEDs via IPMI or correlating the device to a physical slot ("HDD 0") will be limited to those platforms that provide the necessary libtopo .xml annotations. We hope that these annotations won't be necessary for external enclosures, where SES should be sufficient.

Posted by Eric Schrock on May 27, 2007 at 06:48 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

Musings about Fishworks, Operating Systems, and the software that runs on them.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today