SES Sensors and Indicators

Last week, Rob Johnston and I coordinated two putbacks to Solaris to further the cause of Solaris platform integration, this time focusing on sensors and indicators. Rob has a great blog post with an overview of the new sensor abstraction layer in libtopo. Rob did most of the hard work- my contribution consisted only of extending the SES enumerator to support the new facility infrastructure.

You can find a detailed description of the changes in the original FMA portfolio here, but it's much easier to understand via demonstration. This is the fmtopo output for a fan node in a J4400 JBOD:

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
    label             string    Cooling Fan  0
    FRU               fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f
    target-path       string    /dev/es/ses3

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=ident
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=ident
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: facility                       version: 1   stability: Private/Private
    type              uint32    0x1 (LOCATE)
    mode              uint32    0x0 (OFF)
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=fail
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=fail
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: facility                       version: 1   stability: Private/Private
    type              uint32    0x0 (SERVICE)
    mode              uint32    0x0 (OFF)
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=speed
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=speed
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x4 (FAN)
    units             uint32    0x12 (RPM)
    reading           double    3490.000000
    state             uint32    0x0 (0x00)
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=fault
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=fault
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    discrete
    type              uint32    0x103 (GENERIC_STATE)
    state             uint32    0x1 (DEASSERTED)
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f

Here you can see the available indicators (locate and service), the fan speed (3490 RPM) and if the fan is faulted. Right now this is just interesting data for savvy administrators to play with, as it's not used by any software. But that will change shortly, as we work on the next phases:

  • Monitoring of sensors to detect failure in external components which have no visibility in Solaris outside libtopo, such as power supplies and fans. This will allow us to generate an FMA fault when a power supply or fan fails, regardless of whether it's in the system chassis or an external enclosure.
  • Generalization of the disk-monitor fmd plugin to support arbitrary disks. This will control the failure indicator in response to FMA-diagnosed faults.
  • Correlation of ZFS faults with the associated physical disk. Currently, ZFS faults are against a "vdev" - a ZFS-specific construct. The user is forced to translate from this vdev to a device name, and then use the normal (i.e. painful) methods to figure out which physical disk was affected. With a little work it's possible to include the physical disk in the FMA fault to avoid this step, and also allow the fault LED to be controlled in response to ZFS-detected faults.
  • Expansion of the SCSI framework to support native diagnosis of faults, instead of a stream of syslog messages. This involves generating telemetry in a way that can be consumed by FMA, as well as a diagnosis engine to correlate these ereports with an associated fault.

Even after we finish all of these tasks and reach the nirvana of a unified storage management framework, there will still be lots of open questions about how to leverage the sensor framework in interesting ways, such as a prtdiag-like tool for assembling sensor information, or threshold alerts for non-critical warning states. But with these latest putbacks, it feels like our goals from two years ago are actually within reach, and that I will finally be able to turn on that elusive LED.

Comments:

Controlling LEDs? So I will finally be able to turn those blue LED in Thumper into scrolling 4x12 display? :-D

Posted by Tomasz on August 07, 2008 at 04:12 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

Musings about Fishworks, Operating Systems, and the software that runs on them.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today