Recent FMA Work

Well it's been over 15 months since my last blog entry so I thought I'd take the time to provide an update on some of the FMA work I've been doing in the interim - broken down by the Solaris Nevada build that the work was integrated into:

Build 98
6735704 deadlock in topo_node_facility()
6735691 deadlock in topo_node_facbind()
6743070 libtopo: need FRU labels for chip nodes on doradi

The first two changes were bug fixes to address a pair of potential deadlocks in libtopo introduced in the Sensor Abstraction Layer Phase 2 integration.

Doradi was the internal code name for the Sun-Fire 4150 and 4250 systems.  The third change was a small fix that allows the FMA command line tools to reference the FRU label when referring to a faulty CPU module.  (which is more friendly than using an FMRI to refer to the CPU module)

Build 100
6683671 txml_print_prop() leaks memory, segfaults on failure
6710915 typo in topo_method.h
6740819 typo in ses "contains" topo method

Three minor bug fixes to libtopo.

Build 106
PSARC 2008/753 Reflecting Fan/Power Supply Diagnosis in Solaris
6641745 diagnosis of power supply and fan failures via IPMI
6768720 disk-monitor: small leak in dm_process_sysevent() when handling
6769133 libtopo: hc_is_replaced() can leak memory
6765830 libtopo: need to enumerate sensors/indicators on fan/psu nodes
        on X4600
6773926 libipmi: ipmi_sdr_get sometimes bites off more than it can chew
6780080 libtopo: should optimize lookups for propmethod-backed properties
        if propvals are non-volatile
6781654 libtopo: completely bogus, but harmless logic in topo_snap_hold
        could be removed

This integration represents the first consumer for the Sensor Abstraction Layer that we introduced in build 96.  Why is this important to the user?

Modern servers are equipped with a wide variety of hardware sensors and our common service processor firmware (ILOM) monitors the fans and power supplies and can respond to failures and present them via IPMI or in its browser user interface.  However, any diagnosis made by ILOM was not visible to the Fault Manager running on Solaris.  This creates a sub-optimal usage model for admins and service personnel, as they're forced to consult multiple sources to get a complete and accurate picture of a system's health status, because some fault conditions are only visible from ILOM while others are only visible from the host OS.

This integration builds upon the previous Sensor Abstraction Layer work described here to allow fan and power supply faults that are diagnosed by the service processor to be reflected in the Solaris Fault Manager.  In the event of a fan or power supply failure, this integration allows us to provide a consistent Fault Management experience to service personnel, including:

    - producing a localized diagnosis message on the console
    - referring the admin to a relevant knowledge article
    - "fmadm faulty" output now includes fans/psus faulted by the SP

This work was based on code originally developed by Eric Schrock for the FishWorks product line.  I took this code and generalized it so that it could work across Sun's X64 server line and even theoretically on non-Sun systems which support IPMI.

This work also includes an important performance optimization for libtopo dynamic properties.  On AMD64 systems, this greatly speeds up lookups of DIMM serial numbers.  For more on how we use DIMM serial numbers in FMA see my earlier blog entry here.

Finally there were a handful of bug fixes in the integration.

Build 109
6802701 libtopo: need to enable the sensor-transport module on the Sun-Fire-X4140
6793549 libtopo: Need to enumerate sensors on duradi platforms
6793478 libtopo: Need to statically enumerate fans on duradi platforms
6793468 libtopo: Need to enumerate disk bays on the Sun-Fire-X4600 and
6805886 libipmi: handling of EIPMI_INVALID_RESERVATION errors is broken in

This integration enabled fan and power supply fault diagnosis on the Sun Fire X4140/X4240 and disk fault diagnosis on the Sun Fire X4600/X4600M2 platforms.

This integration also fixed a minor bug with the private library libipmi.

Build 115
PSARC 2009/265 fmdump -m
6810965 port fmdump -m to ON
6802474 Port libfmd_msg to ON
6805723 libtopo: port fmtopo -m to ON

The key piece in this integration is the introduction of a new private library, libfmd_msg, to Solaris.  This library is used by the FMA userland components to lookup and format localized message content before emitting it to the user (in the form of a console message or SNMP trap or email, etc...).  Encapsulating this functionality into a library allowed us to remove a ton of duplicated code from various userland FMA components.  Additionally, this library allows us to insert expansion macros into the message content contain in the portable object files delivered with Solaris.  These expansion macros can reference elements in the payload of FMA protocol events which, in turn, allows us to emit messages that are more customized to the system.  This library was originally developed by Mike Shapiro for this FishWorks product line.  This integration ports all this goodness to Solaris Nevada/OpenSolaris so that it can be leveraged by other platforms.

Build 116
6841968 syslog-msgs inadvertently uses wrong msg priority, causing msgs to
        be output to all windows

Very small integration to fix an annoying bug in the syslog-msgs fmd plugin.

Build 121
6839705 libtopo needs updates in order to cope with ILOM 3
6840169 libtopo: topo xml schema and parsing code needs to be extended to
        support defining array propvals
6840764 fmtopo can't print TOPO_TYPE_INT32_ARRAY and TOPO_TYPE_UINT64_ARRAY
6844530 dimm/cs serial propmethods in chip enumerator needlessly recompute
        IPMI entity name
6836314 add support for sensor-transport module on ILOM-based X4450 platforms
6844635 libtopo: pull chassis-specific xml out of i86pc-hc-topology.xml into
        seperate map
6844639 libtopo: add DIMM serial to chip-select nodes on X4140/4240/4440
6845699 libipmi: implementation of ipmi_sunoem_led_get/set interfaces needs
        to be updated for ILOM 3
6677012 libtopo: small leaks on snapshot creation
6535637 Add Severity level to payload of list.suspects event
6850083 libtopo: need to add JEDEC id for Hyundai Electronics to jedec_tbl in
        the chip enumerator
6844145 sys/bmc_intf.h should be delivered
6855750 fmadm faulty will fail to expand message tokens that reference event
6862378 libtopo: need to register TOPO_METH_SENSOR_FAILURE on ses nodes

This was a fairly large integration.  The bulk of these changes are to allow the FMA infrastructure to work correctly on Sun platforms which use the new ILOM 3 service processor firmware, while still maintaining compatibility with ILOM 2-based platforms.  Our FMA infrastructure leverages ILOM to do things like light/unlight chassis LED's and detect fan and power supply failures.

Build 124
6875268 missing power supplies may be reported as faulted
6874918 sensor-transport produces ereports too aggressively
6877019 topo_node_facility tries to release lock it doesn't own

This integration fixes some minor bugs in the sensor-transport module.


Post a Comment:
Comments are closed for this entry.



Top Tags
« April 2014