A louder voice for the fault manager

The Solaris reference implementation of the fault manager recently got a boost in its ability to report faults with the introduction of a two-part SNMP agent. This agent makes it easy to integrate the Solaris fault manager into existing SNMP-based monitoring infrastructure.

Background

The fault manager has always been able to report faults to the system log and console(s), and to provide a wealth of status information via fmadm(1M) and fmdump(1M). But these reporting mechanisms leave much to be desired; syslog messages must be parsed, and a busy central log host can easily lose important messages in the noise. Worse still, a privileged user must log into the affected system and run administrative commands to get information they need that isn't contained in the message.

SNMP is a natural choice for extending the reach of the fault manager's voice; it's widely used to facilitate centralised monitoring of events throughout and even across administrative domains. The basic model is simple and extensible; information can be pushed from any device to one or more network management stations (NMSs), or pulled by an administrator or automated utility from a particular device of interest. Managed devices - in this case, a Solaris system - signify events using traps (also called notifications in SNMPv2), which provide a limited amount of information to designated NMSs. They also provide access to a management information base (MIB) on demand. Generally, the MIB provides access to a much greater breadth and depth of information than is transmitted with a trap or notification. An NMS can be configured to retrieve additional data from the MIB upon receipt of a trap if desired.

Availability

The technology described here is available in Solaris Nevada builds 33 and later. OpenSolaris offers access to the sources. A prerequisite for building or using these applications is the installation of the SMA packages provided by the SFW consolidation; BFUing newer ON bits is not sufficient. If you have SWAN access, you can run /ws/onnv-gate/public/bin/update_sma to get the necessary packages; otherwise see the OpenSolaris download center for the packages.

A Note on NMS Configuration

If you use the Net-SNMP-based NMS software delivered in Solaris, as I do below, you will want to tell the client utilities to use the fault management MIB to encode and decode OIDs. The easiest way to do this is to add MIBS=+ALL to your environment. You can also make this permanent by creating (or adding to) /etc/sma/snmp/snmp.conf the line:

    mibs +ALL
See snmp.conf(4) for more information on MIB searching and importing. If you use a different NMS, consult your vendor's documentation to learn how to import a new MIB.

snmp-trapgen: an SNMP plugin for fmd(1M)

The trap or notification generator component is snmp-trapgen. This is a very simple fault manager plugin similar to that which logs fault information to the system log and console. Instead of writing formatted text to a log device, however, this plugin generates SNMPv1 traps and/or SNMPv2 notifications, one for each destination configured in the systemwide snmpd.conf(4). No additional configuration is required; if you have already configured a system to send traps to one or more NMSs, you don't need to do anything else to be notified upon fault diagnosis. If not, you'll want to add v1 or v2 trap destinations to /etc/sma/snmp/snmpd.conf. The hostnames or addresses you use will need to be configured to receive and act upon SNMP traps or notifications. If you don't have an NMS on your network, you can use the snmptrapd(1M) server included with Solaris.

A fault diagnosis trap (sunFmProblemTrap) includes a limited subset of the information contained in the syslog message associated with the fault. Specifically, the diagnosis's UUID, diagnostic code, and reference URL are included. The object identifiers (OIDs) for these data are defined by the fault management MIB, SUN-FM-MIB, installed in /etc/sma/snmp/mibs/. The same information is delivered to both SNMPv1 and SNMPv2 trap sinks. At present, this is the only trap defined by the fault management MIB, but others may be generated in the future. Here's an example of an SNMPv2 notification as decoded by snmptrapd(1M):

2006-02-07 16:36:34 stomper [192.xx.xx.xx]:

        DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (2266748911) 262 days, 8:31:29.11
        SNMPv2-MIB::snmpTrapOID.0 = OID: SUN-FM-MIB::sunFmProblemTrap
        SUN-FM-MIB::sunFmProblemUUID."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: "a58aa105-4fab-6e16-8557-ab7687113de7"
        SUN-FM-MIB::sunFmProblemCode."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: SUN4U-8000-KA
        SUN-FM-MIB::sunFmProblemURL."a58aa105-4fab-6e16-8557-ab7687113de7" = STRING: http://sun.com/msg/SUN4U-8000-KA
The diagnostic code and URL can be used to find knowledge base articles describing the fault and suggested corrective action. The diagnosis UUID can be used to get further detail from fmdump(1M), or from the MIB, as seen in the next section.

libfmd_snmp: a MIB plugin for the System Management Agent (SMA)

Knowing that a fault has been diagnosed is important, but the amount of information delivered with the trap or notification may not be enough to provide an administrator with a complete understanding of the problem. The fault management MIB defines a wealth of detail, and this detail is made available via SMA by libfmd_snmp. In addition to fault diagnosis detail, this MIB also offers information about faulty components and the configuration of the fault manager itself, similar to that offered by fmadm(1M).

Enabling the plugin requires configuring the master SNMP agent on each server you wish to query. Adding the architecture-dependent line

    dlmod sunFM /usr/lib/fm/sparcv9/libfmd_snmp.so.1
to /etc/sma/snmp/snmpd.conf will cause the MIB plugin to be automatically loaded and initialised the next time the master agent is started, such as via /etc/init.d/init.sma. In the future, SMA will be managed via SMF; see 6349499[0].

No further configuration is necessary, although the usual snmpd.conf(4) directives will allow you to restrict access to the MIB, which may be important to you since some of the information it provides is ordinarily restricted to privileged users.

The fault management MIB provides 4 tables and a single scalar, in addition to the trap/notification described above. sunFmProblemTable and sunFmFaultEventTable are logically two pieces of the same table; they are separated only because MIBs do not support nested tables. The problem table contains the scalar information about each diagnosis, while the fault event table contains lists of the events associated with each diagnosis. Both tables are indexed by diagnosis UUID; the fault event table utilises a second scalar index to distinguish between multiple events associated with a diagnosis. In response to the trap above, you might want to know which Automated System Recovery Unit(s) (ASRU(s)) the fault manager believes may have caused the fault. This is just a fancy way of saying we want to know what broke to trigger the diagnosis. Because each ASRU is associated with a fault event, we'll first need to know how many fault events were associated with this diagnosis so that we can then look up each one's ASRU in the fault event table. To do this, we'll use snmpget(1M), delivered by Solaris in /usr/sfw/bin. Of course, you can use any NMS software.

    nms$ snmpget -c public -v 2c stomper \\
        sunFmProblemSuspectCount.\\"a58aa105-4fab-6e16-8557-ab7687113de7\\"
    SUN-FM-MIB::sunFmProblemSuspectCount."a58aa105-4fab-6e16-8557-ab7687113de7" = Gauge32: 1
This diagnosis has only one fault event associated with it. To look up the ASRU, we'll look in the fault event table entry indexed by the UUID and the fault index. Since fault events are indexed starting from 1, we'll need to do:
    nms$ snmpget -c public -v 2c stomper \\
        sunFmFaultEventASRU.\\"a58aa105-4fab-6e16-8557-ab7687113de7\\".1
    SUN-FM-MIB::sunFmFaultEventASRU."a58aa105-4fab-6e16-8557-ab7687113de7".1
    = STRING: cpu:///cpuid=4/serial=23EBEC1505
Most NMSs offer scripting facilities that allow you to perform actions similar to these in response to a trap. Alternately, you could poll the data on a regular basis. Many impementations do both, using polling to offset the risk of losing traps, which like all SNMP datagrams do not offer reliable transmission. SNMPv3 informs, also known as acknowledged notifications, offer only a partial remedy to this problem, and are not supported by snmp-trapgen at this time.

A polling NMS may wish to poll the systemwide faulty component count, provided by the MIB as sunFmFaultCount. An increase in this gauge without a corresponding problem trap is a good indication that the trap has been lost. More details about devices the fault manager believes to be in degraded or faulted states is available via the sunFmResourceTable; walking this table provides a ready - and remote - answer to the common question "What's broken on that machine?" For this, we use the snmpwalk(1M) utility:

    nms$ snmpwalk -c public -v 2c stomper sunFmResourceTable
    SUN-FM-MIB::sunFmResourceFMRI.1 = STRING: cpu:///cpuid=4/serial=23EBEC1505
    SUN-FM-MIB::sunFmResourceStatus.1 = INTEGER: degraded(3)
    SUN-FM-MIB::sunFmResourceDiagnosisUUID.1 = STRING:
        "a58aa105-4fab-6e16-8557-ab7687113de7"
Finally, the sunFmConfigTable offers remote access to the same information provided by fmadm(1M)'s config subcommand; like the other tables, it can be accessed using snmpget(1M), snmpwalk(1M), or any other SNMP-compatible NMS implementation. You can find the complete fault management MIB at the Fault Management community site, and in build 33 and later at /etc/sma/snmp/mibs/SUN-FM-MIB.mib.

[0] The bug should be visible, but it isn't. This is itself a bug, which the SFW team is working to fix.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

wesolows

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today