Fun with the FMA and SNMP

I've been working back in the Auto Service Request space for the past few weeks and doing some education and stuff as we build and test this product or that, it seemed like a great opportunity to put out a quick and dirty "SNMP with the FMA" blog post. There are some excellent previous materials out there that are extremely applicable and you could probably get by with, so first I will point those out:

With those articles and a little help from your nearby SNMP guru, you are pretty much good to go. I've extended the information in those papers a tiny, tiny bit with how to link the information back to Sun's message repository. The great thing about the FMA is that we have "special sauce" around that can create a full-service cycle (including a Remote Shell application that service can use with your permission and oversight to further diagnose problems from a remote location as well as tools to deliver faults back to Sun Microsystems, including our own Common Array Manager).

Introduction


This guides a user through a 2 system setup with

  • system1 - enabled to send traps when a fault is detected by the FMA
  • system 2 - enabled to receive traps from system1

Throughout this write-up, a roughly equivalent FMA command is given that can be run on the originating host so you can follow what the SNMP trap is delivering. To me, the FMA command on the originating host is definitely the preference though, since it usually provides additional information and formatting that may not be available in the SNMP trap or walk.

Setup of Trap Sender


system1 must be setup to deliver traps when FMA events occur.

To do this, follow the instructions provided at A Louder Voice for the Fault Manager, summarized here:


  • Create the file /etc/sma/snmp/snmp.conf (if it doesn't exist) and add the line "mibs +ALL"
  • Add the line to /etc/sma/snmp/snmpd.conf: trap2sink system2
  • Add the additional line to support deeper information query: dlmod sunFM /usr/lib/fm/amd64/libfmd_snmp.so.1

Assuming you have made no other changes prior to this, your system should be ready to go. You do have to restart the SNMP service at this point (I always reboot...its a Windows habit).

Setup of Trap Receiver


system2 must be setup to receive traps, this is simple for demo purposes:

  • Run /usr/sfw/sbin/snmptrapd -P
  • Watch for traps

Receiving FMA traps


When system2 receives FMA traps from system1, they will look like this (formatting appropriately rearranged)


DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1190862249) 137 days, 19:57:02.49
SNMPv2-MIB::snmpTrapOID.0 = OID: SUN-FM-MIB::sunFmProblemTrap
SUN-FM-MIB::sunFmProblemUUID."33570acb-e108-4ca8-8276-c67aeecf2043" = STRING: "33570acb-e108-4ca8-8276-c67aeecf2043"
SUN-FM-MIB::sunFmProblemCode."33570acb-e108-4ca8-8276-c67aeecf2043" = STRING: ZFS-8000-CS
SUN-FM-MIB::sunFmProblemURL."33570acb-e108-4ca8-8276-c67aeecf2043" = STRING: http://sun.com/msg/ZFS-8000-CS

Check out that URL in the sunFmProblemURL, http://sun.com/msg/ZFS-8000-CS. You can actually go there and get the System Administrator Actions that should be taken as well as an extended view of the problem (not contextualized to your view, but this is the general fault information that you would see if you were on the system itself).

This trap is roughly equivalent to the information you would receive from running the
basic fmdump command on the system with the fault. You could also run the "fmdump -v -u 33570acb-e108-4ca8-8276-c67aeecf2043" command on the trap originator to get a bit more information:

bash-3.00# fmdump -v -u 33570acb-e108-4ca8-8276-c67aeecf2043
TIME UUID SUNW-MSG-ID
Jun 26 10:57:45.3602 33570acb-e108-4ca8-8276-c67aeecf2043 ZFS-8000-CS
100% fault.fs.zfs.pool

Problem in: zfs://pool=data
Affects: zfs://pool=data
FRU: -
Location: -

Navigating Problems on the Remote System

Let's start at the top now. We can quickly navigate all known, unresolved problems on the remote system by walking the sunFmProblemTable:

system2:/root-> /usr/sfw/bin/snmpwalk -v2c -c public -t 20 system1 sunFmProblemTable

This results in a dump of all problems with the UUIDs that can be used for deeper queries. The following printout is only a single problem on the system, additional problems are listed in blocks by the attribute (so all sunFmProblemUUIDs are lumped together followed by all ProblemCodes, and so on).

SUN-FM-MIB::sunFmProblemUUID."0f3dcdf3-f85b-c091-8f1f-ce2164976cda" = STRING: "0f3dcdf3-f85b-c091-8f1f-ce2164976cda"
SUN-FM-MIB::sunFmProblemCode."0f3dcdf3-f85b-c091-8f1f-ce2164976cda" = STRING: ZFS-8000-CS
SUN-FM-MIB::sunFmProblemURL."0f3dcdf3-f85b-c091-8f1f-ce2164976cda" = STRING: http://sun.com/msg/ZFS-8000-CS
SUN-FM-MIB::sunFmProblemDiagEngine."0f3dcdf3-f85b-c091-8f1f-ce2164976cda" = STRING: fmd:///module/fmd
SUN-FM-MIB::sunFmProblemDiagTime."0f3dcdf3-f85b-c091-8f1f-ce2164976cda" = STRING: 2008-6-18,10:7:4.0,-6:0
SUN-FM-MIB::sunFmProblemSuspectCount."0f3dcdf3-f85b-c091-8f1f-ce2164976cda" = Gauge32: 1

This is roughly equivalent to the fmdump command on a system, though with the basic fmdump only the UUID, Code and a MSG-ID are given.

Based on this information, we can look at the ZFS-8000-CS message at sun.com and determine what our next steps should be. It indicates that using the zpool status -x on the system with the fault would be useful. Going to the originating system and running it returns:


bash-3.00# zpool status -x
pool: data
state: UNAVAIL
status: One or more devices could not be opened. There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-D3
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
data UNAVAIL 0 0 0 insufficient replicas
mirror UNAVAIL 0 0 0 insufficient replicas
c2t0d0 UNAVAIL 0 0 0 cannot open
c3t1d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t2d0 UNAVAIL 0 0 0 cannot open
c3t3d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t4d0 UNAVAIL 0 0 0 cannot open
c3t5d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t6d0 UNAVAIL 0 0 0 cannot open
c3t7d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t8d0 UNAVAIL 0 0 0 cannot open
c3t9d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t10d0 UNAVAIL 0 0 0 cannot open
c3t11d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t12d0 UNAVAIL 0 0 0 cannot open
c3t13d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t14d0 UNAVAIL 0 0 0 cannot open
c3t15d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t16d0 UNAVAIL 0 0 0 cannot open
c3t17d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t18d0 UNAVAIL 0 0 0 cannot open
c3t19d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t20d0 UNAVAIL 0 0 0 cannot open
c3t21d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t22d0 UNAVAIL 0 0 0 cannot open
c3t23d0 UNAVAIL 0 0 0 cannot open
bash-3.00#

For some history on this particular problem, we disconnected a JBOD that had the "data" pool built, so none of the devices are available...ouch.

You can look more deeply at the events that resulted in the problem by walking the sunFmFaultEventTable:

system2:/root-> /usr/sfw/bin/snmpwalk -v2c -c public -t 20 system1 sunFmFaultEventTable
SUN-FM-MIB::sunFmFaultEventProblemUUID."0f3dcdf3-f85b-c091-8f1f-ce2164976cda".1 = STRING: "0f3dcdf3-f85b-c091-8f1f-ce2164976cda"
SUN-FM-MIB::sunFmFaultEventProblemUUID."33570acb-e108-4ca8-8276-c67aeecf2043".1 = STRING: "33570acb-e108-4ca8-8276-c67aeecf2043"
SUN-FM-MIB::sunFmFaultEventProblemUUID."3600a05e-acc1-cae2-c185-f50852156777".1 = STRING: "3600a05e-acc1-cae2-c185-f50852156777"
SUN-FM-MIB::sunFmFaultEventProblemUUID."97bfeb63-7b02-c2b6-c51f-c451a9f760c5".1 = STRING: "97bfeb63-7b02-c2b6-c51f-c451a9f760c5"
SUN-FM-MIB::sunFmFaultEventClass."0f3dcdf3-f85b-c091-8f1f-ce2164976cda".1 = STRING: fault.fs.zfs.pool
SUN-FM-MIB::sunFmFaultEventClass."33570acb-e108-4ca8-8276-c67aeecf2043".1 = STRING: fault.fs.zfs.pool
SUN-FM-MIB::sunFmFaultEventClass."3600a05e-acc1-cae2-c185-f50852156777".1 = STRING: fault.fs.zfs.pool
SUN-FM-MIB::sunFmFaultEventClass."97bfeb63-7b02-c2b6-c51f-c451a9f760c5".1 = STRING: fault.fs.zfs.pool
SUN-FM-MIB::sunFmFaultEventCertainty."0f3dcdf3-f85b-c091-8f1f-ce2164976cda".1 = Gauge32: 100
SUN-FM-MIB::sunFmFaultEventCertainty."33570acb-e108-4ca8-8276-c67aeecf2043".1 = Gauge32: 100
SUN-FM-MIB::sunFmFaultEventCertainty."3600a05e-acc1-cae2-c185-f50852156777".1 = Gauge32: 100
SUN-FM-MIB::sunFmFaultEventCertainty."97bfeb63-7b02-c2b6-c51f-c451a9f760c5".1 = Gauge32: 100
SUN-FM-MIB::sunFmFaultEventASRU."0f3dcdf3-f85b-c091-8f1f-ce2164976cda".1 = STRING: zfs://pool=1ca09fa50e7ca8c7
SUN-FM-MIB::sunFmFaultEventASRU."33570acb-e108-4ca8-8276-c67aeecf2043".1 = STRING: zfs://pool=data
SUN-FM-MIB::sunFmFaultEventASRU."3600a05e-acc1-cae2-c185-f50852156777".1 = STRING: zfs://pool=6f658a6c4b99b18b
SUN-FM-MIB::sunFmFaultEventASRU."97bfeb63-7b02-c2b6-c51f-c451a9f760c5".1 = STRING: zfs://pool=data
SUN-FM-MIB::sunFmFaultEventFRU."0f3dcdf3-f85b-c091-8f1f-ce2164976cda".1 = STRING: -
SUN-FM-MIB::sunFmFaultEventFRU."33570acb-e108-4ca8-8276-c67aeecf2043".1 = STRING: -
SUN-FM-MIB::sunFmFaultEventFRU."3600a05e-acc1-cae2-c185-f50852156777".1 = STRING: -
SUN-FM-MIB::sunFmFaultEventFRU."97bfeb63-7b02-c2b6-c51f-c451a9f760c5".1 = STRING: -
SUN-FM-MIB::sunFmFaultEventResource."0f3dcdf3-f85b-c091-8f1f-ce2164976cda".1 = STRING: zfs://pool=1ca09fa50e7ca8c7
SUN-FM-MIB::sunFmFaultEventResource."33570acb-e108-4ca8-8276-c67aeecf2043".1 = STRING: zfs://pool=data
SUN-FM-MIB::sunFmFaultEventResource."3600a05e-acc1-cae2-c185-f50852156777".1 = STRING: zfs://pool=6f658a6c4b99b18b
SUN-FM-MIB::sunFmFaultEventResource."97bfeb63-7b02-c2b6-c51f-c451a9f760c5".1 = STRING: zfs://pool=data

This is roughly equivalent to the fmdump -v command. The fmdump -V command cannot be duplicated over SNMP, though it can be useful to run on the host side. fmdump -V can provide product, chassis and server IDs as well as a more complete list of faults and ereports that resulted in the diagnosis.

You could also view the fault management information by resource rather than by fault or event over SNMP

system2:/root-> /usr/sfw/bin/snmpwalk -v2c -c public -t 20 system1 sunFmResourceTable
SUN-FM-MIB::sunFmResourceFMRI.1 = STRING: zfs://pool=data
SUN-FM-MIB::sunFmResourceStatus.1 = INTEGER: faulted(5)
SUN-FM-MIB::sunFmResourceDiagnosisUUID.1 = STRING: "33570acb-e108-4ca8-8276-c67aeecf2043"

This is similar to the "fmadm faulty" command that can be run from a system. The faulty parameter results in the some additional information and text, though that text can also be retrieved at the event URL that was earlier identified.

While this is the "SPAM" approach to inquiring about a system, you could also walk each problem as they come in.

As was mentioned previously, there isn't a "ton" of information available within the SNMP MIB itself. The result of receiving a particular trap is often to do additional diagnosis on the system with the fault. So some ssh work may be necessary.

Additional things to do with FMA


You can dump the topology of a system!

Go to /usr/lib/fm/fmd and type "fmtopo", you should get a complete topology as it is recognized by the fault management infrastructure.


bash-3.00# ./fmtopo
TIME UUID
Jun 26 13:20:16 ddba8792-0166-6fb4-a81b-db3de9622649

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0/chip=0

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0/chip=0/cpu=0

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0/chip=0/cpu=1

hc://:product-id=Sun-Fire-X4150:chassis-id=0811QAR189:server-id=server1/motherboard=0/chip=0/cpu=2

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0/chip=0/cpu=3

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0/memory-controller=0

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0/memory-controller=0/dram-channel=0

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1:serial=518545073102016426:part=7777777:revision/motherboard=0/memory-controller=0/dram-channel=0/dimm=0

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1:serial=518545073102016426:part=7777777:revision/motherboard=0/memory-controller=0/dram-channel=0/dimm=0/rank=0

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1:serial=518545073102016426:part=7777777:revision/motherboard=0/memory-controller=0/dram-channel=0/dimm=0/rank=1

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0/memory-controller=0/dram-channel=1

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1:serial=555555555:part=777777777 :revision/motherboard=0/memory-controller=0/dram-channel=1/dimm=0

... (and on and on)

Artificial Fault Injection


There isn't a lot of documentation I've found for fminject, but it looks like it can be quite useful. It will, at its easiest, allow you to replay a set of ereports and faults that previously occurred in a system. For example, we can replay the error log that resulted in the FMA trap for the ZFS pool problem that the trap was sent out for. In one window on the system that sent the trap, start up the simulator

cd /usr/lib/fm/fmd
./fmsim -i

In another window, replay an error log

cd /usr/lib/fm/fmd
./fminject -v /var/fm/fmd/errlog

You will see the SNMP Trap re-appear in the trap receiver. Note there is no good way to determine that this is a replayed trap in the trap receiver's window that I've determined, so use this with caution.

Summary


While there is enough information to completely diagnose a problem from an SNMP trap, the traps are a great opportunity for integration with higher level management tools. More often than not, additional commands and information are required to be run on the system with the fault (be it logs or output of various commands, especially in the ZFS space), but the messages themselves typically have the next steps for diagnosis.

Comments:

Post a Comment:
Comments are closed for this entry.
About

pmonday

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today