Thursday Dec 17, 2009

Tweeting your Sun Storage 7000 Appliance Alerts

Twitter, Instant Messages, Mobile Alerts have always fascinated me. I truly believe that a Storage Administrator should not have to leave the comfort of their iPhone, Droid or Palm Pre to do 90% of their day to day management tasks. As I was scanning the headlines of blogs.sun.com I saw this great article on Tweeting from Command Line using Python.

So, leading into how to manage a Sun Storage 7000 Appliance using Alerts (the next article in my series) I thought I would take some time and adapt this script to tweet my received Sun Storage 7000 Appliance Traps. I am going to use the AK MIB traps (to be explained in more detail in the next article) to achieve this.

Writing the Python Trap Handler

First, create the trap handler (this is based on the Python Script presented in the Blog Article: Tweeting from Command Line using Python).

Here is the Python Script:

#!/usr/bin/python
import sys

from os import popen

def tweet(user,password,message):
print 'Hold on there %s....Your message %s is getting posted....' % (message, user)

url = 'http://twitter.com/statuses/update.xml'

curl = '/usr/dist/local/sei/tools/SunOS-sparc/curl -s -u %s:%s -d status="%s" %s' % (user,password,message,url)

pipe = popen(curl, 'r')
print 'Done...awesome'

if __name__ == '__main__':
host = sys.stdin.readline()
ip = sys.stdin.readline()
uptime = sys.stdin.readline()
uuid = sys.stdin.readline()
alertclass = sys.stdin.readline()
alertcount = sys.stdin.readline()
alerttype = sys.stdin.readline()
alertseverity = sys.stdin.readline()
alertsresponse = sys.stdin.readline()
messageArray = [host,ip,alerttype]
t=","
message = t.join(messageArray)
message = message[0:140]

user = "yourtwitter" #put your username inside these quotes

password = "yourpassword" #put your password inside these quotes
tweet(user,password,message)

You will have to make the following changes at a minimum


  • Re-insert the missing tabs based on Python formatting
  • Ensure the path to CURL is appropriate
  • Change the user and password variables to your Twitter account

Once that is done you should be set.

Adding the Trap Handler to SNMP

Next, set up your snmptrapd.conf to handle traps from the AK MIB by invoking the Python Script above. My /etc/sma/snmp/snmptrapd.conf looks something like this:

traphandle .1.3.6.1.4.1.42.2.225.1.3.0.1 /export/home/oracle/ak-tweet.py

The OID .1.3.6.1.4.1.42.2.225.1.3.0.1 identifies the sunAkTraps portion of the AK-MIB delivered with the Sun Storage 7000 Appliance.

Now, invoke snmptrapd using the above configuration file (there are many ways to do this but I am doing the way I know will pick up my config file :-)

/usr/sfw/sbin/snmptrapd -c /etc/sma/snmp/snmptrapd.conf -P

Sending Alerts from the Sun Storage 7000

Using the article I posted yesterday, ensure SNMP is enabled with a trapsink identifying the system where your trap receiver is running. Now we have to enable an alert to be sent via SNMP from your Sun Storage 7000 Appliance (this is different from the default Fault Management Traps I discussed yesterday).

For now, trust me on this, I will explain more in my next article, let's enable a simple ARC size threshold to be violated. Go into the Browser User Interface for your system (or the simulator) and go into the Configuration -> Alerts screen. Click through to the "Thresholds" and add one that you know will be violated, like this one:

Each alert that is sent gets posted to the twitter account identified within the Python Script! And as my friends quickly noted, from there it can go to your Facebook account where you can quickly declare "King of the Lab"!

Thanks Sandip for your inspiring post this morning :-)

Wednesday Dec 16, 2009

The SNMP Service on a Sun Storage 7000 Appliance

Without a doubt, SNMP rules the playground in terms of monitoring hardware assets, and many software assets, in a data center monitoring ecosystem. It is the single biggest integration technology I'm asked about and that I've encountered when discussing monitoring with customers.

Why does SNMP have such amazing staying power?


  • It's extensible (vendors can provide MIBs and extend existing MIBs)
  • It's simple (hierarchical data rules and really it boils down to GET, SET, TRAP)
  • It's ubiquitous (monitoring tools accept SNMP, systems deliver SNMP)
  • It operates on two models, real time (traps) and polling (get)
  • It has aged gracefully (security extensions in v4 did not destroy it's propagation)

To keep the SNMP support in the Sun Storage 7000 Appliances relatively succinct, I am going to tackle this in two separate posts. This first post shows how to enable SNMP and what you get "out of the box" once it's enabled. The next post discusses how to deliver more information via SNMP (alerts with more information and threshold violations).

To get more information on SNMP on the Sun Storage 7000 and to download the MIBs that will be discussed here, go to the Help Wiki on a Sun Storage 7000 Appliance (or the simulator):


  • SNMP - https://[hostname]:215/wiki/index.php/Configuration:Services:SNMP

Also, as I work at Sun Microsystems, Inc., all of my examples of walking MIBs on a Sun Storage 7000 Appliance or receiving traps will be from a Solaris-based system. There are plenty of free / open source / trial packages for other Operating System platforms so you will have to adapt this content appropriately for your platform.

One more note as I progress in this series, all of my examples are from the CLI or from scripts, so you won't find many pretty pictures in the series :-)

Enabling SNMP on the Sun Storage 7000 Appliance gives you the ability to:


  • Receive traps (delivered via Sun's Fault Manager (FM) MIB)
  • GET system information (MIB-II System, MIB-II Interfaces, Sun Enterprise MIB)
  • GET information customized to the appliance (using the Sun Storage AK MIB)

Enabling alerts (covered in the next article) extends the SNMP support by delivering targeted alerts via the AK MIB itself.

Enable SNMP


The first thing we'll want to do is log into a target Sun Storage 7000 Appliance via SSH and check if SNMP is enabled.


aie-7110j:> configuration services snmp
aie-7110j:>configuration services snmp> ls
Properties:
<status> = disabled
community = public
network =
syscontact =
trapsinks =

aie-7110j:configuration services snmp>

Here you can see it is currently disabled and that we have to set up all of the SNMP parameters. The most common community string to this day is "public" and as we will not be changing system information via SNMP we will keep it. The "network" parameter to use for us is 0.0.0.0/0, this allows access to the MIB from any network. Finally, I will add a single trapsink so that any traps get sent to my management host. The last step shown is to enable the service once the parameters are committed.


aie-7110j:configuration services snmp> set network=0.0.0.0/0
network = 0.0.0.0/0 (uncommitted)
aie-7110j:configuration services snmp> set syscontact="Paul Monday"
syscontact = Paul Monday (uncommitted)
aie-7110j:configuration services snmp> set trapsinks=10.9.166.33
trapsinks = 10.9.166.33 (uncommitted)
aie-7110j:configuration services snmp> commit
aie-7110j:configuration services snmp> enable
aie-7110j:configuration services snmp> show
Properties:
<status> = online
community = public
network = 0.0.0.0/0
syscontact = Paul Monday
trapsinks = 10.9.166.33

From the appliance perspective we are now up and running!

Get the MIBs and Install Them


As previously mentioned, all of the MIBs that are unique to the Sun Storage 7000 Appliance are also distributed with the appliance. Go to the Help Wiki and download them, then move them to the appropriate location for monitoring.

On the Solaris system I'm using, that location is /etc/sma/snmp/mibs. Be sure to browse the MIB for appropriate tables or continue to look at the Help Wiki as it identifies relevant OIDs that we'll be using below.

Walking and GETting Information via the MIBs


Using standard SNMP operations, you can retrieve quite a bit of information. As an example from the management station, we will retrieve a list of shares available from the system using snmpwalk:


-bash-3.00# ./snmpwalk -c public -v 2c isv-7110h sunAkShareName
SUN-AK-MIB::sunAkShareName.1 = STRING: pool-0/MMC/deleteme
SUN-AK-MIB::sunAkShareName.2 = STRING: pool-0/MMC/data
SUN-AK-MIB::sunAkShareName.3 = STRING: pool-0/TestVarious/filesystem1
SUN-AK-MIB::sunAkShareName.4 = STRING: pool-0/oracle_embench/oralog
SUN-AK-MIB::sunAkShareName.5 = STRING: pool-0/oracle_embench/oraarchive
SUN-AK-MIB::sunAkShareName.6 = STRING: pool-0/oracle_embench/oradata
SUN-AK-MIB::sunAkShareName.7 = STRING: pool-0/AnotherProject/NoCacheFileSystem
SUN-AK-MIB::sunAkShareName.8 = STRING: pool-0/AnotherProject/simpleFilesystem
SUN-AK-MIB::sunAkShareName.9 = STRING: pool-0/default/test
SUN-AK-MIB::sunAkShareName.10 = STRING: pool-0/default/test2
SUN-AK-MIB::sunAkShareName.11 = STRING: pool-0/EC/tradetest
SUN-AK-MIB::sunAkShareName.12 = STRING: pool-0/OracleWork/simpleExport

Next, I can use snmpget to obtain a mount point for the first share:

-bash-3.00# ./snmpget -c public -v 2c isv-7110h sunAkShareMountpoint.1
SUN-AK-MIB::sunAkShareMountpoint.1 = STRING: /export/deleteme

It is also possible to get a list of problems on the system identified by problem code:

-bash-3.00# ./snmpwalk -c public -v 2c isv-7110h sunFmProblemUUID
SUN-FM-MIB::sunFmProblemUUID."91e97860-f1d1-40ef-8668-dc8fb85679bb" = STRING: "91e97860-f1d1-40ef-8668-dc8fb85679bb"

And then turn around and retrieve the associated knowledge article identifier:

-bash-3.00# ./snmpget -c public -v 2c isv-7110h sunFmProblemCode.\\"91e97860-f1d1-40ef-8668-dc8fb85679bb\\"
SUN-FM-MIB::sunFmProblemCode."91e97860-f1d1-40ef-8668-dc8fb85679bb" = STRING: AK-8000-86

The FM-MIB does not contain information on severity, but using the problem code I can SSH into the system and retrieve that information:

isv-7110h:> maintenance logs select fltlog select uuid="91e97860-f1d1-40ef-8668-dc8fb85679bb"
isv-7110h:maintenance logs fltlog entry-005> ls
Properties:
timestamp = 2009-12-15 05:55:37
uuid = 91e97860-f1d1-40ef-8668-dc8fb85679bb
desc = The service processor needs to be reset to ensure proper functioning.
type = Major Defect

isv-7110h:maintenance logs fltlog entry-005>

Take time to inspect the MIBs through your MIB Browser to understand all of the information available. I tend to shy away from using SNMP for getting system information and instead write scripts and workflows as much more information is available directly on the system, I'll cover this in a later article.

Receive the Traps


Trap receiving on Solaris is a piece of cake, at least for demonstration purposes. What you choose to do with the traps is a whole different process. Each tool has it's own trap monitoring facilities that will hand you the fields in different ways. For this example, Solaris just dumps the traps to the console.

Locate the "snmptrapd" binary on your Solaris system and start monitoring:


-bash-3.00# cd /usr/sfw/sbin
-bash-3.00# ./snmptrapd -P
2009-12-16 09:27:47 NET-SNMP version 5.0.9 Started.

From there you can wait for something bad to go wrong with your system or you can provoke it yourself. Fault Management can be a bit difficult to provoke intentionally since things one thinks would provoke a fault are actually administrator activites. Pulling a disk drive is very different from a SMART drive error on a disk drive. Similarly, pulling a Power Supply is different from tripping over a power cord and yanking it out. The former is not a fault since it is a complex operation requiring an administrator to unseat the power supply (or disk) whereas the latter occurs out in the wild all the time.

Here are some examples of FM traps I've received through this technique using various "malicious" techniques on a lab system ;-)

Here is an FM Trap when I "accidentally" tripped over a power cord in the lab. Be careful when you do this so you don't pull the system off the shelf if it is not racked properly (note that I formatted this a little bit from the raw output):


2009-11-16 12:25:34 isv-7110h [172.20.67.78]:
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1285895753) 148 days, 19:55:57.53
SNMPv2-MIB::snmpTrapOID.0 = OID: SUN-FM-MIB::sunFmProblemTrap
SUN-FM-MIB::sunFmProblemUUID."2c7ff987-6248-6f40-8dbc-f77f22ce3752" = STRING: "2c7ff987-6248-6f40-8dbc-f77f22ce3752"
SUN-FM-MIB::sunFmProblemCode."2c7ff987-6248-6f40-8dbc-f77f22ce3752" = STRING: SENSOR-8000-3T
SUN-FM-MIB::sunFmProblemURL."2c7ff987-6248-6f40-8dbc-f77f22ce3752" = STRING: http://sun.com/msg/SENSOR-8000-3T

Notice again that I have a SunFmProblemUUID that I can turn around and shell into the system to obtain more details (similarly to what was shown in the last section). Again, the next article will contain an explanation of Alerts. Using the AK MIB and Alerts, we can get many more details pushed out to us via an SNMP Trap, and we have finer granularity as to the alerts that get pushed.

Here, I purchased a very expensive fan stopper-upper device from a fellow tester. It was quite pricey, it turns out it is also known as a "Twist Tie". Do NOT do this at home, seriously, the decreased air flow through the system can cause hiccups in your system.


DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1285889746) 148 days, 19:54:57.46
SNMPv2-MIB::snmpTrapOID.0 = OID: SUN-FM-MIB::sunFmProblemTrap
SUN-FM-MIB::sunFmProblemUUID."cf480476-51b7-c53a-bd07-c4df59030284" = STRING: "cf480476-51b7-c53a-bd07-c4df59030284"
SUN-FM-MIB::sunFmProblemCode."cf480476-51b7-c53a-bd07-c4df59030284" = STRING: SENSOR-8000-26
SUN-FM-MIB::sunFmProblemURL."cf480476-51b7-c53a-bd07-c4df59030284" = STRING: http://sun.com/msg/SENSOR-8000-26

You will receive many, many other traps throughout the day including the Enterprise MIB letting us know when the system starts up or any other activities.

Wrap it Up


In this article, I illustrated enabling the SNMP Service on the Sun Storage 7000 Appliance via an SSH session. I also showed some basic MIB walking and traps that you'll receive once SNMP is enabled.

This is really simply the "start" of the information we can push through the SNMP pipe from a system. In the next article I'll show how to use Alerts on the system with the SNMP pipe so you can have more control over the events on a system that you wish to be notified about.

Thursday Jun 26, 2008

Fun with the FMA and SNMP

I've been working back in the Auto Service Request space for the past few weeks and doing some education and stuff as we build and test this product or that, it seemed like a great opportunity to put out a quick and dirty "SNMP with the FMA" blog post. There are some excellent previous materials out there that are extremely applicable and you could probably get by with, so first I will point those out:

With those articles and a little help from your nearby SNMP guru, you are pretty much good to go. I've extended the information in those papers a tiny, tiny bit with how to link the information back to Sun's message repository. The great thing about the FMA is that we have "special sauce" around that can create a full-service cycle (including a Remote Shell application that service can use with your permission and oversight to further diagnose problems from a remote location as well as tools to deliver faults back to Sun Microsystems, including our own Common Array Manager).

Introduction


This guides a user through a 2 system setup with

  • system1 - enabled to send traps when a fault is detected by the FMA
  • system 2 - enabled to receive traps from system1

Throughout this write-up, a roughly equivalent FMA command is given that can be run on the originating host so you can follow what the SNMP trap is delivering. To me, the FMA command on the originating host is definitely the preference though, since it usually provides additional information and formatting that may not be available in the SNMP trap or walk.

Setup of Trap Sender


system1 must be setup to deliver traps when FMA events occur.

To do this, follow the instructions provided at A Louder Voice for the Fault Manager, summarized here:


  • Create the file /etc/sma/snmp/snmp.conf (if it doesn't exist) and add the line "mibs +ALL"
  • Add the line to /etc/sma/snmp/snmpd.conf: trap2sink system2
  • Add the additional line to support deeper information query: dlmod sunFM /usr/lib/fm/amd64/libfmd_snmp.so.1

Assuming you have made no other changes prior to this, your system should be ready to go. You do have to restart the SNMP service at this point (I always reboot...its a Windows habit).

Setup of Trap Receiver


system2 must be setup to receive traps, this is simple for demo purposes:

  • Run /usr/sfw/sbin/snmptrapd -P
  • Watch for traps

Receiving FMA traps


When system2 receives FMA traps from system1, they will look like this (formatting appropriately rearranged)


DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1190862249) 137 days, 19:57:02.49
SNMPv2-MIB::snmpTrapOID.0 = OID: SUN-FM-MIB::sunFmProblemTrap
SUN-FM-MIB::sunFmProblemUUID."33570acb-e108-4ca8-8276-c67aeecf2043" = STRING: "33570acb-e108-4ca8-8276-c67aeecf2043"
SUN-FM-MIB::sunFmProblemCode."33570acb-e108-4ca8-8276-c67aeecf2043" = STRING: ZFS-8000-CS
SUN-FM-MIB::sunFmProblemURL."33570acb-e108-4ca8-8276-c67aeecf2043" = STRING: http://sun.com/msg/ZFS-8000-CS

Check out that URL in the sunFmProblemURL, http://sun.com/msg/ZFS-8000-CS. You can actually go there and get the System Administrator Actions that should be taken as well as an extended view of the problem (not contextualized to your view, but this is the general fault information that you would see if you were on the system itself).

This trap is roughly equivalent to the information you would receive from running the
basic fmdump command on the system with the fault. You could also run the "fmdump -v -u 33570acb-e108-4ca8-8276-c67aeecf2043" command on the trap originator to get a bit more information:

bash-3.00# fmdump -v -u 33570acb-e108-4ca8-8276-c67aeecf2043
TIME UUID SUNW-MSG-ID
Jun 26 10:57:45.3602 33570acb-e108-4ca8-8276-c67aeecf2043 ZFS-8000-CS
100% fault.fs.zfs.pool

Problem in: zfs://pool=data
Affects: zfs://pool=data
FRU: -
Location: -

Navigating Problems on the Remote System

Let's start at the top now. We can quickly navigate all known, unresolved problems on the remote system by walking the sunFmProblemTable:

system2:/root-> /usr/sfw/bin/snmpwalk -v2c -c public -t 20 system1 sunFmProblemTable

This results in a dump of all problems with the UUIDs that can be used for deeper queries. The following printout is only a single problem on the system, additional problems are listed in blocks by the attribute (so all sunFmProblemUUIDs are lumped together followed by all ProblemCodes, and so on).

SUN-FM-MIB::sunFmProblemUUID."0f3dcdf3-f85b-c091-8f1f-ce2164976cda" = STRING: "0f3dcdf3-f85b-c091-8f1f-ce2164976cda"
SUN-FM-MIB::sunFmProblemCode."0f3dcdf3-f85b-c091-8f1f-ce2164976cda" = STRING: ZFS-8000-CS
SUN-FM-MIB::sunFmProblemURL."0f3dcdf3-f85b-c091-8f1f-ce2164976cda" = STRING: http://sun.com/msg/ZFS-8000-CS
SUN-FM-MIB::sunFmProblemDiagEngine."0f3dcdf3-f85b-c091-8f1f-ce2164976cda" = STRING: fmd:///module/fmd
SUN-FM-MIB::sunFmProblemDiagTime."0f3dcdf3-f85b-c091-8f1f-ce2164976cda" = STRING: 2008-6-18,10:7:4.0,-6:0
SUN-FM-MIB::sunFmProblemSuspectCount."0f3dcdf3-f85b-c091-8f1f-ce2164976cda" = Gauge32: 1

This is roughly equivalent to the fmdump command on a system, though with the basic fmdump only the UUID, Code and a MSG-ID are given.

Based on this information, we can look at the ZFS-8000-CS message at sun.com and determine what our next steps should be. It indicates that using the zpool status -x on the system with the fault would be useful. Going to the originating system and running it returns:


bash-3.00# zpool status -x
pool: data
state: UNAVAIL
status: One or more devices could not be opened. There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-D3
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
data UNAVAIL 0 0 0 insufficient replicas
mirror UNAVAIL 0 0 0 insufficient replicas
c2t0d0 UNAVAIL 0 0 0 cannot open
c3t1d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t2d0 UNAVAIL 0 0 0 cannot open
c3t3d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t4d0 UNAVAIL 0 0 0 cannot open
c3t5d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t6d0 UNAVAIL 0 0 0 cannot open
c3t7d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t8d0 UNAVAIL 0 0 0 cannot open
c3t9d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t10d0 UNAVAIL 0 0 0 cannot open
c3t11d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t12d0 UNAVAIL 0 0 0 cannot open
c3t13d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t14d0 UNAVAIL 0 0 0 cannot open
c3t15d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t16d0 UNAVAIL 0 0 0 cannot open
c3t17d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t18d0 UNAVAIL 0 0 0 cannot open
c3t19d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t20d0 UNAVAIL 0 0 0 cannot open
c3t21d0 UNAVAIL 0 0 0 cannot open
mirror UNAVAIL 0 0 0 insufficient replicas
c2t22d0 UNAVAIL 0 0 0 cannot open
c3t23d0 UNAVAIL 0 0 0 cannot open
bash-3.00#

For some history on this particular problem, we disconnected a JBOD that had the "data" pool built, so none of the devices are available...ouch.

You can look more deeply at the events that resulted in the problem by walking the sunFmFaultEventTable:

system2:/root-> /usr/sfw/bin/snmpwalk -v2c -c public -t 20 system1 sunFmFaultEventTable
SUN-FM-MIB::sunFmFaultEventProblemUUID."0f3dcdf3-f85b-c091-8f1f-ce2164976cda".1 = STRING: "0f3dcdf3-f85b-c091-8f1f-ce2164976cda"
SUN-FM-MIB::sunFmFaultEventProblemUUID."33570acb-e108-4ca8-8276-c67aeecf2043".1 = STRING: "33570acb-e108-4ca8-8276-c67aeecf2043"
SUN-FM-MIB::sunFmFaultEventProblemUUID."3600a05e-acc1-cae2-c185-f50852156777".1 = STRING: "3600a05e-acc1-cae2-c185-f50852156777"
SUN-FM-MIB::sunFmFaultEventProblemUUID."97bfeb63-7b02-c2b6-c51f-c451a9f760c5".1 = STRING: "97bfeb63-7b02-c2b6-c51f-c451a9f760c5"
SUN-FM-MIB::sunFmFaultEventClass."0f3dcdf3-f85b-c091-8f1f-ce2164976cda".1 = STRING: fault.fs.zfs.pool
SUN-FM-MIB::sunFmFaultEventClass."33570acb-e108-4ca8-8276-c67aeecf2043".1 = STRING: fault.fs.zfs.pool
SUN-FM-MIB::sunFmFaultEventClass."3600a05e-acc1-cae2-c185-f50852156777".1 = STRING: fault.fs.zfs.pool
SUN-FM-MIB::sunFmFaultEventClass."97bfeb63-7b02-c2b6-c51f-c451a9f760c5".1 = STRING: fault.fs.zfs.pool
SUN-FM-MIB::sunFmFaultEventCertainty."0f3dcdf3-f85b-c091-8f1f-ce2164976cda".1 = Gauge32: 100
SUN-FM-MIB::sunFmFaultEventCertainty."33570acb-e108-4ca8-8276-c67aeecf2043".1 = Gauge32: 100
SUN-FM-MIB::sunFmFaultEventCertainty."3600a05e-acc1-cae2-c185-f50852156777".1 = Gauge32: 100
SUN-FM-MIB::sunFmFaultEventCertainty."97bfeb63-7b02-c2b6-c51f-c451a9f760c5".1 = Gauge32: 100
SUN-FM-MIB::sunFmFaultEventASRU."0f3dcdf3-f85b-c091-8f1f-ce2164976cda".1 = STRING: zfs://pool=1ca09fa50e7ca8c7
SUN-FM-MIB::sunFmFaultEventASRU."33570acb-e108-4ca8-8276-c67aeecf2043".1 = STRING: zfs://pool=data
SUN-FM-MIB::sunFmFaultEventASRU."3600a05e-acc1-cae2-c185-f50852156777".1 = STRING: zfs://pool=6f658a6c4b99b18b
SUN-FM-MIB::sunFmFaultEventASRU."97bfeb63-7b02-c2b6-c51f-c451a9f760c5".1 = STRING: zfs://pool=data
SUN-FM-MIB::sunFmFaultEventFRU."0f3dcdf3-f85b-c091-8f1f-ce2164976cda".1 = STRING: -
SUN-FM-MIB::sunFmFaultEventFRU."33570acb-e108-4ca8-8276-c67aeecf2043".1 = STRING: -
SUN-FM-MIB::sunFmFaultEventFRU."3600a05e-acc1-cae2-c185-f50852156777".1 = STRING: -
SUN-FM-MIB::sunFmFaultEventFRU."97bfeb63-7b02-c2b6-c51f-c451a9f760c5".1 = STRING: -
SUN-FM-MIB::sunFmFaultEventResource."0f3dcdf3-f85b-c091-8f1f-ce2164976cda".1 = STRING: zfs://pool=1ca09fa50e7ca8c7
SUN-FM-MIB::sunFmFaultEventResource."33570acb-e108-4ca8-8276-c67aeecf2043".1 = STRING: zfs://pool=data
SUN-FM-MIB::sunFmFaultEventResource."3600a05e-acc1-cae2-c185-f50852156777".1 = STRING: zfs://pool=6f658a6c4b99b18b
SUN-FM-MIB::sunFmFaultEventResource."97bfeb63-7b02-c2b6-c51f-c451a9f760c5".1 = STRING: zfs://pool=data

This is roughly equivalent to the fmdump -v command. The fmdump -V command cannot be duplicated over SNMP, though it can be useful to run on the host side. fmdump -V can provide product, chassis and server IDs as well as a more complete list of faults and ereports that resulted in the diagnosis.

You could also view the fault management information by resource rather than by fault or event over SNMP

system2:/root-> /usr/sfw/bin/snmpwalk -v2c -c public -t 20 system1 sunFmResourceTable
SUN-FM-MIB::sunFmResourceFMRI.1 = STRING: zfs://pool=data
SUN-FM-MIB::sunFmResourceStatus.1 = INTEGER: faulted(5)
SUN-FM-MIB::sunFmResourceDiagnosisUUID.1 = STRING: "33570acb-e108-4ca8-8276-c67aeecf2043"

This is similar to the "fmadm faulty" command that can be run from a system. The faulty parameter results in the some additional information and text, though that text can also be retrieved at the event URL that was earlier identified.

While this is the "SPAM" approach to inquiring about a system, you could also walk each problem as they come in.

As was mentioned previously, there isn't a "ton" of information available within the SNMP MIB itself. The result of receiving a particular trap is often to do additional diagnosis on the system with the fault. So some ssh work may be necessary.

Additional things to do with FMA


You can dump the topology of a system!

Go to /usr/lib/fm/fmd and type "fmtopo", you should get a complete topology as it is recognized by the fault management infrastructure.


bash-3.00# ./fmtopo
TIME UUID
Jun 26 13:20:16 ddba8792-0166-6fb4-a81b-db3de9622649

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0/chip=0

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0/chip=0/cpu=0

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0/chip=0/cpu=1

hc://:product-id=Sun-Fire-X4150:chassis-id=0811QAR189:server-id=server1/motherboard=0/chip=0/cpu=2

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0/chip=0/cpu=3

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0/memory-controller=0

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0/memory-controller=0/dram-channel=0

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1:serial=518545073102016426:part=7777777:revision/motherboard=0/memory-controller=0/dram-channel=0/dimm=0

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1:serial=518545073102016426:part=7777777:revision/motherboard=0/memory-controller=0/dram-channel=0/dimm=0/rank=0

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1:serial=518545073102016426:part=7777777:revision/motherboard=0/memory-controller=0/dram-channel=0/dimm=0/rank=1

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1/motherboard=0/memory-controller=0/dram-channel=1

hc://:product-id=Sun-Fire-X4150:chassis-id=09999:server-id=server1:serial=555555555:part=777777777 :revision/motherboard=0/memory-controller=0/dram-channel=1/dimm=0

... (and on and on)

Artificial Fault Injection


There isn't a lot of documentation I've found for fminject, but it looks like it can be quite useful. It will, at its easiest, allow you to replay a set of ereports and faults that previously occurred in a system. For example, we can replay the error log that resulted in the FMA trap for the ZFS pool problem that the trap was sent out for. In one window on the system that sent the trap, start up the simulator

cd /usr/lib/fm/fmd
./fmsim -i

In another window, replay an error log

cd /usr/lib/fm/fmd
./fminject -v /var/fm/fmd/errlog

You will see the SNMP Trap re-appear in the trap receiver. Note there is no good way to determine that this is a replayed trap in the trap receiver's window that I've determined, so use this with caution.

Summary


While there is enough information to completely diagnose a problem from an SNMP trap, the traps are a great opportunity for integration with higher level management tools. More often than not, additional commands and information are required to be run on the system with the fault (be it logs or output of various commands, especially in the ZFS space), but the messages themselves typically have the next steps for diagnosis.

About

pmonday

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today