Thursday Feb 09, 2012

Email notification of FMA events

One of the projects I worked on for Solaris 11 was to record some information on System Panics in FMA events.Now I want to start making it easier to gather this information and map it to known problems. So starting internally I plan to utilise another feature which we developed as part of  the same effort. This is the email notifications framework. Rob Johnston described this feature in his blog here.

So the nice feature I want to utilise it custom message templates. So I thought I'd share how to do this It's pretty simple, but I got burnt by a couple of slight oddities - which we can probably fix.

First off I needed to create a template. There are a number of committed expansion tokens - these will work to expand information from the FMA event in to meaninful info in the email. The ones I care about this time are

%<HOSTNAME> : Hostname of the system which had the event
%<UUID> : UUID of the event - so you can mine more information
%<URL> : URL of the knowledge doc describing the problem

In addition I want to get some data that is panic specific. As yet these are uncommitted interfaces and shouldn't be relied upon, but for my reference these can be accessed

Panic String of the dump is %<fault-list[0].panicstr>
Stack trace to put in to MOS is  %<fault-list[0].panicstack>

These are visible in the panic event - so I don't feel bad about revealing the names, but I stress they shouldn't be relied upon.

So create a template which contains the text you want. Make sure it's readable by the noaccess user (ie. not /root)

The one I created for now looks like this

# cat /usr/lib/fm/notify/panic_template
%<HOSTNAME> Panicked

For more information log in to %<HOSTNAME> and run the command

fmdump -Vu %<UUID>

Please look at %<URL> for more information

Crash dump is available on %<HOSTNAME> in %<fault-list[0].dump-dir>

Panic String of the dump is %<fault-list[0].panicstr>

Stack trace to put in to MOS is  %<fault-list[0].panicstack>

I then need to add this to the notification for the "problem-diagnosed" event class. This is done with the svccfg command


# svccfg setnotify problem-diagnosed \"mailto:someone@somehost?msg_template=/usr/lib/fm/notify/panic_template\"

(Note the backslashes and quotes - they're important to get the parser to recognise the "=" correctly.)

It would be nice to tie it specifically to a panic event, but that needs a bit of plumbing to make it happen.

You can  verify it is configured correctly with the command


# svccfg listnotify problem-diagnosed
    Event: problem-diagnosed (source: svc:/system/fm/notify-params:default)
        Notification Type: smtp
            Active: true
            reply-to: root@localhost
            msg_template: /usr/lib/fm/notify/panic_template
            to: someone@somehost

Now when I get a panic, I get an email with some useful information I can use to start diagnosing the problem.

So what next? I think I'll try to firm up the stability of the useful members of the event, and may be create a new event we can subscribe to for panics only, then make this template an "extended support" option for panic events, and make it easily configurable.

Please do leave comments if you have any opinions on this and where to take it next.


Tuesday Mar 15, 2011

Modeling Panic event in FMA

I haven't blogged in ages, in fact since Sun was taken over by Oracle. However I've not been idle, far from it, just working on product to get it out to market as soon as possible.

However - the release of Solaris 11 Express 2010.11 (yes I've been so busy I haven't even got round to writing this entry for 4 months!) I can tell you about one thing I've been working on with members of the FMA and  SMF teams. It's part of a larger effort to more tightly integrated software "troubles" in to FMA. This includes modeling SMF state changes in FMA, and my favorite, modeling System panic events in FMA.

I won't go in to the details, but in summary, when a system reboots after a panic, savecore is run (even if dumpadm -n is in effect) to check if a dump is present on the dump device. If there is, it raise an "Information Report" for fma to process. This becomes and FMA MSGID of  SUNOS-8000-KL. You should see a message on the console if you're looking, giving instructions on what to do next. There is a small amount of data about the crash, panicstring, stack, date etc embedded in the report. Once savecore is run to extract the dump from the dump device, another information report is raised which FMA ties to the first event, and solves the case.

One of the nice things that can then happen, is the FMA notification capabilities are open to us, so you could set up an SNMP trap or email notification for such a panic. A small thing, but it might help some sysadmins in the middle of the night.

One final thing. That small amount of information in the Ireport can be accessed using fmdump with the -V flag for the uuid of the fault (as reported in the messages on the console or fmadm faulty), for example, this was from a panic I induced by clearing the root vnode pointer.

# fmdump -Vu b2e3080a-5a85-eda0-eabe-e5fa2359f3d0 
TIME                           UUID                                 SUNW-MSG-ID
Jan 13 2011 13:39:17.364216000 b2e3080a-5a85-eda0-eabe-e5fa2359f3d0 SUNOS-8000-KL

  TIME                 CLASS                                 ENA
  Jan 13 13:39:17.1064 ireport.os.sunos.panic.dump_available 0x0000000000000000
  Jan 13 13:33:19.7888 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = b2e3080a-5a85-eda0-eabe-e5fa2359f3d0
        code = SUNOS-8000-KL
        diag-time = 1294925957 157194
        de = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = fmd
                authority = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        product-id = CELSIUS-W360
                        chassis-id = YK7K081269
                        server-id = tetrad
                (end authority)

                mod-name = software-diagnosis
                mod-version = 0.1
        (end de)

        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = defect.sunos.kernel.panic
                certainty = 0x64
                asru = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = sw
                        object = (embedded nvlist)
                        nvlist version: 0
                                path = /var/crash/<host>/.b2e3080a-5a85-eda0-eabe-e5fa2359f3d0
                        (end object)

                (end asru)

                resource = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = sw
                        object = (embedded nvlist)
                        nvlist version: 0
                                path = /var/crash/<host>/.b2e3080a-5a85-eda0-eabe-e5fa2359f3d0
                        (end object)

                (end resource)

                savecore-succcess = 1
                dump-dir = /var/crash/tetrad
                dump-files = vmdump.1
                os-instance-uuid = b2e3080a-5a85-eda0-eabe-e5fa2359f3d0
                panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff0015a865b0 addr=ffffff0200000000
                panicstack = unix:die+10f () | unix:trap+1799 () | unix:cmntrap+e6 () | unix:mutex_enter+b () | genunix:lookupnameatcred+97 () | genunix:lookupname+5c () | elfexec:elf32exec+a5c () | genunix:gexec+6d7 () | genunix:exec_common+4e8 () | genunix:exece+1f () | unix:brand_sys_syscall+1f5 () | 
                crashtime = 1294925154
                panic-time = January 13, 2011 01:25:54 PM GMT GMT
        (end fault-list[0])

        fault-status = 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x4d2f0085 0x15b57ec0

Any way, I hope you find this feature useful. I'm hoping to use the data embedded in the event for data mining, and problem resolution. However if you have any ideas of other information that could be realistically added to the ireport, then please let me know. However you have to bare in mind this information is written while the system is panicking, so what can be reliably gathered is somewhat limited


Chris W Beal


« April 2014