Modeling Panic event in FMA
By Chris W Beal on Mar 15, 2011
I haven't blogged in ages, in fact since Sun was taken over by Oracle. However I've not been idle, far from it, just working on product to get it out to market as soon as possible.
However - the release of Solaris 11 Express 2010.11 (yes I've been so busy I haven't even got round to writing this entry for 4 months!) I can tell you about one thing I've been working on with members of the FMA and SMF teams. It's part of a larger effort to more tightly integrated software "troubles" in to FMA. This includes modeling SMF state changes in FMA, and my favorite, modeling System panic events in FMA.
I won't go in to the details, but in summary, when a system reboots after a panic, savecore is run (even if dumpadm -n is in effect) to check if a dump is present on the dump device. If there is, it raise an "Information Report" for fma to process. This becomes and FMA MSGID of SUNOS-8000-KL. You should see a message on the console if you're looking, giving instructions on what to do next. There is a small amount of data about the crash, panicstring, stack, date etc embedded in the report. Once savecore is run to extract the dump from the dump device, another information report is raised which FMA ties to the first event, and solves the case.
One of the nice things that can then happen, is the FMA notification capabilities are open to us, so you could set up an SNMP trap or email notification for such a panic. A small thing, but it might help some sysadmins in the middle of the night.
One final thing. That small amount of information in the Ireport can be accessed using fmdump with the -V flag for the uuid of the fault (as reported in the messages on the console or fmadm faulty), for example, this was from a panic I induced by clearing the root vnode pointer.
# fmdump -Vu b2e3080a-5a85-eda0-eabe-e5fa2359f3d0 TIME UUID SUNW-MSG-ID Jan 13 2011 13:39:17.364216000 b2e3080a-5a85-eda0-eabe-e5fa2359f3d0 SUNOS-8000-KL TIME CLASS ENA Jan 13 13:39:17.1064 ireport.os.sunos.panic.dump_available 0x0000000000000000 Jan 13 13:33:19.7888 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000 nvlist version: 0 version = 0x0 class = list.suspect uuid = b2e3080a-5a85-eda0-eabe-e5fa2359f3d0 code = SUNOS-8000-KL diag-time = 1294925957 157194 de = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = fmd authority = (embedded nvlist) nvlist version: 0 version = 0x0 product-id = CELSIUS-W360 chassis-id = YK7K081269 server-id = tetrad (end authority) mod-name = software-diagnosis mod-version = 0.1 (end de) fault-list-sz = 0x1 fault-list = (array of embedded nvlists) (start fault-list) nvlist version: 0 version = 0x0 class = defect.sunos.kernel.panic certainty = 0x64 asru = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = sw object = (embedded nvlist) nvlist version: 0 path = /var/crash/<host>/.b2e3080a-5a85-eda0-eabe-e5fa2359f3d0 (end object) (end asru) resource = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = sw object = (embedded nvlist) nvlist version: 0 path = /var/crash/<host>/.b2e3080a-5a85-eda0-eabe-e5fa2359f3d0 (end object) (end resource) savecore-succcess = 1 dump-dir = /var/crash/tetrad dump-files = vmdump.1 os-instance-uuid = b2e3080a-5a85-eda0-eabe-e5fa2359f3d0 panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff0015a865b0 addr=ffffff0200000000 panicstack = unix:die+10f () | unix:trap+1799 () | unix:cmntrap+e6 () | unix:mutex_enter+b () | genunix:lookupnameatcred+97 () | genunix:lookupname+5c () | elfexec:elf32exec+a5c () | genunix:gexec+6d7 () | genunix:exec_common+4e8 () | genunix:exece+1f () | unix:brand_sys_syscall+1f5 () | crashtime = 1294925154 panic-time = January 13, 2011 01:25:54 PM GMT GMT (end fault-list) fault-status = 0x1 severity = Major __ttl = 0x1 __tod = 0x4d2f0085 0x15b57ec0
Any way, I hope you find this feature useful. I'm hoping to use the data embedded in the event for data mining, and problem resolution. However if you have any ideas of other information that could be realistically added to the ireport, then please let me know. However you have to bare in mind this information is written while the system is panicking, so what can be reliably gathered is somewhat limited