Practical debugging with fmstat

fmstat(1M) presents a variety of statistical information about FMD and any active modules registered with the daemon. I've used fmstat a bunch of times, primarily using the -m option to look at the detailed statistics of an individual module when debugging code. The statistics are meaningful for folks familiar with the source, but I'd never given much thought to how fmstat may be useful for a system administrator. Until last Thursday.

I did a "guru" talk at LISA '08 on FMA, which was (a lot of fun and) an informal session to talk about FMA, educate those less familiar with it, or dive into more detailed areas for those working with FMA on their systems. During the session, we ended up doing a little live debugging of FMA on a system. And fmstat played a central role. As I was talking and not taking notes (nor did I get to capture the actual screen output), the material here is summarized and displayed output contrived, but I think you'll get the picture.

The scenario was this: system had a component reported faulty by FMA, the component was replaced, and ever since FMD was taking ~50% of the system time. fmadm faulty reported no faults.

The first thing we did was run fmstat. Without options, fmstat reports some basic statistics on each active module, including the number of events received (ev_recv) and the time spent servicing those events (svc_t):

[output fudged for illustration] # fmstat module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz cpumem-retire 0 0 0.0 0.0 0 0 0 0 0 0 disk-transport 0 0 0.0 317.7 0 0 0 0 32b 0 eft 173 0 0.0 9384.4 0 0 0 0 4.2M 843b fabric-xlate 0 0 0.0 0.3 0 0 0 0 0 0 fmd-self-diagnosis 18 0 0.0 0.1 0 0 0 0 0 0 io-retire 0 0 0.0 0.1 0 0 0 0 0 0 snmp-trapgen 0 0 0.0 0.0 0 0 0 0 32b 0 sysevent-transport 0 0 0.0 337.9 0 0 0 0 0 0 syslog-msgs 0 0 0.0 0.0 0 0 0 0 0 0 zfs-diagnosis 0 0 0.0 0.3 0 0 0 0 0 0 zfs-retire 0 0 0.0 0.1 0 0 0 0 0 0

It was immediately apparent that 'eft', the Eversholt Fault Tree diagnosis engine, was very busy. To immediately relieve the pressure FMD was putting on the system, we unloaded 'eft':

# fmadm unload eft fmadm: module 'eft' unloaded from fault manager

System time for FMD went down to 0.1%. Great. We've addressed the system time crunch. But why was 'eft' busy? fmstat showed us that the module was receiving a large number of events. Using fmdump -e, we saw a large number of ereports coming in:

[output fudged for illustration] # fmdump -e Nov 13 16:22:08.0265 ereport.cpu.intel.XXX Nov 13 16:23:08.0265 ereport.cpu.intel.XXX Nov 13 16:24:08.0265 ereport.cpu.intel.XXX ...

Periodically an error event was being detected and sent into FMA. But either not at a rate high enough to trigger a fault, or an error type that is discarded. Stupidly I didn't write down the actual ereport class - as this could represent a bug in our current diagnosis engines. If you're reading this, happened to be in the session at LISA, and remember the ereport class, please post it in a comment to the blog.

And thanks to everyone that attended the session. I was pleasantly surprised by the turnout.

:wq

Comments:

Whilst I wasn't in your class I am currently experiencing the same issue on a T2.

fdump -e reports the following:

Nov 23 17:55:55.4158 ereport.cpu.ultraSPARC-T2.dsc
Nov 23 17:55:55.4240 ereport.cpu.ultraSPARC-T2.dsc
Nov 23 18:02:48.7793 ereport.cpu.ultraSPARC-T2.dsc
Nov 23 18:02:48.7818 ereport.cpu.ultraSPARC-T2.dsc
Nov 23 18:02:48.7901 ereport.cpu.ultraSPARC-T2.dsc
Nov 23 18:02:48.7993 ereport.cpu.ultraSPARC-T2.dsc

What can I do next to find out what is causing this problem ?

Posted by Jeroen on November 23, 2008 at 04:15 PM PST #

Hi Scott,
I am working on FMA, I really meet some problem.
We are trying to send trap from FM to snmptrapd via SMA, we have followed all the directive and documentation to configure the SMA's snmpd.conf, but we could not see trap, here is the technic detail:

Platform: SunOS 5.10 Generic_127127-11 sun4v sparc SUNW,Netra-T5220

- FMD side configuration:
add two lines into /usr/lib/fm//fmd/plugins/snmp-trapgen.conf
setprop trap_all true
subscribe ereport.cpu.usii.ue

- SMA side configuration:
add the trap destination of the snmptrapd port of the localhost
dlmod sunFM /usr/lib/fm/sparcv9/libfmd_snmp.so.1

- Simulation of the trap send
create one dummy file to send one event via fminject:
# cat /tmp/event.log
evdef ereport.cpu.usii.ue {
enum type { persistent, sticky, intermittent };
int8_t int8[4];
int16_t int16;
string str;
string strarr[];
};
event ereport.cpu.usii.ue ue_1 = {
persistent,
[ 1, 2, 3, 4],
0xfff,
"hello",
[ "hello", "there", "a", "b" ]
};
ue_1;

- Use command fminject to send event to fmd
fminject /tmp/event.log

- Check the event to see if it received by fmd
# fmstat
snmp-trapgen 2 0 0.0 0.2 0 0 0 0 32b 0
it indicate snmp-trapgen has received the events.

But, the SMA side could not get the trap and send it to snmptrapd.

Could you please show me how to send trap from FM to any trapdaemon?

Posted by Jumin on November 25, 2008 at 11:32 AM PST #

My email: jumin@_@nortel.com

replace @_@ with @ please.

Posted by Jumin on November 25, 2008 at 11:33 AM PST #

Jeroen,

The T2 has a hardware scrubber that periodically sweeps through memory and scrubs errors. A DSC is a correctable error on a scrub operation. These are typically benign, but if persisting could result in page retires, which aren't messaged (save extreme cases where there have been excessive page retires in a given DIMM). There's more detail on the ereports with 'fmdump -eV', although any bit-twiddling we'd do is already being done by the diagnosis engine.

Beyond the ereports being logged, are you noticing other impacts on the system?

Posted by Scott Davenport on December 02, 2008 at 03:26 AM PST #

Jumin,

I've personally not experimented with SNMP, but these folks have:

http://blogs.sun.com/pmonday/entry/fun_with_the_fma_and
http://blogs.sun.com/wesolows/entry/a_louder_voice_for_the

Posted by Scott Davenport on December 02, 2008 at 03:28 AM PST #

Post a Comment:
Comments are closed for this entry.
About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today