Saturday Nov 15, 2008

Practical debugging with fmstat

fmstat(1M) presents a variety of statistical information about FMD and any active modules registered with the daemon. I've used fmstat a bunch of times, primarily using the -m option to look at the detailed statistics of an individual module when debugging code. The statistics are meaningful for folks familiar with the source, but I'd never given much thought to how fmstat may be useful for a system administrator. Until last Thursday.

I did a "guru" talk at LISA '08 on FMA, which was (a lot of fun and) an informal session to talk about FMA, educate those less familiar with it, or dive into more detailed areas for those working with FMA on their systems. During the session, we ended up doing a little live debugging of FMA on a system. And fmstat played a central role. As I was talking and not taking notes (nor did I get to capture the actual screen output), the material here is summarized and displayed output contrived, but I think you'll get the picture.

The scenario was this: system had a component reported faulty by FMA, the component was replaced, and ever since FMD was taking ~50% of the system time. fmadm faulty reported no faults.

The first thing we did was run fmstat. Without options, fmstat reports some basic statistics on each active module, including the number of events received (ev_recv) and the time spent servicing those events (svc_t):

[output fudged for illustration] # fmstat module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz cpumem-retire 0 0 0.0 0.0 0 0 0 0 0 0 disk-transport 0 0 0.0 317.7 0 0 0 0 32b 0 eft 173 0 0.0 9384.4 0 0 0 0 4.2M 843b fabric-xlate 0 0 0.0 0.3 0 0 0 0 0 0 fmd-self-diagnosis 18 0 0.0 0.1 0 0 0 0 0 0 io-retire 0 0 0.0 0.1 0 0 0 0 0 0 snmp-trapgen 0 0 0.0 0.0 0 0 0 0 32b 0 sysevent-transport 0 0 0.0 337.9 0 0 0 0 0 0 syslog-msgs 0 0 0.0 0.0 0 0 0 0 0 0 zfs-diagnosis 0 0 0.0 0.3 0 0 0 0 0 0 zfs-retire 0 0 0.0 0.1 0 0 0 0 0 0

It was immediately apparent that 'eft', the Eversholt Fault Tree diagnosis engine, was very busy. To immediately relieve the pressure FMD was putting on the system, we unloaded 'eft':

# fmadm unload eft fmadm: module 'eft' unloaded from fault manager

System time for FMD went down to 0.1%. Great. We've addressed the system time crunch. But why was 'eft' busy? fmstat showed us that the module was receiving a large number of events. Using fmdump -e, we saw a large number of ereports coming in:

[output fudged for illustration] # fmdump -e Nov 13 16:22:08.0265 Nov 13 16:23:08.0265 Nov 13 16:24:08.0265 ...

Periodically an error event was being detected and sent into FMA. But either not at a rate high enough to trigger a fault, or an error type that is discarded. Stupidly I didn't write down the actual ereport class - as this could represent a bug in our current diagnosis engines. If you're reading this, happened to be in the session at LISA, and remember the ereport class, please post it in a comment to the blog.

And thanks to everyone that attended the session. I was pleasantly surprised by the turnout.


Friday Jul 18, 2008

FMA "Guru" Session for LISA '08

It was confirmed yesterday that the proposed FMA "Guru Is In" is a go for the LISA '08 event in San Diego. As it turns out, the "Hit The Ground Running" sessions are not being run - the LISA coordinators wanted to focus on making a good "guru" track.

I'll be at the conference on November 13th, and on hand for your FMA questions and gripes around 5pm PT or so. While I hope for more of the former and less of the latter, I am interested to hear ways we can improve FMA. And you don't have to wait until November to speak up.

Mention this blog and you'll get a hearty handshake. :)


Wednesday Jun 04, 2008

Proposed FMA Topics for LISA '08

Last week I submitted two FMA topics for LISA '08 being held in San Diego this year. One if for a "guru" session - I'm a little sketchy on the details, but envision this being a booth where FMA can be demonstrated and folks can come with their questions (or gripes :) about FMA. The other is a "hit the ground running" session - a short and to-the-point presentation designed as a primer for FMA.

Hopefully one or both will be accepted. The submissions are below.


The "Guru Is In" Session

With OpenSolaris and Solaris 10, the reporting and managing of faulty components in the system significantly changed. Half of the Predictive Self-Healing (PSH) feature of Solaris is Fault Management - a framework for diagnosis engines and response agent to handle and report system faults, as well as a set of command line utilities examine and manage system faults. Our "guru" proposal is to have a short demo of Solaris' Fault Management using the freely available FMA Demo Kit and answer questions about how administrators and users work with Fault Management. Ref:

The "Hit The Ground Running" Session

Hello, I am proposing a Hit the Ground Running (HGTR) session to introduce and cover the essentials of Solaris' Fault Management subsystem. The abstract and outline for the material is below. I'd expect the material to take ~15 minutes to present. Thank you for your consideration, Scott Davenport Sun Microsystems Fault Management Development Team Self-healing functionality for users and administrators of a modern operating system provides fine-grained fault isolation and restart where possible of any component?hardware or software?that experiences a problem. To do so, the system must include intelligent, automated, proactive diagnoses of errors that are observed on the system. The diagnosis system is used to trigger targeted automated responses or guided human intervention that mitigate a specific problem or at least prevent it from getting worse. Finally, these new system capabilities are connected to a new model for system administrators oriented around simpler, higher-level abstractions. Sun's first Predictive Self-Healing features are part of Solaris 10 and OpenSolaris and include the Fault Manager and the Service Manager. This HTGR session focuses on the Fault Manager portion of Solaris' Predictive Self-Healing. The Solaris Fault Management effort (originally code-named FMA inside of Sun) provides a new architecture for building resilient error handlers, error telemetry, automated diagnosis software, response agents, and a consistent model of system failures for a management stack. Outline of material: - Brief History Lesson . legacy UNIX fault model vs. FMA - What is FMA? . FM Daemon, Diagnosis Engines, Response Agents (block diagram) - Responding to Faults . Expect console messsages, knowledge article - Displaying Faults . fmadm faulty - Repairing Faults . fmadm repair - Managing FMD Logs & Modules . fmadm config; fmadm rotate; fmadm load/unload - For More Information .




« April 2014