Fault Management Top 10: #2

Structured Error Events

When subsystems are left to invent their own error handling infrastructure the natural baseline they tend towards is either that of zero infrastructure (just die horribly on any error) or printf logging/debugging (some odd message that has little or no structure, is not registered in any repository, probably means little to anyone other than the developer, and often contains no useful diagnosis information).  I'll term this unstructured printf messaging for this discussion.
There was a time several years ago when Solaris hardware error handling amounted to little greater than either of those baselines.  "Fortunately" the UltraSPARC-II Ecache ... errmmm ... niggle provided the much-needed catalyst towards the very substantial improvements we see today (and a grand future roadmap).  I'll use this as an example of what I'll term structured printf messaging below.
Finally I will contrast these with structured error events.

Unstructured Printf Messaging

The following is an example message that would be barfed (I find that word really suitable in these descriptions) to /var/adm/messages and console in older/unpatched versions of Solaris prior to the first round of hardware error handling improvements.  The message is the result of a corrected error from memory.
May 8 14:35:30 thishost SUNW,UltraSPARC-II: CPU1 CE Error: AFSR 
0x00000000.00100000 AFAR 0x00000000.8abb5a00 UDBH Syndrome 0x85 MemMod U0904
May 8 14:35:30 thishost SUNW,UltraSPARC-II: ECC Data Bit 63 was corrected
May 8 14:35:30 thishost unix: Softerror: Intermittent ECC Memory Error, U0904
At least there is some diagnostic information, such as telling as which bit was in error (we could grok for patterns).  While you could argue that there is some structure to that message you still have to write/maintain a custom grokker to extract it.  The task, of course, is complicated by other related messages sharing little or no common structure.  Here's an uncorrectable error from memory:
panic[CPU1]/thread=2a1000R7dd40: UE Error: AFSR 0x00000000.80200000 AFAR
0x00000000.089cd740 Id 0 Inst 0 MemMod U0501 U0401
Making things worse, in the corrected memory error case, these things were poured out to the console on every occurence.  The implicit expectation had been that you'd see very few such errors - not allowing for a bad pin, for instance.  So while the errors are correctable and survivable (albeit coming thick and fast) the handlers chose to spam the console making it near useless!

Structured Printf Messaging

Taking the time to plan and coordinate error messaging can pay substantial dividends.  This was one of the first reponses to the UltraSPARC-II Ecache problem - to analyse the requirements for error handling and diagnosis on that platform and, with  understanding of that umbrella-view in place, restructure associated handling and messaging.  Here are correctable and uncorrectable memory errors in this improved format:
[AFT0] Corrected Memory Error on CPU1, errID 0x00000036.629edc25
 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.00347dc0
 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1002fe20
 UDBH Syndrome 0x85 Memory Module 1904
[AFT0] errID 0x00000036.629edc25 Corrected Memory Error on 1904 is Intermittent
[AFT0] errID 0x00000036.629edc25 ECC Data Bit 63 was in error and corrected

WARNING: [AFT1] Uncorrectable Memory Error on CPU1 Instruction access at TL=0, 
errID 0x0000004f.818d9280
 AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.0685c7a0
 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x7815c7a0
 UDBH 0x0203<UE> UDBH.ESYND 0x03 UDBL 0x0000 UDBL.ESYND 0x00
 UDBH Syndrome 0x3 Memory Module 190x
WARNING: [AFT1] errID 0x0000004f.818d9280 Syndrome 0x3 indicates that this may 
not be a memory module problem
Behind the scenes a new shared error/fault information structure was introduced, and all kernel hardware error handlers were changed to accrue relevant information in an instance of one of those structures.  A new and robust (lockless) error dispatching mechanism would carry these structures to logging code which, instead of multiple independent printf (cmn_err) calls passed the error structure to a common function to log the error and passed flags to that function indicating what was relevant for the particular error.
Such messaging (along with many other improvements) significantly enhanced diagnosis and trend analysis.  You still had to grok messages files, but at least the messages were in a structured format that was quite readily extracted.  Of course there was still much to be done.  Programmtic access to the logged data was not readily available, syslog is not the most reliable message transport, syslog logs get aged away, harmless errors (say from an isolated cosmic ray incident) litter messages files and cause unnecessary alarm, etc.

Structured Error Events

Roll on the structured events described in my last blog entry.  When we prepare an error report we tag it with a pre-registered name (event class) and fill it with all relevant (and also pre-registered) event payload data.  That event payload data is presented to us a name-value list (really a name-type-value list).  When it is received by the fault manager is it logged away (independently of any diagnosis modules) in a structured fault log (binary, records are kept in the extended accounting format and the record data is in the form of name-value lists).  Here is a corrected memory event dumped from the error log using fmdump(1M):
TIME                          CLASS
Jun 23 2004 02:59:03.207995640 ereport.cpu.ultraSPARC-IIIplus.ce
nvlist version: 0
        class =
        ena = 0xd67a7253e3002c01
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = cpu
                cpuid = 0xb
                cpumask = 0x23
                serial = 0xd2528092482
        (end detector)

        afsr = 0x200000187
        afar-status = 0x1
        afar = 0x23c8f72dc0
        pc = 0x10391e4
        tl = 0x0
        tt = 0x63
        privileged = 1
        multiple = 0
        syndrome-status = 0x1
        syndrome = 0x187
        error-type = Persistent
        l2-cache-ways = 0x2
        l2-cache-data = 0xec0106f1a6
0x772dc0 0x0 0xbd000000 0xe8 0xc05b9 0x840400003e9287b6 0xc05b9
0x850400003e92a7b6 0xfeb5 0xc05b9 0x860400003e92c7b6 0xc05b9
0x870400003e92e7b6 0x3dd24 0xec0106f1a6 0x372dc0 0x1 0xbd000000 0xe8
0x0 0x0 0x0 0x4000000001 0x2e 0x0 0x0 0x0 0x0 0x0
        dcache-ways = 0x0
        icache-ways = 0x0
        resource = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = mem
                unum = /N0/SB4/P3/B1/D2 J16501
        (end resource)

In software (such as a diagnosis module) given a name-value list we can access the associated data either through generic event walkers (use the libnvpair(3LIB) API to iterate over list pairs, get the type for a pair, lookup that name-value pair for the given type) or (for example in the case of a diagnosis module that has subscribed to corrected memory error events and knows, from the registry, what payload data and types to expect):

analyze_error(nvlist_t \*nvl, ...)
        uint64_t afar;
        uint16_t synd;
        char \*type;

        if (nvlist_lookup_pairs(nvl, 0,
            "afar", DATA_TYPE_UINT64, &afar,
            "syndrome", DATA_TYPE_UINT16, &synd,
            "error-type", DATA_TYPE_STRING, &type,
            NULL) != 0)
                return (BAD_PAYLOAD);

Bye-bye hacky grokking scripts - we now have self-describing error event information!

Why Do We Care?

I'm hoping this message is getting across.  In the past, with no structured event protocol, standards for error collation etc to do it properly is just too difficult.  Subsystems would just deliver what was adequate during development, leave random printf debugging in place, perform no higher-level correlation of errors, etc.  Now they/we have no excuse - all the infrastructure is present (and easy to use) and it is vastly simpler to deliver high level error and fault analysis for your product (whatever it is - I keep offering examples from hardware errors but all this is generic).  Watch the Sun space for many leaps forward in fault management as a result.


Posted by guest on February 19, 2007 at 06:51 PM EST #

Post a Comment:
  • HTML Syntax: NOT allowed

I work in the Fault Management core group; this blog describes some of the work performed in that group.


« July 2016
Site Pages
Sun Bloggers

No bookmarks in folder