Fault Management Top 10: #1

Solaris Fault Management Top 10: New Architecture Adopted

Time to get my promised fault management top 10 off the ground.
Sun has adopted an overarching Fault Management Architecture (FMA) to extend across its product line (of course it will take some time before it is realized in all products!).  Before we dive in let me say that I am not the architect of this architecture - I'm just a keen fan and beneficiary.
I'll describe some basic elements of the architecture below, but let's begin with a review of the architecture it replaces:

Outgoing Fault Management Architecture

(This space intentionally left blank)

OK it wasn't quite as bad as that, but it is true to say that we have not previously enjoyed any formal or consistent fault management model or architecture and that what functionality there was was rather ad-hoc and (in Solaris, and generally across the product line) focussed on error handling and not on fault diagnosis:
  • There was no centralised view of errors - each error handler would do the necessary handling as prescribed (e.g., rewrite a memory address in response to a correctable error from memory) but attempts at higher level monitoring of trends within these error handlers (e.g., are all errors from the same DIMM, all suffering the same bit in error, all reported by the same cpu and never any other cpu, etc)  were  somewhat limited and clumsy since there was no framework for collation and communication of error reports and trends.
  • Error handlers are part of the kernel, and that is not the best or easiest place to perform high-level trend analysis and diagnosis.  But the only communication up to userland level was of unstructured messages to syslog which hardly facilitates analysis in userland.  With diagnosis difficult in the kernel error handlers concentrated on error handling (surprise!) and not on analysing error trends to recognise faults - or where they did they were necessarily somewhat limited.
  • Error messages were sent to syslog and barfed to the console.  If you suffered a zillion errors but nonetheless survived, your console might be spammed with a zillion messages (of a few lines each).
  • Don't think just of hardware errors.  Consider the example of a driver encountering some odd condition, again and again.  Typically it would use cmn_err(9F) to barf some warning message to the console, repeatedly.  Chances are that it did not count those events (e.g., wouldn't know if a given event were the first or the 100th), did not kick off any additional diagnostics to look any deeper, would take no corrective action etc.  While some drivers do a better job they were obliged to invent the infrastructure for themselves, and there'd be no re-use in other drivers.
  • The list goes on.  There's some overlap with my previous post here: things were in this state because error handling and fault diagnosis were an afterthought in unix, subscribing to the "errors shouldn't happen - why design for them?" doctrine.
Those points are very much Solaris (and Solaris kernel) centric, but they're representative of the overall picture as it was.

New Fault Management Architecture

Let's start with a simple picture to establish some terminolgy:

  • An error is some unexpected/anomolous/interesting condition/result/signal/datum; it may be observed/detected or go undetected.  An example might be a correctable memory error from ECC-protected memory.
  • A fault is a defect that may produce an error.  Not all errors are necessarily the result of a defect.  For example a cosmic ray may induce an isolated few correctable memory errors, but there is no fault present;  on the other hand a DIMM that has suffered electrostatic damage in mishandling may produce a stream of correctable errors - there is a fault present.
  • Error Handlers respond to the detection or observation of an error.  Examples are:
    • a kernel cpu error trap handler -  a hardware exception that is raised on the cpu in response to  some hardware detection of an error (say an uncorrectable error from a memory DIMM),
    • a SCSI driver detects an unexpected bus reset
    • an application performs an illegal access and gets a SIGSEGV and has installed a signal handler to handle it
    • svc.startd(1M) decides that a service is restarting too frequently
    The error handler takes any required immediate action to properly handle the error, and then (if it wishes to produce events into the fault manager) collects what error data is available and prepares an error report for transmission to the Fault Manager.
  • The Error Report (data gathered in response to the observation of an error) is encoded according to a newly defined FMA event protocol to make an Error Event, which is then transported towards the Fault Manager.
  • The Fault Manager is a software component that is responsible for fault diagnosis through pluggable diagnosis engines.  It is also responsible for logging the error events it receives and for maintaining overall fault management state.  The fault manager receives error events from various error handlers through various transport mechanisms (the error handlers and the fault manager may reside in quite distinct locations, e.g. an error handler within the Solaris kernel and a fault manager running within the service processor or on another independent system). 
  • The FMA event protocol, among many other things, arranges events in a single hierarchical tree (intended to span all events across all Sun products) with each node named by its event class.  Events include error events as produced by the error handler and fault events as produced by the fault manager when a fault is diagnosed (events are generic and are certainly not limited to error and fault events).  The event protocol also specifies how event data is encoded.  Continuing our correctable memory error example, such an error may arrive as an error event of class ereport.cpu.ultraSPARC-IV.ce and have corresponding event payload data (which we'll take a look at another day).
  • Pluggable Diagnosis Engine clients of the fault manager subscribe to incoming error telemetry for event classes that they have knowledge of (i.e., are able to assist in the diagnosis of associated faults).  They apply various algorithms to the incoming error telemetry and attempt to diagnose any faults that are present in the system; when a fault is diagnosed a corresponding fault event is published.
  • FMA Agents subscribe to fault events produced by the diagnosis engines and may take any appropriate remedial or mitigating action.
That's enough terminology for now.

The Fault Manager(s)

The fault manager plays a central role in the architecture.  Furthermore, fault managers may be arranged hierarchically with events flowing between them.  For example one fault manager instance might run within each Solaris instance (say on domains A and B within a Sun-Fire 6900), and another fault manager instance might run on the system controller for the platform.  Events that are detected within the running OS image can produce error reports into the domain fault managers; events that are deteced within the componentry managed by the SC (say an error observed on an ASIC that Solaris does not know exists and for which the error is corrected before it gets anywhere near Solaris) can produce ereports that flow towards the SC fault manager.  And since these things are never entirely independent it is interesting for the SC to know what the domain has seen (and vice versa) so, for example, the domain fault manager might produce events consumed by the SC fault manager.

An Example - Solaris FMA Components

Let's look a bit closer at some of the FMA infrastructure as realized in Solaris 10 (keeping in mind that FMA stretches way beyond just the OS).  It's probably most illustrative to contrast old with new all along, so I'll approach it that way.
Let's consider the example of a correctable error from memory.  The following isn't real Solaris code (that's coming soon) and is a substantial simplification but it gives the flavour.

Old New
CPU traps to notify of a recent correctable error event.  The trap handler gathers data that is considered relevant/interesting such as the detector details (variable cpuid, an integer), address of the error (variable afar, a uint64_t), the error syndrome (variable synd, an integer), which DIMM holds that address (loc, a string), whether the error could be eliminated by a rewrite (status, a boolean), etc.  We now want to forward this data for pattern and trend analysis.
Vomit a message to the syslog with code that amount to little more than the following:

cmn_err(CE_WARN, "Corrected Memory Error detected by CPU 0x%x, "
"address 0x%llu, syndrome 0x%x (%s bit %d corrected), "
"status %s, memory location %s ",
cpuid, afar, synd, synd_to_bit_type(synd), synd_to_bit_number(synd), status ? "Fixable" : "Stuck",  loc);
Create some name-value lists which we'll populate with the relevant information and submit to our event transport:

dtcr = nvlist_create();
payload = nvlist_create();


We're not going to barf a message into the ether and hope for the best!  We're going to produce an error event according to the event protocol (implemented in various support functions which we'll use here).  We start by recording version and event class information:

/\*
 \* Event class naming is performed according to entries in a formal
 \* registry.  For this case assume the registered leaf node is
 \* a "mem-ce".  We'll prepend the full event class name to form a
 \* full class name along the lines of "ereport.cpu.ultraSPARC-IIIplus"
 \*/
payload_class_set(payload, cpu_gen(cpuid, "mem-ce"));


Populate the detector nvlist with information regarding the cpu and add this info into the payload:

(void) cpu_set(dtcr, cpuid, cpuversion(cpuid));
payload_set(payload, "detector", TYPE_NVLIST, dtcr);

Populate the payload with the error information:

payload_set(payload, "afar", TYPE_UINT64, afar);
payload_set(payload, "syndrome", TYPE_UINT16, synd);
payload_set(payload, "status", TYPE_BOOLEAN, status);
payload_set(payload, "location", TYPE_STRING, loc);


Submit the error report into our event transport:

event_submit(payload);

The /var/adm/messages file and possible the console get spammed with a message that might look like this:

Corrected Memory Error detected by cpu 0x38,
address 0x53272400, syndrome 0x34 (data bit 59 corrected),
status Fixable, memory location /N0/SB4/P0/B1/S0 J13301

If syslogd is having a bad day at the office it may even lose this message - e.g., if it reads the message from the kernel and then dies before logging it.  Assuming it does arrive it has also reached the end of the road - any further processing is going to rely on grokking through messages files.
The event transport used to safely transport the event from the kernel to the userland fmd is the sysevent transport.  The error event arrives at the fault manager for further processing.
-
The error event is logged by fmd.  If we were to look at the logged event with fmadm(1M) it would appear along these lines:

May 04 2005 16:26:48.718581240 ereport.cpu.ultraSPARC-IIIplus.mem-ce
nvlist version: 0
class = ereport.cpu.ultraSPARC-IIIplus.mem-ce
detector =  (embeeded nvlist)
nvlist version 0
version = 0x0
scheme = cpu
cpuid = 0x38
afar = 0x53272400
syndrome = 0x34
status = 1
location =
/N0/SB4/P0/B1/S0 J13301
Of course a real ereport has a bit more complexity than that.
-
The fault manager supplies a copy of the event to each plugin module that has subscribed to this event class.  Each receives it as a name-value list and can access all the data fields through the API provided my libnvpair(3LIB).
No userland level diagnosis is performed, other than ad hoc processing of messages files by various custom utilities.  The kernel error handler do perform limited diagnosis but, since the kernel context is necessarily somewhat limited, these are not sophisticated and each handler tends to roll its own approach.
Diagnosis Engine modules can collate statistics on the event stream.  The fmd API provides the infrastructure that a diagnosis engine might want in keeping track of various error and fault information.
High-level diagnosis languages are implemented in custom diagnosis engine modules, allowing diagnosis rules etc to be specified in a high level language rather than coding to a C API.

The DE may eventually decide that there is a fault (e.g., it notices that all events are coming from the same page of memory) and it will publish a fault event.  We can use fmdump to see the fault log.

An agent subscribing to the fault event class can retire the page of memory so that it is no longer used.

The example is incomplete (on both sides) but fair.  I hope it illustrates how unstructured the old regime is, and how structured and how extendable the new is.  Getting additional telemetry out of the kernel into a module that can consume that data to make intelligent diagnosis decisions is pretty trivial now.

Comments:

Gavin I can't imagine the time it took to write these articles. The technical content and way you have put this info out is awsome. I for one applaud you taking time to do this. thanks.

Posted by Paul Humphreys on May 11, 2005 at 10:31 AM EST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

I work in the Fault Management core group; this blog describes some of the work performed in that group.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Site Pages
OpenSolaris
Sun Bloggers

No bookmarks in folder