The FMA Triad: Topology, Telemetry & Diagnosis Rules - Part 2

In Part 1 of the "FMA Triad": Topology, Telemetry, and Diagnosis Rules" I focused on topology. It's time to unravel the second piece of the triad - telemetry - with a specific focus on how the telemetry relates to the topology.

As a reminder, the intention of this series is to illustrate how topology, telemetry, and diagnosis rules fit together, where they must agree, and - as a teaser for the last installment - what problems arise when they don't agree.


Part 2 - Telemetry

telemetry (tə-lĕm'ĭ-trē): the science and technology of automatic measurement and transmission of data by wire, radio, or other means from remote sources, as from space vehicles, to receiving stations for recording and analysis.

Ok, so FMA may not be receiving information from space (yet :). But error detectors in a system - whether they be hardware, sensors, software, or firmware - can provide FMA information about an problem detected in the system. The telemetry given to FMA takes the form of error reports - or ereports.

All FMA ereports are defined in the FMA events registry (see related blog). The ereport class name and content convey details of the error to a diagnosis engine. The ereport also represents an agreement between a provider of telemetry (error detector) and the consumer of that telemetry (diagnosis engine). There's lots and lots of ereports, and each has its own specific content, as different subsystems require different information. But with respect to topology, the focus is one of the common elements present in any ereport - the detector.

The detector takes the form of a fault managed resource identifier (FMRI). In other words, the thing that detected (but not necessarily caused) an error in the system. It's best to look at an example:

# fmdump -eV
TIME                           CLASS
Mar 31 2008 12:08:36.084161600 ereport.io.fire.dmc.eq_not_en
nvlist version: 0
        class = ereport.io.fire.dmc.eq_not_en
        ena = 0x317b96b9efe2c02
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = hc
                hc-root =
                hc-list-sz = 3
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = ioboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = hostbridge
                        hc-id = 0
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = pciexrc
                        hc-id = 0
                (end hc-list[2])

        (end detector)
...

Note: Detectors are not always reported in the 'hc' scheme. For example, device driver detections are typically in the 'dev' scheme.

So this is 'hc' scheme. See the 'hc-list'? And each member of the list having a name and an id? Let's write this a little differently:
    hc:///ioboard=0/hostbridge=0/pciexrc=0
If you've read the topology of this series, this is looking familiar. An FMRI that one could expect in a system's topology. Using the Eversholt (EFT) diagnosis engine as an example, when this ereport is received, EFT will find the detector in the current topology snapshot. The diagnosis engine then knows all about the resource that detected the error, it's FRU properties, etc. from the topology.

Having agreement between Solaris topology and ereport detector may extend beyond Solaris. For example, in sun4v systems, much of the error telemetry is sourced outside of Solaris in the Service Processor (SP). The SP must know what hierarchy to encode in the ereport detector, or the diagnosis engines will not react to the incoming ereport. In fact, FMD will complain - loudly. But examples on that in the next installment.

Next...
In the next installment of this series, we'll examine some Eversholt diagnosis rules and how the rules relate to the topology.

Part 3 -->

:wq

Comments:

Post a Comment:
Comments are closed for this entry.
About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today