The FMA Triad: Topology, Telemetry & Diagnosis Rules - Part 4

Over the course of the last several months, I've described how topology, telemetry, and diagnosis rules work together within the Fault Management Architecture (FMA). Something I've dubbed the FMA Triad. It's high time I finished off this little mini-series with an example of what happens when the members of the Triad don't play nicely together.

Part 4 - When Things Go Wrong

This is a real world example. I briefly mentioned at the tail end of Part 3 that the diagnosis rules often use relative FMRIs, but the telemetry and the topology must use fully qualified FMRIs. When there's a disconnect, the diagnosis engine cannot determine the system resource and is unable to understand the incoming telemetry.

Customers upgraded their T2000 systems to Solaris 10 Update 4 and when starting the OS were greeted with this:

SUNW-MSG-ID: SUNOS-8000-1L, TYPE: Defect, VER: 1, SEVERITY: Minor
EVENT-TIME: Feb 08 15:30:39 CST 2008
PLATFORM: SUNW,Netra-T2000, CSN: -, HOSTNAME: FOO
SOURCE: eft, REV: 1.16
EVENT-ID: 1be16b2d-158e-e73b-f097-b744c2eb8cd3
DESC: The EFT Diagnosis Engine encountered telemetry for which it is unable to produce
a diagnosis.  Refer to http://sun.com/msg/SUNOS-8000-1L for more information.
AUTO-RESPONSE: Error reports from the component will be logged for examination by Sun.
IMPACT: Automated diagnosis and response for these events will not occur.
REC-ACTION: Run pkgchk -n SUNWfmd to ensure that fault management software is installed
properly. Contact Sun for support.
Ouch. Not good. Not only is there an apparent hardware problem, FMA can't make heads or tails of it. First thing to do is find out what events led to this "diagnosis". We can use fmdump for that:

# fmdump -e -u 1be16b2d-158e-e73b-f097-b744c2eb8cd3
TIME                 CLASS
Feb 08 15:27:30.4960 ereport.io.fire.pec.lup
Ok. After checking the event registry, "lup" is a link up error report. Normally, the diagnosis engines ignore these error - and on boot we flat out expect them (the links have to come up). In fact, I tipped my cards in Part 3 by showing you the rules for a "lup" error. Looking deeper at the telemetry:

# fmdump -eV -u 1be16b2d-158e-e73b-f097-b744c2eb8cd3
TIME                           CLASS
Feb 08 2008 15:27:30.496065920 ereport.io.fire.pec.lup
nvlist version: 0
        class = ereport.io.fire.pec.lup
        ena = 0xa3e9ab80002
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = hc
                hc-root =
                hc-list-sz = 3
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = ioboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = hostbridge
                        hc-id = 0
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = pciexrc
                        hc-id = 0
                (end hc-list[2])

        (end detector)

        primary = 1
        tlu-oeele = 0xffffff
        tlu-oeie = 0xffffff00ffffff
        tlu-oeis = 0x100
        tlu-oeess = 0x0
        __ttl = 0x1
        __tod = 0x47ace562 0x1d915d80
The telemetry is describing an FMRI of hc:///ioboard=0/hostbridge=0/pciexrc=0, but the diagnosis rules aren't understanding that. But, the rules are only looking for errors against hostbridge/pciexrc. So the rules themselves are fine. The problem must be that this FMRI can't be located in the topology. Turning to fmtopo:

# /usr/lib/fm/fmd/fmtopo
...
hc:///motherboard=0/hostbridge=0/pciexrc=0
...
Eureka! There is in fact a disconnect between the telemetry and the topology. This explains the undiagnosable errors from FMA.

Now, to the question of what changed to cause this problem? On T2000 class systems, telemetry for root complex errors (like the "lup" error) is generated in the Service Processor. Topology is constructed by enumerators in Solaris. Since the change made to the system was an upgrade of Solaris, something has gone wrong in Solaris 10 Update 4.

The mechanisms for generating a topology changed significantly between S10U3 and S10U4. In S10U3, there were .topo files, and the one for Netra,T2000 described an ioboard/hostbridge/pciexrc arrangement. When the newer XML map mechanism came with S10U4 (the one I described in Part 1 of this series), a platform specific topology map for Netra,T2000 was overlooked. As Part 1 details, when there is no platform specific topology map, FMD reverts to the architecture specific topology map - sun4v in this case. The sun4v XML map in S10U4 describes a motherboard/hostbridge/pciexrc arrangement.

The solution was to provide a platform specific topo map for the Netra,T2000 system. If memory serves, a similar change for the Netra,CP3060 was also needed.

I hope you found this to be a good example that ties this series up. If you've enjoyed this half as much as I have, then I've enjoyed it twice as much as you. :)

:wq

Comments:

Post a Comment:
Comments are closed for this entry.
About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today