Wednesday Jul 07, 2010

What Topology Map(s) is FMA Using?

When debugging FMA issues, there's been several times I've wanted to know what XML maps are being used to generate (or fail to generate :) the FM topology. I guess I reached a sufficient point of annoyance, so I wrote a pair of dtrace scripts to automate this. Here's an example from an Ultra 24 workstation running snv_140.

# ./topomaps.d -c /usr/lib/fm/fmd/fmtopo 2>/dev/null | grep Map Map: //usr/platform/i86pc/lib/fm/topo/maps/i86pc-hc-topology.xml Map: //usr/platform/i86pc/lib/fm/topo/maps/i86pc-legacy-hc-topology.xml Map: //usr/platform/i86pc/lib/fm/topo/maps/chip-hc-topology.xml Map: //usr/platform/i86pc/lib/fm/topo/maps/chassis-hc-topology.xml \^C

Now, I did say a pair of dtrace scripts. topomaps.d is the first script, which calls a second script topomapmon.d that ultimately traces the calls to topo_xml_read(). Two scripts are required to deal with the dynamic loading done by fmtopo. Anyway, drop both in the same directory and run with sufficient privileges and you should be good to go.


Friday Aug 01, 2008

The FMA Triad: Topology, Telemetry & Diagnosis Rules - Part 4

Over the course of the last several months, I've described how topology, telemetry, and diagnosis rules work together within the Fault Management Architecture (FMA). Something I've dubbed the FMA Triad. It's high time I finished off this little mini-series with an example of what happens when the members of the Triad don't play nicely together.

Part 4 - When Things Go Wrong

This is a real world example. I briefly mentioned at the tail end of Part 3 that the diagnosis rules often use relative FMRIs, but the telemetry and the topology must use fully qualified FMRIs. When there's a disconnect, the diagnosis engine cannot determine the system resource and is unable to understand the incoming telemetry.

Customers upgraded their T2000 systems to Solaris 10 Update 4 and when starting the OS were greeted with this:

SUNW-MSG-ID: SUNOS-8000-1L, TYPE: Defect, VER: 1, SEVERITY: Minor
EVENT-TIME: Feb 08 15:30:39 CST 2008
SOURCE: eft, REV: 1.16
EVENT-ID: 1be16b2d-158e-e73b-f097-b744c2eb8cd3
DESC: The EFT Diagnosis Engine encountered telemetry for which it is unable to produce
a diagnosis.  Refer to for more information.
AUTO-RESPONSE: Error reports from the component will be logged for examination by Sun.
IMPACT: Automated diagnosis and response for these events will not occur.
REC-ACTION: Run pkgchk -n SUNWfmd to ensure that fault management software is installed
properly. Contact Sun for support.
Ouch. Not good. Not only is there an apparent hardware problem, FMA can't make heads or tails of it. First thing to do is find out what events led to this "diagnosis". We can use fmdump for that:

# fmdump -e -u 1be16b2d-158e-e73b-f097-b744c2eb8cd3
TIME                 CLASS
Feb 08 15:27:30.4960
Ok. After checking the event registry, "lup" is a link up error report. Normally, the diagnosis engines ignore these error - and on boot we flat out expect them (the links have to come up). In fact, I tipped my cards in Part 3 by showing you the rules for a "lup" error. Looking deeper at the telemetry:

# fmdump -eV -u 1be16b2d-158e-e73b-f097-b744c2eb8cd3
TIME                           CLASS
Feb 08 2008 15:27:30.496065920
nvlist version: 0
        class =
        ena = 0xa3e9ab80002
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = hc
                hc-root =
                hc-list-sz = 3
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = ioboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = hostbridge
                        hc-id = 0
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = pciexrc
                        hc-id = 0
                (end hc-list[2])

        (end detector)

        primary = 1
        tlu-oeele = 0xffffff
        tlu-oeie = 0xffffff00ffffff
        tlu-oeis = 0x100
        tlu-oeess = 0x0
        __ttl = 0x1
        __tod = 0x47ace562 0x1d915d80
The telemetry is describing an FMRI of hc:///ioboard=0/hostbridge=0/pciexrc=0, but the diagnosis rules aren't understanding that. But, the rules are only looking for errors against hostbridge/pciexrc. So the rules themselves are fine. The problem must be that this FMRI can't be located in the topology. Turning to fmtopo:

# /usr/lib/fm/fmd/fmtopo
Eureka! There is in fact a disconnect between the telemetry and the topology. This explains the undiagnosable errors from FMA.

Now, to the question of what changed to cause this problem? On T2000 class systems, telemetry for root complex errors (like the "lup" error) is generated in the Service Processor. Topology is constructed by enumerators in Solaris. Since the change made to the system was an upgrade of Solaris, something has gone wrong in Solaris 10 Update 4.

The mechanisms for generating a topology changed significantly between S10U3 and S10U4. In S10U3, there were .topo files, and the one for Netra,T2000 described an ioboard/hostbridge/pciexrc arrangement. When the newer XML map mechanism came with S10U4 (the one I described in Part 1 of this series), a platform specific topology map for Netra,T2000 was overlooked. As Part 1 details, when there is no platform specific topology map, FMD reverts to the architecture specific topology map - sun4v in this case. The sun4v XML map in S10U4 describes a motherboard/hostbridge/pciexrc arrangement.

The solution was to provide a platform specific topo map for the Netra,T2000 system. If memory serves, a similar change for the Netra,CP3060 was also needed.

I hope you found this to be a good example that ties this series up. If you've enjoyed this half as much as I have, then I've enjoyed it twice as much as you. :)


Tuesday Jul 29, 2008

Putback: Platform Independent FMA for sun4v

As predicted in my last entry about sun4v platform independent FMA, I am going to yelp about the putback:

Event: putback-to Comment: FWARC 2008/300 Sun4v FMA Platform Independent FMA Topology Enumeration PSARC 2008/392 FMA new canonical hc names for sun4v_pi enumerator PSARC 2008/440 sun4v Platform Independent Topology Enumerator - libmdesc extensions 6628827 Need platform independent topo enumeration for sun4v platforms Files: update: usr/src/lib/fm/libmdesc/ update: usr/src/lib/fm/libmdesc/common/mapfile-vers update: usr/src/lib/fm/topo/libtopo/common/hc.c update: usr/src/lib/fm/topo/libtopo/common/topo_hc.h update: usr/src/lib/fm/topo/maps/sun4v/sun4v-hc-topology.xml update: usr/src/lib/fm/topo/modules/sun4v/Makefile update: usr/src/lib/fm/topo/modules/sun4v/pcibus/Makefile update: usr/src/lib/fm/topo/modules/sun4v/pcibus/pci_sun4v.c update: usr/src/pkgdefs/SUNWfmd/prototype_sparc update: usr/src/uts/common/sys/mdesc.h create: usr/src/common/mdesc/mdesc_getproparcs.c create: usr/src/common/mdesc/mdesc_walkdag.c create: usr/src/lib/fm/topo/modules/sun4v/sun4vpi/Makefile create: usr/src/lib/fm/topo/modules/sun4v/sun4vpi/pi_cpu.c create: usr/src/lib/fm/topo/modules/sun4v/sun4vpi/pi_defer.c create: usr/src/lib/fm/topo/modules/sun4v/sun4vpi/pi_generic.c create: usr/src/lib/fm/topo/modules/sun4v/sun4vpi/pi_impl.h create: usr/src/lib/fm/topo/modules/sun4v/sun4vpi/pi_ldom.c create: usr/src/lib/fm/topo/modules/sun4v/sun4vpi/pi_niu.c create: usr/src/lib/fm/topo/modules/sun4v/sun4vpi/pi_pciexrc.c create: usr/src/lib/fm/topo/modules/sun4v/sun4vpi/pi_subr.c create: usr/src/lib/fm/topo/modules/sun4v/sun4vpi/pi_top.c create: usr/src/lib/fm/topo/modules/sun4v/sun4vpi/pi_walker.c create: usr/src/lib/fm/topo/modules/sun4v/sun4vpi/sun4vpi.c Examined files: 24 Contents Summary: 14 create 10 update

The code went into snv_96 yesterday so it should hit OpenSolaris in a day or so.





« February 2017