FMA and DIMM serial numbers


I've pretty much had my head down working on various FMA bug fixes and enhancements for the last few months.  Now that I've finally gotten them putback, I have some time to take a (short) breather and so I thought I'd blog about a few of the things I've been working on here.  Here's the first installment:


The Solaris Fault Manager maintains a snapshot of the hardware topology in a tree-like structure that includes a node for all hardware resources and FRU's that are managed/monitored by FMA.  The interfaces for generating a topology snapshot, walking the resulting tree and for manipulating the individual nodes in the tree are provided by libtopo and documented in Chapter 9 of the Fault Manager Programmer's Reference Guide.  Scott Davenport also has  some nice overview material here.   The nodes in the tree are represented by a unique identifier called an FMRI (fault managed resource identifier).  The format of the FMRI for hardware resources is the following:

hc://[<authority>],[hardware-id]/[<hc-root.>][<hardware-component-path>]

where hardware-id would be:
 [:serial=<serial-number>][:part=<part-number>][:revision=<revision-number>]


Among other things, the optional "hardware-id" fields (in particular the serial) can be used by the fault manager to detect when a FRU has been replaced by service personnel.  In the absence of hardware identity information, administrators must manually inform the fault manager after they've replaced a faulty component via the "repair" subcommand to fmadm(1m).  Otherwise, the fault manager will continue to report the component as faulty and attempt to isolate it.  On our UltraSPARC systems much of this information is provided by the OpenBoot Platform firmware.   On x86 we don't have the benefit of sitting on top a common firmware layer that we control.  As a result, we historically haven't filled in the hardware-id fields because we haven't found a generalized, reliable mechanism for fetching this data.  However, on our newer AMD-based server platforms[1], some FRU information is maintained in non-volatile storage by the service processor and is accessible using a common protocol: IPMI

In Solaris Nevada, build 87 we've added the capability to leverage IPMI to find and attach serial numbers to the dimm nodes in our topology on our AMD-based server platforms and we've extended the fault manager to check for this serial property and use it, if found, to detect when a faulted DIMM has been replaced.  For people who like ugly details :), here's a brief rundown of the code changes:

It all starts with a new topo node property method that is registered to the dimm nodes in our topology on our AMD-based server platforms.  The XML for this looks like the example below and the complete XML changes are in usr/src/lib/fm/topo/maps/i86pc/chip-hc-topology.xml

 <propmethod name='get_dimm_serial' version='0' propname='serial' proptype='string' >
      <argval name='format' type='string' value='p%d.d%d.fru' />
      <argval name='offset' type='uint32' value='0' />
</propmethod>

This property method uses interfaces from libipmi to communicate with the service processor to lookup the FRU locator record for the associated DIMM.  The FRU locator record provides the offset into the FRU inventory on the service processor where we can fetch information such as manufacturer name and the serial number.  Using the manufacturer name and serial number we synthesize a Sun serial ID[2] and attach it as a property to the dimm node.  This all happens in usr/src/lib/fm/topo/modules/i86pc/chip/chip_serial.c

Next we've modified the Fault Manager to look for the existence of the serial property method, and if found, invoke it and attach the serial to the FMRI's that are included in the payload of a fault event.  See fmd_nvl_create_fault() in usr/src/cmd/fm/fmd/common/fmd_api.c

The fault manager maintains a persistent cache of resources that have been the subject of a diagnosis (see Chapter 6 of the Fault Manager Programmer's Reference Guide).  The fault manager uses this to keep track of what's faulty and enables it to re-report and re-isolate a faulted component after a system restart.  However, before doing this, the fault manager first attempts to determine if the faulted component is still present in the system.  (No need to report or isolate something that's been removed).  The  code for determining if a faulted resource is a bit hard to follow and in some case varies based based on the type of component and whether we're on SPARC or x86, but the basic idea is to determine what scheme the FMRI of the faulted resource is in and then call the appropriate is_present method which should return TRUE, if the resource is still present and FALSE, otherwise.  For the DIMM case on our AMD-based platforms, the code flow looks like this:

usr/src/cmd/fm/fmd/common/fmd_asru.c::fmd_asru_hash_recreate()
|
|-> usr/src/cmd/fm/fmd/common/fmd_fmri.c::fmd_fmri_present()
    |
    |-> usr/src/cmd/fm/schemes/mem/mem.c::fmd_fmri_present()
        |
        |-> usr/src/lib/fm/topo/libtopo/common/topo_fmri.c::topo_fmri_present()
            |
            |-> usr/src/lib/fm/topo/libtopo/common/hc.c::hc_is_present()
                |
                |-> usr/src/lib/fm/topo/modules/i86pc/chip/chip_subr.c::rank_is_present()


The rank_is_present method in the chip enumerator module will compare the serial numbers and returns FALSE if the serial number of the faulted resource doesn't match the current serial number in the topology snapshot.  If any errors occur along the path above, thus preventing us from determining if the resource is still present, we err on the side of caution and return TRUE.

Ok - so that's some of the gory code details, but what will it look like from the user's perspective?

If a DIMM is diagnosed as faulty on an X64 system, the user will see something like this on the console (no change here):

SUNW-MSG-ID: AMD-8000-48, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Wed Mar 19 14:04:01 PDT 2008
PLATFORM: Sun Fire X4500, CSN: 00:14:4F:20:E4:B0     , HOSTNAME: lollipop
SOURCE: eft, REV: 1.16
EVENT-ID: 44384620-5c7d-4073-edbc-ff0664004de4
DESC: The number of errors associated with this memory module has exceeded acceptable levels.  Refer to http://sun.com/msg/AMD-8000-48 for more information.
AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported.
IMPACT: Total system memory capacity will be reduced as pages are retired.
REC-ACTION: Schedule a repair procedure to replace the affected memory module.  Use fmdump -v -u <EVENT_ID> to identify the module.


If the user runs "fmadm faulty", they'll see this (note the DIMM serial number is now included in the FRU FMRI)

lollipop# fmadm faulty -a
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Mar 19 14:04:01 44384620-5c7d-4073-edbc-ff0664004de4  AMD-8000-48    Major   

Fault class : fault.memory.dimm_ue
Affects     : mem:///motherboard=0/chip=0/memory-controller=0/dimm=0/rank=0
                  degraded but still in service
FRU         : "CPU 0 DIMM 0" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E4-B0:server-id=lollipop:serial=002C000000DA062AF3/motherboard=0/chip=0/memory-controller=0/dimm=0)
                  faulty

Description : The number of errors associated with this memory module has
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/AMD-8000-48 for more information.

Response    : Pages of memory associated with this memory module are being
              removed from service as errors are reported.

Impact      : Total system memory capacity will be reduced as pages are
              retired.

Action      : Schedule a repair procedure to replace the affected memory
              module.  Use fmdump -v -u <EVENT_ID> to identify the module.

Now if the user/service guy replaces "CPU 0 DIMM 0" and then reruns "fmadm faulty" after bringing the system back up they'll see this:  (note the state of the ASRU and FRU have changed to "faulted and taken out of service" and "not present", respectively)

lollipop# fmadm faulty -a
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Mar 19 14:04:01 44384620-5c7d-4073-edbc-ff0664004de4  AMD-8000-48    Major   

Fault class : fault.memory.dimm_ue
Affects     : mem:///motherboard=0/chip=0/memory-controller=0/dimm=0/rank=0
                  faulted and taken out of service
FRU         : "CPU 0 DIMM 0" (hc://:product-id=Sun-Fire-X4500:chassis-id=00-14-4F-20-E4-B0:server-id=lollipop:serial=002C000000DA062AF3/motherboard=0/chip=0/memory-controller=0/dimm=0)
                  not present

Description : The number of errors associated with this memory module has
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/AMD-8000-48 for more information.

Response    : Pages of memory associated with this memory module are being
              removed from service as errors are reported.

Impact      : Total system memory capacity will be reduced as pages are
              retired.

Action      : Schedule a repair procedure to replace the affected memory
              module.  Use fmdump -v -u <EVENT_ID> to identify the module.

I also putback a handful of other bug fixes into build 87 - here's the complete putback notification:

Event:            putback-to
Parent workspace: /ws/onnv-gate
(elpaso:/ws/onnv-gate)
Child workspace: /net/hyper/tank/ws/robj/fma-dimm-serial2
(hyper:/tank/ws/robj/fma-dimm-serial2)
User: robj

Comment:
6593380 topology for Sun x64 platforms should include serial numbers for dimms
6671247 missing DIMM FRU labels on 4600/4600M2 platforms with family 15 modules
6672188 chip FRU labels computed incorrectly on 2-socket AF4+ blades
6675806 libipmi: ipmi_fru_read() can leak memory on failure

Files:
update: usr/src/cmd/fm/eversholt/files/i386/i86pc/amd64.esc
update: usr/src/cmd/fm/eversholt/files/i386/i86pc/intel.esc
update: usr/src/cmd/fm/fmd/common/fmd_api.c
update: usr/src/cmd/fm/schemes/mem/mem.c
update: usr/src/lib/fm/topo/libtopo/common/libtopo.h
update: usr/src/lib/fm/topo/libtopo/common/mapfile-vers
update: usr/src/lib/fm/topo/libtopo/common/topo_fmri.c
update: usr/src/lib/fm/topo/maps/i86pc/chip-hc-topology.xml
update: usr/src/lib/fm/topo/modules/i86pc/chip/Makefile
update: usr/src/lib/fm/topo/modules/i86pc/chip/chip.c
update: usr/src/lib/fm/topo/modules/i86pc/chip/chip.h
update: usr/src/lib/fm/topo/modules/i86pc/chip/chip_amd.c
update: usr/src/lib/fm/topo/modules/i86pc/chip/chip_label.c
update: usr/src/lib/fm/topo/modules/i86pc/chip/chip_subr.c
update: usr/src/lib/libipmi/common/ipmi_fru.c
create: usr/src/lib/fm/topo/modules/i86pc/chip/chip_serial.c

Examined files: 16

Contents Summary:
1 create
15 update



[1] I'm qualifying the statement by saying "on our newer AMD-based server platforms" for a few reasons:

  1. Since we're sourcing the serial number from the service processor, we obviously won't be able to support this on our AMD-based desktop platforms which don't have baseboard management controllers.
  2. The third-party service processor firmware on some of older AMD-based server platforms do not export sufficient FRU information to allow us to get the serial numbers.  This mainly affects the lower-end X2100/2200 line.
  3. Our Intel-based platforms use a completely different mechanism to get the DIMM serial numbers.

[2] You might be wondering why we need to synthesize a Sun serial ID as opposed to simply using the manufacturer serial number.  There are a couple problems with using the manufacturer serial number, as is.  First, different DIMM manufacturers could use the same serial number.  Secondly, because the serial space is limited (8 characters) and DIMM manufacturers pump out DIMM's at a staggering rate, the same manufacturer could cycle through and then resuse serial numbers as frequently as every week.  Because FMA needs to use the serial number to determine whether a given DIMM has been replaced, we need know that the serial is as unique as possible.  Newer versions of our service processor firmware (ILOM) will concatenate the following three additional pieces of information to the manufacturer serial to form a globally unique 18 character Sun serial ID:

  1. The JEDEC ID of the manufacturer
  2. The manufacturing location
  3. The manufacturing date

For the cases where we encounter older ILOM software that doesn't synthesize a globally unique Sun serial ID, Solaris will synthesize an 18 character serial ID based on the manufacturer JEDEC ID and the manufacturer serial (filling in zeroes for the location and date).  While this isn't guaranteed to be unique, it is more likely to be unique than just using the manufacturer serial alone.






Comments:

Post a Comment:
Comments are closed for this entry.
About

user12611677

Search

Top Tags
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today