FMA Support For AMD Opteron rev F

In my last blog entry I discussed the initial support in Solaris for fault management on AMD Opteron, Athlon 64 and Turion 64 processor systems (more accurately known as AMD Family 0xf, or the K8 family). That support applied to revisions E and earlier of those chips, which were current at the time, and has since been backported to Solaris 10 Update 2. In this entry I will discuss recent changes to support "revision F", and I'll followup in the next few weeks with a side-by-side comparison of our error handling and fault management with that of one or two other OS offerings.

In August this year AMD introduced it's "Next Generation AMD Opteron processor" which has a bunch of cool featues (such as AMD Virtualization technology and a socket-compatible upgrade path to Quad-Core chips when they release) but the main change as far as error handling and fault management go is the switch to DDR2 memory. This next generation is also known as revision F (of AMD family 0xf) and sometimes also as the "socket F" (really F(1207)) and "socket AM2" processors; the initial Solaris support for AMD fault management does not apply to revision F (the cpu module driver would choose not to initialize for revisions beyond E).

Quick Summary

If you want to avoid the detail below here is the quick summary. Solaris now supports revision F in Nevada build 51, and targeting Solaris 10 Update 4 next year. The support works pretty much as for earlier revisions (most code shared) - identically for cpu diagnosis, and with minor changes in memory diagnosis.

Revision F Support in Nevada Build 51

In early October I putback support for rev F, together with a number of improvements and bugfixes to the original project. Below is the putback notification including both the bugids and the files changed - most files that implement the AMD fault management support were touched to some degree, so this provides a handy list for anyone wanting to explore more deeply:

Event:            putback-to
Parent workspace: /ws/onnv-gate
Child workspace:  /net/
User:             gavinm

PSARC 2006/564 FMA for Athlon 64 and Opteron Rev F/G Processors
PSARC 2006/566 eversholt language enhancements: confprop_defined
6362846 eversholt doesn't allow dashes in pathname components
6391591 AMD NB config should not set NbMcaToMstCpuEn
6391605 AMD DRAM scrubber should be disabled when errata #99 applies
6398506 memory controller driver should not bother to attach at all on rev F
6424822 FMA needs to support AMD family 0xf revs F and G
6443847 FMA x64 multibit ChipKill rules need to follow MQSC guidelines
6443849 Accrue inf_sys and s_ecc ECC errors against memory
6443858 mc-amd can free unitsrtr before usage in subsequent error path
6443891 mc-amd does not recognise mismatched dimm support
6455363 x86 error injector should allow addr option for most errors
6455370 Opteron erratum 101 only applies on revs D and earlier
6455373 Identify chip-select lines used on a dimm
6455377 improve x64 quadrank dimm support
6455382 add generic interfaces for amd chip revision and package/socket type
6468723 mem scheme fmri containment test for hc scheme is busted
6473807 eversholt could use some mdb support
6473811 eversholt needs a confprop_defined function
6473819 eversholt should show version of rules active in DE
6475302 ::nvlist broken by some runtime link ordering changes

update: usr/closed/cmd/mtst/x86/common/opteron/opt.h
update: usr/closed/cmd/mtst/x86/common/opteron/opt_common.c
update: usr/closed/cmd/mtst/x86/common/opteron/opt_main.c
update: usr/closed/cmd/mtst/x86/common/opteron/opt_nb.c
update: usr/src/cmd/fm/dicts/AMD.dict
update: usr/src/cmd/fm/dicts/AMD.po
update: usr/src/cmd/fm/eversholt/common/check.c
update: usr/src/cmd/fm/eversholt/common/eftread.c
update: usr/src/cmd/fm/eversholt/common/eftread.h
update: usr/src/cmd/fm/eversholt/common/esclex.c
update: usr/src/cmd/fm/eversholt/common/escparse.y
update: usr/src/cmd/fm/eversholt/common/literals.h
update: usr/src/cmd/fm/eversholt/common/tree.c
update: usr/src/cmd/fm/eversholt/common/tree.h
update: usr/src/cmd/fm/eversholt/files/i386/i86pc/amd64.esc
update: usr/src/cmd/fm/modules/Makefile.plugin
update: usr/src/cmd/fm/modules/common/cpumem-retire/cma_main.c
update: usr/src/cmd/fm/modules/common/cpumem-retire/cma_page.c
update: usr/src/cmd/fm/modules/common/eversholt/Makefile
update: usr/src/cmd/fm/modules/common/eversholt/eval.c
update: usr/src/cmd/fm/modules/common/eversholt/fme.c
update: usr/src/cmd/fm/modules/common/eversholt/platform.c
update: usr/src/cmd/fm/schemes/mem/mem_unum.c
update: usr/src/cmd/mdb/common/modules/genunix/genunix.c
update: usr/src/cmd/mdb/common/modules/genunix/nvpair.c
update: usr/src/cmd/mdb/common/modules/genunix/nvpair.h
update: usr/src/cmd/mdb/common/modules/libnvpair/libnvpair.c
update: usr/src/cmd/mdb/i86pc/modules/amd_opteron/ao.c
update: usr/src/common/mc/mc-amd/mcamd_api.h
update: usr/src/common/mc/mc-amd/mcamd_misc.c
update: usr/src/common/mc/mc-amd/mcamd_patounum.c
update: usr/src/common/mc/mc-amd/mcamd_rowcol.c
update: usr/src/common/mc/mc-amd/mcamd_rowcol_impl.h
update: usr/src/common/mc/mc-amd/mcamd_rowcol_tbl.c
update: usr/src/common/mc/mc-amd/mcamd_unumtopa.c
update: usr/src/lib/fm/topo/libtopo/common/hc_canon.h
update: usr/src/lib/fm/topo/libtopo/common/mem.c
update: usr/src/lib/fm/topo/libtopo/common/topo_protocol.c
update: usr/src/lib/fm/topo/modules/i86pc/chip/chip.c
update: usr/src/lib/fm/topo/modules/i86pc/chip/chip.h
update: usr/src/pkgdefs/SUNWfmd/prototype_com
update: usr/src/uts/i86pc/Makefile.files
update: usr/src/uts/i86pc/Makefile.workarounds
update: usr/src/uts/i86pc/cpu/amd_opteron/ao.h
update: usr/src/uts/i86pc/cpu/amd_opteron/ao_cpu.c
update: usr/src/uts/i86pc/cpu/amd_opteron/ao_main.c
update: usr/src/uts/i86pc/cpu/amd_opteron/ao_mca.c
update: usr/src/uts/i86pc/cpu/amd_opteron/
update: usr/src/uts/i86pc/cpu/amd_opteron/ao_poll.c
update: usr/src/uts/i86pc/io/mc/mcamd.h
update: usr/src/uts/i86pc/io/mc/mcamd_drv.c
update: usr/src/uts/i86pc/io/mc/
update: usr/src/uts/i86pc/io/mc/mcamd_subr.c
update: usr/src/uts/i86pc/mc-amd/Makefile
update: usr/src/uts/i86pc/os/cmi.c
update: usr/src/uts/i86pc/os/cpuid.c
update: usr/src/uts/i86pc/sys/cpu_module.h
update: usr/src/uts/i86pc/sys/cpu_module_impl.h
update: usr/src/uts/intel/sys/fm/cpu/AMD.h
update: usr/src/uts/intel/sys/mc.h
update: usr/src/uts/intel/sys/mc_amd.h
update: usr/src/uts/intel/sys/mca_amd.h
update: usr/src/uts/intel/sys/x86_archext.h
create: usr/src/cmd/fm/modules/common/eversholt/eft_mdb.c
create: usr/src/uts/i86pc/io/mc/mcamd_dimmcfg.c
create: usr/src/uts/i86pc/io/mc/mcamd_dimmcfg.h
create: usr/src/uts/i86pc/io/mc/mcamd_dimmcfg_impl.h
create: usr/src/uts/i86pc/io/mc/mcamd_pcicfg.c
create: usr/src/uts/i86pc/io/mc/mcamd_pcicfg.h
These changes will appear in Nevada build 51, and are slated for backport to Solaris 10 Update 4 (which is a bit distant, but Update 3 is almost out the door and long-since frozen for project backports).

New RAS Features In Revision F

In terms of error handling the processor cores and attendent caches in revision F chips are substantially unchanged from earlier revisions. There are no new error types detected in these banks (icache, dcache, load-store unit, bus unit aka l2cache) so no additional error reports for us to raise or any change required to the diagnosis rules that consume those error reports.

While the switch to DDR2 is a substantial feature change for the AMD chip (and makes it even faster!) this really only affects how we discover memory configuration. The types of error and fault that DDR2 memory DIMMs can experience are much the same as for DDR1, and so the set of errors reported from memory are nearly identical across all revisions - revision F adds parity checking on the dram command and address lines, and otherwise detects and reports the same set of errors.

An Online Spare Chip-Select is introduced in revision F. The BIOS can choose (if the option is available and selected) to set aside a chip-select on each memory node to act as a standby should any other chip-select be determined to be bad. When a bad chip-select is diagnosed software (OS or BIOS) can write the details to the Online Spare Control Register to initiate a copy of the bad chip-select contents to the online spare, and to redirect all access to the bad chip-select to the spare. The copy proceeds at a huge rate (over 1GB/s) and does not interrupt system operation at all.

Revision F introduces some hardware support for counting ECC error occurences. The NorthBridge counts all ECC errors it sees (regardless of source on that memory node) in a 12-bit counter of the DRAM Errors Threshold Register. Software can set a starting value for this counter which will increment with every ECC error observed by the NorthBridge; when the counter overflows (tries to go beyond 4095) software will receive a preselected interrupt (either for the OS or for the BIOS, depending on setup). The interrupt handler can diagnose a dimm as faulty, perhaps, although it does not know from this information alone which DIMM(s) to suspect.

The Online Spare Control Register also exposes 16 4-bit counters - one for each of 8 possible chip-selects on each of dram channels A and B. Again software may initialize the counters and receive an interrupt when the counter overflows. With information from all chip-selects of each channel it is possible to narrow the source to a single rank (side) of a single DIMM.

An Overview of Code Changes

With processor cores substantially unchanged from our point-of-view, introducing error handling and fault management for cpu errors for these chips was pretty trivial - really just a question of allow the cpu module to initialize in ao_init.

The changes for the transition from DDR1 to DDR2 memory are all within the memory controller and dram controllers in the on-chip NorthBridge, seen by Solaris as additions and changes to the memory controller registers that it can quiz and set via PCI config space accesses. Replacing our previous #define collection for bitmasks and bitfields of the memory controller registers with data strucutres that describe the bitfields for all revisions and some MC_REV_\* and MCREG_\* macros for easy revision-test and structure access flattens out these differences when the mcamd_drv.c code uses these macros. Our mc-amd driver is reponsible for discovering the memory configuration details of a system, and also for performing error address to dimm resource translation. So with it now able to read the changed memory controller registers the next step was to teach it to interpret the values in the bitfields of those register as changed for revision F. Most of the work here is in teaching it how to resolve a dram address into row, column and internal bank addresses for the new revision but the error address to dimm resource translation algorithm also required changes to allow for the possible presence of an online spare chip-select.

The cpu module cpu.AuthenticAMD.15 for AMD family 15 (0xf) mostly required changes to allow for revision-dependent error detector bank configuration. It also was changed to initialize the online spare control register and NorthBridge ECC Error Threshold Register - more about that below - and to provide easier control over its hardware configuration activities for the three hardware scrubbers, the hardware watchdog, and the NorthBridge MCA Config register.

The mc-amd driver also grew to understand all possible DIMM configurations on the AMD platform. It achieves this through working out which socket type we're running on (754, 939, 940, AM2, F(1207), S1g1) and what class of DIMMs we are working with - normal, quadrank registered DIMMs, or quadrank SODIMM - and performing a table lookup. The result determines, for each chip-select base and mask pair, how the resulting logical DIMM is numbered and which physical chip-select lines are used in operating that chip-select (from channel, socket number and rank number). The particular table selected for lookup is also determined by whether mismatched DIMM support is present - unmatched DIMMs present in the channelA/channelB DIMM socket pairs. This information determines how we will number the DIMMs and ranks thereon in the topology that is built in the mc-amd driver and thereafter in full glory in the topology library. This is very important, since the memory controller registers only work in terms of nodes and chip-selects while we want to diagnose down to individual DIMM and rank thereon (requesting replacement of a particular DIMM is so much better than requesting replacement of a chip-select).

With detailed information on the ranks of each DIMM now available I changed the cpu/memory diagnosis topology to list the ranks of a DIMM, and the diagnosis rules to diagnose down to the individual DIMM rank. This was so as to facilitate a swap to any online spare chip-select when we diagnose a dimm fault (chip-selects are built from ranks, not DIMMs). The new cpu/memory topology is as follows:

This is a single-socket (chip=0) dual-core (cpu=0 and cpu=1 on chip 0) socket AM2 system with two DIMMs installed (dimm=0 and dimm=1) both dual-rank (rank=0 and rank=1 on each DIMM). The DIMMS are numbered 0 and 1 based on the DIMM configuration information discussed above - chip-select base/mask pair 0 is active for the 128-bit wide chip-select formed from the two rank=0 ranks, and chip-select base/mask pair 1 is active for the 128 bit wide chip-select formed from the two rank=1 ranks. All of this information, and more, is present as properties on the topology nodes. The following shows the properties of the memory-controller node, one active chip-select on it and the two DIMMs contributing to that chip-select (one from each channel) and the specific ranks on those two DIMMs that are used in the chip-select.
  group: memory-controller-properties   version: 1   stability: Private/Private
    num               uint64    0x0
    revision          uint64    0x20f0020
    revname           string    F
    socket            string    Socket AM2
    ecc-type          string    Normal 64/8
    base-addr         uint64    0x0
    lim-addr          uint64    0x7fffffff
    node-ilen         uint64    0x0
    node-ilsel        uint64    0x0
    cs-intlv-factor   uint64    0x2
    dram-hole-size    uint64    0x0
    access-width      uint64    0x80
    bank-mapping      uint64    0x2
    bankswizzle       uint64    0x0
    mismatched-dimm-support uint64    0x0

  group: chip-select-properties         version: 1   stability: Private/Private
    num               uint64    0x0
    base-addr         uint64    0x0
    mask              uint64    0x7ffeffff
    size              uint64    0x40000000
    dimm1-num         uint64    0x0
    dimm1-csname      string    MA0_CS_L[0]
    dimm2-num         uint64    0x1
    dimm2-csname      string    MB0_CS_L[0]

    num               uint64    0x0
    size              uint64    0x40000000

  group: rank-properties                version: 1   stability: Private/Private
    size              uint64    0x20000000
    csname            string    MA0_CS_L[0]
    csnum             uint64    0x0

    num               uint64    0x1
    size              uint64    0x40000000

  group: rank-properties                version: 1   stability: Private/Private
    size              uint64    0x20000000
    csname            string    MB0_CS_L[0]
    csnum             uint64    0x0

ECC Error Thresholding, And Lack Thereof

At boot, in functions nb_mcamisc_init and ao_sparectl_cfg , Solaris clears any BIOS-registered interrupt for ECC counter overflow and does not install its own. Thus we choose not to make use of the hardware ECC counting and thresholding mechanism, since we can already perform more advanced counting and diagnostic analysis in our diagnosis engine rules. The payload of ECC error ereports is modified to include the NorthBridge Threshold Register in case it should prove interesting in any required human analysis of the ereport logs (not normally required); the counts from the Online Spare Control Register are not in the payload (I may add them soon).

Further Work In Progress

The following are all in-progress in the FMA group at the moment, with the intention and hope of catching Solaris 10 Update 4 (some may miss if they prove to be more difficult than expected).

PCI/PCIE Diagnosis and Driver Hardening API

This is already in Nevada, since build 43. It provides both full PCI/PCIE diagnosis and an API with which drivers may harden themselves to various types of device error.

Topology Library Updates

This is approaching putback to Nevada. The topology library is one of the foundations of FMA as we move forward - it will be our repository for information on all things FMA.

More Advanced ECC Telemetry Analysis

In build 51 memory ECC telemetry is fed into simple SERD engines which "count" occurences and decay them over time (so events from months ago are not necessarily considered to match and aggravate occurences from seconds ago). We are currently developing more advanced diagnosis rules which will distinguish cases of isolated memory cell faults, a frequent error from something like a stuck pin or failed sdram in ChipKill mode, and will check which bits are in error to see if an uncorrectable error appears to be imminent.

CPU/Memory FRU Labelling for AMD Platforms

I touched on how we number cpus and memory above. How these relate to the actual FRU labels on a random AMD platform, if at all, is difficult to impossible to determine in software (notwithstanding, or perhaps because of, the efforts of SMBIOS etc!). So our "dimm0" may be silkscreened as "A0", "B0D0", "BOD3", "DIMM5" etc. With SMBIOS proving to be pretty useless in providing these labels accurately, nevermind associating them with the memory configuration one can discover through the memory controller we will resort to some form of hardcoded table for at least the AMD platforms that Sun has shipped.

CPU/Memory FRU Serial Numbers for AMD Platforms

Again, these are tricky to come by in a generic fashion but for Sun platforms we can teach our software the various ways of retrieving serial numbers. This will assist in replacement of the correct FRU, and in allowing the FMA software to detect when a faulted FRU has been replaced.

Technorati Tags: ,


Post a Comment:
Comments are closed for this entry.

I work in the Fault Management core group; this blog describes some of the work performed in that group.


« April 2014
Site Pages
Sun Bloggers

No bookmarks in folder