FMA Support For AMD Opteron rev F
By user12652883 on Oct 24, 2006
In my last blog entry I discussed the initial support in Solaris for fault management on AMD Opteron, Athlon 64 and Turion 64 processor systems (more accurately known as AMD Family 0xf, or the K8 family). That support applied to revisions E and earlier of those chips, which were current at the time, and has since been backported to Solaris 10 Update 2. In this entry I will discuss recent changes to support "revision F", and I'll followup in the next few weeks with a side-by-side comparison of our error handling and fault management with that of one or two other OS offerings.
In August this year AMD introduced it's "Next Generation AMD Opteron processor" which has a bunch of cool featues (such as AMD Virtualization technology and a socket-compatible upgrade path to Quad-Core chips when they release) but the main change as far as error handling and fault management go is the switch to DDR2 memory. This next generation is also known as revision F (of AMD family 0xf) and sometimes also as the "socket F" (really F(1207)) and "socket AM2" processors; the initial Solaris support for AMD fault management does not apply to revision F (the cpu module driver would choose not to initialize for revisions beyond E).
If you want to avoid the detail below here is the quick summary. Solaris now supports revision F in Nevada build 51, and targeting Solaris 10 Update 4 next year. The support works pretty much as for earlier revisions (most code shared) - identically for cpu diagnosis, and with minor changes in memory diagnosis.
Revision F Support in Nevada Build 51
In early October I putback support for rev F, together with a number of improvements and bugfixes to the original project. Below is the putback notification including both the bugids and the files changed - most files that implement the AMD fault management support were touched to some degree, so this provides a handy list for anyone wanting to explore more deeply:
These changes will appear in Nevada build 51, and are slated for backport to Solaris 10 Update 4 (which is a bit distant, but Update 3 is almost out the door and long-since frozen for project backports).Event: putback-to Parent workspace: /ws/onnv-gate (elpaso:/ws/onnv-gate) Child workspace: /net/tb2.uk.sun.com/u/gavinm/revF-clone4pb/onnv-dev (tb2.uk.sun.com:/u/gavinm/revF-clone4pb/onnv-dev) User: gavinm Comment: PSARC 2006/564 FMA for Athlon 64 and Opteron Rev F/G Processors PSARC 2006/566 eversholt language enhancements: confprop_defined 6362846 eversholt doesn't allow dashes in pathname components 6391591 AMD NB config should not set NbMcaToMstCpuEn 6391605 AMD DRAM scrubber should be disabled when errata #99 applies 6398506 memory controller driver should not bother to attach at all on rev F 6424822 FMA needs to support AMD family 0xf revs F and G 6443847 FMA x64 multibit ChipKill rules need to follow MQSC guidelines 6443849 Accrue inf_sys and s_ecc ECC errors against memory 6443858 mc-amd can free unitsrtr before usage in subsequent error path 6443891 mc-amd does not recognise mismatched dimm support 6455363 x86 error injector should allow addr option for most errors 6455370 Opteron erratum 101 only applies on revs D and earlier 6455373 Identify chip-select lines used on a dimm 6455377 improve x64 quadrank dimm support 6455382 add generic interfaces for amd chip revision and package/socket type 6468723 mem scheme fmri containment test for hc scheme is busted 6473807 eversholt could use some mdb support 6473811 eversholt needs a confprop_defined function 6473819 eversholt should show version of rules active in DE 6475302 ::nvlist broken by some runtime link ordering changes Files: update: usr/closed/cmd/mtst/x86/common/opteron/opt.h update: usr/closed/cmd/mtst/x86/common/opteron/opt_common.c update: usr/closed/cmd/mtst/x86/common/opteron/opt_main.c update: usr/closed/cmd/mtst/x86/common/opteron/opt_nb.c update: usr/src/cmd/fm/dicts/AMD.dict update: usr/src/cmd/fm/dicts/AMD.po update: usr/src/cmd/fm/eversholt/common/check.c update: usr/src/cmd/fm/eversholt/common/eftread.c update: usr/src/cmd/fm/eversholt/common/eftread.h update: usr/src/cmd/fm/eversholt/common/esclex.c update: usr/src/cmd/fm/eversholt/common/escparse.y update: usr/src/cmd/fm/eversholt/common/literals.h update: usr/src/cmd/fm/eversholt/common/tree.c update: usr/src/cmd/fm/eversholt/common/tree.h update: usr/src/cmd/fm/eversholt/files/i386/i86pc/amd64.esc update: usr/src/cmd/fm/modules/Makefile.plugin update: usr/src/cmd/fm/modules/common/cpumem-retire/cma_main.c update: usr/src/cmd/fm/modules/common/cpumem-retire/cma_page.c update: usr/src/cmd/fm/modules/common/eversholt/Makefile update: usr/src/cmd/fm/modules/common/eversholt/eval.c update: usr/src/cmd/fm/modules/common/eversholt/fme.c update: usr/src/cmd/fm/modules/common/eversholt/platform.c update: usr/src/cmd/fm/schemes/mem/mem_unum.c update: usr/src/cmd/mdb/common/modules/genunix/genunix.c update: usr/src/cmd/mdb/common/modules/genunix/nvpair.c update: usr/src/cmd/mdb/common/modules/genunix/nvpair.h update: usr/src/cmd/mdb/common/modules/libnvpair/libnvpair.c update: usr/src/cmd/mdb/i86pc/modules/amd_opteron/ao.c update: usr/src/common/mc/mc-amd/mcamd_api.h update: usr/src/common/mc/mc-amd/mcamd_misc.c update: usr/src/common/mc/mc-amd/mcamd_patounum.c update: usr/src/common/mc/mc-amd/mcamd_rowcol.c update: usr/src/common/mc/mc-amd/mcamd_rowcol_impl.h update: usr/src/common/mc/mc-amd/mcamd_rowcol_tbl.c update: usr/src/common/mc/mc-amd/mcamd_unumtopa.c update: usr/src/lib/fm/topo/libtopo/common/hc_canon.h update: usr/src/lib/fm/topo/libtopo/common/mem.c update: usr/src/lib/fm/topo/libtopo/common/topo_protocol.c update: usr/src/lib/fm/topo/modules/i86pc/chip/chip.c update: usr/src/lib/fm/topo/modules/i86pc/chip/chip.h update: usr/src/pkgdefs/SUNWfmd/prototype_com update: usr/src/uts/i86pc/Makefile.files update: usr/src/uts/i86pc/Makefile.workarounds update: usr/src/uts/i86pc/cpu/amd_opteron/ao.h update: usr/src/uts/i86pc/cpu/amd_opteron/ao_cpu.c update: usr/src/uts/i86pc/cpu/amd_opteron/ao_main.c update: usr/src/uts/i86pc/cpu/amd_opteron/ao_mca.c update: usr/src/uts/i86pc/cpu/amd_opteron/ao_mca_disp.in update: usr/src/uts/i86pc/cpu/amd_opteron/ao_poll.c update: usr/src/uts/i86pc/io/mc/mcamd.h update: usr/src/uts/i86pc/io/mc/mcamd_drv.c update: usr/src/uts/i86pc/io/mc/mcamd_off.in update: usr/src/uts/i86pc/io/mc/mcamd_subr.c update: usr/src/uts/i86pc/mc-amd/Makefile update: usr/src/uts/i86pc/os/cmi.c update: usr/src/uts/i86pc/os/cpuid.c update: usr/src/uts/i86pc/sys/cpu_module.h update: usr/src/uts/i86pc/sys/cpu_module_impl.h update: usr/src/uts/intel/sys/fm/cpu/AMD.h update: usr/src/uts/intel/sys/mc.h update: usr/src/uts/intel/sys/mc_amd.h update: usr/src/uts/intel/sys/mca_amd.h update: usr/src/uts/intel/sys/x86_archext.h create: usr/src/cmd/fm/modules/common/eversholt/eft_mdb.c create: usr/src/uts/i86pc/io/mc/mcamd_dimmcfg.c create: usr/src/uts/i86pc/io/mc/mcamd_dimmcfg.h create: usr/src/uts/i86pc/io/mc/mcamd_dimmcfg_impl.h create: usr/src/uts/i86pc/io/mc/mcamd_pcicfg.c create: usr/src/uts/i86pc/io/mc/mcamd_pcicfg.h
New RAS Features In Revision F
In terms of error handling the processor cores and attendent caches in revision F chips are substantially unchanged from earlier revisions. There are no new error types detected in these banks (icache, dcache, load-store unit, bus unit aka l2cache) so no additional error reports for us to raise or any change required to the diagnosis rules that consume those error reports.
While the switch to DDR2 is a substantial feature change for the AMD chip (and makes it even faster!) this really only affects how we discover memory configuration. The types of error and fault that DDR2 memory DIMMs can experience are much the same as for DDR1, and so the set of errors reported from memory are nearly identical across all revisions - revision F adds parity checking on the dram command and address lines, and otherwise detects and reports the same set of errors.
An Online Spare Chip-Select is introduced in revision F. The BIOS can choose (if the option is available and selected) to set aside a chip-select on each memory node to act as a standby should any other chip-select be determined to be bad. When a bad chip-select is diagnosed software (OS or BIOS) can write the details to the Online Spare Control Register to initiate a copy of the bad chip-select contents to the online spare, and to redirect all access to the bad chip-select to the spare. The copy proceeds at a huge rate (over 1GB/s) and does not interrupt system operation at all.
Revision F introduces some hardware support for counting ECC error occurences. The NorthBridge counts all ECC errors it sees (regardless of source on that memory node) in a 12-bit counter of the DRAM Errors Threshold Register. Software can set a starting value for this counter which will increment with every ECC error observed by the NorthBridge; when the counter overflows (tries to go beyond 4095) software will receive a preselected interrupt (either for the OS or for the BIOS, depending on setup). The interrupt handler can diagnose a dimm as faulty, perhaps, although it does not know from this information alone which DIMM(s) to suspect.
The Online Spare Control Register also exposes 16 4-bit counters - one for each of 8 possible chip-selects on each of dram channels A and B. Again software may initialize the counters and receive an interrupt when the counter overflows. With information from all chip-selects of each channel it is possible to narrow the source to a single rank (side) of a single DIMM.
An Overview of Code Changes
With processor cores substantially unchanged from our point-of-view, introducing error handling and fault management for cpu errors for these chips was pretty trivial - really just a question of allow the cpu module to initialize in ao_init.
The changes for the transition from DDR1 to DDR2 memory are all within
the memory controller and dram controllers in the on-chip NorthBridge,
seen by Solaris as additions and changes to the memory controller registers
that it can quiz and set via PCI config space accesses. Replacing
#define collection for bitmasks and bitfields
of the memory controller registers with
data strucutres that describe the bitfields for all revisions
MCREG_\* macros for easy
revision-test and structure access flattens out these differences
when the mcamd_drv.c code
uses these macros.
mc-amd driver is reponsible for discovering the
memory configuration details of a system, and also for performing
error address to dimm resource translation. So with it now able to
read the changed memory controller registers the next step was to
teach it to interpret the values in the bitfields of those register
as changed for revision F. Most of the work here is in teaching
how to resolve a dram address into row, column and internal bank
for the new revision but the error address to dimm resource
translation algorithm also required
changes to allow for the possible presence of an online spare chip-select.
The cpu module
cpu.AuthenticAMD.15 for AMD family 15 (0xf)
changes to allow for revision-dependent error detector bank configuration.
It also was changed to initialize the online spare control register
and NorthBridge ECC Error Threshold Register - more about that below -
and to provide easier control over its hardware configuration activities
for the three hardware scrubbers, the hardware watchdog, and the
NorthBridge MCA Config register.
mc-amd driver also grew to understand
all possible DIMM configurations on the AMD platform.
It achieves this through working out which socket type we're running
on (754, 939, 940, AM2, F(1207), S1g1) and what class of DIMMs we are
working with - normal, quadrank registered DIMMs, or quadrank SODIMM -
and performing a table lookup. The result determines, for
each chip-select base and mask pair, how the resulting logical DIMM is
numbered and which physical chip-select lines are used in operating
that chip-select (from channel, socket number and rank number).
The particular table selected for lookup is also determined by whether
mismatched DIMM support is present - unmatched DIMMs present in the
channelA/channelB DIMM socket pairs. This information determines how
we will number the DIMMs and ranks thereon in the topology that is built
mc-amd driver and thereafter in full glory in the
topology library. This is very important, since the memory controller
registers only work in terms of nodes and chip-selects while we want
to diagnose down to individual DIMM and rank thereon (requesting
replacement of a particular DIMM is so much better than requesting
replacement of a chip-select).
With detailed information on the ranks of each DIMM now available I changed the cpu/memory diagnosis topology to list the ranks of a DIMM, and the diagnosis rules to diagnose down to the individual DIMM rank. This was so as to facilitate a swap to any online spare chip-select when we diagnose a dimm fault (chip-selects are built from ranks, not DIMMs). The new cpu/memory topology is as follows:
This is a single-socket (chip=0) dual-core (cpu=0 and cpu=1 on chip 0) socket AM2 system with two DIMMs installed (dimm=0 and dimm=1) both dual-rank (rank=0 and rank=1 on each DIMM). The DIMMS are numbered 0 and 1 based on the DIMM configuration information discussed above - chip-select base/mask pair 0 is active for the 128-bit wide chip-select formed from the two rank=0 ranks, and chip-select base/mask pair 1 is active for the 128 bit wide chip-select formed from the two rank=1 ranks. All of this information, and more, is present as properties on the topology nodes. The following shows the properties of the memory-controller node, one active chip-select on it and the two DIMMs contributing to that chip-select (one from each channel) and the specific ranks on those two DIMMs that are used in the chip-select.hc:///motherboard=0/chip=0/cpu=0 hc:///motherboard=0/chip=0/cpu=1 hc:///motherboard=0/chip=0/memory-controller=0 hc:///motherboard=0/chip=0/memory-controller=0/dram-channel=0 hc:///motherboard=0/chip=0/memory-controller=0/dram-channel=1 hc:///motherboard=0/chip=0/memory-controller=0/chip-select=0 hc:///motherboard=0/chip=0/memory-controller=0/chip-select=1 hc:///motherboard=0/chip=0/memory-controller=0/dimm=0 hc:///motherboard=0/chip=0/memory-controller=0/dimm=0/rank=0 hc:///motherboard=0/chip=0/memory-controller=0/dimm=0/rank=1 hc:///motherboard=0/chip=0/memory-controller=0/dimm=1 hc:///motherboard=0/chip=0/memory-controller=0/dimm=1/rank=0 hc:///motherboard=0/chip=0/memory-controller=0/dimm=1/rank=1
hc:///motherboard=0/chip=0/memory-controller=0 group: memory-controller-properties version: 1 stability: Private/Private num uint64 0x0 revision uint64 0x20f0020 revname string F socket string Socket AM2 ecc-type string Normal 64/8 base-addr uint64 0x0 lim-addr uint64 0x7fffffff node-ilen uint64 0x0 node-ilsel uint64 0x0 cs-intlv-factor uint64 0x2 dram-hole-size uint64 0x0 access-width uint64 0x80 bank-mapping uint64 0x2 bankswizzle uint64 0x0 mismatched-dimm-support uint64 0x0 hc:///motherboard=0/chip=0/memory-controller=0/chip-select=0 group: chip-select-properties version: 1 stability: Private/Private num uint64 0x0 base-addr uint64 0x0 mask uint64 0x7ffeffff size uint64 0x40000000 dimm1-num uint64 0x0 dimm1-csname string MA0_CS_L dimm2-num uint64 0x1 dimm2-csname string MB0_CS_L hc:///motherboard=0/chip=0/memory-controller=0/dimm=0 num uint64 0x0 size uint64 0x40000000 hc:///motherboard=0/chip=0/memory-controller=0/dimm=0/rank=0 group: rank-properties version: 1 stability: Private/Private size uint64 0x20000000 csname string MA0_CS_L csnum uint64 0x0 hc:///motherboard=0/chip=0/memory-controller=0/dimm=1 num uint64 0x1 size uint64 0x40000000 hc:///motherboard=0/chip=0/memory-controller=0/dimm=1/rank=0 group: rank-properties version: 1 stability: Private/Private size uint64 0x20000000 csname string MB0_CS_L csnum uint64 0x0
ECC Error Thresholding, And Lack Thereof
At boot, in functions nb_mcamisc_init and ao_sparectl_cfg , Solaris clears any BIOS-registered interrupt for ECC counter overflow and does not install its own. Thus we choose not to make use of the hardware ECC counting and thresholding mechanism, since we can already perform more advanced counting and diagnostic analysis in our diagnosis engine rules. The payload of ECC error ereports is modified to include the NorthBridge Threshold Register in case it should prove interesting in any required human analysis of the ereport logs (not normally required); the counts from the Online Spare Control Register are not in the payload (I may add them soon).
Further Work In Progress
The following are all in-progress in the FMA group at the moment, with the intention and hope of catching Solaris 10 Update 4 (some may miss if they prove to be more difficult than expected).
PCI/PCIE Diagnosis and Driver Hardening API
This is already in Nevada, since build 43. It provides both full PCI/PCIE diagnosis and an API with which drivers may harden themselves to various types of device error.
Topology Library Updates
This is approaching putback to Nevada. The topology library is one of the foundations of FMA as we move forward - it will be our repository for information on all things FMA.
More Advanced ECC Telemetry Analysis
In build 51 memory ECC telemetry is fed into simple SERD engines which "count" occurences and decay them over time (so events from months ago are not necessarily considered to match and aggravate occurences from seconds ago). We are currently developing more advanced diagnosis rules which will distinguish cases of isolated memory cell faults, a frequent error from something like a stuck pin or failed sdram in ChipKill mode, and will check which bits are in error to see if an uncorrectable error appears to be imminent.
CPU/Memory FRU Labelling for AMD Platforms
I touched on how we number cpus and memory above. How these relate to the actual FRU labels on a random AMD platform, if at all, is difficult to impossible to determine in software (notwithstanding, or perhaps because of, the efforts of SMBIOS etc!). So our "dimm0" may be silkscreened as "A0", "B0D0", "BOD3", "DIMM5" etc. With SMBIOS proving to be pretty useless in providing these labels accurately, nevermind associating them with the memory configuration one can discover through the memory controller we will resort to some form of hardcoded table for at least the AMD platforms that Sun has shipped.
CPU/Memory FRU Serial Numbers for AMD Platforms
Again, these are tricky to come by in a generic fashion but for Sun platforms we can teach our software the various ways of retrieving serial numbers. This will assist in replacement of the correct FRU, and in allowing the FMA software to detect when a faulted FRU has been replaced.