AMD Opteron/Athlon64/Turion64 Fault Management

Fault Management for Athlon64 and Opteron

In February we (the Solaris Kernel/RAS group) integrated the "fma x64" project into Solaris Nevada, delivering Fault Management for the AMD K8 family of chips (Athlon(TM) 64, Opteron(TM), Turion(TM) 64). This brings fault management on the Solaris AMD64 platform for cpu and memory up to par with that already present on Sun's current SPARC platforms, and addresses one of the most-requested missing functionalities required by customers (or potential customers) of Sun's impressive (and growing) AMD Opteron family offerings (the project, of course, benefits all AMD 64-bit systems not just those from Sun). We had planned some blogging from the rooftops about the project back at integration, but instead we all concentrated on addressing the sleep deficit that the last few hectic weeks of a big project brought and since putback to the Solaris 11 gate there has also been much effort in preparing the backport to Solaris 10 Update 2 (aka Solaris 10 06/06).

Well, it has already hit the streets now that Solaris Express Community Edition build 34 is available for download and the corresponding source is available at (around 315 files, search for "2006/020" in file history). There are a few bug fixes that will appear in build 36, but build 34 has all of the primary fault management functionality.

In this blog I'll attempt an overview of the project functionality, with some examples. In future entries I'll delve into some of the more subtle nuances. Before I begin I'll highlight something the project does not deliver: any significant improvement in machine error handling and fault diagnosis for Intel chips (i.e., anything more than a terse console message). This stuff is very platform/chip dependent, and since Sun has a number of AMD Opteron workstation and server offerings with much more in the pipeline it was the natural first target for enhanced support. The project also does not delivered hardened IO drivers and corresponding diagnosis - that is the subject of a follow-on project due in a couple of months.

AMD64 Platform Error Detection Architecture Overview

The following image shows the basics of a 2 chip dual-core (4 cores total) AMD system (with apologies to StarOffice power users):

Each chip may have more than one core (current AMD offerings have up to two cores). Each core has an associated on-chip level 1 instruction cache, level 2 data cache, and level 2 cache. There is also an on-chip memory controller (one per chip) which can control up to 8 DDR memory modules. All cores on all chips can access all memory, but the access is not uniform - accesses to a "remote" node involve a HyperTransport request to the owning node which will respond with the data. For historical reasons the functional unit within the chip that includes the memory controller, dram controller, hypertransport logic, crossbar and no doubt a few other tricks is known as the "Northbridge".

This project is concerned with the initial handling of an error, marshalling of error telemetry from the cpu and memory components (a followup project, due in the next few months, will do the same for telemetry from the IO subsystem), and then consuming that telemetry to produce any appropriate diagnosis of any fault that is determined to be present. These chips have a useful array of error detectors, as described in the following table:

Functional Unit Array Protection
Instruction Cache (IC) Icache main tag array Parity
Icache snoop tag array Parity
Instruction L1 TLB Parity
Instruction L2 TLB Parity
Icache data array Parity
Data Cache (DC) Dcache main tag array Parity
Dcache snoop tag array Parity
Dcache L1 TLB Parity
Dcache L2 TLB Parity
Dcache data array ECC
L2 Cache ("bus unit") (BU) L2 cache main tag array ECC and Parity
L2 cache data array ECC
Northbridge (NB) Memory controlled by this node ECC (depends on dimms)
Table 1: SRAM and DRAM memory arrays

There are a number of other detectors present, notably in the Northbridge, but the above are the main sram and dram arrays.

If an error is recoverable then it does not raise a Machine Check Exception (MCE or mc#) when detected. The recoverable errors, broadly speaking, are single-bit ECC errors from ECC-protected arrays and parity errors on clean parity-protected arrays such as the Icache and the TLBs (translation lookaside buffers - virtual to physical address translations). Instead of a mc# the recoverable errors simply log error data into machine check architecture registers of the detecting bank (IC/DC/BU/NB, and one we don't mention in the table above the Load-Store unit LS) and the operating system (or advanced BIOS implementations) can poll those registers to harvest the information. No special handling of the error is required (e.g., no need to flush caches, panic etc).

If an error is irrecoverable then detection of that error will raise a machine check exception (if the bit that controls mc# for that error type is set; if not you'll either never know or you pick it up by polling). The mc# handler can extract information about the error from the machine check architecture registers as before, but has the additional responsibility of deciding what further actions (which may include panic and reboot) are required. A machine check exception is a form of interrupt which allows immediate notification of an error condition - you can't afford to wait to poll for the error since that could result in the use of bad data and associated data corruption.

Traditional Error Handling - The "Head in the Sand" Approach

The traditional operating system (all OS, not just Solaris) approach to errors in the x86 cpu architecture is as follows:

  • leave the BIOS to choose which error detecting banks to enable, and which irrecoverable errors that are detected will raise a mc#
  • if a mc# is raised the OS fields it and terminates with a useful diagnostic message such as "Machine Check Occured", not event hinting at the affected resource
  • ignore recoverable errors, i.e. don't poll for their occurence, a more advanced BIOS will perhaps poll for these errors but is not in a position to do anything about them while the OS is running
That is not unreasonable for the x86 cpus of years gone by, since they typically had very little in the way of data protection either on-chip or for memory. But more recently they have improved in this area, and protection of the on-chip arrays is now common as is ECC-protected main memory. With the sizes of on-chip memory arrays such as the L2 cache growing, especially for chips offered for typical server use, there is also all the more chance that they will have defects introduced during manufacturing, subsequent handling, while installed etc.

Recognising the increased need for error handling and fault management on the x86 platform, some operating systems have begun to offer limited support in this area. Solaris has been doing this for some time on sparc (let's just say the the US-II E-cache disaster did have some good side-effects!) and so in Solaris we will offer the well-rounded end-to-end fault management on amd64 platforms that we already have on sparc.

A Better Approach - Sun's Fault Management Architecture "FMA"

In a previous blog entry I described the Sun Fault Management Architecture. Error events flow into a Fault Manager and associated Diagnosis Engines which may produce fault diagnoses which can be acted upon not just for determining repair actions but also to isolate the fault before it affects system availability (e.g., to offline a cpu that is experiecing errors at a sustained rate). This architecture has been in use for some time now in the sparc world, and this project expands it to include AMD chips.

FMA for AMD64

To deliver FMA for AMD64 systems the project has:

  • made Solaris take responsibility (in addition to the BIOS) for deciding which error-detecting banks to enable and which error types will raise machine-check exceptions
  • taught Solaris how to recognize all the error types documented in the AMD Bios and Kernel Developer's Guide
  • delivered an intelligent machine-check exception handler and periodic poller (for recoverable errors) which collect all error data available, determine what error type has occured and propogate it for logging, and take appropriate action (if any)
  • introduced cpu driver modules to the Solaris x86 kernel (as have existed on sparc for many years) so that features of a particular processor family (such as the AMD processors) may be specifically supported
  • introduced a memory-controller kernel driver module whose job it is to understand everything about the memory configuration of a node (e.g., to provide translation from a fault address to which dimm is affected)
  • developed rules for consuming the error telemetry with the "eft" diagnosis engine; these are written using the "eversholt" diagnosis language, and their task is to diagnose any faults that the incoming telemetry may indicate
  • delivered an enhanced "platform topology" library to describe the inter-relationship of the hardware components of a platform and to provide a repository for hardware component properties
An earlier putback, also as part of this project, introduced SMBIOS support to Solaris so that we can have some access to platform details (type, slot labelling, configuration, ...). That list seems pretty small for what was a considerable effort - there's lots of detail to each item.

With all this in place we are now able to diagnose the following fault classes:

Fault Class Description
fault.cpu.amd.dcachedata DC data array fault
fault.cpu.amd.dcachetag DC main tag array fault
fault.cpu.amd.dcachestag DC snoop tag array fault
fault.cpu.amd.l1dtlb DC L1TLB fault
fault.cpu.amd.l2dtlb DC L2TLB fault
fault.cpu.amd.icachedata IC data array fault
fault.cpu.amd.icachetag IC main tag array fault
fault.cpu.amd.icachestag IC snoop tag array fault
fault.cpu.amd.l1itlb IC L1TLB fault
fault.cpu.amd.l2itlb IC L2TLB fault
fault.cpu.amd.l2cachedata L2 data array fault
fault.cpu.amd.l2cachetag L2 tag fault Individual page fault
fault.memory.dimm_sb A DIMM experiencing sustained excessive single-bit errors
fault.memory.dimm_ck A DIMM with a ChipKill-correctable multiple-bit faults
fault.memory.dimm_ue A DIMM with an uncorrectable (not even with ChipKill, if present and enabled) multiple-bit fault

An Example - A CPU With Single-bit Errors

The system is a v40z with hostname 'parity' (we also have other cheesy hostnames such as 'chipkill', 'crc', 'hamming' etc!). It has 4 single-core Opteron cpus. If we clear all fault management history and let it run for a while (or give it a little load to speed things up) we very soon see the following message on the console:

SUNW-MSG-ID: AMD-8000-5M, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Wed Mar 15 08:06:08 PST 2006
PLATFORM: i86pc, CSN: -, HOSTNAME: parity
SOURCE: eft, REV: 1.16
EVENT-ID: 1e8e6d7a-fd3a-e4a4-c211-dece00e68c65
DESC: The number of errors associated with this CPU has exceeded acceptable levels.
Refer to for more information.
AUTO-RESPONSE: An attempt will be made to remove this CPU from service.
IMPACT: Performance of this system may be affected.
REC-ACTION: Schedule a repair procedure to replace the affected CPU.
Use fmdump -v -u <EVENT-ID> to identify the module.
Running the indicated command we see that cpu 3 has a fault:
# fmdump -v -u 1e8e6d7a-fd3a-e4a4-c211-dece00e68c65
TIME                 UUID                                 SUNW-MSG-ID
Mar 15 08:06:08.5797 1e8e6d7a-fd3a-e4a4-c211-dece00e68c65 AMD-8000-5M
  100%  fault.cpu.amd.l2cachedata

        Problem in: hc:///motherboard=0/chip=3/cpu=0
           Affects: cpu:///cpuid=3
               FRU: hc:///motherboard=0/chip=3
That tells us the resource affected (chip3, cpu core 0), it's logical identifier (cpuid 3, as used in psrinfo etc), and the field replaceable unit that should be replaced (the chip, you can't replace a core). In future we intend to extract FRU labelling information from SMBIOS but at the moment there are difficulties with smbios data and the accuracy thereof that make that harder than it should be.

If you didn't see or notice the console message then running fmadm faulty highlights resources that have been diagnosed as faulty:

# fmadm faulty
-------- ----------------------------------------------------------------------
 faulted cpu:///cpuid=3
-------- ----------------------------------------------------------------------
In Solaris 11 already and coming to Solaris 10 Update 2 is SNMP trap support for FMA fault events, which provides another avenue by which you can become aware of a newly-diagnosed fault.

We can see the automated response that was performed upon making the diagnosis of a cpu fault:

# psrinfo
0       on-line   since 03/11/2006 00:27:08
1       on-line   since 03/11/2006 00:27:08
2       on-line   since 03/10/2006 23:28:51
3       faulted   since 03/15/2006 08:06:08
The faulted resource has been isolated by offlining the cpu. If you reboot then the cache of faults will cause the cpu to offlined again.

Note that the event id appears in the fmadm faulty output, so you can formulate the fmdump command line shown in the console message if you wish and visit and enter the indicated SUNW-MSG-ID (quick aside: we have some people working on beefing up the amd64 knowledge articles there, the current ones are pretty uninformative). We can also use the event id to see what error reports led to this diagnosis:

# fmdump -e -u 1e8e6d7a-fd3a-e4a4-c211-dece00e68c65
TIME                 CLASS
Mar 15 08:05:18.1624 ereport.cpu.amd.bu.l2d_ecc1     
Mar 15 08:04:48.1624 ereport.cpu.amd.bu.l2d_ecc1     
Mar 15 08:04:48.1624 ereport.cpu.amd.dc.inf_l2_ecc1  
Mar 15 08:06:08.1624 ereport.cpu.amd.dc.inf_l2_ecc1  
The -e option selects dumping of the error log instead of the fault log, so we can see the error telemetry that led to the diagnosis. So we see that in the space of a few seconds this cpu experienced 4 single-bit errors from the L2 cache - we are happy to tolerate occasional single-bit errors but not at this rate, so we diagnose a fault. If we use option -V we can see the full error report contents, for example for the last ereport above:
Mar 15 2006 08:06:08.162418201 ereport.cpu.amd.dc.inf_l2_ecc1
nvlist version: 0
        class = ereport.cpu.amd.dc.inf_l2_ecc1
        ena = 0x62a5aaa964f00c01
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = hc
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = motherboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = chip
                        hc-id = 3
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = cpu
                        hc-id = 0
                (end hc-list[2])

        (end detector)

        bank-status = 0x9432400000000136
        bank-number = 0x0
        addr = 0x5f76a9ac0
        addr-valid = 1
        syndrome = 0x64
        syndrome-type = E
        ip = 0x0
        privileged = 1
        __ttl = 0x1
        __tod = 0x44183b70 0x9ae4e19
One day we'll teach fmdump (or some new command) to mark all that stuff up into human-readable output. For now it shows the raw(ish) telemetry read from the machine check architecture registers when we polled for this event. This telemetry is consumed by the diagnosis rules to produce any appropriate fault diagnosis.

We're Not Done Yet

There are a number of additional features that we'd like to bring to amd64 fault management. For example:

  • use more SMBIOS info (on platforms that have SMBIOS support, and which give accurate data!) to discover FRU labelling etc
  • introduce serial number support so that we can detect when a cpu or dimm has been replaced (currently you have to perform manual fmadm repair
  • introduce some form of communication (if only one-way) between the service processor (on systems that have such a thing, such as the current Sun AMD offerings) and the diagnosis software
  • extend the diagnosis rules to perform more complex predictive diagnosis for DIMM errors based on what we have learned on sparc
Many of these are complicated by the historical lack of architectural standards for the "PC" platform. But we have a solid start, but there's still plenty of functional and usability features we'd like to add.

Technorati Tags: ,


Post a Comment:
Comments are closed for this entry.

I work in the Fault Management core group; this blog describes some of the work performed in that group.


« July 2016
Site Pages
Sun Bloggers

No bookmarks in folder