Generic Machine Check Architecture (MCA) In Solaris

The work described below was integrated into Solaris Nevada way back in August 2007 - build 76; it has since been backported to Solaris 10. It's never too late to blog about things! Actually, I just want to separate this description from the entry that will follow - Solaris x86 xVM Fault Management.

Why Generic MCA?

In past blogs I have described x86 cpu and memory fault management feature-additions for specific processor types: AMD Opteron family 0xf revisions B-E, and AMD Opteron family 0xf revisions F and G. At the time of the first AMD work Sun was not shipping any Intel x64 systems; since then, of course, Sun has famously begun a partnership with Intel and so we needed to look at offering fault management support for our new Intel-based platforms.

Both AMD and Intel base their processor error handling and reporting on the generic Machine Check Architecture (MCA). AMD public documentation actually details and classifies error detectors and error types well beyond what generic MCA allows, while Intel public documentation generally maintains the abstract nature of MCA by not publishing model-specific details. Suppose we observe an error on MCA bank number 2 of a processor with status register MC2_STATUS reading 0xd40040000000017a. In generic MCA that will be classified as "GCACHEL2_EVICT_ERR in bank 2" - a generic transaction type experienced a level-2 cache error during a cache eviction event; other fields of the status register indicate that this event has model-specific error code 0, whether the error was hardware-corrected and so on. In the case of Intel processors, public documentation does not generally describe additional classification of the error - for example they do not document which processor unit "bank 2" might correspond to, nor what model-specific error code 0 for the above error might tell us. In the AMD case all that detail is available, and we could further classify the error as having been detected by the "bus unit" (bank 2), the extended error code (part of the MCA model-specific error code) of 0 on an evict transacton indicates a "Data Copyback" error single-bit ECC error from L2 cache.

Our first AMD support in Solaris included support for all this additional classification detail, and involved a substantial amount of work for each new processor revision. In the case of AMD family 0xf revision F processors (the first support DDR2 memory) we were not able to get the required support back into Solaris 10 in time for the first Sun systems using those processors! We began to realize that this detailed model-specific approach was never going to scale - that was probably obvious at the beginning, but our SPARC roots had not prepared us for how quickly the x86 world rolls out new processor revisions! When the additional requirement to support Intel processors was made, we soon decided it was time to adopt a generic MCA approach.

We also realized that, with a few notable exceptions, we were not actually making any real use of the additional detail available in the existing AMD implementations. For example, our diagnosis algorithms for the error above would simply count the error as a corrected error from the L2 Cache data array - and all that information was available via generic MCA. So the design would be to implement most support in a generic MCA module, and layer model-specific support on top of that to facilitate value-add from model-specific details when we can put them to good use for diagnosis, surviving the error where we might not from generic information, and so on.

What Does Solaris Generic MCA Support Look Like?

It may not have been a blog, but I have documented this before - in the Generic MCA FMA Portfolio documentation (we prepare such a portfolio for all FMA work). In particular, the Generic MCA Philosophy document describes how errors are classified, what error report class is used for each type of generic MCA error, what ereport payload content is included for each, etc.

Generic MCA Preparation For Virtualization - A Handle On Physical CPUs

In the generic MCA project we also decided to prepare the way x86 cpu and memory fault handling in virtualized environments, such as Solaris x86 xVM Server. Our existing cpu module API at the time was all modelled around the assumption that there was a fixed, one-to-one relationship between Solaris logical cpus (as reported in psrinfo for example) and real, physical processor execution resources. If running as a dom0, however, we may be limited to 2 virtual cpus (vcpus) while the underlying physical hardware may have 16 physical cpus (say 8 dual-core chips); and while Solaris dom0 may bind a software thread to a vcpu it is actually the hypervisor that is scheduling vcpus onto real physical cpus (pcpus) and binding to a vcpu does not imply binding to a pcpu - if you're binding to read a few hardware registers from what should be the same cpu for all reads you're going to be disappointed if the hypervisor switches you from one pcpu to another midway! And, anyway, you'd be lucky if the hypervisor let you succeed in reading such registers at all - and you don't know whether any values you read were modified by the hypervisor!

So the generic MCA project, already reworking the cpu module interface very substantially, chose to insert a "handle" argument to most cpu module interfaces. A handle is associated with each physical (chip-id, core-id, strand-id) processor instance, and one needs to lookup and quote the handle you wish to operate on throughout the cpu module interface and model-specific overlay interface, as well as in kernel modules delivering cpu module implementations, or model-specific implementations.

Supported Platforms With Generic MCA in Solaris

The generic MCA project raises the baseline level of MCA support for all x86 processors. Before this project Solaris had detailed MCA support for AMD Opteron family 0xf, and very poor support for anything else - little more than dumping some raw telemetry to the console on fatal events, with no telemetry raised and therefore no logging or diagnosis of telemetry. With generic MCA we raise telemetry and diagnose all correctable and uncorrectable errors (the latter provided the hardware does not simply reset, not allowing the OS any option of gathering telemetry) on all x86 platforms, regardless of chip manufacturer or system vendor.

The MCA does not, however, cover all aspects of cpu/memory fault management:

  • Not all cpu and memory error reporting necessarily falls under the MCA umbrella. Most notably, Intel memory-controller hub (MCH) systems have an off-chip memory-controller which can raise notifications through MCA but most telemetry is available through device registers not from the registers that are part of the MCA.
  • The MCA does a pretty complete job of classifying within-chip errors down to a unit that you can track errors on (e.g., down to the instruction cache data and tag arrays). It does not classify off-chip errors other than generically, and additional model-specific support is required to interpret such errors. For example, AMD Opteron has on-chip memory-controller which reports under the MCA but generic MCA can only classify as far as "memory error, external to the chip": model-specific support can refine the classification to recognise a single-bit ECC error affecting bit 56 at a particular address, and a partnering memory-controller driver could resolve that address down to a memory dimm and rank thereon.
  • For full fault management we require access to additional information about resources that we are performing diagnosis on. For example, we might diagnose that the chip with HyperTransport ID 2 has a bad L2 Cache - but to complete the picture it is nice to provide a FRU ("field-replaceable unit") label such as "CPU_2" that matches the service labelling on the system itself, and ideally also a serial number associated with the FRU.

So while all platforms have equal generic MCA capabilities, some platforms are more equal than others once additional value-add functionality is taken into account:

Memory-Controlller Drivers

Some platforms have corresponding memory-controller drivers. Today we have such drivers for AMD family 0xf (mc-amd), the Intel 5000/5400/7300 chipset series (intel_nb5000), and Intel Nehalem systems (intel_nhm). These drivers provide at minimum an address-to-memory-resource translation service that lets us translate an error address (say 0x10344540) to a memory resource (e.g., chip 3, dimm 2, rank 1). They also supply memory topology information. Such drivers operate on all systems built with the chipset they implement support for, whoever the vendor.

Model-specific MCA Support

Some platforms have additional model-specific MCA support layered on top of the generic MCA code. "Legacy" AMD family 0xf support is maintained unchanged by a model-specific layer which is permitted to rewrite the generic ereport classes in those more-specific classes used in the original MCA work, and to add ereport payload members that are non-architectural such as ECC syndrome. We also added a "generic AMD" support module which would apply in the absence of any more-specific support; right now this applies to all AMD families after 0xf, i.e. 0x10 and 0x11. This module permits the recognition and diagnosis of all memory error even in the absence of a memory controller driver for the platform. The Intel Nehalem support taught the Intel module to recognise additional Nehalem-specific error types and handle them appropriately. Such model-specific support applies regardless of vendor.

FRU Labelling

Some platforms have cpu and memory FRU label support, i.e., the ability to diagnose "CPU2 DIMM3" as bad instead of something like "chip=2/memory-controller=0/dimm=3" which actually requires platform schematics to map to a particualar dimm with absolute certainty. For many reasons, this support is delivered via hardcoded rules in XML map files. Solaris today only delivers such information for Sun x64 systems; support for additional systems, including non-Sun platforms, is trivial to add in the XML - the tricky part is in obaining the schematics etc to perform the mapping of chip/dimm instances to FRU labels.

FRU Serial Numbers

Some platforms have memory DIMM serial number support. Again we only deliver such support for Sun x64 systems, and how this is done varies a little (on Intel the memory-controller driver reads the DIMM SPD proms itself; on AMD we let the service processor do that and read the records from the service processor).

What?! - No Examples?

I'll leave a full worked example to my next blog entry, where I'll demonstrate the full story under Solaris xVM x86 hypervisor.

Technorati Tags: ,


Posted by guest on September 26, 2008 at 10:10 AM EST #

Post a Comment:
Comments are closed for this entry.

I work in the Fault Management core group; this blog describes some of the work performed in that group.


« June 2016
Site Pages
Sun Bloggers

No bookmarks in folder