Tuesday Sep 16, 2008

New! Solaris x86 xVM CPU & Memory Fault Handling

We recently completed the first phase of re-enabling traditional Solaris Fault Management features on x86 systems running the Solaris xVM Hypervisor; this support was integrated into Solaris Nevada build 99, and will from there find its way into a future version of the xVM Server product (post 1.0).

In such systems prior to build 99 there is essentially no hardware error handling or fault diagnosis. Our existing Solaris features in this area were designed for native ("bare metal") environments in which the Solaris OS instance has full access to and control over the underlying hardware. When a type 1 hypervisor is inserted it becomes the privileged owner of the hardware, and native error handling and fault management capabilities of a dom0 operating system such as Solaris are nullified.

The two most-important sources or hardware errors for our purposes are cpu/memory and IO (PCI/PCI-E). This first phase restores cpu/memory error handling and fault diagnosis to near-native capabilities. A future phase will restore IO fault diagnosis.

Both AMD and Intel systems are equally supported in this implementation. The examples below are from an AMD lab system - details on an Intel system will differ trivially.

A Worked Example - A Full Fault Lifecycle Illustration

We'll follow a cpu fault through its full lifecycle. I'm going to include sample output from administrative commands only, and I'm not going to look "under the hood" - the intention is to show how simple this is and hide the under-the-hood complexities!

I will use my favourite lab system - a Sun Fire v40z server with 4 x AMD Opteron, one of which has a reliable fault that will produce no end of correctable instruction cache and l2 cache single-bit errors. I've used this host "parity" in a past blog some time ago - if you contrast the procedure and output there with that below you'll (hopefully!) see we've come a long way.

The administrative interface for Solaris Fault Management is unchanged with this new feature - you now just run fmadm on a Solaris dom0 instead of natively. The output below is from host "parity" dom0 (in an i86xpv Solaris xVM boot).

Notification Of Fault Diagnosis - (dom0) Console, SNMP

Prior to diagnosis of a fault you'll see nothing on the system console. Errors are being logged in dom0 and analysed, but we'll only trouble the console once a fault is diagnosed (console spam is bad). So after a period of uptime on this host we see the following on the console:

SUNW-MSG-ID: AMD-8000-7U, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Mon Sep 15 20:54:06 PDT 2008
PLATFORM: Sun Fire V40z, CSN: XG051535088, HOSTNAME: parity
SOURCE: eft, REV: 1.16
EVENT-ID: 8db62dbe-a091-e222-a6b4-dd95f015b4cc
DESC: The number of errors associated with this CPU has exceeded acceptable
	levels.  Refer to http://sun.com/msg/AMD-8000-7U for more information.
AUTO-RESPONSE: An attempt will be made to remove this CPU from service.
IMPACT: Performance of this system may be affected.
REC-ACTION: Schedule a repair procedure to replace the affected CPU.
	Use 'fmadm faulty' to identify the CPU.
If you've configured your system to raise SNMP traps on fault diagnosis you'll be notified by whatever software has been plumbed in to that mechanism, too.

Why, you ask, does the console message not tell us which cpu is faulty? Well that's because the convoluted means by which we generate these messages (convoluted to facilitate localization of messages in localized version of Solaris) has restricted us to a limited amount of dynamic content in the messages - an improvement in this area is in the works.

Show Current Fault State - fmadm faulty

For now we following the advice and run fmadm faulty on dom0.

dom0# fmadm faulty   
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 15 20:54:06 8db62dbe-a091-e222-a6b4-dd95f015b4cc  AMD-8000-7U    Major    

Fault class : fault.cpu.amd.icachedata
Affects     : hc://:product-id=Sun-Fire-V40z:chassis-id=XG051535088:server-id=parity/motherboard=0/chip=3/core=0/strand=0
                  faulted and taken out of service
FRU         : "CPU 3" (hc://:product-id=Sun-Fire-V40z:chassis-id=XG051535088:server-id=parity/motherboard=0/chip=3)

Description : The number of errors associated with this CPU has exceeded
              acceptable levels.  Refer to http://sun.com/msg/AMD-8000-7U for
              more information.

Response    : An attempt will be made to remove this CPU from service.

Impact      : Performance of this system may be affected.

Action      : Schedule a repair procedure to replace the affected CPU.  Use
              'fmadm faulty' to identify the CPU.
Note the faulted and taken out of service - we have diagnosed this resource as faulty and have offlined the processor at the hypervisor level. This system supports cpu and memory FRU labels so all that hc://.... mumbo-jumbo can be ignored - the thing to replace is the processor in the socket labelled "CPU 3" (either on the board or on the service sticker affixed to the chassis lid, or both - depends on the system).

Replace The Bad CPU

Schedule downtime to replace the chip in location "CPU 3". Software can't do this for you!

Declare Your Replacement Action

dom0# fmadm replaced "CPU 3"
fmadm: recorded replacement of CPU 3
This is an unfortunate requirement: x86 systems do not support cpu serial numbers (the privacy world rioted when Intel included such functionality and enabled it by default), and so we cannot auto-detect the replacement of a cpu FRU. On sparc systems, and on Sun x86 systems for memory dimms, we do have FRU serial number support and this step is not required.

When the replacement is declared as above (or for FRU types that have serial numbers, at reboot following FRU replacment) the console will confirm that all aspects of this fault case have been addressed. At this time the cpu has been brought back online at the hypervisor level.

Did You See Any xVM In That?

No, you didn't! - native/xVM support is seamless. The steps and output above did not involve any xVM-specfic steps or verbage. Solaris is running as a dom0, really!

dom0# uname -a
SunOS parity 5.11 on-fx-dev i86pc i386 i86xpv
Actually there are a few differences elsewhere. Most notable is that psrinfo output will not change when the physical cpu is offlined at the hypervisor level:
dom0# psrinfo   
0       on-line   since 09/09/2008 03:25:15
1       on-line   since 09/09/2008 03:25:19
2       on-line   since 09/09/2008 03:25:19
3       on-line   since 09/09/2008 03:25:19
That's because psrinfo is reporting the state of virtual cpus and not physical cpus. We plan to integrate functionality to query and control the state of physical cpus, but that is yet to come.

Another Example - Hypervisor Panic

In this case the processor suffers an uncorrected data cache error, losing hypervisor state. The hypervisor handles the exception, dispatches telemetry towards dom0, and initiates a panic. The system was booted under Solaris xVM at the time of the error:

Xen panic[dom=0xffff8300e319c080/vcpu=0xffff8300e3de8080]: Hypervisor state lost due to machine check exception.

	rdi: ffff828c801e7348 rsi: ffff8300e2e09d48 rdx: ffff8300e2e09d78
	rcx: ffff8300e3dfe1a0  r8: ffff828c801dd5fa  r9:        88026e07f
	rax: ffff8300e2e09e38 rbx: ffff828c8026c800 rbp: ffff8300e2e09e28
	r10: ffff8300e319c080 r11: ffff8300e2e09cb8 r12:                0
	r13:        a00000400 r14: ffff828c801f12e8 r15: ffff8300e2e09dd8
	fsb:                0 gsb: ffffff09231ae580  ds:               4b
	 es:               4b  fs:                0  gs:              1c3
	 cs:             e008 rfl:              282 rsp: ffff8300e2e09d30
	rip: ffff828c80158eef:  ss:             e010
	cr0: 8005003b  cr2: fffffe0307699458  cr3: 6dfeb5000  cr4:      6f0
Xen panic[dom=0xffff8300e319c080/vcpu=0xffff8300e3de8080]: Hypervisor state lost due to machine check exception.

ffff8300e2e09e28 xpv:mcheck_cmn_handler+302
ffff8300e2e09ee8 xpv:k8_machine_check+22
ffff8300e2e09f08 xpv:machine_check_vector+21
ffff8300e2e09f28 xpv:do_machine_check+1c
ffff8300e2e09f48 xpv:early_page_fault+77
ffff8300e2e0fc48 xpv:smp_call_function_interrupt+86
ffff8300e2e0fc78 xpv:call_function_interrupt+2b
ffff8300e2e0fd98 xpv:do_mca+9f0
ffff8300e2e0ff18 xpv:syscall_enter+67
ffffff003bda1b90 unix:xpv_int+46 ()
ffffff003bda1bc0 unix:cmi_hdl_int+42 ()
ffffff003bda1d20 specfs:spec_ioctl+86 ()
ffffff003bda1da0 genunix:fop_ioctl+7b ()
ffffff003bda1eb0 genunix:ioctl+174 ()
ffffff003bda1f00 unix:brand_sys_syscall32+328 ()

syncing file systems... done
ereport.cpu.amd.dc.data_eccm ena=f0b88edf0810001 detector=[ version=0 scheme=
 "hc" hc-list=[ hc-name="motherboard" hc-id="0" hc-name="chip" hc-id="1"
 hc-name="core" hc-id="0" hc-name="strand" hc-id="0" ] ] compound_errorname=
 "DCACHEL1_DRD_ERR" disp="unconstrained,forcefatal" IA32_MCG_STATUS=7
 machine_check_in_progress=1 ip=0 privileged=0 bank_number=0 bank_msr_offset=
 400 IA32_MCi_STATUS=b600a00000000135 overflow=0 error_uncorrected=1
 error_enabled=1 processor_context_corrupt=1 error_code=135
 model_specific_error_code=0 IA32_MCi_ADDR=2e7872000 syndrome=1 syndrome-type=
 "E" __injected=1

dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
100% done
Note that as part of the panic flow dom0 has retrieved the fatal error telemetry from the hypervisor address space, and it dumps the error report to the console. Hopefully we won't have to interpret that output - it's only there in case the panic fails (e.g., no dump device). The ereport is also written to the system dump device at a well-known location, and the fault management service will retrieve this telemetry when it restarts on reboot.

On the subsequent reboot we see:

v3.1.4-xvm chgset 'Sun Sep 07 13:00:23 2008 -0700 15898:354d7899bf35'
SunOS Release 5.11 Version on-fx-dev 64-bit
Copyright 1983-2008 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
DEBUG enabled
(xVM) Xen trace buffers: initialized
Hostname: parity
NIS domain name is scalab.sfbay.sun.com
Reading ZFS config: done.
Mounting ZFS filesystems: (9/9)

parity console login:
(xVM) Prepare to bring CPU1 down...

SUNW-MSG-ID: AMD-8000-AV, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Mon Sep 15 23:05:21 PDT 2008
PLATFORM: Sun Fire V40z, CSN: XG051535088, HOSTNAME: parity
SOURCE: eft, REV: 1.16
EVENT-ID: f107e323-9355-4f70-dd98-b73ba7881cc5
DESC: The number of errors associated with this CPU has exceeded acceptable levels.  Refer to http://sun.com/msg/AMD-8000-AV for more information.
AUTO-RESPONSE: An attempt will be made to remove this CPU from service.
IMPACT: Performance of this system may be affected.
REC-ACTION: Schedule a repair procedure to replace the affected CPU.  Use 'fmadm faulty' to identify the module.
(xVM) CPU 1 is now offline
Pretty cool - huh? We diagnosed a terminal MCA event that caused a hypervisor panic and system reset. You'll only see the xVM-prefixed messages if you have directed hypervisor serial console output to dom0 console. At this stage we can see the fault in fmadm:
dom0# fmadm faulty -f
"CPU 1" (hc://:product-id=Sun-Fire-V40z:chassis-id=XG051535088:server-id=parity/motherboard=0/chip=1) faulty
f107e323-9355-4f70-dd98-b73ba7881cc5 1 suspects in this FRU total certainty 100%

Description : The number of errors associated with this CPU has exceeded
              acceptable levels.  Refer to http://sun.com/msg/AMD-8000-AV for
              more information.

Response    : An attempt will be made to remove this CPU from service.

Impact      : Performance of this system may be affected.

Action      : Schedule a repair procedure to replace the affected CPU.  Use
              'fmadm faulty' to identify the module.

Phase 1 Architecture

"Architecture" may be overstating it a bit - I like to call this an "initial enabling" since the architecture will have to evolve considerably as the x86 platform itself matures to offer features that will make many of the sun4v/ldoms hardware fault management features possible on x86.

The aim was to re-use as much existing Solaris error handling and fault diagnosis code as possible. That's not just for efficiency - it also allows interoperability and compatibility between native and virtualized boots for fault management. The first requirement was to expose physical cpu information to dom0 from the hypervisor: dom0 is normally presented with virtual cpu information - no good for diagnosing real physical faults. That information allows Solaris to initialize a cpu module handle for each physical (chipid, coreid, strandid) combination in the system, and to associate telemetry arising from the hypervisor with a particular cpu resource. It also allows us to enumerate the cpu resources in our topology library - we use this in our diagnosis software.

Next we had to modify our chip topology, as reported in /usr/lib/fm/fmd/fmtopo and as used in diagnosis software, from chip/cpu (misguidedly introduced with the orginal Solaris AMD fault management work - see previous blogs) to a more physical representation of chip/core/strand - you can see the alignment with cpu module handles mentioned above.

Then communication channels were introduced from the hypervisor to dom0 in order to deliver error telemetry to dom0 for logging and diagnosis:

  • Error exceptions are handled in the hypervisor; it decides whether we can continue or need to panic. In both cases the available telemetry (read from MCA registers) is packaged up and forwarded to dom0.
  • If the hypervisor needs to panic as the result of an error then the crashdump procedure, handled as today by dom0, extracts the in-flight terminal error telemetry and stores it at a special location on the dump device. At subsequent reboot (whether native or virtualized) we extract this telemetry and submit it for logging and diagnosis.
  • The hypervisor polls MCA state for corrected errors (it's done this for ages) and forwards any telemetry it finds on to dom0 for logging and diagnosis (instead of simple printk to the hypervisor console).
  • The dom0 OS has full PCI access; when error notifications arrive in dom0 we check the memory-controller device for any additional telemetry. We also perform periodic poll of the memory-controller device from dom0.
Once telemetry is received in dom0 a thin layer of "glue" kernel code injects the data into the common (existing, now shared between native and i86xpv boots) infrastructure which then operates without modification to raise error reports to dom0 userland diagnosis software. The userland diagnosis software applies the same algorithms as on native; if a fault is diagnosed then the retire agent may request to offline the physical processor to isolate it - this is effected through a new hypercall request to the hypervisor.

The obligatory picture:


The hypervisor to dom0 communication channel for MCA telemetry and the MCA hypercall itself that we've used in our project is derived from work contributed to the Xen community by Christoph Egger of AMD. We worked with Christoph on some design requirements for that work.

Frank van der Linden did the non-MCA hypervisor work - physical cpu information for dom0, and hypervisor-level cpu offline/online.

Sean Ye designed and implemented our new /dev/fm driver; the new topology enumerator that works identically regardless of dom0 or native; topology-method based resource retire for FMA agents. Sean also performed most of the glue work in putting various pieces together as they became available.

John Cui extended our fault test harness to test all the possible error and fault scenarios under native and xVM, and performed all our regression and a great deal of unit testing on a host of systems.

I did the MCA-related hypervisor and kernel work, and the userland error injector changes.

We started this project in January, with a developer in each of the United Kingdom, Netherlands, and China. We finished in September with the same three developers, but with some continent reuffles: the Netherlands moving to the US in March/April, and the UK moving to Australia in May/June. Thanks to our management for their indulgence and patience while we went off-the-map for extended periods mid-project!

Technorati Tags: ,

Sunday Sep 14, 2008

Generic Machine Check Architecture (MCA) In Solaris

The work described below was integrated into Solaris Nevada way back in August 2007 - build 76; it has since been backported to Solaris 10. It's never too late to blog about things! Actually, I just want to separate this description from the entry that will follow - Solaris x86 xVM Fault Management.

Why Generic MCA?

In past blogs I have described x86 cpu and memory fault management feature-additions for specific processor types: AMD Opteron family 0xf revisions B-E, and AMD Opteron family 0xf revisions F and G. At the time of the first AMD work Sun was not shipping any Intel x64 systems; since then, of course, Sun has famously begun a partnership with Intel and so we needed to look at offering fault management support for our new Intel-based platforms.

Both AMD and Intel base their processor error handling and reporting on the generic Machine Check Architecture (MCA). AMD public documentation actually details and classifies error detectors and error types well beyond what generic MCA allows, while Intel public documentation generally maintains the abstract nature of MCA by not publishing model-specific details. Suppose we observe an error on MCA bank number 2 of a processor with status register MC2_STATUS reading 0xd40040000000017a. In generic MCA that will be classified as "GCACHEL2_EVICT_ERR in bank 2" - a generic transaction type experienced a level-2 cache error during a cache eviction event; other fields of the status register indicate that this event has model-specific error code 0, whether the error was hardware-corrected and so on. In the case of Intel processors, public documentation does not generally describe additional classification of the error - for example they do not document which processor unit "bank 2" might correspond to, nor what model-specific error code 0 for the above error might tell us. In the AMD case all that detail is available, and we could further classify the error as having been detected by the "bus unit" (bank 2), the extended error code (part of the MCA model-specific error code) of 0 on an evict transacton indicates a "Data Copyback" error single-bit ECC error from L2 cache.

Our first AMD support in Solaris included support for all this additional classification detail, and involved a substantial amount of work for each new processor revision. In the case of AMD family 0xf revision F processors (the first support DDR2 memory) we were not able to get the required support back into Solaris 10 in time for the first Sun systems using those processors! We began to realize that this detailed model-specific approach was never going to scale - that was probably obvious at the beginning, but our SPARC roots had not prepared us for how quickly the x86 world rolls out new processor revisions! When the additional requirement to support Intel processors was made, we soon decided it was time to adopt a generic MCA approach.

We also realized that, with a few notable exceptions, we were not actually making any real use of the additional detail available in the existing AMD implementations. For example, our diagnosis algorithms for the error above would simply count the error as a corrected error from the L2 Cache data array - and all that information was available via generic MCA. So the design would be to implement most support in a generic MCA module, and layer model-specific support on top of that to facilitate value-add from model-specific details when we can put them to good use for diagnosis, surviving the error where we might not from generic information, and so on.

What Does Solaris Generic MCA Support Look Like?

It may not have been a blog, but I have documented this before - in the Generic MCA FMA Portfolio documentation (we prepare such a portfolio for all FMA work). In particular, the Generic MCA Philosophy document describes how errors are classified, what error report class is used for each type of generic MCA error, what ereport payload content is included for each, etc.

Generic MCA Preparation For Virtualization - A Handle On Physical CPUs

In the generic MCA project we also decided to prepare the way x86 cpu and memory fault handling in virtualized environments, such as Solaris x86 xVM Server. Our existing cpu module API at the time was all modelled around the assumption that there was a fixed, one-to-one relationship between Solaris logical cpus (as reported in psrinfo for example) and real, physical processor execution resources. If running as a dom0, however, we may be limited to 2 virtual cpus (vcpus) while the underlying physical hardware may have 16 physical cpus (say 8 dual-core chips); and while Solaris dom0 may bind a software thread to a vcpu it is actually the hypervisor that is scheduling vcpus onto real physical cpus (pcpus) and binding to a vcpu does not imply binding to a pcpu - if you're binding to read a few hardware registers from what should be the same cpu for all reads you're going to be disappointed if the hypervisor switches you from one pcpu to another midway! And, anyway, you'd be lucky if the hypervisor let you succeed in reading such registers at all - and you don't know whether any values you read were modified by the hypervisor!

So the generic MCA project, already reworking the cpu module interface very substantially, chose to insert a "handle" argument to most cpu module interfaces. A handle is associated with each physical (chip-id, core-id, strand-id) processor instance, and one needs to lookup and quote the handle you wish to operate on throughout the cpu module interface and model-specific overlay interface, as well as in kernel modules delivering cpu module implementations, or model-specific implementations.

Supported Platforms With Generic MCA in Solaris

The generic MCA project raises the baseline level of MCA support for all x86 processors. Before this project Solaris had detailed MCA support for AMD Opteron family 0xf, and very poor support for anything else - little more than dumping some raw telemetry to the console on fatal events, with no telemetry raised and therefore no logging or diagnosis of telemetry. With generic MCA we raise telemetry and diagnose all correctable and uncorrectable errors (the latter provided the hardware does not simply reset, not allowing the OS any option of gathering telemetry) on all x86 platforms, regardless of chip manufacturer or system vendor.

The MCA does not, however, cover all aspects of cpu/memory fault management:

  • Not all cpu and memory error reporting necessarily falls under the MCA umbrella. Most notably, Intel memory-controller hub (MCH) systems have an off-chip memory-controller which can raise notifications through MCA but most telemetry is available through device registers not from the registers that are part of the MCA.
  • The MCA does a pretty complete job of classifying within-chip errors down to a unit that you can track errors on (e.g., down to the instruction cache data and tag arrays). It does not classify off-chip errors other than generically, and additional model-specific support is required to interpret such errors. For example, AMD Opteron has on-chip memory-controller which reports under the MCA but generic MCA can only classify as far as "memory error, external to the chip": model-specific support can refine the classification to recognise a single-bit ECC error affecting bit 56 at a particular address, and a partnering memory-controller driver could resolve that address down to a memory dimm and rank thereon.
  • For full fault management we require access to additional information about resources that we are performing diagnosis on. For example, we might diagnose that the chip with HyperTransport ID 2 has a bad L2 Cache - but to complete the picture it is nice to provide a FRU ("field-replaceable unit") label such as "CPU_2" that matches the service labelling on the system itself, and ideally also a serial number associated with the FRU.

So while all platforms have equal generic MCA capabilities, some platforms are more equal than others once additional value-add functionality is taken into account:

Memory-Controlller Drivers

Some platforms have corresponding memory-controller drivers. Today we have such drivers for AMD family 0xf (mc-amd), the Intel 5000/5400/7300 chipset series (intel_nb5000), and Intel Nehalem systems (intel_nhm). These drivers provide at minimum an address-to-memory-resource translation service that lets us translate an error address (say 0x10344540) to a memory resource (e.g., chip 3, dimm 2, rank 1). They also supply memory topology information. Such drivers operate on all systems built with the chipset they implement support for, whoever the vendor.

Model-specific MCA Support

Some platforms have additional model-specific MCA support layered on top of the generic MCA code. "Legacy" AMD family 0xf support is maintained unchanged by a model-specific layer which is permitted to rewrite the generic ereport classes in those more-specific classes used in the original MCA work, and to add ereport payload members that are non-architectural such as ECC syndrome. We also added a "generic AMD" support module which would apply in the absence of any more-specific support; right now this applies to all AMD families after 0xf, i.e. 0x10 and 0x11. This module permits the recognition and diagnosis of all memory error even in the absence of a memory controller driver for the platform. The Intel Nehalem support taught the Intel module to recognise additional Nehalem-specific error types and handle them appropriately. Such model-specific support applies regardless of vendor.

FRU Labelling

Some platforms have cpu and memory FRU label support, i.e., the ability to diagnose "CPU2 DIMM3" as bad instead of something like "chip=2/memory-controller=0/dimm=3" which actually requires platform schematics to map to a particualar dimm with absolute certainty. For many reasons, this support is delivered via hardcoded rules in XML map files. Solaris today only delivers such information for Sun x64 systems; support for additional systems, including non-Sun platforms, is trivial to add in the XML - the tricky part is in obaining the schematics etc to perform the mapping of chip/dimm instances to FRU labels.

FRU Serial Numbers

Some platforms have memory DIMM serial number support. Again we only deliver such support for Sun x64 systems, and how this is done varies a little (on Intel the memory-controller driver reads the DIMM SPD proms itself; on AMD we let the service processor do that and read the records from the service processor).

What?! - No Examples?

I'll leave a full worked example to my next blog entry, where I'll demonstrate the full story under Solaris xVM x86 hypervisor.

Technorati Tags: ,

Monday Nov 20, 2006

London OpenSolaris FMA Presentation

On Wednesday last week I presented some aspects of Solaris FMA at the November meeting of the OpenSolaris User Group. It seemed to go well (no obvious snoring, and some good feedback) although it went over time by 10 or 15 minutes which is a dangerous thing to do when post-meeting beer and food beckons (worked around from the beginning by bringing the beer into the presentation). At Chris Gerhard's prompting I'm posting the slides here on my blog.

Technorati Tags: ,


I work in the Fault Management core group; this blog describes some of the work performed in that group.


« August 2016
Site Pages
Sun Bloggers

No bookmarks in folder