Thursday Nov 10, 2011

Oracle Solaris 11 - New Fault Management Features

Oracle Solaris 11 has been launched today! Here is a quick writeup of some of the new fault management features in Solaris 11.

Friday Apr 01, 2011

Notifications for SMF Instance State Transitions

Something you may or may not have noticed in the Oracle Solaris 11 Express release is the ability to monitor SMF service instance transitions, such as to maintenance state or from online to disabled. This leverages the Solaris FMA notification daemons, which were introduced at the same time as this SMF support (previously SNMP notification for email was delivered as an fmd plugin, and there was no email notification).

First thing to know is that the notification daemons are packaged apart from the fault manager itself. There are separate packages for SMTP notification and for SNMP notification, and each deliver a corresponding SMF service:

# pkg install smtp-notify
# pkg install snmp-notify
# svcadm enable smtp-notify
# svcadm enable snmp-notify
# svcs smtp-notify
# svcs snmp-notify
You need only install the mechanism(s) you want. If the packages are absent you are still able to configure notification preferences with svccfg as below - but notifications will not work until the packages are added and the services online.

Next you need to setup your notification preferences. Like many things nowadays these are stored in SMF, and we use an updated svccfg command line. See manpages svccfg(1m) for subcommands listnotify, setnotify and delnotify as well smtp-notify(1m) and snmp-notify(1m). For the die-hards there is more detail in smf(5).

While you can notify of any transition and configure per-service in addition to globally, most-commonly it will be transitions to maintenance state that will be configured for global notification: (I'll use $EMAIL here both to protect the innocent and to stop email address obfuscation from wreckig the example!):

# svccfg listnotify -g
# svccfg setnotify -g to-maintenance mailto:$EMAIL
# svccfg listnotify -g
    Event: to-maintenance (source: svc:/system/svc/global:default)
        Notification Type: smtp
            Active: true
            to: john.q.citizen-AT-elsewhere-DOT-com
That configures email notification when any service instance transitions to maintenance. Similarly you can configure such global actions for transitions such as to-offline.

If you only want to monitor particular services, or appoint different actions for different services such as to email different aliases) then drop the -g (global) and list the instance you are interested in:

# svccfg -s svc:/network/http:apache22 setnotify to-offline snmp:active

Finally, you can notify of FMA problem lifecycle events using the same mechanism:

# svccfg setnotify problem-diagnosed mailto:$EMAIL
The manpages detail the other lifecycle events.

Tuesday Sep 16, 2008

New! Solaris x86 xVM CPU & Memory Fault Handling

We recently completed the first phase of re-enabling traditional Solaris Fault Management features on x86 systems running the Solaris xVM Hypervisor; this support was integrated into Solaris Nevada build 99, and will from there find its way into a future version of the xVM Server product (post 1.0).

In such systems prior to build 99 there is essentially no hardware error handling or fault diagnosis. Our existing Solaris features in this area were designed for native ("bare metal") environments in which the Solaris OS instance has full access to and control over the underlying hardware. When a type 1 hypervisor is inserted it becomes the privileged owner of the hardware, and native error handling and fault management capabilities of a dom0 operating system such as Solaris are nullified.

The two most-important sources or hardware errors for our purposes are cpu/memory and IO (PCI/PCI-E). This first phase restores cpu/memory error handling and fault diagnosis to near-native capabilities. A future phase will restore IO fault diagnosis.

Both AMD and Intel systems are equally supported in this implementation. The examples below are from an AMD lab system - details on an Intel system will differ trivially.

A Worked Example - A Full Fault Lifecycle Illustration

We'll follow a cpu fault through its full lifecycle. I'm going to include sample output from administrative commands only, and I'm not going to look "under the hood" - the intention is to show how simple this is and hide the under-the-hood complexities!

I will use my favourite lab system - a Sun Fire v40z server with 4 x AMD Opteron, one of which has a reliable fault that will produce no end of correctable instruction cache and l2 cache single-bit errors. I've used this host "parity" in a past blog some time ago - if you contrast the procedure and output there with that below you'll (hopefully!) see we've come a long way.

The administrative interface for Solaris Fault Management is unchanged with this new feature - you now just run fmadm on a Solaris dom0 instead of natively. The output below is from host "parity" dom0 (in an i86xpv Solaris xVM boot).

Notification Of Fault Diagnosis - (dom0) Console, SNMP

Prior to diagnosis of a fault you'll see nothing on the system console. Errors are being logged in dom0 and analysed, but we'll only trouble the console once a fault is diagnosed (console spam is bad). So after a period of uptime on this host we see the following on the console:

SUNW-MSG-ID: AMD-8000-7U, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Mon Sep 15 20:54:06 PDT 2008
PLATFORM: Sun Fire V40z, CSN: XG051535088, HOSTNAME: parity
SOURCE: eft, REV: 1.16
EVENT-ID: 8db62dbe-a091-e222-a6b4-dd95f015b4cc
DESC: The number of errors associated with this CPU has exceeded acceptable
	levels.  Refer to for more information.
AUTO-RESPONSE: An attempt will be made to remove this CPU from service.
IMPACT: Performance of this system may be affected.
REC-ACTION: Schedule a repair procedure to replace the affected CPU.
	Use 'fmadm faulty' to identify the CPU.
If you've configured your system to raise SNMP traps on fault diagnosis you'll be notified by whatever software has been plumbed in to that mechanism, too.

Why, you ask, does the console message not tell us which cpu is faulty? Well that's because the convoluted means by which we generate these messages (convoluted to facilitate localization of messages in localized version of Solaris) has restricted us to a limited amount of dynamic content in the messages - an improvement in this area is in the works.

Show Current Fault State - fmadm faulty

For now we following the advice and run fmadm faulty on dom0.

dom0# fmadm faulty   
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 15 20:54:06 8db62dbe-a091-e222-a6b4-dd95f015b4cc  AMD-8000-7U    Major    

Fault class : fault.cpu.amd.icachedata
Affects     : hc://:product-id=Sun-Fire-V40z:chassis-id=XG051535088:server-id=parity/motherboard=0/chip=3/core=0/strand=0
                  faulted and taken out of service
FRU         : "CPU 3" (hc://:product-id=Sun-Fire-V40z:chassis-id=XG051535088:server-id=parity/motherboard=0/chip=3)

Description : The number of errors associated with this CPU has exceeded
              acceptable levels.  Refer to for
              more information.

Response    : An attempt will be made to remove this CPU from service.

Impact      : Performance of this system may be affected.

Action      : Schedule a repair procedure to replace the affected CPU.  Use
              'fmadm faulty' to identify the CPU.
Note the faulted and taken out of service - we have diagnosed this resource as faulty and have offlined the processor at the hypervisor level. This system supports cpu and memory FRU labels so all that hc://.... mumbo-jumbo can be ignored - the thing to replace is the processor in the socket labelled "CPU 3" (either on the board or on the service sticker affixed to the chassis lid, or both - depends on the system).

Replace The Bad CPU

Schedule downtime to replace the chip in location "CPU 3". Software can't do this for you!

Declare Your Replacement Action

dom0# fmadm replaced "CPU 3"
fmadm: recorded replacement of CPU 3
This is an unfortunate requirement: x86 systems do not support cpu serial numbers (the privacy world rioted when Intel included such functionality and enabled it by default), and so we cannot auto-detect the replacement of a cpu FRU. On sparc systems, and on Sun x86 systems for memory dimms, we do have FRU serial number support and this step is not required.

When the replacement is declared as above (or for FRU types that have serial numbers, at reboot following FRU replacment) the console will confirm that all aspects of this fault case have been addressed. At this time the cpu has been brought back online at the hypervisor level.

Did You See Any xVM In That?

No, you didn't! - native/xVM support is seamless. The steps and output above did not involve any xVM-specfic steps or verbage. Solaris is running as a dom0, really!

dom0# uname -a
SunOS parity 5.11 on-fx-dev i86pc i386 i86xpv
Actually there are a few differences elsewhere. Most notable is that psrinfo output will not change when the physical cpu is offlined at the hypervisor level:
dom0# psrinfo   
0       on-line   since 09/09/2008 03:25:15
1       on-line   since 09/09/2008 03:25:19
2       on-line   since 09/09/2008 03:25:19
3       on-line   since 09/09/2008 03:25:19
That's because psrinfo is reporting the state of virtual cpus and not physical cpus. We plan to integrate functionality to query and control the state of physical cpus, but that is yet to come.

Another Example - Hypervisor Panic

In this case the processor suffers an uncorrected data cache error, losing hypervisor state. The hypervisor handles the exception, dispatches telemetry towards dom0, and initiates a panic. The system was booted under Solaris xVM at the time of the error:

Xen panic[dom=0xffff8300e319c080/vcpu=0xffff8300e3de8080]: Hypervisor state lost due to machine check exception.

	rdi: ffff828c801e7348 rsi: ffff8300e2e09d48 rdx: ffff8300e2e09d78
	rcx: ffff8300e3dfe1a0  r8: ffff828c801dd5fa  r9:        88026e07f
	rax: ffff8300e2e09e38 rbx: ffff828c8026c800 rbp: ffff8300e2e09e28
	r10: ffff8300e319c080 r11: ffff8300e2e09cb8 r12:                0
	r13:        a00000400 r14: ffff828c801f12e8 r15: ffff8300e2e09dd8
	fsb:                0 gsb: ffffff09231ae580  ds:               4b
	 es:               4b  fs:                0  gs:              1c3
	 cs:             e008 rfl:              282 rsp: ffff8300e2e09d30
	rip: ffff828c80158eef:  ss:             e010
	cr0: 8005003b  cr2: fffffe0307699458  cr3: 6dfeb5000  cr4:      6f0
Xen panic[dom=0xffff8300e319c080/vcpu=0xffff8300e3de8080]: Hypervisor state lost due to machine check exception.

ffff8300e2e09e28 xpv:mcheck_cmn_handler+302
ffff8300e2e09ee8 xpv:k8_machine_check+22
ffff8300e2e09f08 xpv:machine_check_vector+21
ffff8300e2e09f28 xpv:do_machine_check+1c
ffff8300e2e09f48 xpv:early_page_fault+77
ffff8300e2e0fc48 xpv:smp_call_function_interrupt+86
ffff8300e2e0fc78 xpv:call_function_interrupt+2b
ffff8300e2e0fd98 xpv:do_mca+9f0
ffff8300e2e0ff18 xpv:syscall_enter+67
ffffff003bda1b90 unix:xpv_int+46 ()
ffffff003bda1bc0 unix:cmi_hdl_int+42 ()
ffffff003bda1d20 specfs:spec_ioctl+86 ()
ffffff003bda1da0 genunix:fop_ioctl+7b ()
ffffff003bda1eb0 genunix:ioctl+174 ()
ffffff003bda1f00 unix:brand_sys_syscall32+328 ()

syncing file systems... done
ereport.cpu.amd.dc.data_eccm ena=f0b88edf0810001 detector=[ version=0 scheme=
 "hc" hc-list=[ hc-name="motherboard" hc-id="0" hc-name="chip" hc-id="1"
 hc-name="core" hc-id="0" hc-name="strand" hc-id="0" ] ] compound_errorname=
 "DCACHEL1_DRD_ERR" disp="unconstrained,forcefatal" IA32_MCG_STATUS=7
 machine_check_in_progress=1 ip=0 privileged=0 bank_number=0 bank_msr_offset=
 400 IA32_MCi_STATUS=b600a00000000135 overflow=0 error_uncorrected=1
 error_enabled=1 processor_context_corrupt=1 error_code=135
 model_specific_error_code=0 IA32_MCi_ADDR=2e7872000 syndrome=1 syndrome-type=
 "E" __injected=1

dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
100% done
Note that as part of the panic flow dom0 has retrieved the fatal error telemetry from the hypervisor address space, and it dumps the error report to the console. Hopefully we won't have to interpret that output - it's only there in case the panic fails (e.g., no dump device). The ereport is also written to the system dump device at a well-known location, and the fault management service will retrieve this telemetry when it restarts on reboot.

On the subsequent reboot we see:

v3.1.4-xvm chgset 'Sun Sep 07 13:00:23 2008 -0700 15898:354d7899bf35'
SunOS Release 5.11 Version on-fx-dev 64-bit
Copyright 1983-2008 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
DEBUG enabled
(xVM) Xen trace buffers: initialized
Hostname: parity
NIS domain name is
Reading ZFS config: done.
Mounting ZFS filesystems: (9/9)

parity console login:
(xVM) Prepare to bring CPU1 down...

SUNW-MSG-ID: AMD-8000-AV, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Mon Sep 15 23:05:21 PDT 2008
PLATFORM: Sun Fire V40z, CSN: XG051535088, HOSTNAME: parity
SOURCE: eft, REV: 1.16
EVENT-ID: f107e323-9355-4f70-dd98-b73ba7881cc5
DESC: The number of errors associated with this CPU has exceeded acceptable levels.  Refer to for more information.
AUTO-RESPONSE: An attempt will be made to remove this CPU from service.
IMPACT: Performance of this system may be affected.
REC-ACTION: Schedule a repair procedure to replace the affected CPU.  Use 'fmadm faulty' to identify the module.
(xVM) CPU 1 is now offline
Pretty cool - huh? We diagnosed a terminal MCA event that caused a hypervisor panic and system reset. You'll only see the xVM-prefixed messages if you have directed hypervisor serial console output to dom0 console. At this stage we can see the fault in fmadm:
dom0# fmadm faulty -f
"CPU 1" (hc://:product-id=Sun-Fire-V40z:chassis-id=XG051535088:server-id=parity/motherboard=0/chip=1) faulty
f107e323-9355-4f70-dd98-b73ba7881cc5 1 suspects in this FRU total certainty 100%

Description : The number of errors associated with this CPU has exceeded
              acceptable levels.  Refer to for
              more information.

Response    : An attempt will be made to remove this CPU from service.

Impact      : Performance of this system may be affected.

Action      : Schedule a repair procedure to replace the affected CPU.  Use
              'fmadm faulty' to identify the module.

Phase 1 Architecture

"Architecture" may be overstating it a bit - I like to call this an "initial enabling" since the architecture will have to evolve considerably as the x86 platform itself matures to offer features that will make many of the sun4v/ldoms hardware fault management features possible on x86.

The aim was to re-use as much existing Solaris error handling and fault diagnosis code as possible. That's not just for efficiency - it also allows interoperability and compatibility between native and virtualized boots for fault management. The first requirement was to expose physical cpu information to dom0 from the hypervisor: dom0 is normally presented with virtual cpu information - no good for diagnosing real physical faults. That information allows Solaris to initialize a cpu module handle for each physical (chipid, coreid, strandid) combination in the system, and to associate telemetry arising from the hypervisor with a particular cpu resource. It also allows us to enumerate the cpu resources in our topology library - we use this in our diagnosis software.

Next we had to modify our chip topology, as reported in /usr/lib/fm/fmd/fmtopo and as used in diagnosis software, from chip/cpu (misguidedly introduced with the orginal Solaris AMD fault management work - see previous blogs) to a more physical representation of chip/core/strand - you can see the alignment with cpu module handles mentioned above.

Then communication channels were introduced from the hypervisor to dom0 in order to deliver error telemetry to dom0 for logging and diagnosis:

  • Error exceptions are handled in the hypervisor; it decides whether we can continue or need to panic. In both cases the available telemetry (read from MCA registers) is packaged up and forwarded to dom0.
  • If the hypervisor needs to panic as the result of an error then the crashdump procedure, handled as today by dom0, extracts the in-flight terminal error telemetry and stores it at a special location on the dump device. At subsequent reboot (whether native or virtualized) we extract this telemetry and submit it for logging and diagnosis.
  • The hypervisor polls MCA state for corrected errors (it's done this for ages) and forwards any telemetry it finds on to dom0 for logging and diagnosis (instead of simple printk to the hypervisor console).
  • The dom0 OS has full PCI access; when error notifications arrive in dom0 we check the memory-controller device for any additional telemetry. We also perform periodic poll of the memory-controller device from dom0.
Once telemetry is received in dom0 a thin layer of "glue" kernel code injects the data into the common (existing, now shared between native and i86xpv boots) infrastructure which then operates without modification to raise error reports to dom0 userland diagnosis software. The userland diagnosis software applies the same algorithms as on native; if a fault is diagnosed then the retire agent may request to offline the physical processor to isolate it - this is effected through a new hypercall request to the hypervisor.

The obligatory picture:


The hypervisor to dom0 communication channel for MCA telemetry and the MCA hypercall itself that we've used in our project is derived from work contributed to the Xen community by Christoph Egger of AMD. We worked with Christoph on some design requirements for that work.

Frank van der Linden did the non-MCA hypervisor work - physical cpu information for dom0, and hypervisor-level cpu offline/online.

Sean Ye designed and implemented our new /dev/fm driver; the new topology enumerator that works identically regardless of dom0 or native; topology-method based resource retire for FMA agents. Sean also performed most of the glue work in putting various pieces together as they became available.

John Cui extended our fault test harness to test all the possible error and fault scenarios under native and xVM, and performed all our regression and a great deal of unit testing on a host of systems.

I did the MCA-related hypervisor and kernel work, and the userland error injector changes.

We started this project in January, with a developer in each of the United Kingdom, Netherlands, and China. We finished in September with the same three developers, but with some continent reuffles: the Netherlands moving to the US in March/April, and the UK moving to Australia in May/June. Thanks to our management for their indulgence and patience while we went off-the-map for extended periods mid-project!

Technorati Tags: ,

Sunday Sep 14, 2008

Generic Machine Check Architecture (MCA) In Solaris

The work described below was integrated into Solaris Nevada way back in August 2007 - build 76; it has since been backported to Solaris 10. It's never too late to blog about things! Actually, I just want to separate this description from the entry that will follow - Solaris x86 xVM Fault Management.

Why Generic MCA?

In past blogs I have described x86 cpu and memory fault management feature-additions for specific processor types: AMD Opteron family 0xf revisions B-E, and AMD Opteron family 0xf revisions F and G. At the time of the first AMD work Sun was not shipping any Intel x64 systems; since then, of course, Sun has famously begun a partnership with Intel and so we needed to look at offering fault management support for our new Intel-based platforms.

Both AMD and Intel base their processor error handling and reporting on the generic Machine Check Architecture (MCA). AMD public documentation actually details and classifies error detectors and error types well beyond what generic MCA allows, while Intel public documentation generally maintains the abstract nature of MCA by not publishing model-specific details. Suppose we observe an error on MCA bank number 2 of a processor with status register MC2_STATUS reading 0xd40040000000017a. In generic MCA that will be classified as "GCACHEL2_EVICT_ERR in bank 2" - a generic transaction type experienced a level-2 cache error during a cache eviction event; other fields of the status register indicate that this event has model-specific error code 0, whether the error was hardware-corrected and so on. In the case of Intel processors, public documentation does not generally describe additional classification of the error - for example they do not document which processor unit "bank 2" might correspond to, nor what model-specific error code 0 for the above error might tell us. In the AMD case all that detail is available, and we could further classify the error as having been detected by the "bus unit" (bank 2), the extended error code (part of the MCA model-specific error code) of 0 on an evict transacton indicates a "Data Copyback" error single-bit ECC error from L2 cache.

Our first AMD support in Solaris included support for all this additional classification detail, and involved a substantial amount of work for each new processor revision. In the case of AMD family 0xf revision F processors (the first support DDR2 memory) we were not able to get the required support back into Solaris 10 in time for the first Sun systems using those processors! We began to realize that this detailed model-specific approach was never going to scale - that was probably obvious at the beginning, but our SPARC roots had not prepared us for how quickly the x86 world rolls out new processor revisions! When the additional requirement to support Intel processors was made, we soon decided it was time to adopt a generic MCA approach.

We also realized that, with a few notable exceptions, we were not actually making any real use of the additional detail available in the existing AMD implementations. For example, our diagnosis algorithms for the error above would simply count the error as a corrected error from the L2 Cache data array - and all that information was available via generic MCA. So the design would be to implement most support in a generic MCA module, and layer model-specific support on top of that to facilitate value-add from model-specific details when we can put them to good use for diagnosis, surviving the error where we might not from generic information, and so on.

What Does Solaris Generic MCA Support Look Like?

It may not have been a blog, but I have documented this before - in the Generic MCA FMA Portfolio documentation (we prepare such a portfolio for all FMA work). In particular, the Generic MCA Philosophy document describes how errors are classified, what error report class is used for each type of generic MCA error, what ereport payload content is included for each, etc.

Generic MCA Preparation For Virtualization - A Handle On Physical CPUs

In the generic MCA project we also decided to prepare the way x86 cpu and memory fault handling in virtualized environments, such as Solaris x86 xVM Server. Our existing cpu module API at the time was all modelled around the assumption that there was a fixed, one-to-one relationship between Solaris logical cpus (as reported in psrinfo for example) and real, physical processor execution resources. If running as a dom0, however, we may be limited to 2 virtual cpus (vcpus) while the underlying physical hardware may have 16 physical cpus (say 8 dual-core chips); and while Solaris dom0 may bind a software thread to a vcpu it is actually the hypervisor that is scheduling vcpus onto real physical cpus (pcpus) and binding to a vcpu does not imply binding to a pcpu - if you're binding to read a few hardware registers from what should be the same cpu for all reads you're going to be disappointed if the hypervisor switches you from one pcpu to another midway! And, anyway, you'd be lucky if the hypervisor let you succeed in reading such registers at all - and you don't know whether any values you read were modified by the hypervisor!

So the generic MCA project, already reworking the cpu module interface very substantially, chose to insert a "handle" argument to most cpu module interfaces. A handle is associated with each physical (chip-id, core-id, strand-id) processor instance, and one needs to lookup and quote the handle you wish to operate on throughout the cpu module interface and model-specific overlay interface, as well as in kernel modules delivering cpu module implementations, or model-specific implementations.

Supported Platforms With Generic MCA in Solaris

The generic MCA project raises the baseline level of MCA support for all x86 processors. Before this project Solaris had detailed MCA support for AMD Opteron family 0xf, and very poor support for anything else - little more than dumping some raw telemetry to the console on fatal events, with no telemetry raised and therefore no logging or diagnosis of telemetry. With generic MCA we raise telemetry and diagnose all correctable and uncorrectable errors (the latter provided the hardware does not simply reset, not allowing the OS any option of gathering telemetry) on all x86 platforms, regardless of chip manufacturer or system vendor.

The MCA does not, however, cover all aspects of cpu/memory fault management:

  • Not all cpu and memory error reporting necessarily falls under the MCA umbrella. Most notably, Intel memory-controller hub (MCH) systems have an off-chip memory-controller which can raise notifications through MCA but most telemetry is available through device registers not from the registers that are part of the MCA.
  • The MCA does a pretty complete job of classifying within-chip errors down to a unit that you can track errors on (e.g., down to the instruction cache data and tag arrays). It does not classify off-chip errors other than generically, and additional model-specific support is required to interpret such errors. For example, AMD Opteron has on-chip memory-controller which reports under the MCA but generic MCA can only classify as far as "memory error, external to the chip": model-specific support can refine the classification to recognise a single-bit ECC error affecting bit 56 at a particular address, and a partnering memory-controller driver could resolve that address down to a memory dimm and rank thereon.
  • For full fault management we require access to additional information about resources that we are performing diagnosis on. For example, we might diagnose that the chip with HyperTransport ID 2 has a bad L2 Cache - but to complete the picture it is nice to provide a FRU ("field-replaceable unit") label such as "CPU_2" that matches the service labelling on the system itself, and ideally also a serial number associated with the FRU.

So while all platforms have equal generic MCA capabilities, some platforms are more equal than others once additional value-add functionality is taken into account:

Memory-Controlller Drivers

Some platforms have corresponding memory-controller drivers. Today we have such drivers for AMD family 0xf (mc-amd), the Intel 5000/5400/7300 chipset series (intel_nb5000), and Intel Nehalem systems (intel_nhm). These drivers provide at minimum an address-to-memory-resource translation service that lets us translate an error address (say 0x10344540) to a memory resource (e.g., chip 3, dimm 2, rank 1). They also supply memory topology information. Such drivers operate on all systems built with the chipset they implement support for, whoever the vendor.

Model-specific MCA Support

Some platforms have additional model-specific MCA support layered on top of the generic MCA code. "Legacy" AMD family 0xf support is maintained unchanged by a model-specific layer which is permitted to rewrite the generic ereport classes in those more-specific classes used in the original MCA work, and to add ereport payload members that are non-architectural such as ECC syndrome. We also added a "generic AMD" support module which would apply in the absence of any more-specific support; right now this applies to all AMD families after 0xf, i.e. 0x10 and 0x11. This module permits the recognition and diagnosis of all memory error even in the absence of a memory controller driver for the platform. The Intel Nehalem support taught the Intel module to recognise additional Nehalem-specific error types and handle them appropriately. Such model-specific support applies regardless of vendor.

FRU Labelling

Some platforms have cpu and memory FRU label support, i.e., the ability to diagnose "CPU2 DIMM3" as bad instead of something like "chip=2/memory-controller=0/dimm=3" which actually requires platform schematics to map to a particualar dimm with absolute certainty. For many reasons, this support is delivered via hardcoded rules in XML map files. Solaris today only delivers such information for Sun x64 systems; support for additional systems, including non-Sun platforms, is trivial to add in the XML - the tricky part is in obaining the schematics etc to perform the mapping of chip/dimm instances to FRU labels.

FRU Serial Numbers

Some platforms have memory DIMM serial number support. Again we only deliver such support for Sun x64 systems, and how this is done varies a little (on Intel the memory-controller driver reads the DIMM SPD proms itself; on AMD we let the service processor do that and read the records from the service processor).

What?! - No Examples?

I'll leave a full worked example to my next blog entry, where I'll demonstrate the full story under Solaris xVM x86 hypervisor.

Technorati Tags: ,

Monday Nov 20, 2006

London OpenSolaris FMA Presentation

On Wednesday last week I presented some aspects of Solaris FMA at the November meeting of the OpenSolaris User Group. It seemed to go well (no obvious snoring, and some good feedback) although it went over time by 10 or 15 minutes which is a dangerous thing to do when post-meeting beer and food beckons (worked around from the beginning by bringing the beer into the presentation). At Chris Gerhard's prompting I'm posting the slides here on my blog.

Technorati Tags: ,

Tuesday Oct 24, 2006

FMA Support For AMD Opteron rev F

In my last blog entry I discussed the initial support in Solaris for fault management on AMD Opteron, Athlon 64 and Turion 64 processor systems (more accurately known as AMD Family 0xf, or the K8 family). That support applied to revisions E and earlier of those chips, which were current at the time, and has since been backported to Solaris 10 Update 2. In this entry I will discuss recent changes to support "revision F", and I'll followup in the next few weeks with a side-by-side comparison of our error handling and fault management with that of one or two other OS offerings.

In August this year AMD introduced it's "Next Generation AMD Opteron processor" which has a bunch of cool featues (such as AMD Virtualization technology and a socket-compatible upgrade path to Quad-Core chips when they release) but the main change as far as error handling and fault management go is the switch to DDR2 memory. This next generation is also known as revision F (of AMD family 0xf) and sometimes also as the "socket F" (really F(1207)) and "socket AM2" processors; the initial Solaris support for AMD fault management does not apply to revision F (the cpu module driver would choose not to initialize for revisions beyond E).

Quick Summary

If you want to avoid the detail below here is the quick summary. Solaris now supports revision F in Nevada build 51, and targeting Solaris 10 Update 4 next year. The support works pretty much as for earlier revisions (most code shared) - identically for cpu diagnosis, and with minor changes in memory diagnosis.

Revision F Support in Nevada Build 51

In early October I putback support for rev F, together with a number of improvements and bugfixes to the original project. Below is the putback notification including both the bugids and the files changed - most files that implement the AMD fault management support were touched to some degree, so this provides a handy list for anyone wanting to explore more deeply:

Event:            putback-to
Parent workspace: /ws/onnv-gate
Child workspace:  /net/
User:             gavinm

PSARC 2006/564 FMA for Athlon 64 and Opteron Rev F/G Processors
PSARC 2006/566 eversholt language enhancements: confprop_defined
6362846 eversholt doesn't allow dashes in pathname components
6391591 AMD NB config should not set NbMcaToMstCpuEn
6391605 AMD DRAM scrubber should be disabled when errata #99 applies
6398506 memory controller driver should not bother to attach at all on rev F
6424822 FMA needs to support AMD family 0xf revs F and G
6443847 FMA x64 multibit ChipKill rules need to follow MQSC guidelines
6443849 Accrue inf_sys and s_ecc ECC errors against memory
6443858 mc-amd can free unitsrtr before usage in subsequent error path
6443891 mc-amd does not recognise mismatched dimm support
6455363 x86 error injector should allow addr option for most errors
6455370 Opteron erratum 101 only applies on revs D and earlier
6455373 Identify chip-select lines used on a dimm
6455377 improve x64 quadrank dimm support
6455382 add generic interfaces for amd chip revision and package/socket type
6468723 mem scheme fmri containment test for hc scheme is busted
6473807 eversholt could use some mdb support
6473811 eversholt needs a confprop_defined function
6473819 eversholt should show version of rules active in DE
6475302 ::nvlist broken by some runtime link ordering changes

update: usr/closed/cmd/mtst/x86/common/opteron/opt.h
update: usr/closed/cmd/mtst/x86/common/opteron/opt_common.c
update: usr/closed/cmd/mtst/x86/common/opteron/opt_main.c
update: usr/closed/cmd/mtst/x86/common/opteron/opt_nb.c
update: usr/src/cmd/fm/dicts/AMD.dict
update: usr/src/cmd/fm/dicts/AMD.po
update: usr/src/cmd/fm/eversholt/common/check.c
update: usr/src/cmd/fm/eversholt/common/eftread.c
update: usr/src/cmd/fm/eversholt/common/eftread.h
update: usr/src/cmd/fm/eversholt/common/esclex.c
update: usr/src/cmd/fm/eversholt/common/escparse.y
update: usr/src/cmd/fm/eversholt/common/literals.h
update: usr/src/cmd/fm/eversholt/common/tree.c
update: usr/src/cmd/fm/eversholt/common/tree.h
update: usr/src/cmd/fm/eversholt/files/i386/i86pc/amd64.esc
update: usr/src/cmd/fm/modules/Makefile.plugin
update: usr/src/cmd/fm/modules/common/cpumem-retire/cma_main.c
update: usr/src/cmd/fm/modules/common/cpumem-retire/cma_page.c
update: usr/src/cmd/fm/modules/common/eversholt/Makefile
update: usr/src/cmd/fm/modules/common/eversholt/eval.c
update: usr/src/cmd/fm/modules/common/eversholt/fme.c
update: usr/src/cmd/fm/modules/common/eversholt/platform.c
update: usr/src/cmd/fm/schemes/mem/mem_unum.c
update: usr/src/cmd/mdb/common/modules/genunix/genunix.c
update: usr/src/cmd/mdb/common/modules/genunix/nvpair.c
update: usr/src/cmd/mdb/common/modules/genunix/nvpair.h
update: usr/src/cmd/mdb/common/modules/libnvpair/libnvpair.c
update: usr/src/cmd/mdb/i86pc/modules/amd_opteron/ao.c
update: usr/src/common/mc/mc-amd/mcamd_api.h
update: usr/src/common/mc/mc-amd/mcamd_misc.c
update: usr/src/common/mc/mc-amd/mcamd_patounum.c
update: usr/src/common/mc/mc-amd/mcamd_rowcol.c
update: usr/src/common/mc/mc-amd/mcamd_rowcol_impl.h
update: usr/src/common/mc/mc-amd/mcamd_rowcol_tbl.c
update: usr/src/common/mc/mc-amd/mcamd_unumtopa.c
update: usr/src/lib/fm/topo/libtopo/common/hc_canon.h
update: usr/src/lib/fm/topo/libtopo/common/mem.c
update: usr/src/lib/fm/topo/libtopo/common/topo_protocol.c
update: usr/src/lib/fm/topo/modules/i86pc/chip/chip.c
update: usr/src/lib/fm/topo/modules/i86pc/chip/chip.h
update: usr/src/pkgdefs/SUNWfmd/prototype_com
update: usr/src/uts/i86pc/Makefile.files
update: usr/src/uts/i86pc/Makefile.workarounds
update: usr/src/uts/i86pc/cpu/amd_opteron/ao.h
update: usr/src/uts/i86pc/cpu/amd_opteron/ao_cpu.c
update: usr/src/uts/i86pc/cpu/amd_opteron/ao_main.c
update: usr/src/uts/i86pc/cpu/amd_opteron/ao_mca.c
update: usr/src/uts/i86pc/cpu/amd_opteron/
update: usr/src/uts/i86pc/cpu/amd_opteron/ao_poll.c
update: usr/src/uts/i86pc/io/mc/mcamd.h
update: usr/src/uts/i86pc/io/mc/mcamd_drv.c
update: usr/src/uts/i86pc/io/mc/
update: usr/src/uts/i86pc/io/mc/mcamd_subr.c
update: usr/src/uts/i86pc/mc-amd/Makefile
update: usr/src/uts/i86pc/os/cmi.c
update: usr/src/uts/i86pc/os/cpuid.c
update: usr/src/uts/i86pc/sys/cpu_module.h
update: usr/src/uts/i86pc/sys/cpu_module_impl.h
update: usr/src/uts/intel/sys/fm/cpu/AMD.h
update: usr/src/uts/intel/sys/mc.h
update: usr/src/uts/intel/sys/mc_amd.h
update: usr/src/uts/intel/sys/mca_amd.h
update: usr/src/uts/intel/sys/x86_archext.h
create: usr/src/cmd/fm/modules/common/eversholt/eft_mdb.c
create: usr/src/uts/i86pc/io/mc/mcamd_dimmcfg.c
create: usr/src/uts/i86pc/io/mc/mcamd_dimmcfg.h
create: usr/src/uts/i86pc/io/mc/mcamd_dimmcfg_impl.h
create: usr/src/uts/i86pc/io/mc/mcamd_pcicfg.c
create: usr/src/uts/i86pc/io/mc/mcamd_pcicfg.h
These changes will appear in Nevada build 51, and are slated for backport to Solaris 10 Update 4 (which is a bit distant, but Update 3 is almost out the door and long-since frozen for project backports).

New RAS Features In Revision F

In terms of error handling the processor cores and attendent caches in revision F chips are substantially unchanged from earlier revisions. There are no new error types detected in these banks (icache, dcache, load-store unit, bus unit aka l2cache) so no additional error reports for us to raise or any change required to the diagnosis rules that consume those error reports.

While the switch to DDR2 is a substantial feature change for the AMD chip (and makes it even faster!) this really only affects how we discover memory configuration. The types of error and fault that DDR2 memory DIMMs can experience are much the same as for DDR1, and so the set of errors reported from memory are nearly identical across all revisions - revision F adds parity checking on the dram command and address lines, and otherwise detects and reports the same set of errors.

An Online Spare Chip-Select is introduced in revision F. The BIOS can choose (if the option is available and selected) to set aside a chip-select on each memory node to act as a standby should any other chip-select be determined to be bad. When a bad chip-select is diagnosed software (OS or BIOS) can write the details to the Online Spare Control Register to initiate a copy of the bad chip-select contents to the online spare, and to redirect all access to the bad chip-select to the spare. The copy proceeds at a huge rate (over 1GB/s) and does not interrupt system operation at all.

Revision F introduces some hardware support for counting ECC error occurences. The NorthBridge counts all ECC errors it sees (regardless of source on that memory node) in a 12-bit counter of the DRAM Errors Threshold Register. Software can set a starting value for this counter which will increment with every ECC error observed by the NorthBridge; when the counter overflows (tries to go beyond 4095) software will receive a preselected interrupt (either for the OS or for the BIOS, depending on setup). The interrupt handler can diagnose a dimm as faulty, perhaps, although it does not know from this information alone which DIMM(s) to suspect.

The Online Spare Control Register also exposes 16 4-bit counters - one for each of 8 possible chip-selects on each of dram channels A and B. Again software may initialize the counters and receive an interrupt when the counter overflows. With information from all chip-selects of each channel it is possible to narrow the source to a single rank (side) of a single DIMM.

An Overview of Code Changes

With processor cores substantially unchanged from our point-of-view, introducing error handling and fault management for cpu errors for these chips was pretty trivial - really just a question of allow the cpu module to initialize in ao_init.

The changes for the transition from DDR1 to DDR2 memory are all within the memory controller and dram controllers in the on-chip NorthBridge, seen by Solaris as additions and changes to the memory controller registers that it can quiz and set via PCI config space accesses. Replacing our previous #define collection for bitmasks and bitfields of the memory controller registers with data strucutres that describe the bitfields for all revisions and some MC_REV_\* and MCREG_\* macros for easy revision-test and structure access flattens out these differences when the mcamd_drv.c code uses these macros. Our mc-amd driver is reponsible for discovering the memory configuration details of a system, and also for performing error address to dimm resource translation. So with it now able to read the changed memory controller registers the next step was to teach it to interpret the values in the bitfields of those register as changed for revision F. Most of the work here is in teaching it how to resolve a dram address into row, column and internal bank addresses for the new revision but the error address to dimm resource translation algorithm also required changes to allow for the possible presence of an online spare chip-select.

The cpu module cpu.AuthenticAMD.15 for AMD family 15 (0xf) mostly required changes to allow for revision-dependent error detector bank configuration. It also was changed to initialize the online spare control register and NorthBridge ECC Error Threshold Register - more about that below - and to provide easier control over its hardware configuration activities for the three hardware scrubbers, the hardware watchdog, and the NorthBridge MCA Config register.

The mc-amd driver also grew to understand all possible DIMM configurations on the AMD platform. It achieves this through working out which socket type we're running on (754, 939, 940, AM2, F(1207), S1g1) and what class of DIMMs we are working with - normal, quadrank registered DIMMs, or quadrank SODIMM - and performing a table lookup. The result determines, for each chip-select base and mask pair, how the resulting logical DIMM is numbered and which physical chip-select lines are used in operating that chip-select (from channel, socket number and rank number). The particular table selected for lookup is also determined by whether mismatched DIMM support is present - unmatched DIMMs present in the channelA/channelB DIMM socket pairs. This information determines how we will number the DIMMs and ranks thereon in the topology that is built in the mc-amd driver and thereafter in full glory in the topology library. This is very important, since the memory controller registers only work in terms of nodes and chip-selects while we want to diagnose down to individual DIMM and rank thereon (requesting replacement of a particular DIMM is so much better than requesting replacement of a chip-select).

With detailed information on the ranks of each DIMM now available I changed the cpu/memory diagnosis topology to list the ranks of a DIMM, and the diagnosis rules to diagnose down to the individual DIMM rank. This was so as to facilitate a swap to any online spare chip-select when we diagnose a dimm fault (chip-selects are built from ranks, not DIMMs). The new cpu/memory topology is as follows:

This is a single-socket (chip=0) dual-core (cpu=0 and cpu=1 on chip 0) socket AM2 system with two DIMMs installed (dimm=0 and dimm=1) both dual-rank (rank=0 and rank=1 on each DIMM). The DIMMS are numbered 0 and 1 based on the DIMM configuration information discussed above - chip-select base/mask pair 0 is active for the 128-bit wide chip-select formed from the two rank=0 ranks, and chip-select base/mask pair 1 is active for the 128 bit wide chip-select formed from the two rank=1 ranks. All of this information, and more, is present as properties on the topology nodes. The following shows the properties of the memory-controller node, one active chip-select on it and the two DIMMs contributing to that chip-select (one from each channel) and the specific ranks on those two DIMMs that are used in the chip-select.
  group: memory-controller-properties   version: 1   stability: Private/Private
    num               uint64    0x0
    revision          uint64    0x20f0020
    revname           string    F
    socket            string    Socket AM2
    ecc-type          string    Normal 64/8
    base-addr         uint64    0x0
    lim-addr          uint64    0x7fffffff
    node-ilen         uint64    0x0
    node-ilsel        uint64    0x0
    cs-intlv-factor   uint64    0x2
    dram-hole-size    uint64    0x0
    access-width      uint64    0x80
    bank-mapping      uint64    0x2
    bankswizzle       uint64    0x0
    mismatched-dimm-support uint64    0x0

  group: chip-select-properties         version: 1   stability: Private/Private
    num               uint64    0x0
    base-addr         uint64    0x0
    mask              uint64    0x7ffeffff
    size              uint64    0x40000000
    dimm1-num         uint64    0x0
    dimm1-csname      string    MA0_CS_L[0]
    dimm2-num         uint64    0x1
    dimm2-csname      string    MB0_CS_L[0]

    num               uint64    0x0
    size              uint64    0x40000000

  group: rank-properties                version: 1   stability: Private/Private
    size              uint64    0x20000000
    csname            string    MA0_CS_L[0]
    csnum             uint64    0x0

    num               uint64    0x1
    size              uint64    0x40000000

  group: rank-properties                version: 1   stability: Private/Private
    size              uint64    0x20000000
    csname            string    MB0_CS_L[0]
    csnum             uint64    0x0

ECC Error Thresholding, And Lack Thereof

At boot, in functions nb_mcamisc_init and ao_sparectl_cfg , Solaris clears any BIOS-registered interrupt for ECC counter overflow and does not install its own. Thus we choose not to make use of the hardware ECC counting and thresholding mechanism, since we can already perform more advanced counting and diagnostic analysis in our diagnosis engine rules. The payload of ECC error ereports is modified to include the NorthBridge Threshold Register in case it should prove interesting in any required human analysis of the ereport logs (not normally required); the counts from the Online Spare Control Register are not in the payload (I may add them soon).

Further Work In Progress

The following are all in-progress in the FMA group at the moment, with the intention and hope of catching Solaris 10 Update 4 (some may miss if they prove to be more difficult than expected).

PCI/PCIE Diagnosis and Driver Hardening API

This is already in Nevada, since build 43. It provides both full PCI/PCIE diagnosis and an API with which drivers may harden themselves to various types of device error.

Topology Library Updates

This is approaching putback to Nevada. The topology library is one of the foundations of FMA as we move forward - it will be our repository for information on all things FMA.

More Advanced ECC Telemetry Analysis

In build 51 memory ECC telemetry is fed into simple SERD engines which "count" occurences and decay them over time (so events from months ago are not necessarily considered to match and aggravate occurences from seconds ago). We are currently developing more advanced diagnosis rules which will distinguish cases of isolated memory cell faults, a frequent error from something like a stuck pin or failed sdram in ChipKill mode, and will check which bits are in error to see if an uncorrectable error appears to be imminent.

CPU/Memory FRU Labelling for AMD Platforms

I touched on how we number cpus and memory above. How these relate to the actual FRU labels on a random AMD platform, if at all, is difficult to impossible to determine in software (notwithstanding, or perhaps because of, the efforts of SMBIOS etc!). So our "dimm0" may be silkscreened as "A0", "B0D0", "BOD3", "DIMM5" etc. With SMBIOS proving to be pretty useless in providing these labels accurately, nevermind associating them with the memory configuration one can discover through the memory controller we will resort to some form of hardcoded table for at least the AMD platforms that Sun has shipped.

CPU/Memory FRU Serial Numbers for AMD Platforms

Again, these are tricky to come by in a generic fashion but for Sun platforms we can teach our software the various ways of retrieving serial numbers. This will assist in replacement of the correct FRU, and in allowing the FMA software to detect when a faulted FRU has been replaced.

Technorati Tags: ,

Thursday Mar 16, 2006

AMD Opteron/Athlon64/Turion64 Fault Management

Fault Management for Athlon64 and Opteron

In February we (the Solaris Kernel/RAS group) integrated the "fma x64" project into Solaris Nevada, delivering Fault Management for the AMD K8 family of chips (Athlon(TM) 64, Opteron(TM), Turion(TM) 64). This brings fault management on the Solaris AMD64 platform for cpu and memory up to par with that already present on Sun's current SPARC platforms, and addresses one of the most-requested missing functionalities required by customers (or potential customers) of Sun's impressive (and growing) AMD Opteron family offerings (the project, of course, benefits all AMD 64-bit systems not just those from Sun). We had planned some blogging from the rooftops about the project back at integration, but instead we all concentrated on addressing the sleep deficit that the last few hectic weeks of a big project brought and since putback to the Solaris 11 gate there has also been much effort in preparing the backport to Solaris 10 Update 2 (aka Solaris 10 06/06).

Well, it has already hit the streets now that Solaris Express Community Edition build 34 is available for download and the corresponding source is available at (around 315 files, search for "2006/020" in file history). There are a few bug fixes that will appear in build 36, but build 34 has all of the primary fault management functionality.

In this blog I'll attempt an overview of the project functionality, with some examples. In future entries I'll delve into some of the more subtle nuances. Before I begin I'll highlight something the project does not deliver: any significant improvement in machine error handling and fault diagnosis for Intel chips (i.e., anything more than a terse console message). This stuff is very platform/chip dependent, and since Sun has a number of AMD Opteron workstation and server offerings with much more in the pipeline it was the natural first target for enhanced support. The project also does not delivered hardened IO drivers and corresponding diagnosis - that is the subject of a follow-on project due in a couple of months.

AMD64 Platform Error Detection Architecture Overview

The following image shows the basics of a 2 chip dual-core (4 cores total) AMD system (with apologies to StarOffice power users):

Each chip may have more than one core (current AMD offerings have up to two cores). Each core has an associated on-chip level 1 instruction cache, level 2 data cache, and level 2 cache. There is also an on-chip memory controller (one per chip) which can control up to 8 DDR memory modules. All cores on all chips can access all memory, but the access is not uniform - accesses to a "remote" node involve a HyperTransport request to the owning node which will respond with the data. For historical reasons the functional unit within the chip that includes the memory controller, dram controller, hypertransport logic, crossbar and no doubt a few other tricks is known as the "Northbridge".

This project is concerned with the initial handling of an error, marshalling of error telemetry from the cpu and memory components (a followup project, due in the next few months, will do the same for telemetry from the IO subsystem), and then consuming that telemetry to produce any appropriate diagnosis of any fault that is determined to be present. These chips have a useful array of error detectors, as described in the following table:

Functional Unit Array Protection
Instruction Cache (IC) Icache main tag array Parity
Icache snoop tag array Parity
Instruction L1 TLB Parity
Instruction L2 TLB Parity
Icache data array Parity
Data Cache (DC) Dcache main tag array Parity
Dcache snoop tag array Parity
Dcache L1 TLB Parity
Dcache L2 TLB Parity
Dcache data array ECC
L2 Cache ("bus unit") (BU) L2 cache main tag array ECC and Parity
L2 cache data array ECC
Northbridge (NB) Memory controlled by this node ECC (depends on dimms)
Table 1: SRAM and DRAM memory arrays

There are a number of other detectors present, notably in the Northbridge, but the above are the main sram and dram arrays.

If an error is recoverable then it does not raise a Machine Check Exception (MCE or mc#) when detected. The recoverable errors, broadly speaking, are single-bit ECC errors from ECC-protected arrays and parity errors on clean parity-protected arrays such as the Icache and the TLBs (translation lookaside buffers - virtual to physical address translations). Instead of a mc# the recoverable errors simply log error data into machine check architecture registers of the detecting bank (IC/DC/BU/NB, and one we don't mention in the table above the Load-Store unit LS) and the operating system (or advanced BIOS implementations) can poll those registers to harvest the information. No special handling of the error is required (e.g., no need to flush caches, panic etc).

If an error is irrecoverable then detection of that error will raise a machine check exception (if the bit that controls mc# for that error type is set; if not you'll either never know or you pick it up by polling). The mc# handler can extract information about the error from the machine check architecture registers as before, but has the additional responsibility of deciding what further actions (which may include panic and reboot) are required. A machine check exception is a form of interrupt which allows immediate notification of an error condition - you can't afford to wait to poll for the error since that could result in the use of bad data and associated data corruption.

Traditional Error Handling - The "Head in the Sand" Approach

The traditional operating system (all OS, not just Solaris) approach to errors in the x86 cpu architecture is as follows:

  • leave the BIOS to choose which error detecting banks to enable, and which irrecoverable errors that are detected will raise a mc#
  • if a mc# is raised the OS fields it and terminates with a useful diagnostic message such as "Machine Check Occured", not event hinting at the affected resource
  • ignore recoverable errors, i.e. don't poll for their occurence, a more advanced BIOS will perhaps poll for these errors but is not in a position to do anything about them while the OS is running
That is not unreasonable for the x86 cpus of years gone by, since they typically had very little in the way of data protection either on-chip or for memory. But more recently they have improved in this area, and protection of the on-chip arrays is now common as is ECC-protected main memory. With the sizes of on-chip memory arrays such as the L2 cache growing, especially for chips offered for typical server use, there is also all the more chance that they will have defects introduced during manufacturing, subsequent handling, while installed etc.

Recognising the increased need for error handling and fault management on the x86 platform, some operating systems have begun to offer limited support in this area. Solaris has been doing this for some time on sparc (let's just say the the US-II E-cache disaster did have some good side-effects!) and so in Solaris we will offer the well-rounded end-to-end fault management on amd64 platforms that we already have on sparc.

A Better Approach - Sun's Fault Management Architecture "FMA"

In a previous blog entry I described the Sun Fault Management Architecture. Error events flow into a Fault Manager and associated Diagnosis Engines which may produce fault diagnoses which can be acted upon not just for determining repair actions but also to isolate the fault before it affects system availability (e.g., to offline a cpu that is experiecing errors at a sustained rate). This architecture has been in use for some time now in the sparc world, and this project expands it to include AMD chips.

FMA for AMD64

To deliver FMA for AMD64 systems the project has:

  • made Solaris take responsibility (in addition to the BIOS) for deciding which error-detecting banks to enable and which error types will raise machine-check exceptions
  • taught Solaris how to recognize all the error types documented in the AMD Bios and Kernel Developer's Guide
  • delivered an intelligent machine-check exception handler and periodic poller (for recoverable errors) which collect all error data available, determine what error type has occured and propogate it for logging, and take appropriate action (if any)
  • introduced cpu driver modules to the Solaris x86 kernel (as have existed on sparc for many years) so that features of a particular processor family (such as the AMD processors) may be specifically supported
  • introduced a memory-controller kernel driver module whose job it is to understand everything about the memory configuration of a node (e.g., to provide translation from a fault address to which dimm is affected)
  • developed rules for consuming the error telemetry with the "eft" diagnosis engine; these are written using the "eversholt" diagnosis language, and their task is to diagnose any faults that the incoming telemetry may indicate
  • delivered an enhanced "platform topology" library to describe the inter-relationship of the hardware components of a platform and to provide a repository for hardware component properties
An earlier putback, also as part of this project, introduced SMBIOS support to Solaris so that we can have some access to platform details (type, slot labelling, configuration, ...). That list seems pretty small for what was a considerable effort - there's lots of detail to each item.

With all this in place we are now able to diagnose the following fault classes:

Fault Class Description
fault.cpu.amd.dcachedata DC data array fault
fault.cpu.amd.dcachetag DC main tag array fault
fault.cpu.amd.dcachestag DC snoop tag array fault
fault.cpu.amd.l1dtlb DC L1TLB fault
fault.cpu.amd.l2dtlb DC L2TLB fault
fault.cpu.amd.icachedata IC data array fault
fault.cpu.amd.icachetag IC main tag array fault
fault.cpu.amd.icachestag IC snoop tag array fault
fault.cpu.amd.l1itlb IC L1TLB fault
fault.cpu.amd.l2itlb IC L2TLB fault
fault.cpu.amd.l2cachedata L2 data array fault
fault.cpu.amd.l2cachetag L2 tag fault Individual page fault
fault.memory.dimm_sb A DIMM experiencing sustained excessive single-bit errors
fault.memory.dimm_ck A DIMM with a ChipKill-correctable multiple-bit faults
fault.memory.dimm_ue A DIMM with an uncorrectable (not even with ChipKill, if present and enabled) multiple-bit fault

An Example - A CPU With Single-bit Errors

The system is a v40z with hostname 'parity' (we also have other cheesy hostnames such as 'chipkill', 'crc', 'hamming' etc!). It has 4 single-core Opteron cpus. If we clear all fault management history and let it run for a while (or give it a little load to speed things up) we very soon see the following message on the console:

SUNW-MSG-ID: AMD-8000-5M, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Wed Mar 15 08:06:08 PST 2006
PLATFORM: i86pc, CSN: -, HOSTNAME: parity
SOURCE: eft, REV: 1.16
EVENT-ID: 1e8e6d7a-fd3a-e4a4-c211-dece00e68c65
DESC: The number of errors associated with this CPU has exceeded acceptable levels.
Refer to for more information.
AUTO-RESPONSE: An attempt will be made to remove this CPU from service.
IMPACT: Performance of this system may be affected.
REC-ACTION: Schedule a repair procedure to replace the affected CPU.
Use fmdump -v -u <EVENT-ID> to identify the module.
Running the indicated command we see that cpu 3 has a fault:
# fmdump -v -u 1e8e6d7a-fd3a-e4a4-c211-dece00e68c65
TIME                 UUID                                 SUNW-MSG-ID
Mar 15 08:06:08.5797 1e8e6d7a-fd3a-e4a4-c211-dece00e68c65 AMD-8000-5M
  100%  fault.cpu.amd.l2cachedata

        Problem in: hc:///motherboard=0/chip=3/cpu=0
           Affects: cpu:///cpuid=3
               FRU: hc:///motherboard=0/chip=3
That tells us the resource affected (chip3, cpu core 0), it's logical identifier (cpuid 3, as used in psrinfo etc), and the field replaceable unit that should be replaced (the chip, you can't replace a core). In future we intend to extract FRU labelling information from SMBIOS but at the moment there are difficulties with smbios data and the accuracy thereof that make that harder than it should be.

If you didn't see or notice the console message then running fmadm faulty highlights resources that have been diagnosed as faulty:

# fmadm faulty
-------- ----------------------------------------------------------------------
 faulted cpu:///cpuid=3
-------- ----------------------------------------------------------------------
In Solaris 11 already and coming to Solaris 10 Update 2 is SNMP trap support for FMA fault events, which provides another avenue by which you can become aware of a newly-diagnosed fault.

We can see the automated response that was performed upon making the diagnosis of a cpu fault:

# psrinfo
0       on-line   since 03/11/2006 00:27:08
1       on-line   since 03/11/2006 00:27:08
2       on-line   since 03/10/2006 23:28:51
3       faulted   since 03/15/2006 08:06:08
The faulted resource has been isolated by offlining the cpu. If you reboot then the cache of faults will cause the cpu to offlined again.

Note that the event id appears in the fmadm faulty output, so you can formulate the fmdump command line shown in the console message if you wish and visit and enter the indicated SUNW-MSG-ID (quick aside: we have some people working on beefing up the amd64 knowledge articles there, the current ones are pretty uninformative). We can also use the event id to see what error reports led to this diagnosis:

# fmdump -e -u 1e8e6d7a-fd3a-e4a4-c211-dece00e68c65
TIME                 CLASS
Mar 15 08:05:18.1624 ereport.cpu.amd.bu.l2d_ecc1     
Mar 15 08:04:48.1624 ereport.cpu.amd.bu.l2d_ecc1     
Mar 15 08:04:48.1624 ereport.cpu.amd.dc.inf_l2_ecc1  
Mar 15 08:06:08.1624 ereport.cpu.amd.dc.inf_l2_ecc1  
The -e option selects dumping of the error log instead of the fault log, so we can see the error telemetry that led to the diagnosis. So we see that in the space of a few seconds this cpu experienced 4 single-bit errors from the L2 cache - we are happy to tolerate occasional single-bit errors but not at this rate, so we diagnose a fault. If we use option -V we can see the full error report contents, for example for the last ereport above:
Mar 15 2006 08:06:08.162418201 ereport.cpu.amd.dc.inf_l2_ecc1
nvlist version: 0
        class = ereport.cpu.amd.dc.inf_l2_ecc1
        ena = 0x62a5aaa964f00c01
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = hc
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = motherboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = chip
                        hc-id = 3
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = cpu
                        hc-id = 0
                (end hc-list[2])

        (end detector)

        bank-status = 0x9432400000000136
        bank-number = 0x0
        addr = 0x5f76a9ac0
        addr-valid = 1
        syndrome = 0x64
        syndrome-type = E
        ip = 0x0
        privileged = 1
        __ttl = 0x1
        __tod = 0x44183b70 0x9ae4e19
One day we'll teach fmdump (or some new command) to mark all that stuff up into human-readable output. For now it shows the raw(ish) telemetry read from the machine check architecture registers when we polled for this event. This telemetry is consumed by the diagnosis rules to produce any appropriate fault diagnosis.

We're Not Done Yet

There are a number of additional features that we'd like to bring to amd64 fault management. For example:

  • use more SMBIOS info (on platforms that have SMBIOS support, and which give accurate data!) to discover FRU labelling etc
  • introduce serial number support so that we can detect when a cpu or dimm has been replaced (currently you have to perform manual fmadm repair
  • introduce some form of communication (if only one-way) between the service processor (on systems that have such a thing, such as the current Sun AMD offerings) and the diagnosis software
  • extend the diagnosis rules to perform more complex predictive diagnosis for DIMM errors based on what we have learned on sparc
Many of these are complicated by the historical lack of architectural standards for the "PC" platform. But we have a solid start, but there's still plenty of functional and usability features we'd like to add.

Technorati Tags: ,

Wednesday Jun 15, 2005

SPARC System Call Anatomy

SPARC System Call Anatomy

Russ Blaine has described aspects of the x86 and x64 system call implementation in OpenSolaris. In this entry I'll describe the codeflow from userland to kernel and back for a SPARC system call.

Making A System Call

An application making a system call actually calls a libc wrapper function which performs any required posturing and then enters the kernel with a software trap instruction. This means that user code and compilers do not need to know the runes to enter the kernel, and allows binaries to work on later versions of the OS where perhaps the runes have been modified, system call numbers newly overloaded etc.

OpenSolaris for SPARC supports 3 software traps for entering the kernel:

S/W Trap # Instruction Description
0x0 ta 0x0 Used for system calls for binaries running in SunOS 4.x binary compatability mode.
0x8 ta 0x8 32-bit (ILP32) binary running on 64-bit (ILP64) kernel
0x40 ta 0x40 64-bit (ILP64) binary running on 64-bit (ILP64) kernel

Since OpenSolaris (as Solaris since Solaris 10) no longer includes a 32-bit kernel the ILP32 syscall on ILP32 kernel is no longer implemented.

In the wrapper function the syscall arguments are rearranged if necessary (the kernel function implementing the syscall may expect them in a different order to the syscall API, for example multiple related system calls may share a single system call number and select behaviour based on an additional argument passed into the kernel). It then places the system call number in register %g1 and executes one of the above trap-always instructions (e.g., the 32-bit libc will use ta 0x8 while the 64-bit libc will use ta 0x40). There's a lot more activity and posturing in the wrapper functions than described here, but for our purposes we simply note that it all boils down to a ta instruction to enter the kernel.

Handling A System Call Trap

A ta n instruction, as executed in userland by the wrapper function, results in a trap type 0x100 + n being taken and we move from traplevel 0 (where all userland and most kernel code executes) to traplevel 1 in nucleus context. Code that executes in nucleus context has to be handcrafted in assembler since nucleus context does not comply to the ABI etc conventions and is generally much more restricted in what it can do. The task of the trap handler executing at traplevel 1 is to provide the necessary glue in order to get us back to TL0 and running privileged (kernel) C code that implements the actual system call.

The trap table entries for sun4u and sun4v for these traps are identical. I'm going to following the two regular syscall traps and ignore the SunOS 4.x trap. Note that a trap table handler has just 8 instructions dedicated to it in the trap table - it must use these to do a little work and then to branch elsewhere:

 \* SYSCALL is used for system calls on both ILP32 and LP64 kernels
 \* depending on the "which" parameter (should be either syscall_trap
 \* or syscall_trap32).
#define SYSCALL(which)                  \\
        TT_TRACE(trace_gen)             ;\\
        set     (which), %g1            ;\\
        ba,pt   %xcc, sys_trap          ;\\
        sub     %g0, 1, %g4             ;\\
        .align  32


        /\* hardware traps \*/
        /\* user traps \*/
        GOTO(syscall_trap_4x);          /\* 100  old system call \*/
        SYSCALL(syscall_trap32);        /\* 108  ILP32 system call on LP64 \*/
        SYSCALL(syscall_trap)           /\* 140  LP64 system call \*/

So in both cases we branch to sys_trap, requesting TL0 handler of syscall_trap32 for an ILP32 syscall and syscall_trap for a ILP64 syscall. In both cases we request PIL to remain as it currently is (always 0 since we came from userland). sys_trap is generic glue code that is used to take us from nucleus (TL>0) context back to TL0 running a specified handler (address in %g1, usually written in C) at a chosen PIL. The specified handler is called with arguments as given by registers %g2 and %g3 at the time we branch to sys_trap: the SYSCALL macro above does not move anything into these registers (no arguments to be passed to handler). sys_trap handlers are always called with a first argument pointing to a struct regs that provides access to all the register values at the time of branching to sys_trap; for syscalls these will include the system call number in %g1 and arguments in output registers (note that %g1 as prepared in the wrapper and %g1 as used in the SYSCALL macro for the trap table entry are not the same register - on trap we move from regular globals (as userland executes in) on to alternate globals - but that sys_trap glue collects all the correct (user) registers together and makes them available in the struct regs it passes to the handler.

sys_trap is also responsible for setting up our return linkage. When the TL0 handling is complete the handler will return, restoring the stack pointer and program counter as constructed in sys_trap. Since we trapped from userland it will be user_rtt that is interposed as the glue that TL0 handling code will return into, and which will get us back out of the kernel and into userland again.

Aside: Fancy Improving Something In OpenSolaris?

Adam Leventhal logged bug 4816328 "system call traps should go straight to user_trap" some time ago. As described above, the SYSCALL macro branches to sys_trap:

        ! force tl=1, update %cwp, branch to correct handler
        wrpr    %g0, 1, %tl
        rdpr    %tstate, %g5
        btst    TSTATE_PRIV, %g5
        and     %g5, TSTATE_CWP, %g6
        bnz,pn  %xcc, priv_trap
        wrpr    %g0, %g6, %cwp

Well we know that we're at TL1 and that we were unprivileged before the trap, so (aside from the current window pointer manipulation which Adam explains in the bug report i- it's not required coming from a syscall trap) we could save a few instructions by going straight to user_trap from the trap table. Adam's benchmarking suggests that can save around 45ns per system call - more than 1% of a quick system call!

syscall_trap32(struct regs \*rp);

We'll follow the ILP32 syscall route; the route for ILP64 is analogous with trivial differences in terms of not having to clear the upper 32 bits of arguments etc. You can view the source here. This runs at TL0 as a sys_trap handler so could be written in C, however for performance and hands-on-DIY assembler-level reasons it is in assembler. Our task is to lookup and call the nominated system call handler, and performing the required housekeeping along the way.

        ldx     [THREAD_REG + T_CPU], %g1       ! get cpu pointer               
        mov     %o7, %l0                        ! save return addr              

First note that we do not obtain a new register window here - we will squat within the window that sys_trap crafted for itself. Normally this would mean that you'd have to live within the output registers, but by agreement handlers called via sys_trap are permitted to use registers %l0 thru %l3.

We begin by loading a pointer to the cpu this thread is executing on into %g1, and saving the return PC (as constructed by sys_trap) in %o7.

        ! If the trapping thread has the address mask bit clear, then it's      
        !   a 64-bit process, and has no business calling 32-bit syscalls.      
        ldx     [%o0 + TSTATE_OFF], %l1         ! saved is that      
        andcc   %l1, TSTATE_AM, %l1             !   of the trapping proc        
        be,pn   %xcc, _syscall_ill32            !                               
          mov   %o0, %l1                        ! save reg pointer              

The comment says it all. The AM bit in the PSTATE at the time we trapped (executed the ta instruction is available in the %tstate register after trap, and sys_trap preserved that before it could be modified by further traps for us in the regs structure. Assuming we're not a 64-bit app making a 32-bit syscall:

        srl     %i0, 0, %o0                     ! copy 1st arg, clear high bits 
        srl     %i1, 0, %o1                     ! copy 2nd arg, clear high bits 
        ldx     [%g1 + CPU_STATS_SYS_SYSCALL], %g2                              
        inc     %g2                             ! cpu_stats.sys.syscall++       
        stx     %g2, [%g1 + CPU_STATS_SYS_SYSCALL]                              

The libc wrapper placed up to the first 6 arguments in %o0 thru %o5 (with the rest, if any, on stack). During sys_trap a SAVE instruction was performed to obtain a new register window, so those arguments are now available in the corresponding input registers (despite us not performing a save in syscall_trap32 itself). We're going to call the real handler so we prepare the arguments in our outputs (which we're sharing with sys_trap but outputs are understood to be volatile across calls). The shift-right-logical by 0 bits is a 32-bit operation (i.e., not srlx) so it performs no shifting but it does clear the uppermost 32-bits of the arguments. We also increment the statistic counting the number of system calls made by this cpu; this statistic is in the cpu_t and the offset, like most, is generated for a by genasym.

        ! Set new state for LWP                                                 
        ldx     [THREAD_REG + T_LWP], %l2                                       
        mov     LWP_SYS, %g3                                                    
        srl     %i2, 0, %o2                     ! copy 3rd arg, clear high bits 
        stb     %g3, [%l2 + LWP_STATE]                                          
        srl     %i3, 0, %o3                     ! copy 4th arg, clear high bits 
        ldx     [%l2 + LWP_RU_SYSC], %g2        ! pesky statistics              
        srl     %i4, 0, %o4                     ! copy 5th arg, clear high bits 
        addx    %g2, 1, %g2                                                     
        stx     %g2, [%l2 + LWP_RU_SYSC]                                        
        srl     %i5, 0, %o5                     ! copy 6th arg, clear high bits 
        ! args for direct syscalls now set up                                   

We continue preparing arguments as above. Interleaved with these instructions we change the lwp_state member of the associated lwp stucture (there must be one - a user thread made a syscall, this is not a kernel thread) to indicate it is running in-kernel (LWP_SYS, would have been LWP_USER prior to this update) and increment the count of the number of syscall made by this particular lwp (there is a 1:1 correspondence between user threads and lwps these days).

Next we write a TRAPTRACE entry - only on DEBUG kernels. That's a topic for another day - I'll skip the code here, too.

While we're on the subject of tracing, note that the next code snippet includes mentions of SYSCALLTRACE. This is not defined in normal production kernels. But, of course, one of the great beauties of DTrace is that it doesn't require custom kernels to perform its tracing since it can insert/enable probes on-the-fly - so SYSCALLTRACE is near worthless now!

        ! Test for pre-system-call handling                                     
        ldub    [THREAD_REG + T_PRE_SYS], %g3   ! pre-syscall proc?             
#ifdef SYSCALLTRACE                                                             
        sethi   %hi(syscalltrace), %g4                                          
        ld      [%g4 + %lo(syscalltrace)], %g4                                  
        orcc    %g3, %g4, %g0                   ! pre_syscall OR syscalltrace?  
        tst     %g3                             ! is pre_syscall flag set?      
#endif /\* SYSCALLTRACE \*/                                                       
        bnz,pn  %icc, _syscall_pre32            ! yes - pre_syscall needed      
        ! Fast path invocation of new_mstate                                    
        mov     LMS_USER, %o0                                                   
        call    syscall_mstate                                                  
        mov     LMS_SYSTEM, %o1                                                 
        lduw    [%l1 + O0_OFF + 4], %o0         ! reload 32-bit args            
        lduw    [%l1 + O1_OFF + 4], %o1                                         
        lduw    [%l1 + O2_OFF + 4], %o2                                         
        lduw    [%l1 + O3_OFF + 4], %o3                                         
        lduw    [%l1 + O4_OFF + 4], %o4                                         
        lduw    [%l1 + O5_OFF + 4], %o5                                         

        ! lwp_arg now set up                                                    

If curthread->t_pre_sys flag is set then we branch to _syscall_pre32 to call pre_syscall. If that does not abort the call it will reload the outputs with the args (they were lost on the call to _syscall_pre32) using lduw instructions from the regs area and loading from just the lower 32-bit word of the args (we can no longer use srl by 0 since no registers have the arguments anymore) and branch back to label 3 above (as if we'd done the same after a call to syscall_mstate).

If we don't have pre-syscall work to perform then call syscall_mstate(LMS_USER, LMS_SYSTEM) to record the transition from user to system state for microstate accounting purposes. Microstate accounting is always performed now - it used not to be the default and was enabled when desired.

After the unconditional call to syscall_mstate we reload the arguments from the regs struct into the output registers (as after the pre-syscall work). Evidently our earlier srl work in the args is a complete waste of time (although not expensive) since we always land up loading them from the passed regs structure. This appears to be a hangover from days when microstate accounting was not always enabled.

Aside: Another Performance Opportunity?

So we see that our original argument shuffling is always undone as we have to reload after a call for microstate accounting, at least. But those reloads are made from the regs structure (cache/memory accesses) while it is clear that the input registers remain untouched and we could simply performing register-to-register manipulations (srl for the 32-bit version, mov for the 64-bit version). Reading through and documenting code like this really is worthwhile - I'll log a bug now!

        ! Call the handler.  The %o's have been set up.                         
        lduw    [%l1 + G1_OFF + 4], %g1         ! get 32-bit code               
        set     sysent32, %g3                   ! load address of vector table  
        cmp     %g1, NSYSCALL                   ! check range                   
        sth     %g1, [THREAD_REG + T_SYSNUM]    ! save syscall code             
        bgeu,pn %ncc, _syscall_ill32                                            
          sll   %g1, SYSENT_SHIFT, %g4          ! delay - get index             
        add     %g3, %g4, %g5                   ! g5 = addr of sysentry         
        ldx     [%g5 + SY_CALLC], %g3           ! load system call handler      
        brnz,a,pt %g1, 4f                       ! check for indir()             
        mov     %g5, %l4                        ! save addr of sysentry         
        ! Yuck.  If %g1 is zero, that means we're doing a syscall() via the     
        ! indirect system call.  That means we have to check the                
        ! flags of the targetted system call, not the indirect system call      
        ! itself.  See return value handling code below.                        
        set     sysent32, %l4                   ! load address of vector table  
        cmp     %o0, NSYSCALL                   ! check range                   
        bgeu,pn %ncc, 4f                        ! out of range, let C handle it 
          sll   %o0, SYSENT_SHIFT, %g4          ! delay - get index             
        add     %g4, %l4, %l4                   ! compute & save addr of sysent 
        call    %g3                             ! call system call handler      

We load the nominated syscall number into %g1, sanity-check it for range, and lookup the entry at that index in the table of 32-bit system calls sysent32 and extract the registered handler (the real implementation). Ignoring the indirect syscall cruft we the call the handler and the real work of the syscall is executed. Erick Schrock has described the sysent/sysent32 table in his blog entry on adding system calls to Solaris.

        ! If handler returns long long then we need to split the 64 bit         
        ! return value in %o0 into %o0 and %o1 for ILP32 clients.               
        lduh    [%l4 + SY_FLAGS], %g4           ! load sy_flags                 
        andcc   %g4, SE_64RVAL | SE_32RVAL2, %g0 ! check for 64-bit return      
        bz,a,pt %xcc, 5f                                                        
          srl   %o0, 0, %o0                     ! 32-bit only                   
        srl     %o0, 0, %o1                     ! lower 32 bits into %o1        
        srlx    %o0, 32, %o0                    ! upper 32 bits into %o0        

For ILP32 clients we need to massage 64-bit return types into 2 adjacent and paired registers.

        ! Check for post-syscall processing.                                    
        ! This tests all members of the union containing t_astflag, t_post_sys, 
        ! and t_sig_check with one test.                                        
        ld      [THREAD_REG + T_POST_SYS_AST], %g1                              
        tst     %g1                             ! need post-processing?         
        bnz,pn  %icc, _syscall_post32           ! yes - post_syscall or AST set 
        mov     LWP_USER, %g1                                                   
        stb     %g1, [%l2 + LWP_STATE]          ! set lwp_state                 
        stx     %o0, [%l1 + O0_OFF]             ! set rp->r_o0                  
        stx     %o1, [%l1 + O1_OFF]             ! set rp->r_o1                  
        clrh    [THREAD_REG + T_SYSNUM]         ! clear syscall code            
        ldx     [%l1 + TSTATE_OFF], %g1         ! get saved tstate              
        ldx     [%l1 + nPC_OFF], %g2            ! get saved npc (new pc)        
        mov     CCR_IC, %g3                                                     
        sllx    %g3, TSTATE_CCR_SHIFT, %g3                                      
        add     %g2, 4, %g4                     ! calc new npc                  
        andn    %g1, %g3, %g1                   ! clear carry bit for no error  
        stx     %g2, [%l1 + PC_OFF]                                             
        stx     %g4, [%l1 + nPC_OFF]                                            
        stx     %g1, [%l1 + TSTATE_OFF]                                         

If post-syscall processing is required then branch to _syscall_post32 which will call post_syscall and then "return" by jumping to the return address passed by sys_trap (which is always user_rtt for syscalls). If not then change the lwp_state back to LWP_USER and stash the return value (possibly in 2 registers as above) in the regs structure, clear the curthread->t_sysnum since we're no longer executing a syscall, and step the PC and nPC values on so that the RETRY instruction at the end of user_rtt which we're about to "return" into will not simply re-execute the ta instruction.

        ! fast path outbound microstate accounting call                         
        mov     LMS_SYSTEM, %o0                                                 
        call    syscall_mstate                                                  
        mov     LMS_USER, %o1                                                   
        jmp     %l0 + 8                                                         

Transition our state from system to user again (for microstate accounting purposes) and "return" through user_rtt as arranged by sys_trap. It is the task of user_rtt to get us back out of the kernel to resume at the instruction indicated in %tstate (for which we stepped the PC and nPC) and continue execution in userland.

Technorati Tag:
Technorati Tag:

Tuesday Jun 14, 2005


thread_nomigrate(): Environmentally friendly prevention of kernel thread migration

The launch of OpenSolaris today means that as a Solaris developer I can take the voice that has already given me and talk not just in general about aspects of Solaris in which I work but in detail and with source freely quoted and referenced as I wish!  We've come a long way - who'd have thought several years ago that employees (techies, even!) would have the freedom to discuss in public what we do for a living in the corporate world (as has delivered for some time now) and now, with OpenSolaris, not just talk in general about subject matter but also discuss the design and implementation.  Fabulous!

I thought I'd start by describing a kernel-private interface I added in Solaris 10 which can be used to request short-term prevention of a kernel thread from migrating between processors.  Thread migration refers to a thread changing processors - running on one processor until preemption or blocking and then resuming on a different processor.  A description of thread_nomigrate (the new interface) soon turns into a mini tour of some aspects of the dispatcher (I don't work in dispatcher land much, I just have an interest in the area, and I had a project that required this functionality).

A Quick Overview of Processor Selection

I'm not going to attempt a niity-gritty detailed story here - just enough for the discussion below.

The kthread_t member t_state tracks the current run state of a kernel thread.  State TS_ONPROC indicates that a thread is currently running on a processor.  This state is always preceded by state TS_RUN - runnable but not yet on a processor.  Threads in state TS_RUN are enqueued on various dispatch queues; each processor has a bunch of dispatch queues (one for every priority level) and there are other global dispatch queues such as the partition-wide preemption queue.  All enqueuing to dispatch queues is performed by the dispatcher workhorses setfrontdq and setbackdq.  It is these functions which honour processor and processor-set binding requests or call cpu_choose to select the best processor to enqueue on.  When a thread is enqueued on a dispatch queue of some processor it is nominally aimed at being run on that processor, and in most cases will be;  however idle processors may choose to run suitable threads initially dispatched to other processors. Eric Saxe has described a lot more of the operation of the dispatcher and scheduler in his opening day blog.

Requirements for "Weak Binding"

There were already a number of ways of avoiding migration (for a thread not already permanently bound, such as an interrupt thread):

  • Raise IPL to above LOCK_LEVEL.

    Not something you want to do for more than a moment, but it is one way to avoid being preempted and hence also to avoid migration (for as long as the state persists).  Not suitable for general use.

  • processor_bind System Call.

    processor_bind implements the corresponding system call which may be used directly from applications or could be the result of a use of pbind(1M).  It acquires cpu_lock and uses cpu_bind_{thread,process,task,project,zone,contract} depending on the arguments. Function thread_bind locks the current thread and records the new user-selected binding by processor id in t_bind_cpu of the kthread structure and again but by cpu structure address in t_bound_cpu, and then requeues the thread if it was waiting on a dispatch queue somewhere (thread state TS_RUN) or poke it off of cpu if it is currently on cpu (possibly not the one to which we've just bound it) to force it through the dispatcher at which point the new binding will take effect (it will be noticed in setfrontdq/setbackdq).  The others - cpu_bind_process etc - are built on top of cpu_bind_thread and on each-other.

  • thread_affinity_set(kthread_id_t t,int cpu_id) and thread_affinity_clear(kthread_id_t).

    The artist previously known as affinity_set (and still available as that for compatability), used to request a system-specified (as opposed to userland-specified) binding.  Again this requires that cpu_lock be held (or it acquires it for you if cpu_id is specified as CPU_CURRENT).  It locks the indicated thread (note that it might not be curthread) and sets a hard affinity for the requested (or current) processor by incrementing t_affinitycnt and setting t_bound_cpu in the kthread structure.  The hard affinity count will prevent any processor_bind initiated requests from succeeding.  Finally it forces the target thread through the dispatcher if necessary (so that the requested binding may be honoured).

  • kprempt_disable() and kpreempt_enable().

    This actually prevents processor migration as a bonus side-effect of disabling preemption.  It is extremely lightweight and usable from any context (well, any where you could ever care about migration); in particular it does not require cpu_lock at all and can be called regardless of IPL and from interrupt context.

    To prevent preemption kpreempt_disable simply increments curthread->t_preempt.  To re-enable preemption this count is decremented.  Uses may be nested so preemption is only possible again when the count returns to zero.  When the count is decremented to zero we must also check for any preemption requests we ignored while preemption was disabled - i.e., whether cpu_kprunrun is set for the current processor - and call kpreempt synchronously now if so.  To understand how that prevents preemption you need to understand a little more of how preemption works in Solaris.  To preempt a thread running on cpu we set cpu_kprunrun for the processor it is on and "poke" that with a null interrupt whereafter return-from-interrupt processing will notice the flag set and call kpreempt.  It is in kpreempt that we consult t_preempt to see if preemption has been temporarily disabled;  if it is then the request is ignored for now and actioned only when preemption is re-enabled.

    Since a thread already running on one processor can only migrate to a new processor if we can get it off the current processor, disabling preemption has a bonus side-effect of preventing migration.  If, however, a thread with preemption disabled performs some operation that causes the thread to sleep (which would be legal but silly - why accept sleeping if you're asking not to be bumped from processor) then it may be migrated on resume since no part of the set{font,back}dq or cpu_choose code consults t_preempt.

    There is one big disadvantage to using kpreempt_disable.  It, errr, disables preemption which may interfere with the dispatch latency for other threads - preemption should only ever be disabled for a tiny window so that the thread can be pushed out of the way for higher priority threads (especially for realtime threads for which dispatch latency must be bounded).

Thus we already had userland-requested processor long-term binding to a specific processor (or set) via processor_bind, system requested long-term binding to a specific processor via thread_affinity_set, and system-requested short-term "binding" (as in "don't kick me off processor") via kpreempt_disable

I was modifying kernel bcopy, copyin, copyout and hwblkpagecopy code (see cheetah_copy.s) to add a new hardware test feature which would require that hardware-accelerated copies (bigger copies use the floating point unit and the prefetch cache to speed copy) run on the same processor throughout the copy (even if preempted for a while in mid copy by a higher priority thread in mid-copy).  I could not use processor_bind (non-starter, it's for user specified binding), nor thread_affinity_set which requires cpu_lock (bcopy(9F) can be called from interrupt context including high level interrupt.  That left kpreempt_disable which, although beautifully light-weight, could not be used for more than a moment without introducing realtime dispatch glitches - and copies (although accelerated) can be very large.  I needed a thread_nomigrate which would stop a kernel thread from migrating from the current processor (whichever you happened to be on when called) but would still allow the thread to be preempted, which was reasonably light-weight (copy code is performance critical), and which had few restrictions on caller context (no more than copy code).  Sounded simple enough!

Some Terminology

I'll refer to threads that are bound to a processor with t_bound_cpu set as being strongly bound.  The processor_bind and thread_affinity_set interfaces produce strong bindings in this sense.  This isn't traditional terminology - none was necessary - but we'll see that the new interface introduces weak binding so I had to call the existing mechanism something.

Processor Offlining

Another requirement of the proposed interface was that it must not interfere with processor offlining.  A quick look at cpu_offline source shows that it fails if there are threads that are strongly bound to the target processor - it waits a short interval to allow any such bindings to drop but if there are any remaining thereafter (no new binding can occur while it waits as cpu_lock is held) the offline attempt fails.  The new interface was required to work more like kpreempt_disable does - not interfere with offlining at all.  kpreempt_disable achieves this through resisting the attempt to preempt the thread with the high-priority per-cpu pause thread - cpu_offline waits until all cpus are running their pause thread so a kpreempt_disable just makes it wait a tiny bit longer.  For the new mechanism, however, we could not acquire cpu_lock as a barrier to preventing new weak bindings (as used in cpu_offline for strong bindings) and the whole point of the new mechanism is not to interfere with preemption so I could not use that method, either.

No Blocking Allowed

As mentioned above, kpreempt_disable does not assure no-migrate semantics if the thread voluntarily gives up cpu.  Since a sleep may take a while we don't want weak-bound threads sleeping as that would interfere with processor offlining.  So we'll outlaw sleeping.  This is no loss - if you can sleep then you can afford to use the full thread_affinity_set route.

Weak-binding Must Be Short-Term

Again to avoid interfering with processor offlining.  A weakbound thread which is preempted will necessarily be queued on the dispatch queues of the processor to which it is weakbound.  During attempted offline of a processor we will need to allow threads weakbound to that processor to drain - we must be sure that allowing threads in TS_RUN state to run a short while longer will be enough for them to complete their action and drop their weak binding.


This turned out to be trickier than initially hoped, which explains some of the liberal commenting you'll find in the source!

void thread_nomigrate(void);

You can view the source to this new function here.  I'll discuss it in chunks below, leaving out the comments that you'll find in the source as I'll elaborate more here.

        cpu_t \*cp;
        kthread_id_t t = curthread;

        cp = CPU;

It is the "current" cpu to which we will bind.  To nail down exactly which that is (since we may migrate at any moment!) we must first disable migration and we do this in the simplest way possible.  We must re-enable preemption before returning (and only keep it disabled for a moment).

Note that since we have preemption disabled, any strong binding requests which happen in parallel on other cpus for this thread will not be able to poke us across to the strongbound cpu (which may be different to the one we're currently on).

        if (CPU_ON_INTR(cp) || t->t_flag & T_INTR_THREAD ||
            getpil() >= DISP_LEVEL) {

During a highlevel interrupt context the caller does not own the current thread structure and so should not make changes to it.  If we are a lowlevel interrupt thread then we can't migrate anyway.  If we're at high IPL then we also cannot migrate.  So we need take no action; in thread_allowmigrate we must perform a corresponding test.

        if (t->t_nomigrate && t->t_weakbound_cpu && t->t_weakbound_cpu != cp) {
                if (!panicstr)
                        panic("thread_nomigrate: binding to %p but already "
                            "bound to %p", (void \*)cp,
                            (void \*)t->t_weakbound_cpu);

Some sanity checking that we've not already weakbound to a different cpu.  Weakbinding is recorded by writing the cpu address to the t_weakbound_cpu member and incrementing the t_nomigrate nesting count, as we'll see below.


Prior to this point we might be racing with a competing strong binding request running on another cpu (e.g., a pbind(1M) command line request on a process in copy code and requesting a weak binding).  But strong binding acquires the thread lock for the target thread, so we can synchronize (without blocking) by grabbing our thread lock.  Note that this restricts the context of callers to those for which grabbing the thread lock is appropriate.

        if (t->t_nomigrate < 0 || weakbindingbarrier && t->t_nomigrate == 0) {
                return;         /\* with kpreempt_disable still active \*/

This was the result of an unfortunate interaction between the initial implementation and pool rebinding (see poolbind(1M)).  Pool bindings must succeed or fail atomically - either all threads are rebound in the request or none are (as described in Andrei's blog).  The rebinding code would acquire cpu_lock (preventing further strong bindings) and check that all rebindings could succeed;  but since cpu_lock does not affect weak binding it could later find that some thread refused the rebinding.  The fix involved introducing a mechanism by which weakbinding could, fleetingly, be upgraded to preemption disabling.  The weakbindingbarrier is raised and lowered by calls to weakbinding_{stop,start}.  If it is raised or this is a nested call and we've already gone the no-preempt route for this thread then we return with preemption disabled and signify/count this through negative counting in t_nomigrate.  The t_weakbound_cpu member will be left NULL.  Note that whimping out and selecting the stronger condition of disabling preemption to achieve no-migration semantics does not signicantly undermine the goal of never interfering with dispatch latency: if you are performing pool rebinding operations you expect a glitch as threads are moved.

It's possible that we are running on a different cpu to which we are strongbound - a strong binding request was made between the time we disabled preemption and when we acquired the thread lock.  We can still grant the weakbinding in this case, which will result in our weak binding being different to our strong binding!  This is not unhealthy as long as we allow the thread to gravitate towards its strongbound cpu as soon as the weakbinding drops (which will be soon since it is a short-term condition).  To favour weakbinding over any strong we will also require some changes in setfrontdq and setbackdq.

Weakbinding requests always succeed - there is no return value to indicate failure.  However we may sometimes want to delay granting a weakbinding request until we are running on a more suitable cpu.  Recall that a weakbinding simply prevents migration during the critical section, but does not nominate a particular cpu.  If our current cpu is the subject of an offline request then we will migrate the thread to another cpu and retry the weakbinding request there.  We do this to avoid the (admittedly unlikely) case that repeated weakbinding requests being made by a thread prevent it from offlining (remember that the strategy is that any weakbound threads waiting to run on an offline target will drop their binding if allowed to run for a moment longer - if new bindings are continually being made then that assumption is violated).

        if (cp != cpu_inmotion || t->t_nomigrate > 0 || t->t_preempt > 1 ||
            t->t_bound_cpu == cp) {
                t->t_weakbound_cpu = cp;

We set cpu_inmotion during cpu_offline to record the target cpu.  If we're not currently on an offline target (the common case) or if we've already weakbound to this cpu (this is a nested call) or if we can't migrate away from this cpu because preemption is disabled or we're strongbound to it then go ahead and grant the weakbinding to this cpu by incrementing the nesting count and recording our weakbinding in t_weakbound_cpu (for the dispatcher).  Make these changes visible to the world before dropping the thread lock so that competing strong binding requests see the full view of the world.  Finally re-enable preemption, and we're done.

        } else {
                 \* Move to another cpu before granting the request by
                 \* forcing this thread through preemption code.  When we
                 \* get to set{front,back}dq called from CL_PREEMPT()
                 \* cpu_choose() will be used to select a cpu to queue
                 \* us on - that will see cpu_inmotion and take
                 \* steps to avoid returning us to this cpu.
                cp->cpu_kprunrun = 1;
                kpreempt_enable();      /\* will call preempt() \*/
                goto again;

If we are the target of an offline request and are not obliged to grant the weakbinding to this cpu, then force ourselves onto another cpu.  The disptacher will lean away from the cpu_inmotion and we'll resume elsewhere and likely grant the binding there.  Who says goto can never be used?

void thread_allowmigrate(void);

This drops the weakbinding if the nesting count reduces to zero, but must also look out for the special cases made in thread_nomigrate.  Source may be viewed here.

        kthread_id_t t = curthread;

        ASSERT(t->t_weakbound_cpu == CPU ||
            (t->t_nomigrate < 0 && t->t_preempt > 0) ||
            CPU_ON_INTR(CPU) || t->t_flag & T_INTR_THREAD ||
            getpil() >= DISP_LEVEL);

On DEBUG kernels check that all is operating as it should be. There's a story to tell here regarding cpr (checkpoint-resume power management) which I'll recount a little later.

        if (CPU_ON_INTR(CPU) || (t->t_flag & T_INTR_THREAD) ||
            getpil() >= DISP_LEVEL)

This corresponds to the beginning on thread_nomigrate for the case where we did not have to do anything to prevent migration.

        if (t->t_nomigrate < 0) {

Negative nested counting in t_nomigrate indicates that we're resolving weakbinding requests by upgrading them to no-preemption semantics during pool rebinding.

        } else {
                if (t->t_bound_cpu &&
                    t->t_weakbound_cpu != t->t_bound_cpu)
                        CPU->cpu_kprunrun = 1;
                t->t_weakbound_cpu = NULL;

If we decrement the nesting count to 0 then clear our weak binding recorded in t_weakbound_cpu.  If we are weakbound to a different cpu to which we are strongbound (as explained above) force a trip through preempt so that we can now drop all resistance and migrate.

Changes to setfrontdq and setbackdq

As outlined above it is these two functions which select dispatch queues on which to place threads that are in a runnable state (including threads preempted from cpu).  These functions already checked for strong binding of the thread being enqueued, so they required an additional check for weak binding.  As explained above it is sometimes possible that a thread be both strong and weak bound, normally to the same cpu but sometimes for a short time to different cpus - the changes should therefore favour weak binding over strong.

Changes to cpu_offline

The cpu_lock is held on calling cpu_offline, and that stops further strong bindings to the target (or any) cpu while we're in cpu_offline.  Except in special circumstances (of a failing cpu) a cpu with bound threads cannot be offlined;  if there are any strongbound threads then cpu_offline performs a brief delay loop to give them a chance to unbind and then fails if any remain.  The existence of strongbound threads is checked with disp_bound_threads and disp_bound_anythreads.

To meet the requirement that weakbinding not interfere with offlining we needed a similar mechanism to prevent any further weak bindings to the target cpu and a means of allowing existing weak bindings to drain; we must do this, however, without using a mutex or similar.

The solution was to introduce cpu_in_motion which would normally be NULL but would be set to the target cpu address when that cpu is being offlined.  Since this variable is not protected by any mutex some consideration of memory ordering in multiprocessor systems is required.  We force the store to cpu_in_motion to global visibility in cpu_offline so we know that no new loads (on other cpus) will see the old value after that point (we've "raised a barrier" to weak binding);  however loads already performed on other cpus may already have the old value (we're not synchronised in any way) so we have to be prepared for a thread running on the target cpu to still manage to weakbind just one last time in which case we repeat the loop to allow weakbound threads to drain and thereafter we know no further weakbindings could have occured since the barrier is long  since visible.  The weakbinding barrier cpu_inmotion is checked in thread_nomigrate and a thread trying to weakbind to the cpu that is the target of an offline request will go through preemption code to first migrate to another cpu.

A Twist In The Tail

I integrated thread_nomigrate along with the project that first required it into build 63 of Solaris 10.  A number of builds later a bug turned up in the case described above where a cpu may be temporarily weakbound to a different cpu to which it is strongbound.  In fixing that I modified the assertion test in thread_allowmigrate.  The test suite we had developed for the project was modified to cover the new case, and I put the changes back after successful testing.

Or so I thought.  ON archives routinely go for pre-integration regression testing (before being rolled up to the whole wad-of-stuff that makes a full Solaris build) and they soon turned up a test failure that had systems running DEBUG archives failing the new assertion check in thread_allowmigrate during cpr (checkpoint-resume - power management) validation tests.

Now cpr testing is on the putback checklist but I'd skipped it in the bug fix on the grounds that I couldn't possibly have affected it.  Well that was true - the newly uncovered bug was actually introduced back in the initial putback of build 63 (about 14 weeks earlier) but was now exposed by my extended assertion.

Remember that the initial consumer of the thread_nomigrate interface was to be some modified kernel hardware-accelerated copy code - bcopy in particular.  Well it turns out that cpr uses bcopy when it writes the pages of the system to disk for later restore, taking special care of some pages which may change during the checkpoint operation itself.  However it did not take any special care with regard to the kthread_t structure of the thread performing the cpr operation, and when bcopy called thread_nomigrate the thread structure for the running thread would record the current cpu address in t_weakbound_cpu and the nesting count in t_nomigrate; if the page being checkpointed/copied happened to be that containing this particular kthread_t then those values were preserved and restored on the resume operation - undoing the stores of thread_allowmigrate for this thread - effectively warping us back in time!

There's certainly a moral there: never assume you understand all the interactions of the various elements of the OS, and do perform all required regression testing no matter how unrelated it seems at the time!  I just required the humiliation of the "fix" being backed out to remind me of this.

Technorati Tag:
Technorati Tag:

Tuesday May 10, 2005

Fault Management Top 10: #2

Structured Error Events

When subsystems are left to invent their own error handling infrastructure the natural baseline they tend towards is either that of zero infrastructure (just die horribly on any error) or printf logging/debugging (some odd message that has little or no structure, is not registered in any repository, probably means little to anyone other than the developer, and often contains no useful diagnosis information).  I'll term this unstructured printf messaging for this discussion.
There was a time several years ago when Solaris hardware error handling amounted to little greater than either of those baselines.  "Fortunately" the UltraSPARC-II Ecache ... errmmm ... niggle provided the much-needed catalyst towards the very substantial improvements we see today (and a grand future roadmap).  I'll use this as an example of what I'll term structured printf messaging below.
Finally I will contrast these with structured error events.

Unstructured Printf Messaging

The following is an example message that would be barfed (I find that word really suitable in these descriptions) to /var/adm/messages and console in older/unpatched versions of Solaris prior to the first round of hardware error handling improvements.  The message is the result of a corrected error from memory.
May 8 14:35:30 thishost SUNW,UltraSPARC-II: CPU1 CE Error: AFSR 
0x00000000.00100000 AFAR 0x00000000.8abb5a00 UDBH Syndrome 0x85 MemMod U0904
May 8 14:35:30 thishost SUNW,UltraSPARC-II: ECC Data Bit 63 was corrected
May 8 14:35:30 thishost unix: Softerror: Intermittent ECC Memory Error, U0904
At least there is some diagnostic information, such as telling as which bit was in error (we could grok for patterns).  While you could argue that there is some structure to that message you still have to write/maintain a custom grokker to extract it.  The task, of course, is complicated by other related messages sharing little or no common structure.  Here's an uncorrectable error from memory:
panic[CPU1]/thread=2a1000R7dd40: UE Error: AFSR 0x00000000.80200000 AFAR
0x00000000.089cd740 Id 0 Inst 0 MemMod U0501 U0401
Making things worse, in the corrected memory error case, these things were poured out to the console on every occurence.  The implicit expectation had been that you'd see very few such errors - not allowing for a bad pin, for instance.  So while the errors are correctable and survivable (albeit coming thick and fast) the handlers chose to spam the console making it near useless!

Structured Printf Messaging

Taking the time to plan and coordinate error messaging can pay substantial dividends.  This was one of the first reponses to the UltraSPARC-II Ecache problem - to analyse the requirements for error handling and diagnosis on that platform and, with  understanding of that umbrella-view in place, restructure associated handling and messaging.  Here are correctable and uncorrectable memory errors in this improved format:
[AFT0] Corrected Memory Error on CPU1, errID 0x00000036.629edc25
 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.00347dc0
 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1002fe20
 UDBH Syndrome 0x85 Memory Module 1904
[AFT0] errID 0x00000036.629edc25 Corrected Memory Error on 1904 is Intermittent
[AFT0] errID 0x00000036.629edc25 ECC Data Bit 63 was in error and corrected

WARNING: [AFT1] Uncorrectable Memory Error on CPU1 Instruction access at TL=0, 
errID 0x0000004f.818d9280
 AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.0685c7a0
 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x7815c7a0
 UDBH 0x0203<UE> UDBH.ESYND 0x03 UDBL 0x0000 UDBL.ESYND 0x00
 UDBH Syndrome 0x3 Memory Module 190x
WARNING: [AFT1] errID 0x0000004f.818d9280 Syndrome 0x3 indicates that this may 
not be a memory module problem
Behind the scenes a new shared error/fault information structure was introduced, and all kernel hardware error handlers were changed to accrue relevant information in an instance of one of those structures.  A new and robust (lockless) error dispatching mechanism would carry these structures to logging code which, instead of multiple independent printf (cmn_err) calls passed the error structure to a common function to log the error and passed flags to that function indicating what was relevant for the particular error.
Such messaging (along with many other improvements) significantly enhanced diagnosis and trend analysis.  You still had to grok messages files, but at least the messages were in a structured format that was quite readily extracted.  Of course there was still much to be done.  Programmtic access to the logged data was not readily available, syslog is not the most reliable message transport, syslog logs get aged away, harmless errors (say from an isolated cosmic ray incident) litter messages files and cause unnecessary alarm, etc.

Structured Error Events

Roll on the structured events described in my last blog entry.  When we prepare an error report we tag it with a pre-registered name (event class) and fill it with all relevant (and also pre-registered) event payload data.  That event payload data is presented to us a name-value list (really a name-type-value list).  When it is received by the fault manager is it logged away (independently of any diagnosis modules) in a structured fault log (binary, records are kept in the extended accounting format and the record data is in the form of name-value lists).  Here is a corrected memory event dumped from the error log using fmdump(1M):
TIME                          CLASS
Jun 23 2004 02:59:03.207995640 ereport.cpu.ultraSPARC-IIIplus.ce
nvlist version: 0
        class =
        ena = 0xd67a7253e3002c01
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = cpu
                cpuid = 0xb
                cpumask = 0x23
                serial = 0xd2528092482
        (end detector)

        afsr = 0x200000187
        afar-status = 0x1
        afar = 0x23c8f72dc0
        pc = 0x10391e4
        tl = 0x0
        tt = 0x63
        privileged = 1
        multiple = 0
        syndrome-status = 0x1
        syndrome = 0x187
        error-type = Persistent
        l2-cache-ways = 0x2
        l2-cache-data = 0xec0106f1a6
0x772dc0 0x0 0xbd000000 0xe8 0xc05b9 0x840400003e9287b6 0xc05b9
0x850400003e92a7b6 0xfeb5 0xc05b9 0x860400003e92c7b6 0xc05b9
0x870400003e92e7b6 0x3dd24 0xec0106f1a6 0x372dc0 0x1 0xbd000000 0xe8
0x0 0x0 0x0 0x4000000001 0x2e 0x0 0x0 0x0 0x0 0x0
        dcache-ways = 0x0
        icache-ways = 0x0
        resource = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = mem
                unum = /N0/SB4/P3/B1/D2 J16501
        (end resource)

In software (such as a diagnosis module) given a name-value list we can access the associated data either through generic event walkers (use the libnvpair(3LIB) API to iterate over list pairs, get the type for a pair, lookup that name-value pair for the given type) or (for example in the case of a diagnosis module that has subscribed to corrected memory error events and knows, from the registry, what payload data and types to expect):

analyze_error(nvlist_t \*nvl, ...)
        uint64_t afar;
        uint16_t synd;
        char \*type;

        if (nvlist_lookup_pairs(nvl, 0,
            "afar", DATA_TYPE_UINT64, &afar,
            "syndrome", DATA_TYPE_UINT16, &synd,
            "error-type", DATA_TYPE_STRING, &type,
            NULL) != 0)
                return (BAD_PAYLOAD);

Bye-bye hacky grokking scripts - we now have self-describing error event information!

Why Do We Care?

I'm hoping this message is getting across.  In the past, with no structured event protocol, standards for error collation etc to do it properly is just too difficult.  Subsystems would just deliver what was adequate during development, leave random printf debugging in place, perform no higher-level correlation of errors, etc.  Now they/we have no excuse - all the infrastructure is present (and easy to use) and it is vastly simpler to deliver high level error and fault analysis for your product (whatever it is - I keep offering examples from hardware errors but all this is generic).  Watch the Sun space for many leaps forward in fault management as a result.

OpenSolaris Release Date Set

For the nay-sayers who claim that "they'll never do it" - Casper has given a strong indication of the planned release date for OpenSolaris.  Should be amusing to see what the die-hard trolls in comp.unix.solaris and comp.os.linux.advocacy make of it!

Thursday Apr 28, 2005

Fault Management Top 10: #1

Solaris Fault Management Top 10: New Architecture Adopted

Time to get my promised fault management top 10 off the ground.
Sun has adopted an overarching Fault Management Architecture (FMA) to extend across its product line (of course it will take some time before it is realized in all products!).  Before we dive in let me say that I am not the architect of this architecture - I'm just a keen fan and beneficiary.
I'll describe some basic elements of the architecture below, but let's begin with a review of the architecture it replaces:

Outgoing Fault Management Architecture

(This space intentionally left blank)

OK it wasn't quite as bad as that, but it is true to say that we have not previously enjoyed any formal or consistent fault management model or architecture and that what functionality there was was rather ad-hoc and (in Solaris, and generally across the product line) focussed on error handling and not on fault diagnosis:
  • There was no centralised view of errors - each error handler would do the necessary handling as prescribed (e.g., rewrite a memory address in response to a correctable error from memory) but attempts at higher level monitoring of trends within these error handlers (e.g., are all errors from the same DIMM, all suffering the same bit in error, all reported by the same cpu and never any other cpu, etc)  were  somewhat limited and clumsy since there was no framework for collation and communication of error reports and trends.
  • Error handlers are part of the kernel, and that is not the best or easiest place to perform high-level trend analysis and diagnosis.  But the only communication up to userland level was of unstructured messages to syslog which hardly facilitates analysis in userland.  With diagnosis difficult in the kernel error handlers concentrated on error handling (surprise!) and not on analysing error trends to recognise faults - or where they did they were necessarily somewhat limited.
  • Error messages were sent to syslog and barfed to the console.  If you suffered a zillion errors but nonetheless survived, your console might be spammed with a zillion messages (of a few lines each).
  • Don't think just of hardware errors.  Consider the example of a driver encountering some odd condition, again and again.  Typically it would use cmn_err(9F) to barf some warning message to the console, repeatedly.  Chances are that it did not count those events (e.g., wouldn't know if a given event were the first or the 100th), did not kick off any additional diagnostics to look any deeper, would take no corrective action etc.  While some drivers do a better job they were obliged to invent the infrastructure for themselves, and there'd be no re-use in other drivers.
  • The list goes on.  There's some overlap with my previous post here: things were in this state because error handling and fault diagnosis were an afterthought in unix, subscribing to the "errors shouldn't happen - why design for them?" doctrine.
Those points are very much Solaris (and Solaris kernel) centric, but they're representative of the overall picture as it was.

New Fault Management Architecture

Let's start with a simple picture to establish some terminolgy:

  • An error is some unexpected/anomolous/interesting condition/result/signal/datum; it may be observed/detected or go undetected.  An example might be a correctable memory error from ECC-protected memory.
  • A fault is a defect that may produce an error.  Not all errors are necessarily the result of a defect.  For example a cosmic ray may induce an isolated few correctable memory errors, but there is no fault present;  on the other hand a DIMM that has suffered electrostatic damage in mishandling may produce a stream of correctable errors - there is a fault present.
  • Error Handlers respond to the detection or observation of an error.  Examples are:
    • a kernel cpu error trap handler -  a hardware exception that is raised on the cpu in response to  some hardware detection of an error (say an uncorrectable error from a memory DIMM),
    • a SCSI driver detects an unexpected bus reset
    • an application performs an illegal access and gets a SIGSEGV and has installed a signal handler to handle it
    • svc.startd(1M) decides that a service is restarting too frequently
    The error handler takes any required immediate action to properly handle the error, and then (if it wishes to produce events into the fault manager) collects what error data is available and prepares an error report for transmission to the Fault Manager.
  • The Error Report (data gathered in response to the observation of an error) is encoded according to a newly defined FMA event protocol to make an Error Event, which is then transported towards the Fault Manager.
  • The Fault Manager is a software component that is responsible for fault diagnosis through pluggable diagnosis engines.  It is also responsible for logging the error events it receives and for maintaining overall fault management state.  The fault manager receives error events from various error handlers through various transport mechanisms (the error handlers and the fault manager may reside in quite distinct locations, e.g. an error handler within the Solaris kernel and a fault manager running within the service processor or on another independent system). 
  • The FMA event protocol, among many other things, arranges events in a single hierarchical tree (intended to span all events across all Sun products) with each node named by its event class.  Events include error events as produced by the error handler and fault events as produced by the fault manager when a fault is diagnosed (events are generic and are certainly not limited to error and fault events).  The event protocol also specifies how event data is encoded.  Continuing our correctable memory error example, such an error may arrive as an error event of class ereport.cpu.ultraSPARC-IV.ce and have corresponding event payload data (which we'll take a look at another day).
  • Pluggable Diagnosis Engine clients of the fault manager subscribe to incoming error telemetry for event classes that they have knowledge of (i.e., are able to assist in the diagnosis of associated faults).  They apply various algorithms to the incoming error telemetry and attempt to diagnose any faults that are present in the system; when a fault is diagnosed a corresponding fault event is published.
  • FMA Agents subscribe to fault events produced by the diagnosis engines and may take any appropriate remedial or mitigating action.
That's enough terminology for now.

The Fault Manager(s)

The fault manager plays a central role in the architecture.  Furthermore, fault managers may be arranged hierarchically with events flowing between them.  For example one fault manager instance might run within each Solaris instance (say on domains A and B within a Sun-Fire 6900), and another fault manager instance might run on the system controller for the platform.  Events that are detected within the running OS image can produce error reports into the domain fault managers; events that are deteced within the componentry managed by the SC (say an error observed on an ASIC that Solaris does not know exists and for which the error is corrected before it gets anywhere near Solaris) can produce ereports that flow towards the SC fault manager.  And since these things are never entirely independent it is interesting for the SC to know what the domain has seen (and vice versa) so, for example, the domain fault manager might produce events consumed by the SC fault manager.

An Example - Solaris FMA Components

Let's look a bit closer at some of the FMA infrastructure as realized in Solaris 10 (keeping in mind that FMA stretches way beyond just the OS).  It's probably most illustrative to contrast old with new all along, so I'll approach it that way.
Let's consider the example of a correctable error from memory.  The following isn't real Solaris code (that's coming soon) and is a substantial simplification but it gives the flavour.

Old New
CPU traps to notify of a recent correctable error event.  The trap handler gathers data that is considered relevant/interesting such as the detector details (variable cpuid, an integer), address of the error (variable afar, a uint64_t), the error syndrome (variable synd, an integer), which DIMM holds that address (loc, a string), whether the error could be eliminated by a rewrite (status, a boolean), etc.  We now want to forward this data for pattern and trend analysis.
Vomit a message to the syslog with code that amount to little more than the following:

cmn_err(CE_WARN, "Corrected Memory Error detected by CPU 0x%x, "
"address 0x%llu, syndrome 0x%x (%s bit %d corrected), "
"status %s, memory location %s ",
cpuid, afar, synd, synd_to_bit_type(synd), synd_to_bit_number(synd), status ? "Fixable" : "Stuck",  loc);
Create some name-value lists which we'll populate with the relevant information and submit to our event transport:

dtcr = nvlist_create();
payload = nvlist_create();

We're not going to barf a message into the ether and hope for the best!  We're going to produce an error event according to the event protocol (implemented in various support functions which we'll use here).  We start by recording version and event class information:

 \* Event class naming is performed according to entries in a formal
 \* registry.  For this case assume the registered leaf node is
 \* a "mem-ce".  We'll prepend the full event class name to form a
 \* full class name along the lines of "ereport.cpu.ultraSPARC-IIIplus"
payload_class_set(payload, cpu_gen(cpuid, "mem-ce"));

Populate the detector nvlist with information regarding the cpu and add this info into the payload:

(void) cpu_set(dtcr, cpuid, cpuversion(cpuid));
payload_set(payload, "detector", TYPE_NVLIST, dtcr);

Populate the payload with the error information:

payload_set(payload, "afar", TYPE_UINT64, afar);
payload_set(payload, "syndrome", TYPE_UINT16, synd);
payload_set(payload, "status", TYPE_BOOLEAN, status);
payload_set(payload, "location", TYPE_STRING, loc);

Submit the error report into our event transport:


The /var/adm/messages file and possible the console get spammed with a message that might look like this:

Corrected Memory Error detected by cpu 0x38,
address 0x53272400, syndrome 0x34 (data bit 59 corrected),
status Fixable, memory location /N0/SB4/P0/B1/S0 J13301

If syslogd is having a bad day at the office it may even lose this message - e.g., if it reads the message from the kernel and then dies before logging it.  Assuming it does arrive it has also reached the end of the road - any further processing is going to rely on grokking through messages files.
The event transport used to safely transport the event from the kernel to the userland fmd is the sysevent transport.  The error event arrives at the fault manager for further processing.
The error event is logged by fmd.  If we were to look at the logged event with fmadm(1M) it would appear along these lines:

May 04 2005 16:26:48.718581240 ereport.cpu.ultraSPARC-IIIplus.mem-ce
nvlist version: 0
class = ereport.cpu.ultraSPARC-IIIplus.mem-ce
detector =  (embeeded nvlist)
nvlist version 0
version = 0x0
scheme = cpu
cpuid = 0x38
afar = 0x53272400
syndrome = 0x34
status = 1
location =
/N0/SB4/P0/B1/S0 J13301
Of course a real ereport has a bit more complexity than that.
The fault manager supplies a copy of the event to each plugin module that has subscribed to this event class.  Each receives it as a name-value list and can access all the data fields through the API provided my libnvpair(3LIB).
No userland level diagnosis is performed, other than ad hoc processing of messages files by various custom utilities.  The kernel error handler do perform limited diagnosis but, since the kernel context is necessarily somewhat limited, these are not sophisticated and each handler tends to roll its own approach.
Diagnosis Engine modules can collate statistics on the event stream.  The fmd API provides the infrastructure that a diagnosis engine might want in keeping track of various error and fault information.
High-level diagnosis languages are implemented in custom diagnosis engine modules, allowing diagnosis rules etc to be specified in a high level language rather than coding to a C API.

The DE may eventually decide that there is a fault (e.g., it notices that all events are coming from the same page of memory) and it will publish a fault event.  We can use fmdump to see the fault log.

An agent subscribing to the fault event class can retire the page of memory so that it is no longer used.

The example is incomplete (on both sides) but fair.  I hope it illustrates how unstructured the old regime is, and how structured and how extendable the new is.  Getting additional telemetry out of the kernel into a module that can consume that data to make intelligent diagnosis decisions is pretty trivial now.

Thursday Mar 10, 2005

Error Handling Philosophy: It Happens

In my first entry I referred to error handling and fault management as appearing "unglamorous".  What do I mean by that?  Well, of the more than three-quarters of a million downloads of Solaris 10 to date in how many cases do you think the uppermost question to be answered in subsequently trying it out was "Let's see how well this baby handles hardware faults?".  Answer: not enough.

There is, of course, an informed minority who recognise the value and necessity of proper error and fault handling.  But it seems that too many people subscribe to the quaint fallacy that "hardware errors should not happen, and where they do they indicate a vendor design or manufacturing flaw".  Hmmm, even given a perfect chip design it's more than a little counter-intuitive to believe that we can squeeze a few gazillion transistors onto a tiny chip and expect perfect operation from what is a physical process.  The reality is that electronics, like software, is imperfect both in design and in implementation.  Moreover it is realized in a physical medium and is subject to the laws of physics more than the laws of your data centre or desktop!  Now take all these components and imagine assembling them even into a simple system - opportunities abound!

So rather than stick your head in the sand, accept that hardware errors do happen, that they are expected, and that (whatever your hardware, from whoever) they near certainly have happened to you.  Whether or not you or anything actually noticed and, if so, did anything about it is quite another question.  We're into the domain of detection, correction, and remedial action.

Good hardware components and systems are designed to, at the very least, detect the presence of errors they have suffered (they detect the consequence of the event, say a bit flip, rather than the event itself which may have occured some time earlier).  I'm always amused by the existence of "non-parity memory" for many PC systems (especially those sometimes quoted in the "build your own server for a few hundred bucks and run your business on it - why pay a vendor" articles.  "Non-parity" makes it sounds like a feature, not an ommision; like "Non-fattening" foods are a good thing so "non-parity" must be good.  Lacking even parity protection means that your data in memory is completely unprotected - if it is corrupted by a bit flip the only way you'll ever know is if you notice "cheap" spelled as "nasty" in your precious document (and many would blame the application), or if higher-level application software is performing checksumming on your data for you (not common).  The system has silently corrupted your data and allowed use of that corrupted data as if it had been good, and you're none the wiser!

Of course any self-respecting system nowadays has ECC on memory (typically single-bit-error correction and dual-bit-error detection without correction).  OK, so my cheapo home PC doesn't but that's only used for private email so I'll get over it - but I wouldn't run anything important on there.  A self-respecting system also has data protection, and increasingly correction, on datapaths, ASICs, CPU caches etc.

At it's simplest, system error-handling software need only do what is required by the architecture to correctly survive the error.  This may range from nothing at all (hardware detected and corrected a single bit error and, optionally, let you know about its good deed) to cache flushes, memory rewrites etc.  But we want to do a great deal more than that.  We want to log the event, look for patterns and trends, predict impending uncorrectable failures and "head them off at the pass" (apologies) through preemptive actions, classify errors (eg, transient or not going away) and so on. 

Furthermore, if we accept that "it happens" - errors and faults will occur - we should also accept that they're not always going to be neat and tidy.  For example a particle strike that upsets a single memory cell is easily handled - it's an isolated event and the overhead of handling it (correction, logging, collating in case this is a part of a pattern) is trivial - but a damaged or compromised memory chip (manufacturing defect, electrostatic handling failures, nuts and bolts loose in the system, poor memory slot insertion, coffee in the memory slot etc) may produce a "storm" of errors - do you want the system to spend all its time logging and collating those or would you prefer it also do some useful work on your application? 

While a single memory chip problem will only affect accesses that involve that memory, think what happens when such problems beset the CPU - say a "stuck" bit (always reads 0 even after you write a 1 to it, or vice-versa) is in a primary (on-chip) cache such as the data cache.  Such a failure is going to generate many many error events - how do you make sure nobody notices (i.e., no data is corrupted, everybody runs at full-speed etc).

Of course some failures are simply not correctable.  They may not involve extremes of storms and stuck bits - just 2 flipped bits in a memory word cannot be corrected with most ECC codes currently in use.  Data may be lost (e.g., if it can't be recovered from another copy elsewhere) and process or even operating system state may be compromised.  How do you contain the damage done with least interruption to system service?

I hope I've made the beginnings of a convincing case of why error and fault handling as a first-class citizen is essential in modern systems and operating systems.  In followup posts (probaby after I get some such software putback-ready) I'll continue to make the case, describe where Sun is at and where we're going in the arena etc.  It's perhaps less glamouress working in the cesspit of things that we'd prefer did not happen, but since they do happen and will continue to with current technology it's certainly sexy when you can handle them with barely a glitch to the system or predict the occurence and have already taken steps to contain the damage.

Tuesday Feb 01, 2005

Top Solaris 10 features for fault management

A number of Sun bloggers have posted top N lists of cool and new features in Solaris 10 (e.g., in Adam's blog). I thought I'd have a go at a Solaris 10 top 10 from the error handling and fault management and "Predictive Self-Healing" point of view, and then go into each item in a bit more detail in future entries.
  1. Sun has adopted a fault management architecture and Solaris 10 delivers the first (of many planned) offerings implementing this architecture in the fault management  daemon  (part of the svc:/system/fmd:default service).
  2. Error event handlers now propogate structured error reports to the fault manager where they are logged in perpetuity.
  3. Diagnosis Engines now exist to automate much diagnosis.
  4. Agent software can implement various policies given a diagnosis, e.g. to offline and blacklist a cpu or to retire some memory.
  5. Only diagnoses will appear on consoles etc, and they reference web-based knowledge articles.
  6. The contract filesystem ctfs provides a mechanism by which we can communicate hardware errors to groups of affected processes.
  7. The Service Management Facility is available to manage services affected by errors.
  8. Error trap handlers, now that there is a clear separation of responsibilities, are more robust.
  9. Getting error telemetry out of the kernel is dead easy now.
  10.  Fault management is no longer an afterthought!  And it is set to grow and grow.

Friday Jan 21, 2005

Lift off!

I registered this blog at the beginning of August 2004, so some 5+ months down the line it seem high time that I made an entry! So far all I've done is play with Roller themes for the odd few minutes here and there. In that time the Sun world seems to have blogged just about everything interesting that I might have had to say, anyway (I suppose I should drop the aspiration of limiting myself to interesting material alone now). But now that Solaris 10 is practically out the door I'm hoping to find a few minutes here and there to waffle away.

Time for the obligatory introduction. My name is Gavin Maltby and I joined Sun UK back in 1996. I'm based at the Guillemont Park campus down in the South East of England, although I work from home (a whole 10 minutes drive from the campus - maybe when summer returns I'll even cycle in once in a while) quite a bit. I'm a software developer in the Scaleable Systems Group - SPARC Platform Software. Most of my current work is related to error handling and fault management, so aligns with Sun's RAS (reliability, availability, serviceability) efforts. My wife also works at Sun (internal business application development; no we met many years before either of us started at Sun) and we have a 20 month daughter and "number 2" due in mid-April (so that should explain any blogging black holes around that time).

What will I be blogging? Well to begin with I'll see if I can bring some of the excitement of error handling and fault management to life here. It's a unglamorous area in some respects - errors and faults shouldn't happen, should they? Answers on the back of a USD 100 bill to me, please. Nonetheless, given that they do, there's lots of fun to be had in correctly handling the event, performing the required actions, passing the details on for diagnosis, avoiding downtime due to the event or any fault that might cause such event to recur, and so on.

Well - it's a start.


I work in the Fault Management core group; this blog describes some of the work performed in that group.


« August 2016
Site Pages
Sun Bloggers

No bookmarks in folder