Thursday Nov 15, 2007

A Makeover for 'fmadm faulty'

I recently upgraded some of my systems to Nevada build 77 and, among a lot other cool things outside of FMA, got to see the makeover given to 'fmadm faulty'. The changes were introduced in build 76 via 6484879...but hey, it's been a busy couple of weeks so I'm behind the times.

So what's the big deal? Why do I care? Short answer is fewer commands to see what's going on. Before this change,

# fmadm faulty STATE RESOURCE / UUID -------- ---------------------------------------------------------------------- degraded mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 cd72a0c9-2c5d-e458-d866-f0d8d80ad0bb -------- ----------------------------------------------------------------------

Kinda cryptic. No mention of the FRU, the message code...just the affected FMRI. To get this info, I'd need to run another 'fmdump -v -u <uuid>' command. Such as:

# fmdump -v -u cd72a0c9-2c5d-e458-d866-f0d8d80ad0bb TIME UUID SUNW-MSG-ID Sep 26 14:07:33.7174 cd72a0c9-2c5d-e458-d866-f0d8d80ad0bb SUN4V-8000-E2 95% Problem in: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 Affects: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 FRU: hc://:serial=22ab471:part//motherboard=0/chip=0/branch=0/dra m-channel=0/dimm=0 Location: MB/CMP0/BR0: CH0/D0/J1001

Better...although I still don't know the immediate impact to my system. That information is printed to the console and /var/adm/messages. So a bit of poking around to get all of the information.

With the new output

# fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 14 20:47:15 91ae3cb5-0d3b-ce3e-ee1a-bf764a0c8e99 SUN4V-8000-E2 Critical Fault class : 95% Affects : mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 degraded but still in service FRU : hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0721BBB013 :server-id=wgs48-163:serial=d2155d2f//motherboard=0/chip=0/branch=0/dram-channel =0/dimm=0 95% Description : The number of errors associated with this memory module has exceeded acceptable levels. Refer to for more information. Response : Pages of memory associated with this memory module are being removed from service as errors are reported. Impact : Total system memory capacity will be reduced as pages are retired. Action : Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u to identify the module.

Much nicer. Have the severity and message code immediately available. And the details on impact to the system are right here....don't need to go to the console or message logs.

One thing I've noticed that's been omitted from the new output is the Location field. It's still available with fmdump, and I always found that most useful. You tell me, but if you're in the field and want to identify the exact FRU, the NAC name is a lot more readable than the fully qualified hc scheme....particularly for IO.


Thursday Nov 01, 2007

The FMA Demo Kit

For anyone looking to demo FMA on an T1000, T2000, T5120, or T5220 system, now you least for CPU errors. About a month ago, the FMA Demo Kit was announced on the alias. If you haven't tried the Demo Kit, check it out. It's short and simple, easy to use, and does a fine job of showing you an example of what the automated diagnosis, and some of the CLIs, will will spit out on your system. What I find nice is that the demo can be run either live or in a simulation (fmsim) environment. In the simulation environment, other than a console message, your live FM logs aren't "tainted".

Anyway, I added support for the UltraSPARC-T1 and UltraSPARC-T2 processors to the kit, and the updates were posted today (11/01/2007). The Demo Kit uses fminject to push ereports into the fault management subsystem. In a prior posting, I'd included an example fminject input file for UltraSPARC-T1, with the caveat that it would not work on your system. Now, with the Demo Kit, the fminject input file will be automatically tailored to your system at runtime. For the ultra-curious, dig into the code.

For those not on the alias, here's a copy of the announcement:

Hello, I'm pleased to announce that the Solaris FMA Demo Kit has been updated to include CPU demo support for the UltraSPARC-T1 and UltraSPARC-T2 processors. This update is for CPU support only...I'm still working on memory support. For those that missed Rob Johnston's earlier announcement: The Solaris FMA Demo Kit consists of a set of PERL and Korn shell scripts which implement an automated harness for executing FMA demos. The Demo Kit also provides example demos which demonstrate Solaris' ability to handle and diagnose CPU, Memory and PCI I/O errors. The Solaris FMA Demo kit is designed to run on stock Solaris systems (both SPARC and x86), out-of-the-box - no custom error injection hardware or drivers are required. For more information on the demo kit, including download, installation and usage instructions, please see: Thanks, -scott

And yes, the T1/T2 support is only for CPU errors at the moment. I'm working on memory errors and hope to have that integrated in a few weeks.


Monday Oct 22, 2007

Diagnosing FB-DIMM channel errors

The T5120/T5220 systems Sun recently announced are some of the first systems to come out of Sun using Fully Buffered DIMMs (FB-DIMMs). This new technology presented some new diagnosis challenges for us, but very recently, CR 6536482 - "diagnose FBR and FBU errors to branch" was putback into Solaris 10 (it's been in OpenSolaris for a few weeks now). A patch will be coming soon, and while I don't know the patch number, it'll almost definitely be an update to 119578.

Some very quick background on FB-DIMMs in laymans terms. An FB-DIMM contains an Advanced Memory Buffer (AMB) that sits between the memory controller and the memory module itself. Multiple FB-DIMMs are connected in a serial fashion, AMB-to-AMB.

In other words, a daisy chain. In T5120/T5220, the memory controller is in the UltraSPARC-T2 chip and we have up to four FB-DIMMs per branch.

So back to 6536482. On T5120/T5220, detection and reporting of FB-DIMM channel errors has been in place since day one. But particularly for recoverable errors, since this is new technology to Sun, no diagnosis is issued for these. We had to be careful with when to issue a diagnosis against the channel. It's cool that FB-DIMMs can tell us there's a problem with the channel, but the downside is that there is insufficient information available to determine where in the channel the problem exists. All we know is that something in the channel is broken. This par for the course with uncorrectable errors (an FBU ereport), but something new for us with correctable events (an FBR ereport). Care was taken to select the appropriate SERD values for these correctable FBR events, as the resulting diagnosis implicates everything in the FB-DIMM channel. On a fully loaded T5120/T5220, that equates to 4 DIMMs plus the motherboard (the T2 chip).

Anyway, lets see an example of a diagnosis of an FBR. As I usually do in my blogs, we'll look at this top-down, from the point of view of what you'd see on your system if such a diagnosis occurred. On the console, the following is reported:

SUNW-MSG-ID: SUN4V-8000-FJ, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Tue Oct 16 15:10:53 EDT 2007 PLATFORM: SUNW,SPARC-Enterprise-T5220, CSN: -, HOSTNAME: wgs48-100 SOURCE: cpumem-diagnosis, REV: 1.6 EVENT-ID: d505971c-f3dd-e27e-9e05-faed155c39bd DESC: A problem was detected in the interconnect between a memory DIMM module and its memory controller. Refer to for more information. AUTO-RESPONSE: No automated response. IMPACT: System performance may be impacted.

Looking at the fmdump for the event, we see:

# fmdump -v -u d505971c-f3dd-e27e-9e05-faed155c39bd TIME UUID SUNW-MSG-ID Oct 16 15:10:53.2312 d505971c-f3dd-e27e-9e05-faed155c39bd SUN4V-8000-FJ 70% Problem in: mem:///unum=MB/CMP0/BR0/CH0/D0 Affects: mem:///unum=MB/CMP0/BR0/CH0/D0 FRU: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0704BB5053 :server-id=wgs48-100:serial=22ab471//motherboard=0/chip=0/branch=0/dram-channel=0/dimm=0 Location: MB/CMP0/BR0/CH0/D0 30% Problem in: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0704BB5053 :server-id=wgs48-100:serial=101083:part=541215101/motherboard=0 Affects: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0704BB5053 :server-id=wgs48-100:serial=101083:part=541215101/motherboard=0 FRU: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0704BB5053 :server-id=wgs48-100:serial=101083:part=541215101/motherboard=0 Location: MB

In my particular system, I only have a single DIMM in the branch, so only a single DIMM is listed as a suspect. But in general, all DIMMs in the affected branch are listed. Also, notice the certainty of the fault is weighted toward the DIMMs. The 70%/30% ratio always holds, irrespective of the number of DIMMs in the branch. When multiple DIMMs are indicted, the 70% certainty is evenly distributed across the DIMMs.

Finally, looking at the fmstat output and another fmdump command, there were multiple FBR events leading to the crossing of the SERD threshold and the diagnosis.

# fmstat -s -m cpumem-diagnosis NAME >N T CNT DELTA STAT branch_MB/CMP0/BR0_serd >14 30m 15 36131188301ns fire # fmdump -e -u d505971c-f3dd-e27e-9e05-faed155c39bd TIME CLASS Oct 16 15:09:49.5552 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:46.8111 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:46.3707 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:45.9953 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:45.6104 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:45.2361 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:44.7160 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:31.7703 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:31.4599 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:31.0751 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:30.6986 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:30.0915 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:29.5263 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:28.7693 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:16.8732 ereport.cpu.ultraSPARC-T2plus.fbr

A final note is that on T5120/T5220, the reporting of FBR events is filtered. If one examines the expected bit error rate (BER) for FB-DIMMs, the bandwidth of the memory channels, and so forth the math tells us that we can expect an FBR roughly once per second (yes, once per second) in worst case conditions. We filter to avoid inundating the error logs with FBR events. The SERD thresholds we see in fmstat output are a result of calculations of worst case coupled with emperical testing results, and then accounting for the filtering done with the reporting.





« April 2014