On FMA - it really does work..... even on my Ultra 20

A few weeks ago I took delivery of a brand, spanking new Sun Ultra 20 workstation, purchased for me by my department.

I installed build 38 of Solaris Express and promptly BFU'd to whatever the relevant nightly build was at the time. I do like to live on the edge :-)

Anyway, I noticed after a while (we're talking hours here, not weeks) that there was only ever one process showing up as on cpu when I ran prstat. Curious, I ran psrinfo ---v, which told me that core 0 of my dual-core Opteron was faulted and offline.

Damn! This is a new machine, less than a day old. What on earth could have gone wrong with the cpu?

I thought I'd have a look at the FMA telemetry that was being generated. This was a bit of a shock:


# fmdump
TIME UUID SUNW-MSG-ID
Apr 27 18:50:23.6205 4cd32003-36a3-c3f8-ea93-b7edc762dd9f AMD-8000-JF
Apr 27 18:50:53.5720 5baaf5a5-2bbf-43c7-e3fe-ab24b007c3f7 AMD-8000-JF
Apr 28 07:13:56.8810 5baaf5a5-2bbf-43c7-e3fe-ab24b007c3f7 AMD-8000-JF
Apr 28 12:46:32.0074 5baaf5a5-2bbf-43c7-e3fe-ab24b007c3f7 AMD-8000-JF
Apr 28 13:37:59.2926 5baaf5a5-2bbf-43c7-e3fe-ab24b007c3f7 AMD-8000-JF
Apr 28 13:50:35.5724 412579b7-ed8d-607a-905c-e3fb998f290e ZFS-8000-D3
Apr 28 13:54:46.0114 5baaf5a5-2bbf-43c7-e3fe-ab24b007c3f7 AMD-8000-JF
Apr 28 13:54:46.3803 378726c1-1d68-c0c3-d0fd-9fb2b1431834 ZFS-8000-CS
Apr 28 14:23:09.6371 5baaf5a5-2bbf-43c7-e3fe-ab24b007c3f7 AMD-8000-JF
Apr 29 05:40:24.2258 a4d4edf8-520d-e625-8223-84c7ce652524 AMD-8000-2F
May 01 15:47:52.8092 abea0bc6-80b1-e022-edd1-d4a385117e0d AMD-8000-2F

Now except for the ZFS\* messages (which occurred when I was playing around with my scsi multipack), we've got two SUNW-MSG-ID strings which you can look up at http://www.sun.com/msg.

If you want to see what FMA has logged as faulted you can run


# fmdump -v -u 4cd32003-36a3-c3f8-ea93-b7edc762dd9f
TIME UUID SUNW-MSG-ID
Apr 27 18:50:23.6205 4cd32003-36a3-c3f8-ea93-b7edc762dd9f AMD-8000-JF
100% fault.cpu.amd.datapath

Problem in: hc:///motherboard=0/chip=0/cpu=0
Affects: cpu:///cpuid=0
FRU: hc:///motherboard=0/chip=0

Ok, so there's a serious looking problem with core 0 in my cpu. Good thing I've got two cores. A quick psradm -n 0 got the core to say that it was back online, but I wasn't really sure I'd done anything to fix it.

What about the other AMD\* messages, what do they mean?


# fmdump -v -u abea0bc6-80b1-e022-edd1-d4a385117e0d
TIME UUID SUNW-MSG-ID
May 01 15:47:52.8092 abea0bc6-80b1-e022-edd1-d4a385117e0d AMD-8000-2F
100% fault.memory.dimm_sb

Problem in: hc:///motherboard=0/chip=0/memory-controller=0/dimm=1
Affects: mem:///motherboard=0/chip=0/memory-controller=0/dimm=1
FRU: hc:///motherboard=0/chip=0/memory-controller=0/dimm=1

Ok, that's looking a tad worse. Especially when I try fmadm repair abea0bc6-80b1-e022-edd1-d4a385117e0d --- look at what I got in /var/adm/messages:


May 9 18:04:00 pieces fmd: [ID 441519 daemon.error] SUNW-MSG-ID: FMD-8000-0W, TYPE: Defect, VER: 1, SEVERITY: Minor
May 9 18:04:00 pieces EVENT-TIME: Tue May 9 18:03:59 EST 2006
May 9 18:04:00 pieces PLATFORM: Sun Ultra 20 Workstation, CSN: 0614FK40E2, HOSTNAME: pieces
May 9 18:04:00 pieces SOURCE: fmd-self-diagnosis, REV: 1.0
May 9 18:04:00 pieces EVENT-ID: ee29d053-e3aa-cbe8-a458-ec8528f5bf99
May 9 18:04:00 pieces DESC: The Solaris Fault Manager received an event from a component to which no automated diagnosis software is currently subscribed. Refer to http://sun.com/msg/FMD-8000-0W for more information.
May 9 18:04:00 pieces AUTO-RESPONSE: Error reports from the component will be logged for examination by Sun.
May 9 18:04:00 pieces IMPACT: Automated diagnosis and response for these events will not occur.
May 9 18:04:00 pieces REC-ACTION: Run pkgchk -n SUNWfmd to ensure that fault management software is installed properly. Contact Sun for support.
May 9 18:04:00 pieces fmd: [ID 441519 daemon.error] SUNW-MSG-ID: FMD-8000-0W, TYPE: Defect, VER: 1, SEVERITY: Minor
May 9 18:04:00 pieces EVENT-TIME: Tue May 9 18:04:00 EST 2006
May 9 18:04:00 pieces PLATFORM: Sun Ultra 20 Workstation, CSN: 0614FK40E2, HOSTNAME: pieces
May 9 18:04:00 pieces SOURCE: fmd-self-diagnosis, REV: 1.0
May 9 18:04:00 pieces EVENT-ID: 27b29562-976e-ea5f-f554-d9010393029f
May 9 18:04:00 pieces DESC: The Solaris Fault Manager received an event from a component to which no automated diagnosis software is currently subscribed. Refer to http://sun.com/msg/FMD-8000-0W for more information.
May 9 18:04:00 pieces AUTO-RESPONSE: Error reports from the component will be logged for examination by Sun.
May 9 18:04:00 pieces IMPACT: Automated diagnosis and response for these events will not occur.
May 9 18:04:00 pieces REC-ACTION: Run pkgchk -n SUNWfmd to ensure that fault management software is installed properly. Contact Sun for support.

WTF!?!?!?

So I logged a call with Sun Support requesting a new dual-core cpu and two new dimms. (Yes, I know that the telemetry only mentioned one dimm, but they're shipped in pairs). The parts duly came, and I installed them.

A day or so before the parts came I noticed that root got email saying that the fault log was too busy to rotate. This got me worried as well..... fmdump was showing 6 error events against dimm #1 every minute, which I thought was quite excessive.

So it was time for a quick search of the archives of FMA-discuss but nothing seemed to match. Time for an email to the team to find out whether they'd seen anything like this. Unfortunately not.

So I replaced the cpu (that worked just fine afterwards) and both dimms.... and the fma telemetry for the dimms continued.

Now I was getting worried. Really, really worried.

I'd taken the necessary precautions when replacing the cpu and the dimms, I'd tried running fmadm repair against the dimm uuids, and I'd even tried unloading just about all of the fma modules. (All that produced was messages to the effect of "hey! I've got an error and I dunno what to do with it.")

So I got in contact with the FMA core team and one of their number ssh'd into my workstation and dug around for a bit. I also got an email from Gavin Maltby letting me know that I actually had a single bit error on that dimm. From that he surmised that there was a single pin gone bad in the slot.... and could I spare the downtime to have a look please?

So at lunchtime that day I shutdown the box, took the necessary static precautions and removed the dimms.

Lo! and behold, I saw that there was indeed a bent pin in the second slot away from the cpu:


So I moved the pair of dimms to the other two slots, powered the system on.... ran fmadm repair on the dimm uuid, and life was good again.

I'd love to provide a "moral" to this anecdote, but there isn't one. All I can say is that this FMA stuff really does work and if you're not running Solaris 10 (or Express) by now then you are missing out.


Manpages that you will find useful for FMA include
fmadm(1M), fmd(1M), fmdump(1M), and fmstat(1M)

And don't forget the OpenSolaris Fault Management community pages too.

Comments:

Post a Comment:
Comments are closed for this entry.
About

I work at Oracle in the Solaris group. The opinions expressed here are entirely my own, and neither Oracle nor any other party necessarily agrees with them.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today