Thursday Dec 03, 2009

sun4v FMA Firmware Faults

At long last! A recent putback finally provides are more meaningful diagnosis on sun4v systems where a bug in firmware is the likely culprit.

6502089 ferg.invalid errors should be diagnosed as a fault
6502086 DBU errors should be diagnosed as HV defect/fault

Prior to these fixes, "ferg.invalid" errors would result in FMA reporting the infamous "nosub" fault (FMD-8000-0W). And a DBU error wasn't reported at all, the user experience being a mysterious system crash/reset.

Now, such errors are now respectively diagnosed to a defect.fw.generic-sparc.erpt-gen (SUN4V-8002-SP) or defect.fw.generic-sparc.addr-oob (SUN4V-8002-RA) fault.

As the cases for these errors are better understood, expect the article text on to have more details.


Monday Oct 13, 2008

T5440 Fault Management

10/13/2008: Today Sun announced the T5440 platforms centered around the UltraSPARC T2 Plus processor. The T5440 is the big brother of the T5140/T5240 systems packing 256 strands of execution into a single system. And I'm happy to report that Solaris' Predictive Self Healing feature (also known as the Fault Management Architecture (FMA)) has been extended to include coverage for the T5440.

With respect to fault management, T5440 inherits both the T5140/T5240 FMA features and the T5120/T520 FMA features. Just some of the features included are:

  • diagnosis of CPU errors at the strand, core, and chip level
  • offlining of problematic strands and cores
  • diagnosis of the memory subsystem
  • automatic FB-DIMM lane failover
  • extended IO controller diagnosis
  • identification of local vs remote
  • cryptographic unit offlining
The T5440 introduces a coherency interconnect tying the 4 UltraSPARC T2 Plus processors together. The interconnect ASICs have their own set of error detectors. The fault management system covers interconnect-detected errors as well.

Additionally, the FMA Demo Kit has been updated for the T5440. For those not familiar with the kit, it provides a harness for executing fault management demos on CPU, Memory and PCI errors. It runs on out-of-the-box Solaris - no custom hardware, error injectors, or configuration necessary. It's a great - and if using the simulation mode, safe - way to get familiar with how Solaris will respond to certain types of errors.

Example: T5440 Coherency Plane ASIC Error
If one of the coherency ASICs detects errors with coherency between the processors, the system may or may not continue to operate, depending on the nature of the error. A prime tenet of the hardware is to disallow propagation of bogus transations - we don't want data corruption. An example of a fatal error on the coherency plane:

SUNW-MSG-ID: SUN4V-8001-R3, TYPE: Fault, VER: 1, SEVERITY: Critical EVENT-TIME: Wed Oct 24 15:28:05 EDT 2007 PLATFORM: SUNW,T5440, CSN: -, HOSTNAME: wgs48-88 SOURCE: eft, REV: 1.16 EVENT-ID: bc7a8eb5-86be-c138-eece-e65e57840b95 DESC: The ultraSPARC-T2plus Coherency Interconnect has detected a serious communication problem between it and the CPU. Refer to for more information. AUTO-RESPONSE: No automated reponse. IMPACT: The system's integrity is seriously compromised. REC-ACTION: Schedule a repair procedure to replace the affected component, the identity of which can be determined by using fmdump -v -u <EVENT_ID>.

Thursday Nov 15, 2007

A Makeover for 'fmadm faulty'

I recently upgraded some of my systems to Nevada build 77 and, among a lot other cool things outside of FMA, got to see the makeover given to 'fmadm faulty'. The changes were introduced in build 76 via 6484879...but hey, it's been a busy couple of weeks so I'm behind the times.

So what's the big deal? Why do I care? Short answer is fewer commands to see what's going on. Before this change,

# fmadm faulty STATE RESOURCE / UUID -------- ---------------------------------------------------------------------- degraded mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 cd72a0c9-2c5d-e458-d866-f0d8d80ad0bb -------- ----------------------------------------------------------------------

Kinda cryptic. No mention of the FRU, the message code...just the affected FMRI. To get this info, I'd need to run another 'fmdump -v -u <uuid>' command. Such as:

# fmdump -v -u cd72a0c9-2c5d-e458-d866-f0d8d80ad0bb TIME UUID SUNW-MSG-ID Sep 26 14:07:33.7174 cd72a0c9-2c5d-e458-d866-f0d8d80ad0bb SUN4V-8000-E2 95% Problem in: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 Affects: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 FRU: hc://:serial=22ab471:part//motherboard=0/chip=0/branch=0/dra m-channel=0/dimm=0 Location: MB/CMP0/BR0: CH0/D0/J1001

Better...although I still don't know the immediate impact to my system. That information is printed to the console and /var/adm/messages. So a bit of poking around to get all of the information.

With the new output

# fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 14 20:47:15 91ae3cb5-0d3b-ce3e-ee1a-bf764a0c8e99 SUN4V-8000-E2 Critical Fault class : 95% Affects : mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 degraded but still in service FRU : hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0721BBB013 :server-id=wgs48-163:serial=d2155d2f//motherboard=0/chip=0/branch=0/dram-channel =0/dimm=0 95% Description : The number of errors associated with this memory module has exceeded acceptable levels. Refer to for more information. Response : Pages of memory associated with this memory module are being removed from service as errors are reported. Impact : Total system memory capacity will be reduced as pages are retired. Action : Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u to identify the module.

Much nicer. Have the severity and message code immediately available. And the details on impact to the system are right here....don't need to go to the console or message logs.

One thing I've noticed that's been omitted from the new output is the Location field. It's still available with fmdump, and I always found that most useful. You tell me, but if you're in the field and want to identify the exact FRU, the NAC name is a lot more readable than the fully qualified hc scheme....particularly for IO.





« April 2014