Thursday Dec 03, 2009

sun4v FMA Firmware Faults

At long last! A recent putback finally provides are more meaningful diagnosis on sun4v systems where a bug in firmware is the likely culprit.

6502089 ferg.invalid errors should be diagnosed as a fault
6502086 DBU errors should be diagnosed as HV defect/fault

Prior to these fixes, "ferg.invalid" errors would result in FMA reporting the infamous "nosub" fault (FMD-8000-0W). And a DBU error wasn't reported at all, the user experience being a mysterious system crash/reset.

Now, such errors are now respectively diagnosed to a defect.fw.generic-sparc.erpt-gen (SUN4V-8002-SP) or defect.fw.generic-sparc.addr-oob (SUN4V-8002-RA) fault.

As the cases for these errors are better understood, expect the article text on http://sun.com/msg to have more details.

:wq

Monday Oct 13, 2008

T5440 Fault Management

10/13/2008: Today Sun announced the T5440 platforms centered around the UltraSPARC T2 Plus processor. The T5440 is the big brother of the T5140/T5240 systems packing 256 strands of execution into a single system. And I'm happy to report that Solaris' Predictive Self Healing feature (also known as the Fault Management Architecture (FMA)) has been extended to include coverage for the T5440.

With respect to fault management, T5440 inherits both the T5140/T5240 FMA features and the T5120/T520 FMA features. Just some of the features included are:

  • diagnosis of CPU errors at the strand, core, and chip level
  • offlining of problematic strands and cores
  • diagnosis of the memory subsystem
  • automatic FB-DIMM lane failover
  • extended IO controller diagnosis
  • identification of local vs remote
  • cryptographic unit offlining
The T5440 introduces a coherency interconnect tying the 4 UltraSPARC T2 Plus processors together. The interconnect ASICs have their own set of error detectors. The fault management system covers interconnect-detected errors as well.

Additionally, the FMA Demo Kit has been updated for the T5440. For those not familiar with the kit, it provides a harness for executing fault management demos on CPU, Memory and PCI errors. It runs on out-of-the-box Solaris - no custom hardware, error injectors, or configuration necessary. It's a great - and if using the simulation mode, safe - way to get familiar with how Solaris will respond to certain types of errors.


Example: T5440 Coherency Plane ASIC Error
If one of the coherency ASICs detects errors with coherency between the processors, the system may or may not continue to operate, depending on the nature of the error. A prime tenet of the hardware is to disallow propagation of bogus transations - we don't want data corruption. An example of a fatal error on the coherency plane:

SUNW-MSG-ID: SUN4V-8001-R3, TYPE: Fault, VER: 1, SEVERITY: Critical EVENT-TIME: Wed Oct 24 15:28:05 EDT 2007 PLATFORM: SUNW,T5440, CSN: -, HOSTNAME: wgs48-88 SOURCE: eft, REV: 1.16 EVENT-ID: bc7a8eb5-86be-c138-eece-e65e57840b95 DESC: The ultraSPARC-T2plus Coherency Interconnect has detected a serious communication problem between it and the CPU. Refer to http://sun.com/msg/SUN4V-8001-R3 for more information. AUTO-RESPONSE: No automated reponse. IMPACT: The system's integrity is seriously compromised. REC-ACTION: Schedule a repair procedure to replace the affected component, the identity of which can be determined by using fmdump -v -u <EVENT_ID>.

Thursday Feb 21, 2008

FMA Demo Kit - UltraSPARC-T1/T2 Memory Support

Several months ago, shortly after the FMA Demo Kit was released, I added CPU support for UltraSPARC-T1 and -T2. At long last, memory support for systems based on these processors has arrived. Yes...I know this took a long time. The changes have been done for quite a while, but I was hung up in getting PSARC/2006/704 amended so I could use a private interface I needed to obtain DIMM serial numbers.

Anyway, here's what I posted to fm-discuss@opensolaris.org:

Hello, I'm pleased to announce that the Solaris FMA Demo Kit has been updated to include Memory demo support for the UltraSPARC-T1 and UltraSPARC-T2 processors. Note that Solaris Nevada build_58 or later is required for the T1/T2 demos. Solaris 10 uses require patch 125369-05 or better for CPU demos. Memory demos require Solaris 10 Update 4. These requirements are also in the README. For more information on the demo kit, including download, installation and usage instructions, please see: http://www.opensolaris.org/os/community/fm/demokit/ Thanks, -scott -http://blogs.sun.com/sdaven

Also, here's an example of the output for the various monitoring windows from a run on a T5220 system.

Console:

Nov 14 20:47:15 wgs48-163 fmd: [ID 441519 daemon.error] SUNW-MSG-ID: SUN4V-8000-E2, TYPE: Fault, VER: 1, SEVERITY: Critical Nov 14 20:47:15 wgs48-163 EVENT-TIME: Wed Nov 14 20:47:15 EST 2007 Nov 14 20:47:15 wgs48-163 PLATFORM: SUNW,SPARC-Enterprise-T5220, CSN: -, HOSTNAME: wgs48-163 Nov 14 20:47:15 wgs48-163 SOURCE: cpumem-diagnosis, REV: 1.6 Nov 14 20:47:15 wgs48-163 EVENT-ID: 91ae3cb5-0d3b-ce3e-ee1a-bf764a0c8e99 Nov 14 20:47:15 wgs48-163 DESC: The number of errors associated with this memory module has exceeded acceptable levels. Refer to http://sun.com/msg/SUN4V-8000-E2 for more information. Nov 14 20:47:15 wgs48-163 AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported. Nov 14 20:47:15 wgs48-163 IMPACT: Total system memory capacity will be reduced as pages are retired. Nov 14 20:47:15 wgs48-163 REC-ACTION: Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u to identify the module.

Error Log Monitor:

=============================== EREPORT LOG =============================== TIME CLASS Nov 14 20:47:15.1972 ereport.cpu.ultraSPARC-T2.dau

Fault Log Monitor:

=============================== FAULT EVENT LOG =============================== TIME UUID SUNW-MSG-ID Nov 14 20:47:15.2662 ebff3832-0e99-6643-c80a-f35b6a46171a SUN4V-8000-C4 100% fault.memory.page Problem in: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001/offset=6a35d400 Affects: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001/offset=6a35d400 FRU: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id= 0721BBB013:server-id=wgs48-163:serial=d2155d2f//motherboard=0/chip=0/branch=0/ dram-channel=0/dimm=0 Location: MB/CMP0/BR0: CH0/D0/J1001 Nov 14 20:47:15.3378 91ae3cb5-0d3b-ce3e-ee1a-bf764a0c8e99 SUN4V-8000-E2 95% fault.memory.bank Problem in: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 Affects: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 FRU: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id= 0721BBB013:server-id=wgs48-163:serial=d2155d2f//motherboard=0/chip=0/branch=0/ dram-channel=0/dimm=0 Location: MB/CMP0/BR0: CH0/D0/J1001

ASRU Monitor:

--------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 14 20:47:15 91ae3cb5-0d3b-ce3e-ee1a-bf764a0c8e99 SUN4V-8000-E2 Critical Fault class : fault.memory.bank 95% Affects : mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 degraded but still in service FRU : hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0721BBB013: server-id=wgs48-163:serial=d2155d2f//motherboard=0/chip=0/branch=0/dram-channel=0/ dimm=0 95% Description : The number of errors associated with this memory module has exceeded acceptable levels. Refer to http://sun.com/msg/SUN4V-8000-E2 for more information. Response : Pages of memory associated with this memory module are being removed from service as errors are reported. Impact : Total system memory capacity will be reduced as pages are retired. Action : Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u to identify the module.

:wq

About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today