Thursday Dec 03, 2009

sun4v FMA Firmware Faults

At long last! A recent putback finally provides are more meaningful diagnosis on sun4v systems where a bug in firmware is the likely culprit.

6502089 ferg.invalid errors should be diagnosed as a fault
6502086 DBU errors should be diagnosed as HV defect/fault

Prior to these fixes, "ferg.invalid" errors would result in FMA reporting the infamous "nosub" fault (FMD-8000-0W). And a DBU error wasn't reported at all, the user experience being a mysterious system crash/reset.

Now, such errors are now respectively diagnosed to a defect.fw.generic-sparc.erpt-gen (SUN4V-8002-SP) or defect.fw.generic-sparc.addr-oob (SUN4V-8002-RA) fault.

As the cases for these errors are better understood, expect the article text on http://sun.com/msg to have more details.

:wq

Monday Oct 13, 2008

T5440 Fault Management

10/13/2008: Today Sun announced the T5440 platforms centered around the UltraSPARC T2 Plus processor. The T5440 is the big brother of the T5140/T5240 systems packing 256 strands of execution into a single system. And I'm happy to report that Solaris' Predictive Self Healing feature (also known as the Fault Management Architecture (FMA)) has been extended to include coverage for the T5440.

With respect to fault management, T5440 inherits both the T5140/T5240 FMA features and the T5120/T520 FMA features. Just some of the features included are:

  • diagnosis of CPU errors at the strand, core, and chip level
  • offlining of problematic strands and cores
  • diagnosis of the memory subsystem
  • automatic FB-DIMM lane failover
  • extended IO controller diagnosis
  • identification of local vs remote
  • cryptographic unit offlining
The T5440 introduces a coherency interconnect tying the 4 UltraSPARC T2 Plus processors together. The interconnect ASICs have their own set of error detectors. The fault management system covers interconnect-detected errors as well.

Additionally, the FMA Demo Kit has been updated for the T5440. For those not familiar with the kit, it provides a harness for executing fault management demos on CPU, Memory and PCI errors. It runs on out-of-the-box Solaris - no custom hardware, error injectors, or configuration necessary. It's a great - and if using the simulation mode, safe - way to get familiar with how Solaris will respond to certain types of errors.


Example: T5440 Coherency Plane ASIC Error
If one of the coherency ASICs detects errors with coherency between the processors, the system may or may not continue to operate, depending on the nature of the error. A prime tenet of the hardware is to disallow propagation of bogus transations - we don't want data corruption. An example of a fatal error on the coherency plane:

SUNW-MSG-ID: SUN4V-8001-R3, TYPE: Fault, VER: 1, SEVERITY: Critical EVENT-TIME: Wed Oct 24 15:28:05 EDT 2007 PLATFORM: SUNW,T5440, CSN: -, HOSTNAME: wgs48-88 SOURCE: eft, REV: 1.16 EVENT-ID: bc7a8eb5-86be-c138-eece-e65e57840b95 DESC: The ultraSPARC-T2plus Coherency Interconnect has detected a serious communication problem between it and the CPU. Refer to http://sun.com/msg/SUN4V-8001-R3 for more information. AUTO-RESPONSE: No automated reponse. IMPACT: The system's integrity is seriously compromised. REC-ACTION: Schedule a repair procedure to replace the affected component, the identity of which can be determined by using fmdump -v -u <EVENT_ID>.

Wednesday Apr 09, 2008

Predictive Self Healing (FMA) for T5140/T5240

April 9, 2008: Sun announced the T5140/T5240 platforms centered around the UltraSPARC T2 Plus processor. The T2 Plus extends the capabilities of the UltraSPARC T2 processor, the most obvious being the capability for multiple processors in a system. And I'm happy to report that Solaris' Predictive Self Healing feature (also known as the Fault Management Architecture (FMA)) has been extended to include coverage for T2 Plus.

With respect to fault management, T2 Plus is very similar to the T2. The fault management features of T5140/T5240 are listed below, along with example output for a couple of the new T2 Plus diagnosis features.

  • Base UltraSPARC T2 features: All of the FMA features present on the T2 processor are also available with the T2 Plus-based systems
  • Coherency plane diagnosis: The T2 Plus processors in the T5140/T5240 systems communicate with one another across a coherency plane, similar in nature to a Fully Buffered DIMM (FB-DIMM) channel. Error handling and diagnosis have been enhanced to detect and diagnose errors (single-lane/multi-lane/protocol errors) on the coherency plane.
  • Local vs. remote errors: With multiple processors in the system, it is possible that one T2 Plus can trigger an error in another T2 Plus (e.g. a remote read of memory/cache). The error handlers have been extended to recognize local vs. remote errors and produce the proper telemetry so diagnosis engines indict the correct T2 Plus.
  • Automatic FB-DIMM lane failover: The UltraSPARC T2 Plus memory controller seamlessly handles a single lane failover on an FM-DIMM link without a system crash. The fault management subsystem has been updated to differentiate between FB-DIMM errors resulting in lane failovers vs. those that do not. Additional information on FB-DIMM diagnosis is in one of my earlier blogs.
  • Extended IO controller diagnosis: The embedded IO controller in the T2 Plus added a few new error detectors, and the FMA software has been extended to include diagnoses for these.

Example output of some of the new features is below.

Additionally, the FMA Demo Kit has been updated for the T5140/T5240 as well. For those not familiar with the kit, it provides a harness for executing fault management demos on CPU, Memory and PCI errors. It runs on out-of-the-box Solaris - no custom hardware, error injectors, or configuration necessary. It's a great - and if using the simulation mode, safe - way to get familiar with how Solaris will respond to certain types of errors.


Example: Automatic Recovery from a Coherency Plane Single Lane Failure
If the T2 Plus hardware detects errors on the coherency planes between the processors, the lanes can be retrained. In the face of a persisting error, a lane may be failed by the hardware. For a single lane failure, the system continues to operate and the following is printed to the console:

SUNW-MSG-ID: SUN4V-8001-MR, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Wed Aug 29 18:24:37 EDT 2007 PLATFORM: SUNW,T5140, CSN: -, HOSTNAME: wgs48-134 SOURCE: cpumem-diagnosis, REV: 1.6 EVENT-ID: 64fe5c4b-894c-e49e-b14d-ba7cbe809a12 DESC: A CPU chip's Link Framing Unit has stopped using a bad lane. Refer to http://sun.com/msg/SUN4V-8001-MR for more information. AUTO-RESPONSE: No other automated response. IMPACT: The system's capacity to correct transmission errors between CPU chips has been reduced. REC-ACTION: Schedule a repair procedure to replace the affected resource, the identity of which can be determined using fmdump -v -u <EVENT_ID>.

Similar messaging is produced for multi-lane failures or protocol failures, although such failures are fatal and cause a system reset.


Example: FB-DIMM Single Lane Failover
In my blog a few months ago, I covered the addition of FB-DIMM channel diagnosis. New with T5140/T5240 is the hardware's capability to ensure a single lane failover without system interruption. And

SUNW-MSG-ID: SUN4V-8001-7R, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Fri Feb 22 16:20:37 EST 2008 PLATFORM: SUNW,T5240, CSN: -, HOSTNAME: wgs48-113 SOURCE: cpumem-diagnosis, REV: 1.6 EVENT-ID: 24c57072-1f59-6054-9af4-a5515325ab0c DESC: A problem was detected in the interconnect between a memory DIMM module and its memory controller. A lane failover has taken place. Refer to http://sun.com/msg/SUN4V-8001-7R for more information. AUTO-RESPONSE: No automated response. IMPACT: System performance may be impacted. REC-ACTION: At convenient time, try reseating the memory module(s). If problem persists, contact Sun to schedule part replacement.

Note the bit about a lane failover taking place in the DESC portion of the console message. Also, when examining the telemetry leading to the fault, we see a single FB-DIMM recoverable ereport:

# fmdump -e -u 24c57072-1f59-6054-9af4-a5515325ab0c TIME CLASS Feb 22 16:20:37.7236 ereport.cpu.ultraSPARC-T2plus.fbr

Since the hardware has already experienced a lane failover, messaging of the fault is immediate. This type of correctable error is not put through a Soft Error Rate Discriminator (SERD) engine.


:wq

About

user9148476

Search

Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today