Thursday Feb 21, 2008

FMA Demo Kit - UltraSPARC-T1/T2 Memory Support

Several months ago, shortly after the FMA Demo Kit was released, I added CPU support for UltraSPARC-T1 and -T2. At long last, memory support for systems based on these processors has arrived. Yes...I know this took a long time. The changes have been done for quite a while, but I was hung up in getting PSARC/2006/704 amended so I could use a private interface I needed to obtain DIMM serial numbers.

Anyway, here's what I posted to fm-discuss@opensolaris.org:

Hello, I'm pleased to announce that the Solaris FMA Demo Kit has been updated to include Memory demo support for the UltraSPARC-T1 and UltraSPARC-T2 processors. Note that Solaris Nevada build_58 or later is required for the T1/T2 demos. Solaris 10 uses require patch 125369-05 or better for CPU demos. Memory demos require Solaris 10 Update 4. These requirements are also in the README. For more information on the demo kit, including download, installation and usage instructions, please see: http://www.opensolaris.org/os/community/fm/demokit/ Thanks, -scott -http://blogs.sun.com/sdaven

Also, here's an example of the output for the various monitoring windows from a run on a T5220 system.

Console:

Nov 14 20:47:15 wgs48-163 fmd: [ID 441519 daemon.error] SUNW-MSG-ID: SUN4V-8000-E2, TYPE: Fault, VER: 1, SEVERITY: Critical Nov 14 20:47:15 wgs48-163 EVENT-TIME: Wed Nov 14 20:47:15 EST 2007 Nov 14 20:47:15 wgs48-163 PLATFORM: SUNW,SPARC-Enterprise-T5220, CSN: -, HOSTNAME: wgs48-163 Nov 14 20:47:15 wgs48-163 SOURCE: cpumem-diagnosis, REV: 1.6 Nov 14 20:47:15 wgs48-163 EVENT-ID: 91ae3cb5-0d3b-ce3e-ee1a-bf764a0c8e99 Nov 14 20:47:15 wgs48-163 DESC: The number of errors associated with this memory module has exceeded acceptable levels. Refer to http://sun.com/msg/SUN4V-8000-E2 for more information. Nov 14 20:47:15 wgs48-163 AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported. Nov 14 20:47:15 wgs48-163 IMPACT: Total system memory capacity will be reduced as pages are retired. Nov 14 20:47:15 wgs48-163 REC-ACTION: Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u to identify the module.

Error Log Monitor:

=============================== EREPORT LOG =============================== TIME CLASS Nov 14 20:47:15.1972 ereport.cpu.ultraSPARC-T2.dau

Fault Log Monitor:

=============================== FAULT EVENT LOG =============================== TIME UUID SUNW-MSG-ID Nov 14 20:47:15.2662 ebff3832-0e99-6643-c80a-f35b6a46171a SUN4V-8000-C4 100% fault.memory.page Problem in: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001/offset=6a35d400 Affects: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001/offset=6a35d400 FRU: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id= 0721BBB013:server-id=wgs48-163:serial=d2155d2f//motherboard=0/chip=0/branch=0/ dram-channel=0/dimm=0 Location: MB/CMP0/BR0: CH0/D0/J1001 Nov 14 20:47:15.3378 91ae3cb5-0d3b-ce3e-ee1a-bf764a0c8e99 SUN4V-8000-E2 95% fault.memory.bank Problem in: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 Affects: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 FRU: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id= 0721BBB013:server-id=wgs48-163:serial=d2155d2f//motherboard=0/chip=0/branch=0/ dram-channel=0/dimm=0 Location: MB/CMP0/BR0: CH0/D0/J1001

ASRU Monitor:

--------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 14 20:47:15 91ae3cb5-0d3b-ce3e-ee1a-bf764a0c8e99 SUN4V-8000-E2 Critical Fault class : fault.memory.bank 95% Affects : mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 degraded but still in service FRU : hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0721BBB013: server-id=wgs48-163:serial=d2155d2f//motherboard=0/chip=0/branch=0/dram-channel=0/ dimm=0 95% Description : The number of errors associated with this memory module has exceeded acceptable levels. Refer to http://sun.com/msg/SUN4V-8000-E2 for more information. Response : Pages of memory associated with this memory module are being removed from service as errors are reported. Impact : Total system memory capacity will be reduced as pages are retired. Action : Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u to identify the module.

:wq

Friday Aug 24, 2007

Example: Core offlining on UltraSPARC-T1

In my last posting, I talked about one of the upcoming fault management features for UltraSPARC-T2 - core offlining. Today, I'll present that example on the UltraSPARC-T1 (Niagara) processor. This functionality is available today if you're running OpenSolaris, and is coming (very soon) in the next update release of Solaris 10.

Core offlining comes into play when a fault is determined for a processor core resource that is shared across all strands. For this discussion, I'll use the example of the data cache (D$). But the same logic applies to the instruction cache (I$), as well as the D$ TLB and I$ TLB.

In the UltraSPARC-T1, each core has 4 strands of execution. All 4 strands share a single data cache (D$). So, if there are errors in the D$, all 4 strands are affected. In the past, FMA would offline a single strand (the detector), which is imprecise for a couple of reasons:

  • The D$ is a shared resource in the core. Strands other than the one offlined are still subject to the problem in the D$. Thus, the system as a whole is still subject to errors.
  • D$ errors are recoverable errors, and are put through a Soft Error Rate Discriminator (SERD) engine, which counts events over time. If >N events within time T occur, the SERD threshold is exceeded and a diagnosis is produced. There's a per D$ SERD engine. Any strand within a core detecting an error increments the N for that core's D$. If the SERD threshold is exceeded, just because strand A happens to be the one that bumps the count beyond N, doesn't mean that strands B, C, and/or D hadn't experienced errors also.
The correct way to handle this is to offline all strands that are effected by the problem.

So let's look at an example. I'll use fminject to manually input D$ error events into FMD. For the terminally curious, here's the input file 1. Thi input file simulates a D$ error detected by strand 0 on core 0. Let's inject the error, and see what the diagnosis engine is doing:

# psrinfo | head -8 0 on-line since 08/23/2007 15:37:00 1 on-line since 08/23/2007 15:37:00 2 on-line since 08/23/2007 15:37:00 3 on-line since 08/23/2007 15:37:00 4 on-line since 08/23/2007 15:37:00 5 on-line since 08/23/2007 15:37:00 6 on-line since 08/23/2007 15:37:00 7 on-line since 08/23/2007 15:37:00 # /usr/lib/fm/fmd/fminject hdddc sending event ddc_a ... done # fmstat -s -m cpumem-diagnosis NAME >N T CNT DELTA STAT cpu_0_1_dcache_serd >8 7d 1 444539658324ns pend

With the fmstat -s command, we can see the SERD engine for the D$ has been started. The N and T values tell us that if more than 8 (N) errors happen in 7 days (T), the engine will fire. Currently, there's a single event (CNT). This is the error we injected above. Repeating the injection above another 8 times will trip the SERD engine. Once that happens, we see the following on the console:

SUNW-MSG-ID: SUN4V-8000-63, TYPE: Fault, VER: 1, SEVERITY: Minor EVENT-TIME: Thu Aug 23 16:41:00 EDT 2007 PLATFORM: SUNW,Sun-Fire-T200, CSN: -, HOSTNAME: foo SOURCE: cpumem-diagnosis, REV: 1.6 EVENT-ID: 93b7158e-20f0-4bac-bdec-c2afa1924563 DESC: The number of errors associated with this CPU has exceeded acceptable levels. Refer to http://sun.com/msg/SUN4V-8000-63 for more information. AUTO-RESPONSE: The fault manager will attempt to remove the affected CPU from service. IMPACT: System performance may be affected. REC-ACTION: Schedule a repair procedure to replace the affected CPU, the identity of which can be determined using fmdump -v -u <EVENT_ID>.

Checking fmstat, we can see the SERD engine as fired. And via fmadm fautly and psrinfo output, we can see that all 4 strands in core 0 have been faulted:

# fmstat -s -m cpumem-diagnosis NAME >N T CNT DELTA STAT cpu_0_1_dcache_serd >8 7d 9 479807399344ns fire # fmadm faulty STATE RESOURCE / UUID -------- ---------------------------------------------------------------------- faulted cpu:///cpuid=0/serial=FFFFF240AD4D5180 93b7158e-20f0-4bac-bdec-c2afa1924563 -------- ---------------------------------------------------------------------- faulted cpu:///cpuid=1/serial=FFFFF240AD4D5180 93b7158e-20f0-4bac-bdec-c2afa1924563 -------- ---------------------------------------------------------------------- faulted cpu:///cpuid=2/serial=FFFFF240AD4D5180 93b7158e-20f0-4bac-bdec-c2afa1924563 -------- ---------------------------------------------------------------------- faulted cpu:///cpuid=3/serial=FFFFF240AD4D5180 93b7158e-20f0-4bac-bdec-c2afa1924563 -------- ---------------------------------------------------------------------- # psrinfo |grep faulted 0 faulted since 08/23/2007 16:41:00 1 faulted since 08/23/2007 16:41:00 2 faulted since 08/23/2007 16:41:00 3 faulted since 08/23/2007 16:41:00

And there you have it, the core is offlined, the standard FMA messaging is supplied, service actions remain the same. The key point is that the faulty D$ is effectively isolated as any users of it have been taken offline.

1 This input file will be ignored on your system. One of the payload members is the serial number of the chip itself, and every system is unique. The DE checks to make sure the telemetry coming in is in fact for a resource in the system being diagnosed. If folks think it's useful, perhaps in a future posting I'll describe how to modify this injection file for your own testing.

:wq

About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today