Thursday Feb 21, 2008

FMA Demo Kit - UltraSPARC-T1/T2 Memory Support

Several months ago, shortly after the FMA Demo Kit was released, I added CPU support for UltraSPARC-T1 and -T2. At long last, memory support for systems based on these processors has arrived. Yes...I know this took a long time. The changes have been done for quite a while, but I was hung up in getting PSARC/2006/704 amended so I could use a private interface I needed to obtain DIMM serial numbers.

Anyway, here's what I posted to fm-discuss@opensolaris.org:

Hello, I'm pleased to announce that the Solaris FMA Demo Kit has been updated to include Memory demo support for the UltraSPARC-T1 and UltraSPARC-T2 processors. Note that Solaris Nevada build_58 or later is required for the T1/T2 demos. Solaris 10 uses require patch 125369-05 or better for CPU demos. Memory demos require Solaris 10 Update 4. These requirements are also in the README. For more information on the demo kit, including download, installation and usage instructions, please see: http://www.opensolaris.org/os/community/fm/demokit/ Thanks, -scott -http://blogs.sun.com/sdaven

Also, here's an example of the output for the various monitoring windows from a run on a T5220 system.

Console:

Nov 14 20:47:15 wgs48-163 fmd: [ID 441519 daemon.error] SUNW-MSG-ID: SUN4V-8000-E2, TYPE: Fault, VER: 1, SEVERITY: Critical Nov 14 20:47:15 wgs48-163 EVENT-TIME: Wed Nov 14 20:47:15 EST 2007 Nov 14 20:47:15 wgs48-163 PLATFORM: SUNW,SPARC-Enterprise-T5220, CSN: -, HOSTNAME: wgs48-163 Nov 14 20:47:15 wgs48-163 SOURCE: cpumem-diagnosis, REV: 1.6 Nov 14 20:47:15 wgs48-163 EVENT-ID: 91ae3cb5-0d3b-ce3e-ee1a-bf764a0c8e99 Nov 14 20:47:15 wgs48-163 DESC: The number of errors associated with this memory module has exceeded acceptable levels. Refer to http://sun.com/msg/SUN4V-8000-E2 for more information. Nov 14 20:47:15 wgs48-163 AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported. Nov 14 20:47:15 wgs48-163 IMPACT: Total system memory capacity will be reduced as pages are retired. Nov 14 20:47:15 wgs48-163 REC-ACTION: Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u to identify the module.

Error Log Monitor:

=============================== EREPORT LOG =============================== TIME CLASS Nov 14 20:47:15.1972 ereport.cpu.ultraSPARC-T2.dau

Fault Log Monitor:

=============================== FAULT EVENT LOG =============================== TIME UUID SUNW-MSG-ID Nov 14 20:47:15.2662 ebff3832-0e99-6643-c80a-f35b6a46171a SUN4V-8000-C4 100% fault.memory.page Problem in: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001/offset=6a35d400 Affects: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001/offset=6a35d400 FRU: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id= 0721BBB013:server-id=wgs48-163:serial=d2155d2f//motherboard=0/chip=0/branch=0/ dram-channel=0/dimm=0 Location: MB/CMP0/BR0: CH0/D0/J1001 Nov 14 20:47:15.3378 91ae3cb5-0d3b-ce3e-ee1a-bf764a0c8e99 SUN4V-8000-E2 95% fault.memory.bank Problem in: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 Affects: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 FRU: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id= 0721BBB013:server-id=wgs48-163:serial=d2155d2f//motherboard=0/chip=0/branch=0/ dram-channel=0/dimm=0 Location: MB/CMP0/BR0: CH0/D0/J1001

ASRU Monitor:

--------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 14 20:47:15 91ae3cb5-0d3b-ce3e-ee1a-bf764a0c8e99 SUN4V-8000-E2 Critical Fault class : fault.memory.bank 95% Affects : mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 degraded but still in service FRU : hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0721BBB013: server-id=wgs48-163:serial=d2155d2f//motherboard=0/chip=0/branch=0/dram-channel=0/ dimm=0 95% Description : The number of errors associated with this memory module has exceeded acceptable levels. Refer to http://sun.com/msg/SUN4V-8000-E2 for more information. Response : Pages of memory associated with this memory module are being removed from service as errors are reported. Impact : Total system memory capacity will be reduced as pages are retired. Action : Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u to identify the module.

:wq

Thursday Jan 24, 2008

2007 Product of the Year Gold Award for T5120/T5220

Some cool news....the UltraSPARC-T2 systems recieved a Gold Award in the small servers category in the Product of the Year awards issued by SearchDataCenter.com. Too bad the cool fault management features weren't mentioned in the article.

:wq

Thursday Jan 03, 2008

FMA Demo Kit - Bug in UltraSPARC-T1/T2 CPU Support

For the past couple of months, I've been working on adding memory support to the FMA Demo Kit for UltraSPARC-T1 and UltraSPARC-T2 processors. That coding and testing is actually done, just working on amending a contract in one of the PSARC cases (sadly, not an open one, otherwise I'd hyperlink here).

But, in testing the memory support, I found a bug in the CPU functionality. It's exposed when the physical to logical mapping of CPUs isn't 1-to-1. In cpu/predemo.sun4v, the following logic is used to find an online CPU and get its brand and serial number:

&common::get_online_cpuids; my $brand = &common::get_cpu_brand($cpuids[0]); if ($brand eq "UNKNOWN") { &common::log_message(1, 1, "The CPU fault demo is not supported ". "on this platform.\\n"); exit 1; } my $serial = &common::get_cpu_serial($cpuids[0]); if ($serial eq "UNKNOWN") { &common::log_message(1, 1, "Unable to obtain serial number for ". "cpuid $cpuids[0].\\n"); exit 1; }

The problem is with how get_online_cpuids() in lib/common.pm generates the list of online CPUs. Here's the routine:

sub get_online_cpuids { my $cpuid = $_[0]; my @output = `$PSRINFO | $GREP on-line`; foreach (@output) { if ($_ =~ /\^(\\d+)\\s+\\S+/) { push(@main::cpuids, $1); } } }

It's using psrinfo(1M) to list online CPUs. This in and of itself is fine, but psrinfo displays logical cpuids. Conversely, FMA is working in a physical world. So, cpuid 0 is not necessarily physical CPU strand 0. For example, suppose on an UltraSPARC-T2, only CORE 1 enabled. The system has 8 strands. Logically, they are strands 0-7. Physically, they are 8-15. Compare the output of psrinfo with fmtopo -s cpu:

# psrinfo 0 on-line since 11/14/2007 10:57:39 1 on-line since 11/14/2007 11:03:17 2 on-line since 11/14/2007 11:03:17 3 on-line since 11/14/2007 11:03:17 4 on-line since 11/14/2007 11:03:17 5 on-line since 11/14/2007 11:03:17 6 on-line since 11/14/2007 11:03:17 7 on-line since 11/14/2007 11:03:17 # /usr/lib/fm/fmd/fmtopo -s cpu TIME UUID Nov 14 20:10:12 bd97c9f8-7d8d-c3b6-f336-92de8d821111 cpu:///cpuid=8/serial=5d67334847 cpu:///cpuid=9/serial=5d67334847 cpu:///cpuid=10/serial=5d67334847 cpu:///cpuid=11/serial=5d67334847 cpu:///cpuid=12/serial=5d67334847 cpu:///cpuid=13/serial=5d67334847 cpu:///cpuid=14/serial=5d67334847 cpu:///cpuid=15/serial=5d67334847

So, the Demo Kit finds the first online processor, it's a logical cpuid. In this example, the Demo Kit will use a cpuid of 0 when constructing the fminject input file. When the CPU/Mem Diagnosis Engine receives the ereport, it'll drop it since cpuid 0 is not physically present in the system. End result - a very uninteresting demo.

The fix is reasonably straightforward, just need a little jiggering after finding the first online CPU via psrinfo to translate that to a physical CPU before constructing the fminject input file. It'd be wonderful just to use fmtopo -S -s cpu and key on the "Unusable" property, but the -S isn't available in S10 currently. That and not all platforms enumerate in the cpu scheme.

I've filed CR 6647037 to track this. Anyone out in the community that wants to contribute a fix is welcome to it. It could be a few weeks before I can tackle this (that pesky day job again).

:wq

About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today