FMA Demo Kit - Bug in UltraSPARC-T1/T2 CPU Support

For the past couple of months, I've been working on adding memory support to the FMA Demo Kit for UltraSPARC-T1 and UltraSPARC-T2 processors. That coding and testing is actually done, just working on amending a contract in one of the PSARC cases (sadly, not an open one, otherwise I'd hyperlink here).

But, in testing the memory support, I found a bug in the CPU functionality. It's exposed when the physical to logical mapping of CPUs isn't 1-to-1. In cpu/predemo.sun4v, the following logic is used to find an online CPU and get its brand and serial number:

&common::get_online_cpuids; my $brand = &common::get_cpu_brand($cpuids[0]); if ($brand eq "UNKNOWN") { &common::log_message(1, 1, "The CPU fault demo is not supported ". "on this platform.\\n"); exit 1; } my $serial = &common::get_cpu_serial($cpuids[0]); if ($serial eq "UNKNOWN") { &common::log_message(1, 1, "Unable to obtain serial number for ". "cpuid $cpuids[0].\\n"); exit 1; }

The problem is with how get_online_cpuids() in lib/common.pm generates the list of online CPUs. Here's the routine:

sub get_online_cpuids { my $cpuid = $_[0]; my @output = `$PSRINFO | $GREP on-line`; foreach (@output) { if ($_ =~ /\^(\\d+)\\s+\\S+/) { push(@main::cpuids, $1); } } }

It's using psrinfo(1M) to list online CPUs. This in and of itself is fine, but psrinfo displays logical cpuids. Conversely, FMA is working in a physical world. So, cpuid 0 is not necessarily physical CPU strand 0. For example, suppose on an UltraSPARC-T2, only CORE 1 enabled. The system has 8 strands. Logically, they are strands 0-7. Physically, they are 8-15. Compare the output of psrinfo with fmtopo -s cpu:

# psrinfo 0 on-line since 11/14/2007 10:57:39 1 on-line since 11/14/2007 11:03:17 2 on-line since 11/14/2007 11:03:17 3 on-line since 11/14/2007 11:03:17 4 on-line since 11/14/2007 11:03:17 5 on-line since 11/14/2007 11:03:17 6 on-line since 11/14/2007 11:03:17 7 on-line since 11/14/2007 11:03:17 # /usr/lib/fm/fmd/fmtopo -s cpu TIME UUID Nov 14 20:10:12 bd97c9f8-7d8d-c3b6-f336-92de8d821111 cpu:///cpuid=8/serial=5d67334847 cpu:///cpuid=9/serial=5d67334847 cpu:///cpuid=10/serial=5d67334847 cpu:///cpuid=11/serial=5d67334847 cpu:///cpuid=12/serial=5d67334847 cpu:///cpuid=13/serial=5d67334847 cpu:///cpuid=14/serial=5d67334847 cpu:///cpuid=15/serial=5d67334847

So, the Demo Kit finds the first online processor, it's a logical cpuid. In this example, the Demo Kit will use a cpuid of 0 when constructing the fminject input file. When the CPU/Mem Diagnosis Engine receives the ereport, it'll drop it since cpuid 0 is not physically present in the system. End result - a very uninteresting demo.

The fix is reasonably straightforward, just need a little jiggering after finding the first online CPU via psrinfo to translate that to a physical CPU before constructing the fminject input file. It'd be wonderful just to use fmtopo -S -s cpu and key on the "Unusable" property, but the -S isn't available in S10 currently. That and not all platforms enumerate in the cpu scheme.

I've filed CR 6647037 to track this. Anyone out in the community that wants to contribute a fix is welcome to it. It could be a few weeks before I can tackle this (that pesky day job again).

:wq

Comments:

Post a Comment:
Comments are closed for this entry.
About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today