sun4v Chip Offline Using the FMA Demo Kit

I met with a customer today - and it's been far too long since I'd done that. They were putting their T5140/T5240 systems through the paces, including a variety of fault scenarios. One of the scenarios they wanted to test is when all strands on a single UltraSPARC-T2plus chip are offlined. First, I explained that the fault policies for FMA on T5140/T5240 (and that of all sun4v platforms presently) do not offline all strands on an entire chip. However, the situation could arise if there were a series of failures at the core level. A contrived situation, perhaps, since after the first or second core failure, corrective action would likely take place (we're all very good about addressing faulted components in our systems, right?).

Enter the FMA Demo Kit. The demo for T2plus processors in the kit is for a core offline. The demo kit finds the lowest numbered online processor strand and uses that strand as the simulated error detector. By running a series of FMA Demos, we can have FMA offline all strands on a chip.

WARNING: This uses the Demo Kit in "live" mode. It is not suggested to run in live mode on your production systems. Faults generated will be transported to your SP, and any other monitoring software (e.g. SNMP)

Assuming a healthy, full-up T5140/T5240 with 128 strands online:

# ./run_fmdemo -d cpu -L ### offlines VF0 core 0 # ./run_fmdemo -d cpu -L ### offlines VF0 core 1 # ./run_fmdemo -d cpu -L ### offlines VF0 core 2 # ./run_fmdemo -d cpu -L ### offlines VF0 core 3 # ./run_fmdemo -d cpu -L ### offlines VF0 core 4 # ./run_fmdemo -d cpu -L ### offlines VF0 core 5 # ./run_fmdemo -d cpu -L ### offlines VF0 core 6 # ./run_fmdemo -d cpu -L ### offlines VF0 core 7

The end result:

# psrinfo |grep faulted | wc -l 64 # psrinfo |grep faulted 0 faulted since 09/15/2008 12:42:03 1 faulted since 09/15/2008 12:42:03 2 faulted since 09/15/2008 12:42:03 ... 63 faulted since 09/15/2008 12:42:03

Of course, after running the demo kit in live mode, you've got cleanup to do with 'fmadm repair'.

And yes....I'm aware that a straight psradm command can also offline a slew of strands. A small semantic difference is that the psrinfo status would read off-line instead of faulted.

:wq

Comments:

Post a Comment:
Comments are closed for this entry.
About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today