A Makeover for 'fmadm faulty'

I recently upgraded some of my systems to Nevada build 77 and, among a lot other cool things outside of FMA, got to see the makeover given to 'fmadm faulty'. The changes were introduced in build 76 via 6484879...but hey, it's been a busy couple of weeks so I'm behind the times.

So what's the big deal? Why do I care? Short answer is fewer commands to see what's going on. Before this change,

# fmadm faulty STATE RESOURCE / UUID -------- ---------------------------------------------------------------------- degraded mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 cd72a0c9-2c5d-e458-d866-f0d8d80ad0bb -------- ----------------------------------------------------------------------

Kinda cryptic. No mention of the FRU, the http://sun.com/msg/ message code...just the affected FMRI. To get this info, I'd need to run another 'fmdump -v -u <uuid>' command. Such as:

# fmdump -v -u cd72a0c9-2c5d-e458-d866-f0d8d80ad0bb TIME UUID SUNW-MSG-ID Sep 26 14:07:33.7174 cd72a0c9-2c5d-e458-d866-f0d8d80ad0bb SUN4V-8000-E2 95% fault.memory.bank Problem in: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 Affects: mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 FRU: hc://:serial=22ab471:part//motherboard=0/chip=0/branch=0/dra m-channel=0/dimm=0 Location: MB/CMP0/BR0: CH0/D0/J1001

Better...although I still don't know the immediate impact to my system. That information is printed to the console and /var/adm/messages. So a bit of poking around to get all of the information.

With the new output

# fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 14 20:47:15 91ae3cb5-0d3b-ce3e-ee1a-bf764a0c8e99 SUN4V-8000-E2 Critical Fault class : fault.memory.bank 95% Affects : mem:///unum=MB/CMP0/BR0:CH0/D0/J1001 degraded but still in service FRU : hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0721BBB013 :server-id=wgs48-163:serial=d2155d2f//motherboard=0/chip=0/branch=0/dram-channel =0/dimm=0 95% Description : The number of errors associated with this memory module has exceeded acceptable levels. Refer to http://sun.com/msg/SUN4V-8000-E2 for more information. Response : Pages of memory associated with this memory module are being removed from service as errors are reported. Impact : Total system memory capacity will be reduced as pages are retired. Action : Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u to identify the module.

Much nicer. Have the severity and message code immediately available. And the details on impact to the system are right here....don't need to go to the console or message logs.

One thing I've noticed that's been omitted from the new output is the Location field. It's still available with fmdump, and I always found that most useful. You tell me, but if you're in the field and want to identify the exact FRU, the NAC name is a lot more readable than the fully qualified hc scheme....particularly for IO.

:wq

Comments:

Post a Comment:
Comments are closed for this entry.
About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today