Monday Jul 14, 2008

Laptop and retired I/O devices

FMA and my laptop ....

Last Thursday was not much fun. The “reduction in force”, more appropriately called lay-off, was a challenge. Good people lost their jobs and those of use remaining are trying to figure out how to still get our jobs done with out some key support folks.

Well coincidently with the RIF my laptop started mis-behaving. Upon reboot Friday (the day after the rif) I was greeted with this error message:

NOTICE: One or more of your I/O devices have been retired

Great, now did my laptop get RIF'd?

Actually not. Thankfully!

What happen is that FMA detected too many error with my on-board ethernet driver and it disabled the faulty component. How cool is that?!? The operating system on my laptop detected a faulty component and disabled so the processor didn't have to keep dealing with the interrupts.

This is way cool, but it was a bit difficult to figure out what was happening.

To save some of you the pain of debugging this condition, here is a quick post on how to determine what's happening and how to fix the condition.

First, let's check to see where the error message was coming from. A quick search at src.opensolaris.org resulted in a link to retire_store.c:241. Here quick look around the source let me know this was from the fault management portion of OpenSolaris.

To see the fault, use fmadm with the faulty sub-command:

$ pfexec fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jul 11 06:30:19 6dc01480-53e3-6046-92c5-b88ea74e17af  PCIEX-8000-0A  Critical 

Fault class : fault.io.pciex.device-interr
Affects     : dev:////pci@0,0/pci8086,27d0@1c/pci1179,1@0
                  faulted and taken out of service
FRU         : "MB" (hc://:product-id=TECRA-M5:chassis-id=Y6071991H:server-id=yoyo/motherboard=0)
                  faulty

Description : A problem was detected for a PCIEX device.
              Refer to http://sun.com/msg/PCIEX-8000-0A for more information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : Schedule a repair procedure to replace the affected device.  Use
              fmdump -v -u  to identify the device or contact Sun for
              support.
With the device path I checked to see if this path corresponded to any known devices:

Check that out ... the e1000g0 is the driver for my ethernet!

$ ls -l /dev/\* | grep "pci1179,1@0'
lrwxrwxrwx 1 root root 54 2008-04-24 16:32 /dev/e1000g0 -> ../devices/pci@0,0/pci8086,27d0@1c/pci1179,1@0:e1000g0

but that device is not available:

ls -l /devices/pci@0,0/pci8086,27d0@1c/pci1179,1@0:e1000g0
/usr/gnu/bin/ls: cannot access /devices/pci@0,0/pci8086,27d0@1c/pci1179,1@0:e1000g0: No such device or address

Knowing that this was the device that was not working I had to correct the fault. Here use the fmadm command again with the repair sub-command and the path the to device.

$ pfexec fmadm repair dev:////pci@0,0/pci8086,27d0@1c/pci1179,1@0
fmadm: recorded repair to dev:////pci@0,0/pci8086,27d0@1c/pci1179,1@0
$ ls -l /devices/pci@0,0/pci8086,27d0@1c/pci1179,1@0:e1000g0
crw-rw-rw- 1 root root 225, 1 2008-07-14 13:40 /devices/pci@0,0/pci8086,27d0@1c/pci1179,1@0:e1000g0
Now to reenable the device and replum my network
$ pfexec ifconfig -a
lo0: flags=2001000849 mtu 8232 index 1
	inet 127.0.0.1 netmask ff000000 
wpi0: flags=201004843 mtu 1500 index 2
	inet 10.0.231.211 netmask ffff0000 broadcast 10.0.255.255
	ether 0:18:de:6a:9e:5a 
ip.tun0: flags=10008d1 mtu 1402 index 7
	inet tunnel src 10.0.231.211 tunnel dst 192.9.5.100
	tunnel security settings  -->  use 'ipsecconf -ln -i ip.tun0'
	tunnel hop limit 60 
	inet 10.7.251.235 --> 129.146.17.123 netmask ffffffff 
lo0: flags=2002000849 mtu 8252 index 1
	inet6 ::1/128 
ip.tun0: flags=2204851 mtu 1480 index 7
	inet tunnel src 10.0.231.211 tunnel dst 192.9.5.100
	tunnel security settings  -->  use 'ipsecconf -ln -i ip.tun0'
	tunnel hop limit 60 
	inet6 fe80::a07:fbeb/128 --> fe80::1 
ip.tun0:1: flags=2200851 mtu 1480 index 7
	inet6 2002:8192:117b:1::a07:fbeb/128 --> 2002:8192:117b:1::1 
$ pfexec ifconfig e1000g0 plumb
$ svcadm restart nwam

Everything is up and running again. It's pretty cool to have enterprise features like FMA on my laptop, but being a simple guy it's good to know how to just get up and running again.

Now I'll keep monitoring my ethernet driver and see if it keeps acting up ...

Once again, the power of an enterprise class operating system on my laptop. This is so cool. Thanks to the FMA team for your great work!

Friday Jan 12, 2007

a physical mashup .... OpenSolaris laptop skin

What happens when you combine a great graphic, a great offering, and a little adhesive?

OpenSolaris laptop closedOpenSolaris laptop open 

The skin is  product from skinit.com and the graphics from the wonderful OpenSolaris team. And yes the laptop (a Toshiba Tecra M5) is running OpenSolaris, (currently build 54) and it supports dual-screens (via the native Solaris NVIDIA drivers)!


About

Here I am writing about the OS I love. Trying to show other people how cool the innovations are in [Open]Solaris.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Visitors

No bookmarks in folder

News
Blogroll
OpenSolaris Storage Links