Solaris FMA Poised For Newer Nehalem Chips

6852259 Add FMA support for new members of Nehalem family

This CR went into OpenSolaris today. Solaris FMA is ready to diagnose when Intel's Westmere and Jasper Forest hit the streets.

:wq

Comments:

Scott,

Are there any Nehalem processors \*not\* supported by Solaris/FMA?

I'm seeing FMA issues with solaris 10u7 running on a Dell r610 (X5570 cpus).

The fault manager sevice hangs hard during boot (right after opening /dev/mc/mc1) and never comes online. The service manager cannot stop/restart the service because the fmd process itself has become unresponsive & unkillable.

I'm seeing this same behavior on all my r610s - I currently have 3 on the floor and have 22 more on the way.

I have found that moving 'i86pc-hc-topology.xml' file out of the way allows the service to run. I'm not sure of the impact of that move though.

Troy McIntosh
Texas Instruments

Posted by Troy McIntosh on September 28, 2009 at 03:00 AM PDT #

Troy,

In S10, all Nehalem EP's are supported (including the Core i7), so the X5570 is expected to work. The future processors noted in this blog haven't been backported to S10 yet.

I'd heard of an "FMA problem" on a R610, and I'd worked with some of the Sun folks onsite at Dell to chase it, but didn't find an issue with FMD starting, nor any other issue for that matter. As for your workaround, I'm surprised the FMD service even starts. The system will end up with an empty topology, essentially rendering diagnosis engines useless.

If you haven't already, I'd suggest opening up a service call with Sun or Dell on the issue. Beyond the description above, also include 'pstack' output of the FMD processes as well (usually the higher numbered pid is more interesting). If you want to tinker further and are able to convince the FMD service to be disabled, after a fresh boot, run as root '/usr/lib/fm/fmd/fmd -o fg=true -o debug=all -o client.debug=true -o client.error=abort' from / and include any output in the service call.

Hope this is in some way remotely helpful.

Posted by Scott Davenport on September 28, 2009 at 04:55 AM PDT #

Thanks for that information, Scott.

I was wondering what the FMD service was running with when it didn't have a map file.

I do have cases open with both Sun & Dell - not much happening on either side I'm afraid. We're working the escalation ladders.

I followed your suggestion of disabling the service and running it manually after a reboot. It hangs loading modules:

# /usr/lib/fm/fmd/fmd -o fg=true -o debug=all -o client.debug=true -o client.error=abort
fmd: [ loading modules ...

At that point it becomes unkillable (no ctrl-c, no kill -9). I tried truss'ing that same command and I see the same behaviour. It hangs right after opening /dev/mc/mc1 - no output after the (apparently successful) open. Here's the last line from the truss.

1252: 0.1461 open("/dev/mc/mc1", O_RDONLY) = 5

Troy McIntosh
Texas Instruments

Posted by Troy McIntosh on September 28, 2009 at 06:53 AM PDT #

Sorry...I only just noticed my last comment never got posted. This is likely in the area where the chip enumerator is fetching an nvlist from the intel_nhm driver for memory topology. One more debugging thing you could run (more info for the service call) is to run 'TOPOCHIPDBG=1 /usr/lib/fm/fmd/fmtopo -d' as root. You can do this whether the FMD service is enabled or disabled.

Posted by Scott Davenport on September 30, 2009 at 02:32 AM PDT #

Scott,

Yep, that's exactly what I'm seeing from the stack trace. Looks like it's spinning in inhm_rank() or inhm_vrank() while trying to create the nvlist.

We have a con call setup today between Sun, Dell & TI. Hopefully this will help point us towards a solution.

> ffffffff88aa7548::walk thread |::findstack
stack pointer for thread ffffffff922b2ba0: fffffe8000d6d8b0
[ fffffe8000d6d8b0 _resume_from_idle+0xf8() ]
fffffe8000d6d9e0 strcmp+0x24()
fffffe8000d6da50 nvlist_add_common+0x28c()
fffffe8000d6da70 nvlist_add_uint64+0x1f()
fffffe8000d6db40 inhm_vrank+0x66()
fffffe8000d6dc00 inhm_rank+0x12f()
fffffe8000d6dca0 inhm_dimm+0x140()
fffffe8000d6dd10 inhm_dimmlist+0x16d()
fffffe8000d6dd30 inhm_create_nvl+0x7f()
fffffe8000d6dd80 inhm_mc_ioctl+0xa5()
fffffe8000d6dd90 cdev_ioctl+0x1d()
fffffe8000d6ddb0 spec_ioctl+0x50()
fffffe8000d6dde0 fop_ioctl+0x25()
fffffe8000d6dec0 ioctl+0xac()
fffffe8000d6df10 _sys_sysenter_post_swapgs+0x14b()

The fmtopo command you suggested hangs too. I'll have to look deeper at it & see if it points out anything new.

# TOPOCHIPDBG=1 /usr/lib/fm/fmd/fmtopo -d
libtopo DEBUG: chip: initializing chip enumerator
libtopo DEBUG: chip: cpu_core_create: node bind failed
libtopo DEBUG: chip: cpu_core_create: node bind failed
libtopo DEBUG: chip: cpu_core_create: node bind failed
libtopo DEBUG: chip: cpu_core_create: node bind failed

Troy McIntosh
Texas Instruments

Posted by Troy McIntosh on September 30, 2009 at 02:47 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today