Contract Kills on Memory UEs

On SPARC systems, when there's a memory uncorrectable error (UE), Solaris will determine if the affected page is in user space or kernel space. If in user space, the affected user process is killed - we term this a contract kill....and it's way better than panicking the entire OS instance. Also, if the affected process is registered with SMF, it will be restarted automatically. And naturally, FMA will diagnose and message the UE, as well as retire the offending page.

However, there's not been any notification of what process has been killed. That changed today with the putback of 6676374 to snv_116. The most pertinent part of the code change from the user perspective is:

uprintf("Killed process %d (%s) in contract id %d " "due to hardware error\\n", p->p_pid, p->p_user.u_comm, ct_id);

Now, on a contract kill you'll get the process id, name, and contract id logged in /var/adm/messages.

For a single UE, this could be viewed as (important) supplemental information, since FMA will loudly message a DIMM fault. However, if subsequent UEs occur in the same DIMM or set of DIMMs previously diagnosed, and the offending DIMM(s) have not been replaced, the subsequent UEs are not loudly messaged by FMA. The reason is that the FMA subsystem recognized the subsequent errors are against already-faulted FMRIs. FMD does not re-message such faults. Without this fix, additional contract kills are silent.

Oh...and if you're wondering about x86 platforms, Solaris has the same capability to contract kill user processes instead of panic. But, in the x86 world, operating systems are not typically given the chance to react to memory UEs. The industry norm is that the FW/BIOS pulls reset on a memory UE.

:wq

UPDATE: This same change was putback to the Solaris 10 gates today (07/08/2009). It will be part of Solaris 10 Update 8 (S10U8).

Comments:

Post a Comment:
Comments are closed for this entry.
About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today