Contract Kills on Memory UEs
By user9148476 on May 12, 2009
However, there's not been any notification of what process has been killed. That changed today with the putback of 6676374 to snv_116. The most pertinent part of the code change from the user perspective is:
uprintf("Killed process %d (%s) in contract id %d " "due to hardware error\\n", p->p_pid, p->p_user.u_comm, ct_id);
Now, on a contract kill you'll get the process id, name, and contract id logged in /var/adm/messages.
For a single UE, this could be viewed as (important) supplemental information, since FMA will loudly message a DIMM fault. However, if subsequent UEs occur in the same DIMM or set of DIMMs previously diagnosed, and the offending DIMM(s) have not been replaced, the subsequent UEs are not loudly messaged by FMA. The reason is that the FMA subsystem recognized the subsequent errors are against already-faulted FMRIs. FMD does not re-message such faults. Without this fix, additional contract kills are silent.
Oh...and if you're wondering about x86 platforms, Solaris has the same capability to contract kill user processes instead of panic. But, in the x86 world, operating systems are not typically given the chance to react to memory UEs. The industry norm is that the FW/BIOS pulls reset on a memory UE.
UPDATE: This same change was putback to the Solaris 10 gates today (07/08/2009). It will be part of Solaris 10 Update 8 (S10U8).