Linux Internals: Memory UE Handling and HWpoison Injection

This blog covers userspace memory Uncorrectable Error(UE) handling by the Linux kernel, userspace handler, as well as the test strategy on Intel/AMD platforms. Kernel memory UE handling, memory scrubbing methods, and device memory UE are out of scope.

Userspace memory UE handling in Linux Kernel

Suppose during the life time of a mission critical application process, a DIMM cell suddenly goes bad, and the page where the bad cell resides is mapped into a userspace process. And then upon reading the memory location, the UE in the DIMM is consumed. Consequently, a series of things happen in the kernel.

Machine Check Exception (MCE)

When a CPU consumes a uncorrectable error, a Machine Check exception will be triggered and the CPU becomes trapped. The Linux kernel MCE handler will be invoked to process the MC event.

The MCE handler checks the MCG_STATUS register and MCi_STATUS MSR to determine the next action: Is the UE address valid? Is the UE address in userspace or kernel space ? Is the userspace program restartable? etc.

Let’s examine the console message from the EDAC (Error Detection And Correction) driver on an X9-2 system:

[10011.077909] EDAC skx MC1: CPU 0: Machine Check Event: 0x0 Bank 18: 0xac00000200a00091
[10011.077911] EDAC skx MC1: TSC 0x142eb51a2263
[10011.077912] EDAC skx MC1: ADDR 0x1cf600800
[10011.077913] EDAC skx MC1: MISC 0x800090053c01086

EDAC skx MC1: CPU 0: Machine Check Event: 0x0 Bank 18: 0xac00000200a00091 means that CPU 0 registered in a Machine Check event on memory channel #1, memory bank 18. MCG_STATUS = 0 and MCi_STATUS = 0xac00000200a00091.

In the generic names like MCi_STATUS, ‘i’ refers to MC#, so from “MC1” we know that MCi_STATUS, MCi_MISC and MCi_ADDR refers to MC1_STATUS, MC1_MISC, MC1_ADDR respectively. But to be consistent with the Intel Documentation, the rest of the blog still uses the generic MSR names.

The global MCG_STATUS = 0. So MCG_STATUS.RIPV (bit 0) is cleared, indicates that the instruction pointed to by the IP register cannot be restarted reliably.

MCi_STATUS = 0xac00000200a00091, means:
```
MCi_STATUS.PCC (bit 57) cleared, indicating process context is not corrupted.
MCi_STATUS.ADDRV (bit 58) set, indicating the UE address captured in MCi_ADDR is valid.
MCi_STATUS.MISCV (bit 59) set, indicating the content in MCi_MISC is trust worthy.
MCi_STATUS.UC (bit 61) set, indicating an UnCorrectable error occurred at the address.
MCi_STATUS.VAL (bit 63) set, indicating contents in this MCi_STATUS register is valid.
```
EDAC skx MC1: TSC 0x142eb51a2263 indicates, CPU time stamp counter value.
EDAC skx MC1: ADDR 0x1cf600800 indicates, MCi_ADDR = 0x1cf600800.

EDAC skx MC1: MISC 0x800090053c01086 indicates, MCi_MISC = 0x800090053c01086, which means:

MCi_MISC[0..5] = 6, indicates Recoverable Address LSB is 6, hence the UE error radius is (1 << 6) = 64 bytes.
MCi_MISC[6..8] = 2, indicates the Address mode is 2, that is, the address in MCi_ADDR is a physical address.

The MCE handler concludes that the UE is software recoverable in userspace and the user process should be killed. So the next step is for the MCE handler to invoke memory_failure().

memory_failure() in Linux kernel

memory_failure() is an architectural independent meeting point where memory UE detected anywhere eventually get processed, including device memory UE, such as DAX memory UE. It attempts to isolate the corrupted page, and if necessary, kill the process. The rest of the blog uses “M-F handler” as an abbreviation to refer to memory_failure().

First, the M-F handler marks the page as poisoned by setting PG_hwpoison in the page.

Next, it attemps to unmap the page from the kernel side, that is, by removing the PTE entry from the processer’s page table. If the page is a transparent huge page (THP), the THP page will be split up into base pages prior to unmapping. If the page is a hugetlb page, the entire hugetlb page will be unmapped.

If unmap is successful, any subsequent attempt to access the page again will lead to a page fault. The page fault handler will discover that a poisoned page is behind the virtual address and will deliver a SIGBUS to the user process.

If the unmapped page is clean and free, the M-F handler will remove it from the buddy system if the page is anonymous; and if the page has a valid file mapping, the M-F handler will punch the page out of the mapping. Consequently the userspace process will not get a SIGBUS upon re-access, instead, it will get a new page. This is the best recovery scenario.

Loosely speaking, there are four types of outcomes from the M-F handler in terms of the state of the poisoned page.

MF_RECOVERED, this is the best scenario. Unmap was successful, poisoned page has been isolated whether the page is clean or dirty. If the page is clean, the user process won’t get a SIGBUS. The console will have a line like Memory failure: <pfn>: recovery action for <page state> is Recovered
MF_DELAYED, this is also a good scenario. The poisoned page has been isolated from the LRU, but kept in the swap cache, such that, re-accessing will trigger a page fault, and the PF handler will kill the process. The console will have a line like Memory failure: <pfn>: recovery action for <page state> is Delayed
MF_FAILED, in this scenario, the M-F handler is unable to isolate the page from the LRU, or unmap failed. A SIGBUS will be delivered to the user process. The si_addr is the virtual address backed by the poisoned page. A smart userspace handler might be able to recover by rolling back to the last known good state. The console will have a line like Memory failure: <pfn>: recovery action for <page state> is Failed
MF_IGNORED, this is the scenario where the kernel throws up its hands as it couldn’t do anything, such as the M-F handler was unable to hold a refcount of the page, or, page state is unclear due to a race condition, etc. This is argueably the worst outcome. The console will have a line like Memory failure: <pfn>: recovery action for <page state> is Ignored

The M-F handler holds a mutex to serialize back to back invocation of the function. It also holds a refcount of the page the entire time. It doesn’t do more to block orthogonal kernel operations involving the page from happening, rather, it checks and rechecks the page state in order to play along.

The M-F handler functionality is available only if CONFIG_MEMORY_FAILURE is selected. The config switch is selected in Oracle Linux by default.

Memory UE Injection

Although memory UE on production systems is extremely rare, however, once it occurs, a speedy recovery is critical to the success of Linux in terms of self-repair. To have an effective test suite that covers a wide range of real time scenarios is important.

Linux offers three methods for simulating memory UE:

madvise(MADV_HWPOISON)
hwpoison-inject sysctl
APEI (ACPI Platform Error Interface) Error INJection

madvise(MADV_HWPOISON)

This method is a Linux kernel software simulation, see manpage madvise(2).

The user process identifies a pagesize aligned virtual address range, and for each page in the given range, the kernel directly invokes memory_failure() as if memory UE in the page has been consumed by the calling process. Here, “pagesize” can be the base page size for base page, or large pagesize for a THP or hugetlb page.

In a real hardware UE triggered Machine-Check scenario, the M-F handler is called with the MF_ACTION_REQUIRED flag set. The flag is used by the M-F handler as an indication that if the poisoned page is dirty, or page unmap failed etc, the current task should be killed, by sending a SIGBUS to the current user process with si_code = BUS_MCEERR_AR, where _AR stands for Action Required.

Linux also supports a notification kind of SIGBUS, with si_code = BUS_MCEERR_AO for a process that has elected for PR_MCE_KILL_EARLY(see manpage prctl(2)), where _AO stands for Action Optional. The idea is that a process that shares the poisoned page in its mappings but has not accessed the page could also receive a SIGBUS when another process is caught actively consuming the UE in the page.

Unlike in a real Machine-Check scenario, when madvise(MADV_HWPOISON) calls the M-F handler, the MF_ACTION_REQUIRED flag is not set. The M-F handler tries to isolate the poisoned page, but unless the user process elects for PR_MCE_KILL_EARLY(see manpage prctl(2)), it will not generate any SIGBUS. By the way, this behavior may change in future.

After the M-F handler unmapped the page and returned, if the user process continues to read the poisoned page, a page fault (PF) will occur, and the PF handler will deliver a SIGBUS with BUS_MCEERR_AR. Here is an example of the console message, notice the word fault in the message:

[  192.375986] Injecting memory failure for pfn 0x20db800 at process virtual address 0x7f21c9400000
[  192.386018] Memory failure: 0x20db800: recovery action for dirty LRU page: Recovered
[  192.394685] MCE: Killing shm_poison_test:12864 due to hardware memory corruption fault at 7f21c9400800

hwpoison-inject sysctl

Just like madvise(MADV_HWPOISON), this method is also a Linux kernel simulation, but through a different user interface: the debugfs /sys/kernel/debug/hwpoison/.

The M-F handler is invoked in similar way like in madvise(MADV_HWPOISON), but from the injector, not from the user process under test.

This method requires the kernel module “hwpoison-inject” which, by default is not loaded, hence

# modprobe hwpoison-inject

The test process is required to identify the page-frame-number(pfn) for the page to be injected poison. This can be achieved with a scan at the /proc/pagemap, looking up the valid pfn for a given virtual page address.

And then, from another terminal, trigger the poison event.

# echo <pfn> > /sys/kernel/debug/hwpoison/corrupt-pfn

After the test is done, as long as the poisoned page has been freed, that is, noo ne is still holding a reference to the page (which could happen if unmap failed and an orphaned child process still has a reference to the page), one may unpoison the page by:

# echo <pfn> > /sys/kernel/debug/hwpoison/unpoison-pfn

APEI Error INJection

This method is a firmware level simulation, facilitated by the kernel module ‘einj’. It is only available on the x86 platform. This method is closer to hardware DIMM UE scenario in that a Machine-Check like event will be triggered, and the MCE handler will be exercised. The above “EDAC” console messages were produced by this type of injection.

To enable APEI Error INJection, the first thing is to enable WHEA Error Injection in the BIOS, or something like that, the exact wording may vary. The second step is to make sure that the Linux OS under test is built with the below three config switches selected:

# cat /boot/config-`uname -r` | egrep 'CONFIG_DEBUG_FS=y|CONFIG_ACPI_APEI=y|CONFIG_ACPI_APEI_EINJ=m'
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_EINJ=m
CONFIG_DEBUG_FS=y

which is true in Oracle Linux by default.

Next, load the einj module

# modprobe einj

Check and make sure Uncorrectable non-fatal type is supported:

# cd /sys/kernel/debug/apei/einj
# cat available_error_type
0x00000008      Memory Correctable
0x00000010      Memory Uncorrectable non-fatal
0x00000020      Memory Uncorrectable fatal
0x80000000      Vendor Defined Error Types

Have the test program generate the physical address to be injected. Again, by searching /proc/pagemap for a given vaddr.

Then from a second terminal, type:

# echo 0x10 > /sys/kernel/debug/apei/einj/error_type
# echo <paddr>  > /sys/kernel/debug/apei/einj/param1
# echo 0xffffffffffffffff > /sys/kernel/debug/apei/einj/param2
# echo 2 > /sys/kernel/debug/apei/einj/flags
# echo 1 > /sys/kernel/debug/apei/einj/notrigger
# echo 1 > /sys/kernel/debug/apei/einj/error_inject

This injection method causes memory_failure() to be called without MF_ACTION_REQUIRED flag set. Hence, just like madvise(MADV_HWPOISON), no SIGBUS will be delivered as a result of the injection, that is, until a subsequent attempt to access the already unmapped page by the user process, causing the page fault handler to deliver a SIGBUS with BUS_MCEERR_AR. Here is a sample console message:

[ 2484.775350] EINJ: Error INJection is initialized.
[ 2530.896276] mce: [Hardware Error]: Machine check events logged
[ 2530.903926] Memory failure: 0x1d9e200: recovery action for dirty LRU page: Recovered
[ 2530.922673] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[ 2530.922674] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 13: 0xac000002001000c0
[ 2530.922674] EDAC skx MC0: TSC 0x54159cefc5b
[ 2530.922675] EDAC skx MC0: ADDR 0x1d9e200800
[ 2530.922675] EDAC skx MC0: MISC 0x8000c06c7801086
[ 2530.922676] EDAC skx MC0: PROCESSOR 0:0x606a6 TIME 1708129329 SOCKET 0 APIC 0x0
[ 2530.922682] EDAC MC0: 1 UE memory scrubbing error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1d9e200 offset:0x800 grain:32 -  err_code:0x0010:0x00c0  SystemAddress:0x1d9e200800 ProcessorSocketId:0x0 MemoryControllerId:0x0 ChannelAddress:0x3a3c40100 ChannelId:0x0 RankAddress:0x1d1e20100 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0xd8f0 Column:0x20 Bank:0x3 BankGroup:0x0 ChipSelect:0x0 ChipId:0x0)
[ 2537.075554] MCE: Killing shm_poison_test:12739 due to hardware memory corruption fault at 7f2963c00000

And here is another example of the console message when PR_MCE_KILL_EARLY was elected prior to the injection, both parent and child processes receive a SIGBUS with BUS_MCEERR_AO. Notice the word fault is not in the message:

[ 5249.953194] Injecting memory failure for pfn 0x1466d0 at process virtual address 0x7f35c1884000
[ 5249.963927] Memory failure: 0x1466d0: Sending SIGBUS to shm_poison_test:13632 due to hardware memory corruption
[ 5249.975197] Memory failure: 0x1466d0: Sending SIGBUS to shm_poison_test:13633 due to hardware memory corruption
[ 5249.986464] Memory failure: 0x1466d0: recovery action for dirty LRU page: Recovered

Frequently Asked Questions

Q: The parent process has received a SIGBUS with BUS_MCEERR_AR, but the child process that has elected for PR_MCE_KILL_EARLY and has the same page mapped did not receive a SIGBUS with BUS_MCEERR_AO, why?

A: The most likely reason is that the parent process received a SIGBUS delivered by the page fault handler, not the M-F handler. And the page fault handler by design does not search for processes sharing the same page but not involved in the page fault.

Another possibility which is rare, is that the parent process received a SIGBUS due to accessing an already hwpoisoned page. In this case, the M-F handle takes a short cut path to kill the parent process immediately. This short cut path does not deal with the BUS_MCEERR_AO scenario. If this is the case, the console messages will be like:

[10238.735295] Memory failure: 0x207400: already hardware poisoned
[10238.750652] Memory failure: 0x207400: Sending SIGBUS to mr_reg_poison:14269 due to hardware memory corruption

Q: Why do I sometimes see a SIGBUS, other times I don’t?

A: First of all, the MF_IGNORED scenarios produce no SIGBUS, though it’s rare. Then, in some of the MF_FAILED scenarios, due to a failure to hold a page refcount, due to a race condition, the M-F handler also produces no SIGBUS. In the race condition case, the M-F handler will actually retry once. If after a retry, it still couldn’t hold a refcount of the page, it will bail out and wish for better luck in future.

It’s worth noting that Memory UE does not lead to corrupted data being returned to the user process, that will never happen. Should the page get re-accessed, the UE will be caught again.

Q: Why doesn’t unpoison work?

A: If the kernel has never caught memory UE since boot up, and the page is free, then unpoison should work. But if hardware memory UE ever occurred during the life time of a running kernel, unpoison will be disabled until system reboot.

Q: Is there any system wide metrics about memory failure that can be observed?

A: There is. The M-F handler provides per node statistics whenever any of the MF_IGNORED, MF_FAILD, MF_DEAYED and MF_RECOVERED outcomes is produced:

# tree /sys/devices/system/node/node*/memory_failure/
/sys/devices/system/node/node0/memory_failure/
├── delayed
├── failed
├── ignored
├── recovered
└── total
/sys/devices/system/node/node1/memory_failure/
├── delayed
├── failed
├── ignored
├── recovered
└── total

Also, /proc/meminfo : HardwareCorrupted reports the estimated total number of bytes of corrupted memory, calculated as:

total = ignored + failed + delayed + recovered
HardwareCorrupted = total * PAGE_SIZE * #nodes

# cat /proc/meminfo | grep HardwareCorrupted

Feel free to take HardwareCorrupted with a grain of salt especially on a test system that has undertaken repeated hwpoison injections to the same set of pages.

References

Intel 64 and IA-32 Architectures Software Developers Manual, Volume 3B: System Programming Guide, Part 2, Chapter 15
hwpoison
APEI Error INJection
madvise(2), prctl(2)

Linux Internals: Memory UE Handling and HWpoison Injection

Userspace memory UE handling in Linux Kernel

Machine Check Exception (MCE)

memory_failure() in Linux kernel

Memory UE Injection

madvise(MADV_HWPOISON)

hwpoison-inject sysctl

APEI Error INJection

Frequently Asked Questions

References

Jane Chu

Btrfs Device Assembly and Verification

PKRAM

Linux Internals: Memory UE Handling and HWpoison Injection

Userspace memory UE handling in Linux Kernel

Machine Check Exception (MCE)

memory_failure() in Linux kernel

Memory UE Injection

madvise(MADV_HWPOISON)

hwpoison-inject sysctl

APEI Error INJection

Frequently Asked Questions

References

Authors

Jane Chu

Btrfs Device Assembly and Verification

PKRAM