There is a new sysctl in Unbreakable Enterprise Kernel (UEK) 8 in the space of memory CE handling which should be added to every Linux system administrators’ tool box.
The enable_soft_offline sysctl was introduced in Linux 6.11, and this blog will cover a brief history of memory CE handling and how the enable_soft_offline sysctl might be wisely deployed.
To disable Linux kernel’s soft-offline capability,
# echo 0 > /proc/sys/vm/enable_soft_offline
To enable Linux kernel’s soft-offline capability,
# echo 1 > /proc/sys/vm/enable_soft_offline
By default, enable_soft_offline is 1.
What is Memory CE?
A memory CE can be viewed as a single “stuck bit” in the DIMM. The bit stays stuck in a specific state and will result in future read errors. With ECC DIMMs this error can be corrected, hence the name “correctable error”. It is not immediately a fatal problem. But when another nearby bit gets corrupted for some reason this could develop into an uncorrectable 2-bit error. In addition the stuck bit will generate regular continuous corrected error report interrupts when it is accessed, consuming system resources. So the reasonable next thing to do is to take down the page that has repeatedly manifested CE and ensure that it will never be used again during the kernel’s life cycle.
Memory CE logging and handling
APEI (ACPI Platform Error Interface) Firmware
CPER (Common Platform Error Record) is the hardware error record format used to describe platform hardware error by various APEI tables. A CPER record includes information about source, type, severity etc of an error. When a memory CE triggers an interrupt in a GHES (Generic Hardware Error Source) device, the device driver scans the CPER data. According to Appendix N of UEFI Specification version 2.11 (released in 2024), when CPER_SEC_ERROR_THRESHOLD_EXCEEDED (this is OS equivalent of the ‘flags’ field, bit 3 in N.5 table record) is set, “OS may choose to discontinue use of this resource”.
So, when the memory CE threshold is crossed, the Linux platform device driver ghes initiates the kernel to soft-offline the page.
soft-offline page operation
Soft-offline page is a Linux kernel operation. It migrates the content from the source page to a newly allocated clean page. Upon success, and places the source on a bad page list to ensure it will never be used.
Soft-offline can be invoked via madvise(MADV_SOFT_OFFLINE), or sysctl
echo <phys_addr> > /sys/devices/system/memory/soft_offline_page
Soft-offline assumes the source page is accessible, it does not kill the involved process, hence soft, and it can fail, leaving the presumably CE plagued page in circulation. Soft-offline is available only if the kernel is built with CONFIG_MEMORY_FAILURE enabled.
mcelog daemon
mcelog is a userspace RAS (Reliability, Availability, and Serviceability) tool that provides system administrator some knobs to control the memory error logging and memory error threshold before soft- or hard- offline operation is to be triggered via sysctl. The knobs reside in /etc/mcelog/mcelog.conf, please refer to mcelog for more information.
It’s worth noting that the error threshold defined in mcelog has no effect on the CPER records, as one is userspace defined and the other is firmware predefined.
Why add enable_soft_offline sysctl?
As explained above, memory CE triggers interrupts, too frequent interrupts are disruptive to your system. Besides, there is an increasing risk of the CEs becoming an uncorrectable error when a neighboring bit turns to a stuck bit too.
Regardless, the soft-offline a page incurs its own overhead from page migration.
If the page is a transparent hugepage (THP), the THP must be split into a group of small pages (in 4K size on x86) first, then only the infected small page will be migrated, leaving the owner process with one less THP.
If the page is a hugetlb page, the entire hugetlb page is to be replaced with a newly and temporarily allocated hugetlb page. Hence not only is more CPU time spent in copying the data (compared to just copying a small page in the case of THP), but also reducing the size of the hugetlb page pool by 1, because once the temporarily allocated hugetlb page is freed, it will be dissolved into small pages rather than returned to the hugetlb pool.
In addition, the author of the patch series expressed doubts about the theory of CEs turning to UE, though that’s up for debate.
It is chiefly against the latency incurred from soft-offline and a light view on the harms CEs could bring, the sysctl switch enable_soft_offline is added, and by default it is not enabled such that system’s boot up behavior remain unchanged.
A word on deploying enable_soft_offline sysctl
On mission critical server systems, it is probably not wise to disable soft-offline without diligently monitoring the memory CEs. In the case where there are many and frequent CEs, the accumulated cost of system resources consumed and the latency incurred to serve the interrupts might outweigh the overhead of soft-offlining the CE infected page.
Runtime CE events can be monitored with mcelog. Overall CE events, such as across multiple reboot, can be viewed from the ILOM.
For example, on a production level X9-2, there happens to be memory CEs concentrated on the same socket/dimm/rank/bank
# fmdump -e -c ereport.cpu.intel.mem_ce
TIMESTAMP EREPORT
2024-11-09/20:44:40 ereport.cpu.intel.mem_ce@/SYS/MB/P1/D2/R1
[.. all of them on /SYS/MB/P1/D2/R1 ..]
2025-02-24/18:08:18 ereport.cpu.intel.mem_ce@/SYS/MB/P1/D2/R1
# fmdump -e -c ereport.cpu.intel.mem_ce | grep "/SYS/MB/P1/D2/R1" | wc -l
86
# fmdump -e -c ereport.cpu.intel.mem_ce -V | egrep 'bank number | IA32_MCi_ADDR'
bank number = 0x1e
IA32_MCi_ADDR = 0x4063e31780
bank number = 0x1e
IA32_MCi_ADDR = 0x4063e1d480
bank number = 0x1e
IA32_MCi_ADDR = 0x4063e35e00
bank number = 0x1e
IA32_MCi_ADDR = 0x4063e35e00
bank number = 0x1e
IA32_MCi_ADDR = 0x4063e31780
[.. all of them on the same bank_number ..]
That said, it might be worth while to turn off the Linux kernel’s soft-offline capability until a time sensitive task completes.