One of the areas of focus within Oracle has been CPU and memory scaling of Linux virtual machines. This effort involves adding and removing CPUs and memory as quickly as possible in order to scale up or down the virtual machine to meet customer demands.
While testing kdump during the scaling, two problems were encountered:
Here are some specifics of the test environment and the problems observed.
The testing was of QEMU-based guest virtual machines using QEMU QMP scripting to scale the VM from 32GiB base memory to 512GiB with virtual DIMMs. On RHEL7 (as this investigation happened circa 2020),
top was used within the guest to monitor. There was no other notable work running on the system.
Scaling the guest routinely needed just over two minutes (144s), but
top within the guest revealed that
kexec were active for almost 6.5 minutes (389s)! Said differently, ramping up to 512GiB required about two minutes, but then several more minutes were needed for the kdump processing to catchup and complete!
Furthermore, the testing of intentionally forcing a panic during the scaling revealed that kdump often did not occur, and instead resulted in an undesirable hang:
[ end Kernel panic - not syncing: sysrq triggered crash ]
And of course at this point the system needed a reset.
This behavior was obviously undesirable, and needed to be addressed.
What was happening with kdump during CPU and memory scaling? An understanding of how kdump generally works is a good place to start.
The kdump service consists of three stages:
Before examining each stage, let’s explain the terms ‘current kernel’ and ‘kdump kernel’.
The current kernel is the kernel that is running just prior to the panic, eg. your everyday use kernel.
The kdump (ie. capture) kernel is the kernel that is loaded (typically once at boot-time), and then later runs when triggered by a panic in order to create the vmcore dump.
In short, the current kernel is used to load the kdump capture kernel. And the current kernel is the one to experience the panic, necessitating the jump to the kdump kernel.
The kdump kernel is one component of a kdump image, though ‘kdump image’ and ‘kdump kernel’ are used somewhat interchangeably.
To load a kdump image, memory must be reserved and a kdump capture kernel placed in it.
Memory for kdump must be reserved at boot time, with the
crashkernel= kernel command line parameter. This memory is then populated by the kdump image, which is usually initiated via systemd, typically also at boot.
kdump service invokes the
kdumpctl start command to load the kdump capture kernel. This command in turn invokes the kexec-tools package
kexec userspace program to do the actual load.
There are several parts to a kdump image, and the
kexec program collects and/or creates them:
The kernel is the kdump capture kernel. Technically this is a specially prepared kernel, but is now usually the same file as the current kernel since the kernel is usually compiled in a manner suitable for use in both scenarios (position independent code). Eg.
The initrd is a minimal ramdisk image specifically for the kdump capture kernel. The initrd contains the essential components needed for creating the vmcore, ie. drivers and makedumpfile. Eg.
The cmdline component contains the capture kernel command line arguments, which is usually similar to the current kernel command line, but with parameters to simplify the capture kernel run-time, for example single core, no NUMA, no hotplug, etc. However, the special parameter
elfcorehdr= is added which provides a pointer to the elfcorehdr component.
The vmcoreinfo provides information about the current kernel, ie. build info, struct page info, zone info and several other such items. The
kexec program obtains a pointer to this information from within the current kernel via
/sys/kernel/vmcoreinfo and passes it along to the kdump kernel (via an elfcorehdr PT_NOTE).
The purgatory component is a bit of transition code to which control is transferred when a panic does occur. The purpose of the purgatory is to perform a checksum/digest over the loaded kdump components to make sure they are unchanged, and if unchanged (ie. not corrupted), jump to the kdump capture kernel.
The boot_params is for x86 kernels, and contains, for example, a pointer to the capture kernel command line. Other architectures might have similar arch-specific components.
The last component, the elfcorehdr, is very important. It is an ELF structure that contains a list of PT_NOTEs describing the CPUs and memory regions in the system. These are the CPUs and memory that will be dumped to vmcore when the kdump capture kernel runs!
kexec userspace utility is used to load the various components of the kdump image into crashkernel memory. It collects the necessary files and information and invokes a syscall (to the current kernel) to load the image into crashkernel memory.
There are two choices for loading kdump images.
For the kexec_load() syscall (ie.
kexec -c), all these components are first loaded into userspace buffers, the syscall is invoked and the kernel copies the components of the kdump image into the crashkernel memory.
For the kexec_file_load() syscall (ie.
kexec -s), the kernel is provided file handles (instead of user space buffers) to the kdump kernel and initrd. The kernel loads the images into kernel buffers and performs authentication, and creates and loads the other components into crashkernel memory itself; ie. the
kexec utility has a significantly reduced role.
Going forward, the kexec_file_load() syscall is the preferred method as it permits the kernel to authenticate the kdump image and handle all the setup itself.
Regardless of the load method, the kdump image can be rather sizable. Here’s some approximate sizes of these buffers based on a test system:
So loading a kdump image copies approximately 40MiB of data from disk, to buffers and then into crashkernel memory.
Note, however, that significantly more memory is reserved via the
crashkernel= parameter as the kdump capture kernel needs run-time memory in order to successfully run. For example, it is common that at least 256MiB, if not more, is reserved for kdump crash kernel use these days.
Once loaded, the kdump image patiently awaits for when it is needed. This is the “armed” stage.
BOOM!!! Something bad happens, and a panic is triggered, and the kdump image springs into action.
The kdump capture kernel boots, using its initrd to load the needed drivers and utilites and runs
/proc/vmcore in order to create the kdump.
/proc/vmcore is backed by the elfcorehdr structure provided to the kdump kernel via the
elfcorehdr= parameter. This is where
makedumpfile obtains the list of CPUs and memory to capture in the vmcore!
Once the kdump is complete, the system reboots. And now that a vmcore is available, post-mortem debugging using
crash tool, for example, can help reveal the nature of the problem.
With the general kdump service understood, let’s return to finding explanations as to why kdump behaved as it did during scaling.
As explained above, the kdump image elfcorehdr component holds the list of CPUs and memory to dump. This data structure is necessary for the creation of an accurate vmcore. So it makes sense that when CPUs or memory change, that the elfcorehdr also must be updated.
The way that is accomplished is via userspace event processing, specifically the udev rule
/usr/lib/udev/rules.d/98-kexec.rules. In short, this udev rule operates such that when CPU or memory is added or removed, the action is to invoke
kdumpctl restart which in turn unloads then reloads the entire kdump image.
The strategy works, but why does it take minutes after scaling has completed in order for the kdump unload-then-reload activities to settle down? The answer is in the math.
Linux manages memory as memblocks, which typically are 128MiB in size. So the 480GiB of hotplug memory turns into 3840 128MiB memblocks, and as each memblock is onlined, it creates a udev event. So the act of scaling up to 512GiB created 3840 udev events and
98-kexec.rules in turn responds by unloading-then-reloading the kdump image for each of 3840 memblocks! That causes about ~40MiB * 3840 = 128GiB of copying! That is an awful lot of (mostly un-necessary) work to update just the elfcorehdr data structure!
So the first behavior is understood, what about the second behavior of why kdump sometimes does not occur during scaling?
A little instrumentation revealed that kdump did not occur because there was not a kdump image loaded at the time of panic. Specifically, the
struct kimage *kexec_crash_image was
But how could that be?
As just discussed, every time the udev
98-kexec.rules responds to a memblock, it unloads-then-reloads the kdump image. So the act of scaling to 512GiB, in this scenario, provided 3840 race windows in which if a panic occurred while the kdump image was unloaded (and before being completely reloaded), no kdump would be possible! The race window is not small either, as it is the amount of time needed for a kdump image to be loaded, which has been shown to be on the order of 40MiB of data to be copied.
With these behaviors understood, how can the kdump service be improved and eliminate the undesirable behavior?
The fundamental problem with both bad behaviors appears to be delegating to userspace the update of the elfcorehdr, and specifically with the
kdumpctl restart technique. The unloading-then-reloading of a 40MiB kdump image, just to update the (small) elfcorehdr, takes too long and exposes a large number of race windows for kdump to fail.
As it turns out, this problem was recognized and RHEL8/OL8 introduced
kdump-udev-throttler in the
98-kexec.rules to aggregate multiple events together. While this helped, the fundamental problem still exists and can still lead to undesirable behavior. For example,
kdump-udev-throttler reduces the number of events by about one third, but that still means there are over 1000 race windows, in this scenario, in which a panic could occur and kdump fail to run.
So the conclusion becomes that the kernel, not userspace, must handle updates to the elfcorehdr for CPU and memory hot un/plug.
Oracle developed a complete solution to overcome these undesirable behaviors. The solution required changes to the following areas:
With Linux 6.6, the kernel now updates the elfcorehdr on CPU and memory hot un/plug (and on/offline) changes. To utilize, select the CONFIG_CRASH_HOTPLUG kernel configuration option. The kernel crash subsystem/infrastructure receives notifications of CPU and memory hot un/plug (and on/offline) events, and updates the elfcorehdr itself.
These CPU and memory hot un/plug activities always generate events. The
98-kexec.rules therefore must be changed to ignore udev events if the kernel supports handling the events internally. This is indicated by the
/sys/devices/system/cpu|memory/crash_hotplug nodes. If the contents of the node is 1, then the kernel is internally handling the events, and
98-kexec.rules can ignore. If the node is missing or 0, then the
98-kexec.rules must still handle the events (and the problem behaviors described previously still exist).
The kexec-tools has been updated so that kdump images loaded via the legacy kexec_load() syscall can also take advantage of the kernel directly updating the elfcorehdr. To utilize, the
/sbin/kdumpctl script needs to change the
standard_kexec_args to include
-c --hotplug to properly setup the kdump image for kernel crash hotplug handling. However, Oracle Linux defaults to the kexec_file_load() syscall, which inherently does the right thing when CONFIG_CRASH_HOTPLUG is in effect. This change is only to cover the legacy kexec_load() syscall.
The combination of all three changes yielded a result in which the kernel directly handles the updates to the elfcorehdr with zero assistance and activity by userspace. This results in nearly instantaneous updates to the elfcorehdr. Since the update to the elfcorehdr happens as fast as the CPU and memory hot un/plug events, the race window is nearly eliminated, being just the time needed to rewrite the elfcorehdr.
These pieces are in:
6f991cc363a3 "crash: move a few code bits to setup support of crash hotplug" 247262756121 "crash: add generic infrastructure for crash hotplug support" f7cc804a9fd4 "kexec: exclude elfcorehdr from the segment digest" 88a6f8994421 "crash: memory and CPU hotplug sysfs attributes" ea53ad9cf73b "x86/crash: add x86 crash hotplug support" a72bbec70da2 "crash: hotplug support for kexec_load()" a396d0f81b1c "crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()"" 543cd4c5e78b "x86/crash: optimize CPU changes" e2a8f20dd8e9 "Crash: add lock to serialize crash hotplug handling"
c36d3e8b2e99 "kexec: define KEXEC_UPDATE_ELFCOREHDR" d6cfd2984844 "crashdump: introduce the hotplug command line options" 75ac71fd94ff "crashdump: setup general hotplug support" a56376080a93 "crashdump: exclude elfcorehdr segment from digest for hotplug" d59d17f37239 "crashdump/x86: identify elfcorehdr segment for hotplug" 118b567ce74a "crashdump/x86: set the elfcorehdr segment size for hotplug"