This blog entry explores a little known feature called pvpanic, which is a paravirtualized device emulated by QEMU and used by the guest OS to report to the VMM when it experiences a panic/crash. This mechanism allows a guest OS kernel to signal the hypervisor when it panics, before any crash dump is collected.
We will describe the behavior and settings required for a Linux guest. Windows guests show similar behavior in the majority of scenarios, but a separate blog entry will be dedicated to discuss the work needed to accomplish this.
If you are a cloud operator (or run a particularly cool home lab), you will occasionally experience issues connecting to the Virtual Machines in your fleet. This can be caused by problems in your infrastructure like a hypervisor crash or network configuration issue, or it could be due to a guest OS level issue like a hang or a kernel panic. We can typically figure out exactly what happened by inspecting logs and checking the status of various services, but it might take some time before either the customer or the operator notices that something has gone wrong.
A cloud operator using QEMU will usually have a monitoring service on the hypervisor listening for QMP events. If a pvpanic event is received, this provides a clear and immediate signal that there is a problem originating at the guest OS level: the guest kernel has experienced a panic. Depending on the event received, it also tells us whether or not a crash dump will be attempted.
From a fleet management perspective, pvpanic provides us with the ability to categorize and root cause unresponsive instances. It helps determine when the instance is down due to a guest level issue, as opposed to an infrastructure (e.g. hypervisor crash) problem, and that information can be used to update metrics and/or alert a customer that their instance has suffered a guest level issue and requires corrective action.
It is important to keep in mind that pvpanic functionality in Linux is limited to notifying the VMM (QEMU in this case) that a panic has occurred in the guest; no additional data is provided to the VMM. This is why pvpanic is normally used in conjunction with the -no-shutdown
option in QEMU, or the equivalent -action shutdown=pause
, which ensures that the QEMU process does not terminate and makes it possible to collect relevant debugging data like the register state or a memory dump for further analysis.
There are two versions of the pvpanic device, the original which is implemented as an emulated ISA device (-device pvpanic
), and a more recent one that is implemented as a PCI device (-device pvpanic-pci
). The PCI version was required in order to use pvpanic on Arm instances which do not have support for ISA devices.
The examples shown here were captured in an aarch64 guest launched on an OCI Ampere A1-2c Bare Metal instance, and therefore use the -device pvpanic-pci
form. The QMP events generated are the same, regardless of the interface that is emulated by the device.
If you are an advanced user who just needs to deploy pvpanic in a hurry, here is a summary that shows how the various parameters determine the pvpanic QMP event that is emitted when the guest kernel panics:
Most Linux distributions enable crash dump functionality by default, and support for the CRASHLOADED event has been available since Linux 5.6.
Use QEMU parameters:
-device pvpanic-pci
-qmp unix:/tmp/qmp.sock,server,wait=no
-action shutdown=pause,panic=none
Linux guest settings:
CONFIG_PVPANIC=y CONFIG_PVPANIC_PCI=m
crash_kexec_post_notifiers
kernel command line parameter.kdump
service is enabled and running.For this example we use a Linux guest running OL8.5
which enables the kdump
service by default, and UEK6-U3
kernel (5.4.17-2136.304.4.5.el8uek.aarch64
) built with the required KCONFIG options. With the above settings, when a kernel panic is triggered in the guest, we will receive the GUEST_CRASHLOADED
QMP event:
{"timestamp": {"seconds": 1648072351, "microseconds": 626053}, "event": "GUEST_CRASHLOADED", "data": {"action": "run"}}
After which the guest will kexec into a secondary kernel and collect a crash dump. Once the crash dump is collected, the guest reboots back to its default kernel, issuing a RESET
event:
{"timestamp": {"seconds": 1648072358, "microseconds": 10547}, "event": "RESET", "data": {"guest": true, "reason": "guest-reset"}}
If the kdump
service is not running, or we have booted an older guest kernel that does not support the GUEST_CRASHLOADED
event, we instead receive a GUEST_PANICKED
event:
{"timestamp": {"seconds": 1648067630, "microseconds": 165464}, "event": "GUEST_PANICKED", "data": {"action": "run"}}
In this case the guest does not automatically reboot, but the QEMU process does not terminate, giving us a chance to collect debugging data.
The remainder of this blog entry elaborates on the recommended settings above and provides more detailed examples.
The QEMU command line parameters that are required in order to use and detect the pvpanic events are:
-device pvpanic-pci
-qmp unix:/tmp/qmp.sock,server,wait=no
-qmp-pretty
to display server responses in pretty-printed JSON formatting.Recommended:
-action shutdown=pause,panic=none
-action shutdown=pause
-no-shutdown
, requests that QEMU does not immediately terminate when the guest shuts down.-action panic=none
CRASHLOADED
event, and Windows guests using the hv-crash
enlightenment. In these cases, the default QEMU action is to pause the guest, whereas this option allows the guest to proceed to capture a crash dump and automatically reboot without intervention by a management layer.Given the requirements above, a basic QEMU command looks like the following:
qemu-system-aarch64 \ -nodefaults \ -device pvpanic-pci \ -action shutdown=pause,panic=none \ -machine virt,gic-version=3 \ -cpu host \ -accel kvm \ -smp 2 \ -m 4G \ -drive file=/usr/share/AAVMF/AAVMF_CODE.pure-efi.fd,if=pflash,unit=0,format=raw,readonly=on \ -drive file=/usr/share/AAVMF/AAVMF_VARS.pure-efi.fd,if=pflash,unit=1,format=raw \ -hda /home/pvpanic-example/ol8.qcow2 \ -qmp unix:/tmp/qmp.sock,server,wait=no \ -serial mon:stdio
There are various methods for connecting to a Unix socket. Here we show an example using the nc (netcat) utility:
# nc -U /tmp/qmp.sock {"QMP": {"version": {"qemu": {"micro": 0, "minor": 2, "major": 6}, "package": "v6.2.0"}, "capabilities": ["oob"]}} {"execute": "qmp_capabilities"} {"return": {}}
Note that we need to issue the qmp_capabilities
command to exit the capability negotiation mode and enter command mode. Otherwise no QMP events will be received.
A kernel panic can be triggered in multiple ways. Since our UEK6-U3 guest kernel enables the Magic SysRq key option (CONFIG_MAGIC_SYSRQ), we use this method.
See the SysRq documentation for details.
Triggering a panic looks like this:
$ echo 1 > /proc/sys/kernel/sysrq $ echo c > /proc/sysrq-trigger [ 625.468565] sysrq: Trigger a crash [ 625.469186] Kernel panic - not syncing: sysrq triggered crash [ 625.470103] CPU: 1 PID: 1985 Comm: bash Not tainted 5.4.17-2136.304.4.5.el8uek.aarch64 #2 [ 625.471452] Hardware name: QEMU KVM Virtual Machine, BIOS 1.4.1 12/03/2020 [ 625.472636] Call trace: [ 625.473094] dump_backtrace+0x0/0x18c [ 625.473768] show_stack+0x24/0x30 [ 625.474392] dump_stack+0xbc/0xe0 [ 625.475003] panic+0x15c/0x368 [ 625.475620] sysrq_handle_crash+0x1c/0x1c [ 625.476358] __handle_sysrq+0x88/0x17c [ 625.477043] write_sysrq_trigger+0x10c/0x174 [ 625.477822] proc_reg_write+0x88/0xe8 [ 625.478511] __vfs_write+0x48/0x8c [ 625.479136] vfs_write+0xb8/0x1d8 [ 625.479743] ksys_write+0x74/0xf8 [ 625.480349] __arm64_sys_write+0x24/0x30 [ 625.481065] el0_svc_common+0xbc/0x19c [ 625.481749] el0_svc_handler+0x38/0x88 [ 625.482442] el0_svc+0x8/0xc [ 625.482974] SMP: stopping secondary CPUs
For modern versions of QEMU and guest kernels, the type of QMP event emitted when a guest panic occurs is generally determined by two factors:
Whether or not a crash dump mechanism (kdump
) is active in the guest.
If kdump
is active, the use of the crash_kexec_post_notifiers
kernel parameter in the guest.
Depending on the combination of the factors above, QEMU might emit a GUEST_PANICKED
, GUEST_CRASHLOADED
, or no event at all. See the following sections for an explanation of the possible scenarios.
This QMP event is emitted by QEMU to indicate that the guest OS kernel panicked and a kernel crash dump mechanism (kdump
) was not active. Linux kernels older than the 5.6 release only supported this event, so this would be the event sent even if kdump
is active, as long as crash_kexec_post_notifiers
is also set in the guest kernel command line parameters.
The other special case is Windows guests using the hv-crash
enlightenment, but we’ll discuss that in a separate blog entry.
Lets see an example:
First we stop the kdump
service on our guest:
[root@localhost ~]# systemctl stop kdump [root@localhost ~]# systemctl status kdump ● kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled) Active: inactive (dead) since Tue 2022-03-22 17:45:06 EDT; 4s ago Process: 1607 ExecStop=/usr/bin/kdumpctl stop (code=exited, status=0/SUCCESS) Process: 1089 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS) Main PID: 1089 (code=exited, status=0/SUCCESS) [...]
And trigger a panic using SysRq
as described earlier.
[ 30.613584] sysrq: Trigger a crash [ 30.614245] Kernel panic - not syncing: sysrq triggered crash [ 30.615193] CPU: 1 PID: 1457 Comm: bash Not tainted 5.4.17-2136.304.4.5.el8uek.aarch64 #2 [ 30.616516] Hardware name: QEMU KVM Virtual Machine, BIOS 1.4.1 12/03/2020 [ 30.617689] Call trace: [ 30.618148] dump_backtrace+0x0/0x18c [ 30.618822] show_stack+0x24/0x30 [ 30.619431] dump_stack+0xbc/0xe0 [ 30.620041] panic+0x15c/0x368 [ 30.620604] sysrq_handle_crash+0x1c/0x1c [ 30.621339] __handle_sysrq+0x88/0x17c [ 30.622041] write_sysrq_trigger+0x10c/0x174 [ 30.622828] proc_reg_write+0x88/0xe8 [ 30.623500] __vfs_write+0x48/0x8c [ 30.624123] vfs_write+0xb8/0x1d8 [ 30.624730] ksys_write+0x74/0xf8 [ 30.625338] __arm64_sys_write+0x24/0x30 [ 30.626026] el0_svc_common+0xbc/0x19c [ 30.626673] el0_svc_handler+0x38/0x88 [ 30.627360] el0_svc+0x8/0xc [ 30.627889] SMP: stopping secondary CPUs [ 30.628864] Kernel Offset: disabled [ 30.629522] CPU features: 0x00002,20802008 [ 30.630257] Memory Limit: none [ 30.630822] ---[ end Kernel panic - not syncing: sysrq triggered crash ]---
Entering the QEMU monitor and checking the status:
QEMU 6.2.0 monitor - type 'help' for more information (qemu) info status VM status: running
we can see that the VM is still running
. This is because we are using -action shutdown=pause,panic=none
in the QEMU parameters. The QMP event received is:
{"timestamp": {"seconds": 1648067630, "microseconds": 165464}, "event": "GUEST_PANICKED", "data": {"action": "run"}}
After this event is received, a system_reset
command can be issued on the monitor or via QMP to reboot the guest.
The GUEST_CRASHLOADED
event indicates that the guest kernel has hit a panic, but it will handle it by itself i.e. kdump
service is running on the guest.
This event is emitted when the guest kernel has panicked, kdump
is active, and crash_kexec_post_notifiers
kernel parameter is used by the guest kernel. Let’s explain how these two settings interact.
The majority of Linux distributions enable the kdump
service, provided by the kexec-tools
package in OL8. kdump
is the Linux kernel crash-dump mechanism. In the event of a system crash, kdump
uses kexec
to boot into a second kernel loaded in memory reserved for this purpose, and captures the contents of the crashed kernel’s memory (vmcore) to aid in determining the cause of the crash. After the crash dump has been collected, the guest will reboot using the default kernel.
For more information on kexec, see this blog entry.
This kernel parameter is used to request that the callbacks registered with the panic notifier chain
are called before the kdump
service.
A notifier chain is a mechanism provided by the Linux kernel to allow a subsystem to be notified when an event occurs in another subsystem, and provide a callback function to invoke.
During initialization, the pvpanic module registers a callback with the panic_notifier_list
notifier chain. When the kernel panics, the value of crash_kexec_post_notifiers
determines whether or not the callbacks registered in the panic_notifier_list are invoked before kexec’ing into the capture kernel which will collect the crash dump. The panic()
code in kernel/panic.c
is extensively commented and worth a look if you are interested in the details.
Therefore, if we want the pvpanic signal to be sent to QEMU when kdump
is enabled, it is necessary to use the crash_kexec_post_notifiers
kernel parameter.
NOTE: The crash_kexec_post_notifiers
parameter can also be set at runtime using:
echo Y > /sys/module/kernel/parameters/crash_kexec_post_notifiers
If the guest is booted with crash_kexec_post_notifiers
, and kdump
is active, we trigger a panic in the guest and see the following QMP events:
{"timestamp": {"seconds": 1648069594, "microseconds": 946575}, "event": "GUEST_CRASHLOADED", "data": {"action": "run"}} {"timestamp": {"seconds": 1648069601, "microseconds": 309809}, "event": "RESET", "data": {"guest": true, "reason": "guest-reset"}}
Note that the RESET
event is emitted after kdump finishes collecting the crash dump and reboots back to the default kernel.
If kdump
is active, but crash_kexec_post_notifiers
is not set, the panic notifier list will not be called. The crash dump will be collected and the guest will automatically reboot, with the only event emitted being the standard RESET
event:
{"timestamp": {"seconds": 1647987955, "microseconds": 455659}, "event": "RESET", "data": {"guest": true, "reason": "guest-reset"}}
In this blog entry we explained the basis of what pvpanic is and how it can be useful for quickly determining if the cause of an outage is due to a guest OS kernel panic. At its core, it is a very simple feature, but unfortunately the various interactions with other components like kdump
, guest kernel command line parameter, QEMU options, and legacy kernels with incomplete feature sets makes for a somewhat convoluted explanation of how to use it.
This blog does not discuss implementation details of the different backends in QEMU, as well as how the Linux driver has evolved over time. Windows guests using the Oracle VirtIO Drivers for Microsoft Windows also have support for pvpanic functionality and emit the same events. These topics will be discussed in future blog posts.
Next Post