In this blog Oracle Linux Kernel engineers Steve Sistare and Mark Kanda present QEMU live update.
The ability to update software with critical bug fixes and security mitigations while minimizing downtime is extremely important to customers and cloud service providers. In this blog post, we present QEMU Live Update, a new method for updating a running QEMU instance to a new version while minimizing the impact to the VM guest. The guest pauses briefly, for less than 100 milliseconds in our prototype, without loss of internal state or external connections.
Live Update uses resources more efficiently than Live Migration. The latter ties up the source and target hosts, and consumes more memory and network bandwidth, and does so for an indeterminate period of time that depends on when the copy phase converges. Live migration is prohibitively expensive if large local storage must be copied across the network to the target.
Live Update preserves the guest state across an exec of a new QEMU binary. It does so by leveraging QEMU's live migration vmstate framework. We enhance QEMU's existing functionality for saving and restoring VM state to allow a guest to be quickly suspended and resumed.
The guest RAM is preserved across the exec and mapped at the same virtual address via a proposed madvise option called MADV_DOEXEC. This option preserves the physical pages and virtual mappings of a memory range, and works for MAP_ANON memory. Briefly, madvise sets a flag in each vma struct covering the range, and exec copies flagged vma's from the old mm struct to the new mm, much like fork. See the patch for details.
The live update sequence consists of updating the QEMU binary, pausing the guest, saving the VM state, exec'ing the new QEMU binary, restoring the VM state, and resuming the guest.
This implementation requires changes to QEMU and the Linux memory management framework, but no changes are required in system libraries or the KVM kernel module.
Two new QEMU QMP/HMP commands are utilized: cprsave and cprload.
cprsave pauses the guest to prevent further modifications to guest RAM and block devices and saves the VM state to a file. Unlike the existing savevm command, cprsave supports any type of guest image and block device. cprsave has two modes of operation: restart, for updating QEMU only, and reboot, for updating and rebooting the host kernel. Reboot is discussed later. With cprsave restart, the address and length of the RAM blocks are saved as environment variables and the RAM is tagged with the MADV_DOEXEC option to preserve it across the exec. Finally, the new QEMU binary is exec'd with the original command line arguments. After exec, QEMU reads the environment variables to find the RAM blocks, rather than allocating memory as it normally would.
cprload recreates the VM using the file produced by cprsave. Guest block devices are used as-is, so the contents of such devices must not be modified between the cprsave and cprload operations. If the VM was running when cprsave was executed, the VM execution will be resumed.
External connections, such as the guest console, QMP connections, and vhost devices, are preserved across the update. Upon cprsave, the associated file descriptors' close-on-exec flags are cleared and the descriptors are saved as environment variables. Upon restart, QEMU finds the file descriptor environment variables, reuses them by associating them with the corresponding devices, and skips the related configuration steps.
VFIO PCI devices are preserved in a similar manner. At creation time, the QEMU VFIO file descriptors (container, group, device, eventfd) are saved as environment variables. Upon cprsave, the vmstate MSI message area is saved, and all preserved file descriptors' close-on-exec flags are cleared. Upon restart, QEMU finds the file descriptor environment variables, reuses them, and skips the related configuration steps for the preserved areas (such as device and IOMMU state). Finally, upon cprload, the MSI data is loaded from the file, the preserved irq eventfd's are attached to the new KVM instance, and the guest is resumed.
The hardware device itself is not quiesced during the restart, and pending DMA requests will continue to execute, reading from and writing to guest memory. This is safe because MADV_DOEXEC preserves the guest memory in place.
The following is an example of updating QEMU from v4.2.0 to v4.2.1 on Oracle Linux 7 using the HMP version of cprsave restart. A QEMU software update is performed while the guest is running to minimize downtime.
Many critical fixes can be applied by updating only QEMU, or by ksplice'ing the host kernel and its kvm module. However, if you need to completely update the host kernel, we provide a method for doing so, using cprsave with the reboot mode argument. In this mode, cprsave saves state to a file and exits. You then kexec boot a new kernel and issue cprload. The guest RAM must be backed by a persistent shared memory file, such as device DAX or a /dev/shm file that is preserved across kexec via Anthony Yznaga's proposed PKRAM kernel patches.
VFIO devices can be preserved if the guest provides an agent that implements suspend to ram, such as qemu-ga. To update, you first issue guest-suspend-ram to the agent, and the guest drivers' suspend methods flush outstanding requests and re-initialize to a reset state -- the same state reached after the host reboots. Thus when the guest resumes, the guest and host agree on the state.
Connections from the guest kernel to the outside world survive the reboot. The guest pause time is longer than for restart mode, and depends heavily on the boot time of the kernel and the pre-requisite userland services.
The following is an example of updating the host kernel on Oracle Linux 7 using the HMP version of cprsave reboot.
We are busily working to bring this functionality to the Linux community. We submitted the first version of the QEMU patches to the qemu-devel email list, and we are working on version 2. Anthony submitted the Linux patches for the madvise option and PKRAM to the Linux kernel email list. Stay tuned for updates.