Live update is a technique wherein a QEMU plus guest instance can be updated to run the latest version of QEMU, with minimal guest pause time, and no loss of external connections. It is faster and more efficient than live migration, with the caveat that the updated QEMU instance must run on the same host as the old.
When I last wrote in this space about our QEMU Live Update project, it was but a prototype, used internally here at Oracle, and not available else where. That has since changed with the release of cpr-reboot mode in QEMU 8.2, and the upcoming release of cpr-transfer mode in QEMU 10.0. There have been changes in interface and implementation along the way, which is the focus of today’s article. You can find some of the following details and more in the QEMU documentation for CPR that I wrote. I use the terms CPR (CheckPoint and Restart) and live update interchangeably.
Live update was originally designed as a separate module with its own set of commands (cprsave, cprexec, and cprload), and their implementation leveraged existing QEMU live migration functions for saving and restoring device state. Given the close relationship between live update and live migration, we decided to re-define live update as a mode of live migration, and use the live migration commands to drive an update.
There is now a QAPI MigrationParameter called mode with values cpr-reboot, cpr-transfer, and normal (the default). If the user sets mode to one of the CPR flavors, via the migrate_set_parameter QAPI command, then the next migrate command will perform a live update rather than a live migration. After that, the user polls the QEMU run state, waiting for it to become postmigrate, indicating the update has finished. The two live update modes have slightly different behavior.
cpr-reboot mode
The so-called reboot mode saves VM state in such a way that it is safe to reboot the host system before restarting the VM, giving the user the opportunity to update the host operating system. QEMU does not perform the reboot; whether and how to do so is up to the user. If you do not need to reboot, then use the faster cpr-transfer mode.
In cpr-reboot his mode, QEMU stops the VM, and writes VM state to the migration URI, which will typically be a file. After quitting QEMU, the user resumes by running QEMU with the -incoming option.
Here is an example:
# qemu-kvm -monitor stdio -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/dax0.0,align=2M,share=on -m 4G ... (qemu) info status VM status: running (qemu) migrate_set_parameter mode cpr-reboot (qemu) migrate_set_capability x-ignore-shared on (qemu) migrate -d file:vm.state (qemu) info status VM status: paused (postmigrate) (qemu) quit ### optionally update QEMU ... ### optionally update kernel and reboot ... # qemu-kvm ... -incoming defer (qemu) info status VM status: paused (inmigrate) (qemu) migrate_set_parameter mode cpr-reboot (qemu) migrate_set_capability x-ignore-shared on (qemu) migrate_incoming file:vm.state (qemu) info status VM status: running
Note how the guest’s main memory is saved in a DAX file which persists not only across termination of QEMU, but also across a host reboot if one is performed. Small auxiliary RAM segments, such as VRAM and device ROM, are saved in the vm.state file. Their total size is typically small, measured in the 10’s of megabytes.
As part of this work, QEMU now correctly handles vmstate save and restore for guests in the S3 state, aka suspended to RAM, so you can live update guests with vfio devices, provided you first suspend the guest, such as by using the QEMU guest agent:
# qemu-kvm
-chardev socket,id=qga0,path=qga.sock,server=on,wait=off
-device virtserialport,chardev=qga0,name=org.qemu.guest_agent.0
...
# echo '{"execute":"guest-suspend-ram"}' | ncat --send-only -U qga.sock
cpr-transfer mode
This mode allows the user to transfer a guest to a new QEMU instance on the same host, by preserving guest RAM in place.
The user starts new QEMU on the same host as old QEMU, with command-line arguments to create the same machine, plus the ‑incoming option for the main migration channel, like normal live migration. In addition, the user adds a second ‑incoming option with channel type cpr.
To initiate live update, the user issues a migrate command to old QEMU, adding a second migration channel of type cpr in the channels argument. Old QEMU stops the VM, saves state to the migration channels, and enters the postmigrate state. Execution resumes in new QEMU.
This is rather different than the prototype, in which old QEMU exec’s new QEMU. Developers of machine managers such as libvirt find the transfer method to be more familiar and comfortable than the exec method, because it is more like live migration.
The migrate command copies state from old QEMU to new, much like a conventional live migration, but with the following differences:
-
No memory is copied, as guest memory is transferred in place. Memory backed by a file is simply reopened in new QEMU. Anonymous memory is created in old QEMU with the memfd_create system call. The memfd is passed to new QEMU via a SCM_RIGHTS message sent over a UNIX domain socket, and new QEMU mmap’s it. Old QEMU must be started with the ‑machine aux-ram-share=on option to cause anonymous memory to be created in this fashion.
-
additional state is copied over the cpr channel, which is the UNIX domain socket mentioned above.
-
The guest is paused immediately but only briefly. There is no extended “live” phase with reduced performance, and there are no dirty writes to be tracked.
Here is an example. Specifying multiple channels requires the use of QMP rather than HMP on the outgoing side, so this is more verbose than the cpr-reboot example above. Some QMP responses are omitted for brevity.
Old QEMU: New QEMU:
# qemu-kvm -qmp stdio
-object memory-backend-file,id=ram0,size=4G,
mem-path=/dev/shm/ram0,share=on -m 4G
-machine memory-backend=ram0
-machine aux-ram-share=on
...
# qemu-kvm -monitor stdio
-incoming tcp:0:44444
-incoming '{"channel-type": "cpr",
"addr": { "transport": "socket",
"type": "unix", "path": "cpr.sock"}}'
...
{"execute":"qmp_capabilities"}
{"execute": "query-status"}
{"return": {"status": "running",
"running": true}}
{"execute":"migrate-set-parameters",
"arguments":{"mode":"cpr-transfer"}}
{"execute": "migrate", "arguments": { "channels": [
{"channel-type": "main",
"addr": { "transport": "socket", "type": "inet",
"host": "0", "port": "44444" }},
{"channel-type": "cpr",
"addr": { "transport": "socket", "type": "unix",
"path": "cpr.sock" }}]}}
{"execute": "query-status"}
{"return": {"status": "postmigrate",
"running": false}}
QEMU 10.0.50 monitor
(qemu) info status
VM status: running
Futures:
vfio and iommufd
The implementation of cpr-transfer mode in QEMU 10.0 only supports virtual devices, but I have submitted the following patches to support vfio and iommufd devices as well: Live update: vfio and iommufd
For both vfio and iommufd, QEMU keeps the device alive by transferring various file descriptors (container, group, device, eventfd) from old QEMU to new QEMU via SCM_RIGHTS over the cpr channel. This approach is very general and can be used to support a wide variety of devices that do not have hardware support for live migration, although some devices need new kernel software interfaces to allow a descriptor to be used in a process that did not originally open it.
QEMU registers memory with the vfio and iommufd drivers so that memory can be pinned for DMA. Live update manipulates the registrations differently for vfio versus iommufd, as described below.
vfio
QEMU registers memory with the vfio driver by calling the VFIO_IOMMU_MAP_DMA ioctl, and the driver remembers its virtual address. This is a problem for live update, because the virtual address changes in new QEMU when the memory file descriptor is mmap’ed.
To solve this problem, I defined new flags VFIO_DMA_UNMAP_FLAG_VADDR and VFIO_DMA_MAP_FLAG_VADDR. During live update, old qemu calls the VFIO_IOMMU_UNMAP_DMA ioctl with VFIO_DMA_UNMAP_FLAG_VADDR to abandon the old virtual addresses for DMA mapped regions, and new QEMU calls VFIO_IOMMU_MAP_DMA with VFIO_DMA_MAP_FLAG_VADDR to register new virtual addresses. The latter call also updates locked memory accounting. The physical pages remain pinned, because the descriptor of the device that locked them remains open, so DMA to those pages continues without interruption.
The kernel has supported the VADDR flags since 5.11, which can be programmatically determined by checking for the VFIO_UPDATE_VADDR extension.
iommufd
To avoid manipulating virtual addresses for iommufd, I added the IOMMU_IOAS_MAP_FILE ioctl, which registers the memory backed by a memory file descriptor. Thus the driver has no knowledge of virtual addresses. This integrates perfectly with our general approach of passing open memory file descriptors to new QEMU. New QEMU mmap’s the memfd, and the iommufd device is none the wiser. As with vfio, the pages remain pinned in memory.
However, QEMU does need to tell the driver that a new process is using the memory, so it can update its mm ownership and locked memory accounting. This is done via the IOMMU_IOAS_CHANGE_PROCESS ioctl, which was added for this project.
The kernel has supported IOMMU_IOAS_MAP_FILE and IOMMU_IOAS_CHANGE_PROCESS since 6.12.
cpr-exec mode
I still like the exec method of live update that we implemented in the prototype. In a containerized QEMU environment, exec reuses an existing QEMU container and its assigned resources, so the virtual machine manager need not create a new container and allocate resources for it. In addition, the container may include agents with their own connections to the outside world, and such connections remain intact when the container is reused.
I plan to submit patches for cpr-exec mode, implemented in the new framework. This is a relatively small delta versus the existing cpr-transfer code.
character devices
Given the ability to transfer open descriptors to new QEMU, it is fairly easy to keep QEMU backend character devices open during live update. Clients connected to these will not see any interruption in service, and the virtual machine manager does nothing special to manage the transition.
These include the QEMU chardev of type socket, pty, stdio, file, pipe, serial, and parallel.
Patches are in the works.
vhost and tap
Once again, it’s all about preserving descriptors. These devices can be preserved in an open state, rather than closing and recreating them as in a traditional live migration. The virtual machine manager does nothing special to manage the transition.
Patches are in the works.
Conclusion
The Live Update project has been a marathon, not a sprint! Parts of it are finally available in upstream QEMU and the kernel, and we continue moving forward to provide a comprehensive solution for updating the QEMU hypervisor with minimal impact to the guest.