Introduction
This blog post introduces the concept of CPU hotplugging (and hot-unplugging) in a virtualized environment, and includes high level coverage of some of the key mechanisms involved. The focus is on QEMU and KVM, primarily with a Linux guest, but general concepts are applicable to most other virtualization environments.
Use Cases
A common use case for CPU hotplugging is scaling up and down virtual machines. As demand or workload increases on a VM, more vCPUs can be added without a reboot. Similarly, when workload is reduced vCPUs can be easily removed. This scaling typically takes in the order of seconds.
Public cloud providers typically charge their customers an amount proportional to the CPU count. Moreover, some software licensing models charge based on the number of CPUs the software is running on. Thus, a popular usage model is for users to scale up their vCPU count only when needed and then decreasing vCPU count when demand is low, in order to reduce costs.
Another use case for CPU hotplug is capacity management. Consider this simple example where a user has 10 hosts, with 40 total VMs running across those hosts. There is a finite total CPU count, 200 CPUs, across the fleet of hosts. If the user now has a need to run 10 additional VMs (10 + 40 total), but the full count of CPUs is in use, how can the user launch these additional VMs?
One option is to simply buy more hosts. This increases the available CPU count, but there is additional cost associated with this option.
Another option is to reduce the number of vCPUs allocated to each VM, keeping the host count unchanged. In this example, each VM has been allocated 5 vCPUs. If we reduce each VM’s CPU count to 4, we can now run 50 total VMs. The cost remains the same, we’ve increased the total VMs by 25%, and only reduced each VM’s CPU count by 20%.
What is “hotplugging”?
Plugging in a device means the user intends a new device to become visible and usable to the system. “Hot” plugging indicates that the system is already running while the device is being added.
Many different device types can be hotplugged, such as memory, vCPUs, or PCIe devices. PCIe, VirtIO, and ACPI provide hotplugging interfaces with different capabilities regarding what sort of devices can be hotplugged, and under what circumstances.
Hot-unplugging is also possible, where a device is removed while the system continues to run. In the case of vCPUs, this requires all processes running on a given vCPU to migrate to other vCPUs so it can be brought offline.
Hotplug Interfaces
Various device interfaces exist, but are often not monoliths. That is, an interface at one level of abstraction may use another lower-level interface in its implementation.
PCIe
The original PCI specification lacked the concept of adding and removing devices at runtime, though support for hotplug was added later. PCIe included hotplug support from its inception. In recent years the most common use cases for PCIe hotplug have gone from the canonical adding/removing peripheral cards in a server, now to hot-swapping NVMe drives, and connecting Thunderbolt devices.
PCIe hotplugging is often correlated with a device being physically connected or disconnected. In a virtualization context, using PCIe hotplug directly may be less common, as exposing a virtual machine to events in the physical world runs contrary to the goal.
VirtIO
VirtIO is a popular interface for exposing simplified devices to a virtual machine. Implementations commonly use PCIe as the underlying transport, so hotplug support is not purely native. Many VirtIO devices can be hotplugged, such as virtio-net and virtio-scsi. There is even support for memory hotplugging in virtio-mem. However, there is no VirtIO CPU device, so VirtIO isn’t helpful for CPU hotplugging.
ACPI
ACPI can be considered a lower-level interface, and provides mechanisms for the OS and firmware to work together to manage the lifecycle of a device. ACPI allows the firmware to describe the hardware and its capabilities in a way the OS can understand, and provides multiple event and device model alternatives. Handlers for different events can be defined in firmware, and the OS hands off control to the appropriate event handler when an interrupt is received.
ACPI allows devices, including CPUs, to be hotplugged and has multiple alternatives for the choice of event model. It is the primary focus of this blog entry.
CPUs
The hypervisor will present a virtual CPU to the guest that is an abstraction of the physical CPUs present on the host. These can be distinct threads of a CPU core, or physical cores themselves. When viewed from this perspective, we will refer to the physical CPU(s) as ‘CPU’ and the virtualized CPU being presented to the guest as ‘vCPU’.
Within the guest, we just refer to a vCPU as a CPU. The guest may be aware that it’s working with a virtual CPU and not a ‘real’ CPU, but for the purposes of hotplug there is no need to make a distinction.
Guest CPU States
From the guest kernel’s perspective, a CPU can be in one of many states. These states are maintained as separate bitmasks.
- Possible – The entire set of CPUs that could be given to the guest. The size of this set is the max CPU count specified to QEMU.
- Present – The set of CPUs that are currently plugged in and visible to the guest. Subset of possible CPUs.
- Online/Offline – There is only an online bitmask, so a present CPU can be online or offline, but not both. An online CPU is availabe to be used by the kernel scheduler. Online is a subset of present.
- Active – Active CPUs are a subset of online CPUs. The scheduler makes an online CPU active, which allows kernel tasks to be migrated to the given CPU.
Some of these CPU bitmasks are exposed read-only through sysfs. An example with maxcpus=68 in QEMU:
[root@localhost ~]# cat /sys/devices/system/cpu/possible 0-67 [root@localhost ~]# cat /sys/devices/system/cpu/present 0-23 [root@localhost ~]# cat /sys/devices/system/cpu/offline 24-67 [root@localhost ~]# cat /sys/devices/system/cpu/online 0-23
Hotplug support within the guest requires a kernel with the CONFIG_HOTPLUG_CPU option enabled. If hotplug support is not present, Possible and Present CPUs will be the set of CPUs reported by ACPI at boot. Active and Online are equivalent in this case as well.
Hotplugging in QEMU
In QEMU, the maximum number of vCPUs must be specified at VM launch time. A guest can start with a smaller number of vCPUs, and additional vCPUs can be hotplugged after launch, but only up to the maximum CPU count initially specified (QEMU determines a max CPU count if the user doesn’t explicitly specify it). ACPI information generated by QEMU will contain structures for the entire set of possible vCPUs, but unused vCPUs will be disabled.
In QEMU, CPU 0 is enabled by default and has unique duties. Normally it cannot be hot-unplugged.
Hotplug Workflow
The hotplug workflow can be divided into 2 parts. Part 1 can be viewed as synchronous, in that the hypervisor waits for the guest to report back on the CPU state change. But Part 2 is mostly opaque to the hypervisor, and involves the guest and actual hardware. Step 2 can be thought of as asynchronous, since the hypervisor is no longer waiting for the CPU state change and can perform other work.
Part 1
Specific CPU hotplug steps may differ, depending on CPU count or whether SecureBoot is enabled, but they share a common overall workflow. ‘Hypervisor’ is used here as a blanket term that can include the host kernel, KVM, QEMU, OVMF, etc.
- The user informs the hypervisor to add a vCPU to the guest
- The hypervisor creates its own internal representation of the new vCPU
- The hypervisor updates the ACPI event status
- The hypervisor sends a System Control Interrupt (SCI) to inform the guest OS that there is a change in virtualized hardware
- The guest OS executes the ACPI method for scanning CPUs (called CSCN) that it was provided at boot
- The ACPI CSCN method (running in the context of the guest OS) may retrieve additional information from the hypervisor about the new CPU
- ACPI will notify the guest OS (guest effectively notifies itself) that there is a new CPU
- The ACPI “notify” event is handled by the ACPI drivers in the guest kernel
- ACPI driver code will enable the new CPU
- The guest notifies the hypervisor that the vCPU hotplug completed with ACPI OST success status
- The guest kernel generates a udev event for the new CPU device
Until this point, the hypervisor was blocking on the vCPU hotplug operation. From the guest’s perspective, the new vCPU is “present” but not “online”, meaning it can’t yet perform any useful work.
Part 2
As previously stated, the hypervisor is mostly hands off during part 2. The guest kernel does the majority of the work here, and will finally initialize the CPU itself.
-
A process in the guest (e.g., systemd-udevd in Oracle Linux) will handle the udev event generated above by registering a handler for the “online” and “offline” actions
-
The udev handler brings the new CPU online by writing to sysfs:
echo 1 > /sys/devices/system/cpu/cpu<number>/online
-
The write to sysfs invokes kernel code that will bring the CPU online. Some notable kernel functions:
- wakeup_secondary_cpu_via_init() sends a sequence of INIT and SIPI inter-processor interrupts that initialize the CPU.
- cpuhp_kick_ap() wakes up a CPU, in case it was halted
- cpuset_wait_for_hotplug() makes the CPU a usable resource for cgroups
- sched_cpu_activate() is called during the last step of bringing a CPU online, and allows kernel tasks to migrate to the CPU
The hypervisor is not notified when the CPU online action completes. This state change is purely within the hardware and the guest.
ACPI Hotplug/Unplug Interfaces
The ACPI hotplug interface is the mechanism by which the hypervisor works with the guest OS to add or remove a vCPU. There are a few variants of the hotplug interface, but they all accomplish the same goal.
The ACPI General Purpose Event has supported CPU hotplug since version 2.0 (released in 2000). QEMU’s “legacy” hotplug interface supports < 256 vCPUs total, and doesn’t support hot-unplug. QEMU’s “modern” hotplug interface is required for > 255 vCPUs, and for hot-unplug support. Both the legacy and modern interfaces use GPE interrupts to signal the vCPU change event to the guest.
ACPI also supports a more generic hotplug interface, through the Generic Event Device (GED), introduced in ACPI version 6.1. QEMU uses GED for memory and NVDIMM hotplug. As of QEMU version 9.1, code for vCPU hotplug using GED is in place, but is not used by either the x86-64 or arm64 architectures.
SecureBoot guests use an additional mechanism on top of GPE, which moves critical vCPU hotplug functionality into Secure Management Mode of x86-64 CPUs.
Legacy Hotplug
QEMU maps vCPU hotplug to GPE.2, which is exposed to the guest as part of the FADT. QEMU will set bits in a bitmask that represents the active CPUs, set the GPE.2 event, and then raise an SCI interrupt in the guest.
The SCI interrupt handler will run in the guest, see that GPE0 status is set, and invoke the GPE.2 handler (defined by QEMU, provided to guest through ACPI information). The handler will perform MMIO reads to find the CPU bits set in the bitmask, and notify the guest OS with corresponding ACPI CPU hot-add events.
“Modern” Hotplug
QEMU’s legacy hotplug interface is inflexible because of the use of bitmasks for vCPUs (limiting vCPU count), and not extensible to functionality like hot-unplug. Among other improvements, the modern interface has a dedicated register for the vCPU’s APIC ID. The interface includes a device status field, to distinguish between insert and eject (hotplug and hot-unplug) requests.
QEMU writes the vCPU’s APIC ID to a register designated the “CPU selector”, rather than a bitmap, and sets the appropriate bit in the CPU device status to indicate hotplug/hot-unplug. The guest is informed about the requested command and the CPU device status by reading distinct MMIO offsets.
SecureBoot Guests
SecureBoot guests are protected such that they only execute trusted firmware during the boot process. x86_64 Secure Management Mode helps provide this isolation because it’s a completely distinct operating mode of the CPU. SMM code has access to a separate address space that is inaccessible to non-SMM code. The emulated chipset is configured to allow write access to certain persistent variables only in SMM.
The GPE.2 handler in qemu will generate an SMI (Secure Management Interrupt) that puts all CPUs into SMM. OVMF (from the EDK II project) is an open-source implementation of UEFI firmware that supports SecureBoot. The SMI handler is defined in OVMF, and this runs when a CPU transitions into SMM.
Reference
Definitions
- ACPI – Advanced Configuration and Power Interface. A standard for managing power and configuring hardware on a system.
- GPE – General Purpose Event. A configurable notification in ACPI that lets hardware inform the OS about an event of interest.
- INIT – An Inter-Processor Interrupt that signals a processor to initialize. It is distinct from RESET in that it can preserve internal state such as caches.
- OVMF – Open Virtual Machine Firmware. An open-source UEFI implementation for virtual machines under the EDK II project.
- SCI – System Control Interrupt. A general-purpose interrupt that is often used by ACPI to signal to the OS an event has occured.
- SIPI – Startup Inter-Processor Interrupt. An interrupt that signals to the processor to start executing instructions at a specified address.
- SMI – System Management Interrupt. A dedicated interrupt that is only used to put the processor into SMM.
- SMM – System Management Mode. This is among the highest-privilege execution modes on x86_64 CPUs, sometimes called “ring -2”. SMM allows extending hardware functionality through firmware, without the need for OS involvement.
- UEFI – Unified Extensible Firmware Interface. Intended to replace legacy BIOS, UEFI provides a more flexible and modern interface between the OS and hardware.