In the world of virtualization, fast virtual machine (VM) boot up is crucial. Swift provisioning and deployment of VMs enable organizations to respond quickly to changing demands, scale resources efficiently, and minimize downtime. Fast boot time also contributes significantly to user experience. Whether it’s developers spinning up test environments or end users accessing virtual desktops, quick boot time enhances productivity, improves user satisfaction, and contributes to a seamless computing experience.
To optimize performance, VMs are commonly configured with preallocated memory. Preallocating memory ensures immediate access to resources, avoids fragmentation, and is more predictable than dynamic allocation. The downsides are that it requires upfront commitment of resources and time to initialize the memory. This means that VMs with large amounts of preallocated memory can experience slow boot times. This blog post describes how to use QEMU’s ThreadContext to reduce VM boot time by accelerating memory preallocation.
What Is ThreadContext
QEMU uses a configurable number of initialization threads to perform memory preallocation. Normally, these threads are placed without consideration for NUMA architecture. As a result, a thread may end up on a different node from the memory it operates on and incur additional latency due to the cross-node memory access. In addition, the impact of this latency can be highly variable as the scheduler may chose different placement from run to run.
This is where QEMU ThreadContext comes into play. Briefly, ThreadContext refers to a thread used for creating additional threads, where each new thread will inherit the original thread’s NUMA affinity settings. Creating a ThreadContext once is sufficient; all subsequent threads created through it will automatically adopt the same CPU affinity. ThreadContext support has been available since QEMU version 7.2.0.
When ThreadContext is used for memory preallocation, it ensures the initialization threads are placed on the same NUMA node as the associated memory region. In addition, a recent enhancement allows ThreadContext regions to be initialized in parallel, which further reduces memory preallocation time.
How to Use ThreadContext
-
Begin by examining the NUMA configuration of the host system. For example, on a two socket host system, OCI’s BM.Standard2.52, lscpu shows the following configuration:
# lscpu ... NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-25,52-77 NUMA node1 CPU(s): 26-51,78-103
-
On the QEMU command line, define each NUMA node with the
-numa node
option, specifying the associated CPU cores and memory objects (we’ll define those in a moment):-numa node,nodeid=0,cpus=0-25,cpus=52-77,memdev=ram-node0 -numa node,nodeid=1,cpus=26-51,cpus=78-103,memdev=ram-node1
-
Create a ThreadContext object for each NUMA node using the
-object thread-context
command line switch, specifying the node affinity:-object thread-context,id=tc0,node-affinity=0 -object thread-context,id=tc1,node-affinity=1
(The
node-affinity
option associates each ThreadContext object with the corresponding NUMA node.) -
Define preallocated memory objects using QEMU’s memory backend types. In this example, each NUMA node has 128GB of preallocated memory, with a policy of “bind” and eight preallocation threads per node, utilizing the
-object memory-backend-ram
command line switch:-object memory-backend-ram,id=ram-node0,size=128g,host-nodes=0,policy=bind, prealloc=on,prealloc-context=tc0,prealloc-threads=8 -object memory-backend-ram,id=ram-node1,size=128g,host-nodes=1,policy=bind, prealloc=on,prealloc-context=tc1,prealloc-threads=8
(The
policy=bind
option ensures that memory allocation is restricted to the corresponding host node.)
Results
Preallocation Time Without ThreadContext
As a baseline, we use the time
command with a simple set of QEMU command line arguments which will terminate the VM after memory initialization. There is some variability when running this test due to scheduler placement, but on the test system the average is about 25 seconds:
``` # time ./qemu-system-x86_64 -name debug-threads=on \ -nographic -monitor stdio -m 256G \ -numa node,nodeid=0,cpus=0-25,cpus=52-77,memdev=ram-node0 \ -numa node,nodeid=1,cpus=26-51,cpus=78-103,memdev=ram-node1 \ -object memory-backend-ram,id=ram-node0,size=128g,host-nodes=0,policy=bind,prealloc=on,prealloc-threads=8 \ -object memory-backend-ram,id=ram-node1,size=128g,host-nodes=1,policy=bind,prealloc=on,prealloc-threads=8 ... real 0m25.482s user 0m0.072s sys 1m21.673s ```
Preallocation Time With ThreadContext
We enable ThreadContext by adding the previously described command line switches, and run the same time
test. In this example, it consistently takes less than 11 seconds:
``` # time ./qemu-system-x86_64 -name debug-threads=on \ -nographic -monitor stdio -m 256G \ -numa node,nodeid=0,cpus=0-25,cpus=52-77,memdev=ram-node0 \ -numa node,nodeid=1,cpus=26-51,cpus=78-103,memdev=ram-node1 \ -object thread-context,id=tc0,node-affinity=0 \ -object thread-context,id=tc1,node-affinity=1 \ -object memory-backend-ram,id=ram-node0,size=128g,host-nodes=0,policy=bind,prealloc=on,prealloc-context=tc0,prealloc-threads=8 \ -object memory-backend-ram,id=ram-node1,size=128g,host-nodes=1,policy=bind,prealloc=on,prealloc-context=tc1,prealloc-threads=8 ... real 0m10.566s user 0m0.074s sys 0m55.131s ```
Comparing this to the previous result shows that using ThreadContext results in a 55% reduction in memory preallocation time.
The ps
command can be used to confirm ThreadContext is working as expected. The QEMU command line has -name debug-threads=on
which gives each thread a name displayed in ps
, and the name of the initialization thread is touch_pages
. Recall that in addition to having the NUMA aware initialization threads, ThreadContext allows memory preallocation to occur in parallel across NUMA nodes. Therefore there are 16 touch_pages
initialization threads:
``` $ ps -p $(pgrep qemu-system-x86) -L -o pid,lwp,cpuid,ucmd PID LWP CPUID CMD 119728 119728 83 qemu-system-x86 119728 119732 43 qemu-system-x86 119728 119733 0 TC tc0 119728 119734 28 TC tc1 119728 119737 8 touch_pages 119728 119738 12 touch_pages 119728 119739 73 touch_pages 119728 119740 53 touch_pages 119728 119741 6 touch_pages 119728 119742 59 touch_pages 119728 119743 9 touch_pages 119728 119744 69 touch_pages 119728 119745 103 touch_pages 119728 119746 37 touch_pages 119728 119747 91 touch_pages 119728 119748 41 touch_pages 119728 119749 94 touch_pages 119728 119750 45 touch_pages 119728 119751 46 touch_pages 119728 119752 47 touch_pages ```
Notice that the ThreadContext tc0
thread is on node 0 (cpu ID 0) along with the first eight touch_pages
threads (cpu IDs 8, 12, 73, 53, 6, 59, 9, 69). Likewise,tc1
is on node 1 (cpu ID 28) with the remaining 8 touch_pages
threads (cpu IDs 103, 37, 91, 41, 94, 45, 46, 47).
Wrapping Up
Fast VM boot time is important for efficient resource management and user experience. In this blog post, we show how QEMU ThreadContext can significantly speed up boot time by making the memory preallocation threads NUMA aware and allowing parallel initialization.
ThreadContext is available in mainline QEMU and in Oracle QEMU 7.2.0 with the KVM AppStream for Oracle Linux 8 and KVM Utilities for Oracle Linux 9.