Speeding up Large Memory VM Boot with QEMU ThreadContext

In the world of virtualization, fast virtual machine (VM) boot up is crucial. Swift provisioning and deployment of VMs enable organizations to respond quickly to changing demands, scale resources efficiently, and minimize downtime. Fast boot time also contributes significantly to user experience. Whether it’s developers spinning up test environments or end users accessing virtual desktops, quick boot time enhances productivity, improves user satisfaction, and contributes to a seamless computing experience.

To optimize performance, VMs are commonly configured with preallocated memory. Preallocating memory ensures immediate access to resources, avoids fragmentation, and is more predictable than dynamic allocation. The downsides are that it requires upfront commitment of resources and time to initialize the memory. This means that VMs with large amounts of preallocated memory can experience slow boot times. This blog post describes how to use QEMU’s ThreadContext to reduce VM boot time by accelerating memory preallocation.

What Is ThreadContext

QEMU uses a configurable number of initialization threads to perform memory preallocation. Normally, these threads are placed without consideration for NUMA architecture. As a result, a thread may end up on a different node from the memory it operates on and incur additional latency due to the cross-node memory access. In addition, the impact of this latency can be highly variable as the scheduler may chose different placement from run to run.

This is where QEMU ThreadContext comes into play. Briefly, ThreadContext refers to a thread used for creating additional threads, where each new thread will inherit the original thread’s NUMA affinity settings. Creating a ThreadContext once is sufficient; all subsequent threads created through it will automatically adopt the same CPU affinity. ThreadContext support has been available since QEMU version 7.2.0.

When ThreadContext is used for memory preallocation, it ensures the initialization threads are placed on the same NUMA node as the associated memory region. In addition, a recent enhancement allows ThreadContext regions to be initialized in parallel, which further reduces memory preallocation time.

How to Use ThreadContext

Begin by examining the NUMA configuration of the host system. For example, on a two socket host system, OCI’s BM.Standard2.52, lscpu shows the following configuration:
```
# lscpu
...
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-25,52-77
  NUMA node1 CPU(s):     26-51,78-103
```
On the QEMU command line, define each NUMA node with the -numa node option, specifying the associated CPU cores and memory objects (we’ll define those in a moment):
```
-numa node,nodeid=0,cpus=0-25,cpus=52-77,memdev=ram-node0
-numa node,nodeid=1,cpus=26-51,cpus=78-103,memdev=ram-node1
```
Create a ThreadContext object for each NUMA node using the -object thread-context command line switch, specifying the node affinity:
```
-object thread-context,id=tc0,node-affinity=0
-object thread-context,id=tc1,node-affinity=1
```
(The node-affinity option associates each ThreadContext object with the corresponding NUMA node.)
Define preallocated memory objects using QEMU’s memory backend types. In this example, each NUMA node has 128GB of preallocated memory, with a policy of “bind” and eight preallocation threads per node, utilizing the -object memory-backend-ram command line switch:
```
-object memory-backend-ram,id=ram-node0,size=128g,host-nodes=0,policy=bind,
                           prealloc=on,prealloc-context=tc0,prealloc-threads=8
-object memory-backend-ram,id=ram-node1,size=128g,host-nodes=1,policy=bind,
                           prealloc=on,prealloc-context=tc1,prealloc-threads=8
```
(The policy=bind option ensures that memory allocation is restricted to the corresponding host node.)

Results

Preallocation Time Without ThreadContext

As a baseline, we use the time command with a simple set of QEMU command line arguments which will terminate the VM after memory initialization. There is some variability when running this test due to scheduler placement, but on the test system the average is about 25 seconds:

```
# time ./qemu-system-x86_64 -name debug-threads=on \
-nographic -monitor stdio -m 256G \
-numa node,nodeid=0,cpus=0-25,cpus=52-77,memdev=ram-node0 \
-numa node,nodeid=1,cpus=26-51,cpus=78-103,memdev=ram-node1 \
-object memory-backend-ram,id=ram-node0,size=128g,host-nodes=0,policy=bind,prealloc=on,prealloc-threads=8 \
-object memory-backend-ram,id=ram-node1,size=128g,host-nodes=1,policy=bind,prealloc=on,prealloc-threads=8
...
real    0m25.482s
user    0m0.072s
sys 1m21.673s

```

Preallocation Time With ThreadContext

We enable ThreadContext by adding the previously described command line switches, and run the same time test. In this example, it consistently takes less than 11 seconds:

```
# time ./qemu-system-x86_64 -name debug-threads=on \
-nographic -monitor stdio -m 256G \
-numa node,nodeid=0,cpus=0-25,cpus=52-77,memdev=ram-node0 \
-numa node,nodeid=1,cpus=26-51,cpus=78-103,memdev=ram-node1 \
-object thread-context,id=tc0,node-affinity=0 \
-object thread-context,id=tc1,node-affinity=1 \
-object memory-backend-ram,id=ram-node0,size=128g,host-nodes=0,policy=bind,prealloc=on,prealloc-context=tc0,prealloc-threads=8 \
-object memory-backend-ram,id=ram-node1,size=128g,host-nodes=1,policy=bind,prealloc=on,prealloc-context=tc1,prealloc-threads=8
...
real    0m10.566s
user    0m0.074s
sys 0m55.131s

```

Comparing this to the previous result shows that using ThreadContext results in a 55% reduction in memory preallocation time.

The ps command can be used to confirm ThreadContext is working as expected. The QEMU command line has -name debug-threads=on which gives each thread a name displayed in ps, and the name of the initialization thread is touch_pages. Recall that in addition to having the NUMA aware initialization threads, ThreadContext allows memory preallocation to occur in parallel across NUMA nodes. Therefore there are 16 touch_pages initialization threads:

```
$ ps -p $(pgrep qemu-system-x86) -L -o pid,lwp,cpuid,ucmd
    PID     LWP CPUID CMD
 119728  119728    83 qemu-system-x86
 119728  119732    43 qemu-system-x86
 119728  119733     0 TC tc0
 119728  119734    28 TC tc1
 119728  119737     8 touch_pages
 119728  119738    12 touch_pages
 119728  119739    73 touch_pages
 119728  119740    53 touch_pages
 119728  119741     6 touch_pages
 119728  119742    59 touch_pages
 119728  119743     9 touch_pages
 119728  119744    69 touch_pages
 119728  119745   103 touch_pages
 119728  119746    37 touch_pages
 119728  119747    91 touch_pages
 119728  119748    41 touch_pages
 119728  119749    94 touch_pages
 119728  119750    45 touch_pages
 119728  119751    46 touch_pages
 119728  119752    47 touch_pages
```

Notice that the ThreadContext tc0 thread is on node 0 (cpu ID 0) along with the first eight touch_pages threads (cpu IDs 8, 12, 73, 53, 6, 59, 9, 69). Likewise,tc1 is on node 1 (cpu ID 28) with the remaining 8 touch_pages threads (cpu IDs 103, 37, 91, 41, 94, 45, 46, 47).

Wrapping Up

Fast VM boot time is important for efficient resource management and user experience. In this blog post, we show how QEMU ThreadContext can significantly speed up boot time by making the memory preallocation threads NUMA aware and allowing parallel initialization.

ThreadContext is available in mainline QEMU and in Oracle QEMU 7.2.0 with the KVM AppStream for Oracle Linux 8 and KVM Utilities for Oracle Linux 9.

Speeding up Large Memory VM Boot with QEMU ThreadContext

What Is ThreadContext

How to Use ThreadContext

Results

Preallocation Time Without ThreadContext

Preallocation Time With ThreadContext

Wrapping Up

Mark Kanda

Introducing Memdesc

Detecting Kernel Memory Leaks using adaptivemm

Speeding up Large Memory VM Boot with QEMU ThreadContext

What Is ThreadContext

How to Use ThreadContext

Results

Preallocation Time Without ThreadContext

Preallocation Time With ThreadContext

Wrapping Up

Authors

Mark Kanda

Introducing Memdesc

Detecting Kernel Memory Leaks using adaptivemm