Introduction

The virtio-blk device has supported multi-queue for quite a while. It is used to improve performance during heavy I/O, by processing the queues in parallel. However, before QEMU 9.0, all the virtqueues were processed by a single IOThread or the main loop. This single thread can be a CPU bottleneck.

Now, in QEMU 9.0, the ‘virtio-blk’ device offers real multiqueue functionality, allowing multiple I/O threads to execute distinct queues on a single disk, thus allowing it to distribute the workload. It is now possible to specify the mapping between multiple IOThreads and virtqueues for a virtio-blk device. This can help improve scalability in particular in situations where the guest provided enough I/O to overload the host CPU while processing the virtio-blk requests with a single I/O thread.

In this article we will see how to configure multiple IOThreads using the new ‘iothread-vq-mapping’ property for a virtio-blk device.

Setup the KVM Host

The distro used for the KVM host was Oracle Linux 9 with a UEK-next Linux Kernel installed. In the next sections, we will see how to install a UEK-next kernel, build QEMU 9.0 and create a NULL device for testing to setup our host.

Install a UEK-next Linux Kernel

First, install the gpg key, then add the UEKnext yum repository.

$ rpm --import https://yum.oracle.com/RPM-GPG-KEY-oracle-development
$ dnf config-manager --add-repo "https://yum.oracle.com/repo/OracleLinux/OL9/developer/UEKnext/x86_64"

 

Now install the UEK-next kernel:

$ dnf install kernel-ueknext

Dependencies resolved.
=====================================================================================================================================
 Package                        Arch      Version              Repository                                                       Size
=====================================================================================================================================
Installing:
 kernel-ueknext                 x86_64    6.8.0-2.el9uek       yum.oracle.com_repo_OracleLinux_OL9_developer_UEKnext_x86_64    235 k
Installing dependencies:
 kernel-ueknext-core            x86_64    6.8.0-2.el9uek       yum.oracle.com_repo_OracleLinux_OL9_developer_UEKnext_x86_64     17 M
 kernel-ueknext-modules         x86_64    6.8.0-2.el9uek       yum.oracle.com_repo_OracleLinux_OL9_developer_UEKnext_x86_64     53 M
 kernel-ueknext-modules-core    x86_64    6.8.0-2.el9uek       yum.oracle.com_repo_OracleLinux_OL9_developer_UEKnext_x86_64     34 M

Transaction Summary
=====================================================================================================================================
Install  4 Packages

Total download size: 104 M
Installed size: 155 M
Is this ok [y/N]: y
Downloading Packages:
(1/4): kernel-ueknext-6.8.0-2.el9uek.x86_64.rpm                                                      1.4 MB/s | 235 kB     00:00
(2/4): kernel-ueknext-core-6.8.0-2.el9uek.x86_64.rpm                                                  25 MB/s |  17 MB     00:00
(3/4): kernel-ueknext-modules-core-6.8.0-2.el9uek.x86_64.rpm                                          32 MB/s |  34 MB     00:01
(4/4): kernel-ueknext-modules-6.8.0-2.el9uek.x86_64.rpm                                               29 MB/s |  53 MB     00:01
-------------------------------------------------------------------------------------------------------------------------------------
Total                                                                                                 56 MB/s | 104 MB     00:01
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                                             1/1
  Installing       : kernel-ueknext-modules-core-6.8.0-2.el9uek.x86_64                                                           1/4
  Running scriptlet: kernel-ueknext-core-6.8.0-2.el9uek.x86_64                                                                   2/4
  Installing       : kernel-ueknext-core-6.8.0-2.el9uek.x86_64                                                                   2/4
  Running scriptlet: kernel-ueknext-core-6.8.0-2.el9uek.x86_64                                                                   2/4
  Installing       : kernel-ueknext-modules-6.8.0-2.el9uek.x86_64                                                                3/4
  Running scriptlet: kernel-ueknext-modules-6.8.0-2.el9uek.x86_64                                                                3/4
  Installing       : kernel-ueknext-6.8.0-2.el9uek.x86_64                                                                        4/4
  Running scriptlet: kernel-ueknext-modules-core-6.8.0-2.el9uek.x86_64                                                           4/4
  Running scriptlet: kernel-ueknext-core-6.8.0-2.el9uek.x86_64                                                                   4/4
  Running scriptlet: kernel-ueknext-modules-6.8.0-2.el9uek.x86_64                                                                4/4
  Running scriptlet: kernel-ueknext-6.8.0-2.el9uek.x86_64                                                                        4/4
  Verifying        : kernel-ueknext-6.8.0-2.el9uek.x86_64                                                                        1/4
  Verifying        : kernel-ueknext-core-6.8.0-2.el9uek.x86_64                                                                   2/4
  Verifying        : kernel-ueknext-modules-6.8.0-2.el9uek.x86_64                                                                3/4
  Verifying        : kernel-ueknext-modules-core-6.8.0-2.el9uek.x86_64                                                           4/4

Installed:
  kernel-ueknext-6.8.0-2.el9uek.x86_64                           kernel-ueknext-core-6.8.0-2.el9uek.x86_64
  kernel-ueknext-modules-6.8.0-2.el9uek.x86_64                   kernel-ueknext-modules-core-6.8.0-2.el9uek.x86_64

Complete!

 

Once installation has completed, reboot the server and confirm that we have booted into the new kernel:

$ uname -r
6.8.0-2.el9uek.x86_64

QEMU Build

The new feature “iothread-vq-mapping” was introduced in QEMU 9.0 to map virtqueues to IOThreads. So we will onw build QEMU 9.0 from source.

To build QEMU, run the commands below:

Notes:

  • We will do a build for a x86_64 target.
  • NR_CPUS is the maximum number of cpus. To check the number of CPUs, use the lscpu command. However, here we can also replace $NR_CPUS with the desired number of CPUs to perform the build.
$ wget https://download.qemu.org/qemu-9.0.0.tar.xz
$ tar xvJf qemu-9.0.0.tar.xz
$ cd qemu-9.0.0
$ ./configure --target-list=x86_64-softmmu
$ make -j $NR_CPUS

 

In the next sections, we will run the QEMU command directly from the build directory to create a virtual machine (VM).

Setup a null-block Device

The null-block device (/dev/nullb*) emulates a block device with a size of X GB. Its purpose is to test the different block-layer implementations. Instead of performing read/write operations, it simply registers them as finished in the request queue.

Create a null-block device /dev/nullb0:

$ sudo modprobe null_blk hw_queue_depth=1024 gb=100 submit_queues=$NR_CPUS nr_devices=1 max_sectors=1024

 

Use the number of CPUs on your system for ‘submit-queues’, set size to 100GB, queue depth of 1024 commands and max IO size as 512K.

Confirm creation of the device:

$ ls -l /dev/nullb*
brw-rw----. 1 root disk 251, 0 Jul 13 13:38 /dev/nullb0

 

Note: You can use any backend device of your choice also instead of nullb0.

Start a VM with virtio-blk Device

This section describes how to start a VM with a virtio-blk device attached, using the QEMU command. First we will see how to create a VM using only one IOThread attached to the virtio-blk device and then use a “iothread-vq-mapping” parameter to attach multiple IOThreads.

VM configuration will be:

  • os-release: OL9.4
  • kernel: 6.8.0-2.el9uek.x86_64
  • 16 vCPUs 
  • 16G RAM  

Use Only Single IOThread

Let’s see how to attach a single I/O thread to virtio-blk disk:

$ qemu-system-x86_64 -smp 16 -m 16G -enable-kvm -cpu host  \
-hda /work/OL9U4_x86_64.qcow2 -name debug-threads=on \
-serial mon:stdio -vnc :7 \
-object iothread,id=iothread0 \
-device virtio-blk-pci,drive=drive0,id=virtblk0,iothread=iothread0,queue-size=1024,config-wce=false \
-drive file=/dev/nullb0,if=none,id=drive0,format=raw,cache=none,aio=native 

 

Note: Open the VNC client application and establish a connection to localhost:5907. A VM may be accessed with any VNC client program. For instance, you may utilize RealVNC or TightVNC if you’re using Windows. Use the vncviewer software that comes with your Linux distribution if you’re running Linux. Moreover, you may launch a VNC server using display number X. Replace the display number with X, so that, for example, 0 will listen on 5900, 1 on 5901, and so on.

Check if multi-queue is enabled in your VM:

$ ls /sys/block/vda/mq
0  1  10  11  12  13  14  15  2  3  4  5  6  7  8  9

 

Check for the I/O thread:

Use pidstat -t -p <QEMU-pid> to check threads of the QEMU process on the host,

$ pidstat -t -p  1034347

11:08:11 AM   UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command

11:08:11 AM     0   1034347         -    0.00    0.00    0.00    0.00    0.00    29  qemu-system-x86
11:08:11 AM     0         -   1034347    0.00    0.00    0.00    0.00    0.00    29  |__qemu-system-x86
11:08:11 AM     0         -   1034348    0.00    0.00    0.00    0.00    0.00   160  |__qemu-system-x86
11:08:11 AM     0         -   1034349    0.00    0.00    0.00    0.00    0.00   175  |__IO iothread0
11:08:11 AM     0         -   1034353    0.00    0.00    0.00    0.00    0.00    20  |__CPU 0/KVM
11:08:11 AM     0         -   1034354    0.00    0.00    0.00    0.00    0.00    17  |__CPU 1/KVM
11:08:11 AM     0         -   1034355    0.00    0.00    0.00    0.00    0.00    25  |__CPU 2/KVM
11:08:11 AM     0         -   1034356    0.00    0.00    0.00    0.00    0.00    31  |__CPU 3/KVM
11:08:11 AM     0         -   1034357    0.00    0.00    0.00    0.00    0.00    21  |__CPU 4/KVM
11:08:11 AM     0         -   1034358    0.00    0.00    0.00    0.00    0.00    27  |__CPU 5/KVM
11:08:11 AM     0         -   1034359    0.00    0.00    0.00    0.00    0.00    27  |__CPU 6/KVM
11:08:11 AM     0         -   1034360    0.00    0.00    0.00    0.00    0.00    30  |__CPU 7/KVM
11:08:11 AM     0         -   1034361    0.00    0.00    0.00    0.00    0.00    29  |__CPU 8/KVM
11:08:11 AM     0         -   1034362    0.00    0.00    0.00    0.00    0.00    16  |__CPU 9/KVM
11:08:11 AM     0         -   1034363    0.00    0.00    0.00    0.00    0.00    18  |__CPU 10/KVM
11:08:11 AM     0         -   1034364    0.00    0.00    0.00    0.00    0.00    19  |__CPU 11/KVM
11:08:11 AM     0         -   1034365    0.00    0.00    0.00    0.00    0.00    23  |__CPU 12/KVM
11:08:11 AM     0         -   1034366    0.00    0.00    0.00    0.00    0.00    22  |__CPU 13/KVM
11:08:11 AM     0         -   1034367    0.00    0.00    0.00    0.00    0.00    28  |__CPU 14/KVM
11:08:11 AM     0         -   1034368    0.00    0.00    0.00    0.00    0.00    24  |__CPU 15/KVM

 

Add the “iothread-vq-mapping” Parameter

Now, let’s attach the disk using the ‘iothread-vq-mapping’ parameter to assign virtqueues to IOThreads.

Below is the command-line syntax of this new property:

--device '{"driver":"foo","iothread-vq-mapping":[{"iothread":"iothread0","vqs":[0,1,2]},...]}'
  • iothread: the id of IOThread object.
  • vqs: an optional array of virtqueue indices that will be handled by this IOThread.

The following is an alternate syntax that does not require specifying individual virtqueue indices:

--device '{"driver":"foo","iothread-vq-mapping":[{"iothread":"iothread0"},{"iothread":"iothread1"},...]}'

 

Remember, either all IOThreads must have vqs mapped or none of them must have it.

Allow QEMU to Assign Mapping

Let’s see how to use this feature without specifying the vqs parameter. In this case, all the virtqueues are assigned round-robin across a set of given IOThreads.

$ qemu-system-x86_64 -smp 16 -m 16G -enable-kvm -cpu host \
-hda /work/OL9U4_x86_64.qcow2 -name debug-threads=on \
-serial mon:stdio -vnc :7 \
-object iothread,id=iothread0 -object iothread,id=iothread1 \
-object iothread,id=iothread2 -object iothread,id=iothread3 \
-object iothread,id=iothread4 -object iothread,id=iothread5 \
-object iothread,id=iothread6 -object iothread,id=iothread7 \
-object iothread,id=iothread8 -object iothread,id=iothread9 \
-object iothread,id=iothread10 -object iothread,id=iothread11 \
-object iothread,id=iothread12 -object iothread,id=iothread13 \
-object iothread,id=iothread14 -object iothread,id=iothread15 \
-drive file=/dev/nullb0,if=none,id=drive0,format=raw,cache=none,aio=native \
--device '{"driver":"virtio-blk-pci","iothread-vq-mapping":[{"iothread":"iothread0"},{"iothread":"iothread1"},{"iothread":"iothread2"},{"iothread":"iothread3"},{"iothread":"iothread4"},{"iothread":"iothread5"},{"iothread":"iothread6"},{"iothread":"iothread7"},{"iothread":"iothread8"},{"iothread":"iothread9"},{"iothread":"iothread10"},{"iothread":"iothread11"},{"iothread":"iothread12"},{"iothread":"iothread13"},{"iothread":"iothread14"},{"iothread":"iothread15"}],"drive":"drive0","queue-size":1024,"config-wce":false}' 

 

To check how IOThreads are getting exploited after this feature is used, please see section Run the tests.

Manually Assign virtqueues to IOThreads

Single vq per IOThread

Now, we will specify each vq that is associated with different IOThthreads. The IOThreads are specified by name and virtqueues are specified by 0-based index.

In this example, we’re assigning each of 16 vqs to 16 IOThreads,

$ qemu-system-x86_64 -smp 16 -m 16G -enable-kvm -cpu host \
-serial mon:stdio -vnc :7 \
-object iothread,id=iothread0 -object iothread,id=iothread1 \
-object iothread,id=iothread2 -object iothread,id=iothread3 \
-object iothread,id=iothread4 -object iothread,id=iothread5 \
-object iothread,id=iothread6 -object iothread,id=iothread7 \
-object iothread,id=iothread8 -object iothread,id=iothread9 \
-object iothread,id=iothread10 -object iothread,id=iothread11 \
-object iothread,id=iothread12 -object iothread,id=iothread13 \
-object iothread,id=iothread14 -object iothread,id=iothread15 \
-drive file=/dev/nullb0,if=none,id=drive0,format=raw,cache=none,aio=native \
--device '{"driver":"virtio-blk-pci","iothread-vq-mapping":[{"iothread":"iothread0","vqs": [0]},{"iothread":"iothread1","vqs": [1]},{"iothread":"iothread2","vqs": [2]},{"iothread":"iothread3","vqs": [3]},{"iothread":"iothread4","vqs": [4]},{"iothread":"iothread5","vqs": [5]},{"iothread":"iothread6","vqs": [6]},{"iothread":"iothread7","vqs": [7]},{"iothread":"iothread8","vqs": [8]},{"iothread":"iothread9","vqs": [9]},{"iothread":"iothread10","vqs": [10]},{"iothread":"iothread11","vqs": [11]},{"iothread":"iothread12","vqs": [12]},{"iothread":"iothread13","vqs": [13]},{"iothread":"iothread14","vqs": [14]},{"iothread":"iothread15","vqs": [15]}],"drive":"drive0","queue-size":1024,"config-wce":false}' \

 

Multiple vqs per IOThread

Below is the –device definition to show how multiple virtqueues can be assigned to a single IOThread.
Here we will assign 2 vqs each to 8 IOThreads:

--device '{"driver":"virtio-blk-pci","iothread-vq-mapping":[{"iothread":"iothread0","vqs": [0,1]},{"iothread":"iothread1","vqs": [2,3]},{"iothread":"iothread2","vqs": [4,5]},{"iothread":"iothread3","vqs": [6,7]},{"iothread":"iothread4","vqs": [8,9]},{"iothread":"iothread5","vqs": [10,11]},{"iothread":"iothread6","vqs": [12,13]},{"iothread":"iothread7","vqs": [14,15]}],"drive":"drive0","queue-size":1024,"config-wce":false}' 

Test with Fio

Pinning QEMU Threads

Check your system’s CPU configuration by running the lscpu -e command. On our system, CPUs 16 to 31 are on the same NUMA node and are free. Since our VM has 16 vCPUs, we’ll bind all the QEMU threads to 16 physical CPUs on the host.

Run the following command on the host:

$ taskset -cp -a 16-31 <QEMU-pid>

 

Run the Tests

With this new feature in use, pidstat -t 1 will show that VMs with -smp 2 or higher are able to exploit multiple IOThreads. Let’s see the below test to verify the same.

Run this fio workload that spreads I/O across 16 queues:

$ cat randread.fio

[global]
bs=4K
iodepth=64
direct=1
ioengine=libaio
group_reporting
time_based
runtime=60
numjobs=16
name=standard-iops
rw=randread
cpus_allowed=0-15

[job1]
filename=/dev/vda

# fio randread.fio

 

pidstat output:

$ pidstat -t -p  3465871 1

12:49:55 PM   UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command

12:49:55 PM     0   3465871         -  412.80  486.80  578.20    0.00 1477.80    26  qemu-system-x86
12:49:55 PM     0         -   3465871    0.00    0.20    0.00    0.00    0.20    26  |__qemu-system-x86
12:49:55 PM     0         -   3465872    0.00    0.00    0.00    0.00    0.00   135  |__qemu-system-x86
12:49:55 PM     0         -   3465873   25.80   29.60    0.00   26.40   55.40    22  |__IO iothread0
12:49:55 PM     0         -   3465874   24.80   28.60    0.00   27.20   53.40    31  |__IO iothread1
12:49:55 PM     0         -   3465875   26.60   28.80    0.00   26.20   55.40    24  |__IO iothread2
12:49:55 PM     0         -   3465876   26.40   27.80    0.00   26.60   54.20    30  |__IO iothread3
12:49:55 PM     0         -   3465877   27.00   28.00    0.00   26.40   55.00    23  |__IO iothread4
12:49:55 PM     0         -   3465878   26.40   28.60    0.00   26.40   55.00    21  |__IO iothread5
12:49:55 PM     0         -   3465879   24.80   29.20    0.00   27.00   54.00    18  |__IO iothread6
12:49:55 PM     0         -   3465880   26.40   29.60    0.00   26.60   56.00    25  |__IO iothread7
12:49:55 PM     0         -   3465881   25.60   28.20    0.00   26.40   53.80    17  |__IO iothread8
12:49:55 PM     0         -   3465882   25.80   28.00    0.00   27.00   53.80    26  |__IO iothread9
12:49:55 PM     0         -   3465883   26.40   28.00    0.00   26.80   54.40    16  |__IO iothread10
12:49:55 PM     0         -   3465884   26.00   29.00    0.00   26.20   55.00    24  |__IO iothread11
12:49:55 PM     0         -   3465885   25.80   29.00    0.00   26.80   54.80    26  |__IO iothread12
12:49:55 PM     0         -   3465886   25.80   29.00    0.00   26.60   54.80    27  |__IO iothread13
12:49:55 PM     0         -   3465887   25.80   28.00    0.00   26.00   53.80    21  |__IO iothread14
12:49:55 PM     0         -   3465888   26.20   29.60    0.00   26.80   55.80    17  |__IO iothread15
12:49:55 PM     0         -   3465891    0.60    2.80   36.40   19.20   39.80    16  |__CPU 0/KVM
12:49:55 PM     0         -   3465893    0.00    2.20   35.20   17.80   37.40    28  |__CPU 1/KVM
12:49:55 PM     0         -   3465894    0.00    2.40   36.40   16.00   38.40    20  |__CPU 2/KVM
12:49:55 PM     0         -   3465895    0.00    2.20   35.80   16.20   37.20    19  |__CPU 3/KVM
12:49:55 PM     0         -   3465896    0.00    2.00   36.40   15.80   37.80    31  |__CPU 4/KVM
12:49:55 PM     0         -   3465897    0.00    2.40   35.60   15.80   38.00    23  |__CPU 5/KVM
12:49:55 PM     0         -   3465898    0.00    1.80   35.20   16.60   36.60    18  |__CPU 6/KVM
12:49:55 PM     0         -   3465899    0.00    2.40   36.00   15.00   38.20    25  |__CPU 7/KVM
12:49:55 PM     0         -   3465900    0.00    1.80   35.80   15.80   37.20    20  |__CPU 8/KVM
12:49:55 PM     0         -   3465901    0.00    2.20   35.80   15.40   36.80    29  |__CPU 9/KVM
12:49:55 PM     0         -   3465902    0.00    2.20   35.80   16.40   37.60    27  |__CPU 10/KVM
12:49:55 PM     0         -   3465903    0.00    1.80   36.80   16.20   38.00    21  |__CPU 11/KVM
12:49:55 PM     0         -   3465904    0.00    1.80   38.00   16.40   38.20    19  |__CPU 12/KVM
12:49:55 PM     0         -   3465905    0.00    2.00   36.80   15.80   37.40    23  |__CPU 13/KVM
12:49:55 PM     0         -   3465906    0.00    1.80   35.00   16.20   37.20    28  |__CPU 14/KVM
12:49:55 PM     0         -   3465907    0.00    2.20   36.60   15.20   38.00    30  |__CPU 15/KVM

 

After running the fio command used in the above example, we observed that the number of IOPs was scaled by multiple times when the ‘allow QEMU to assign mapping’ VM configuration was used.

You can test on your setup with different configurations to see how the performance looks.

Conclusion

In this article, we saw how to setup a VM using the newly introduced ‘iothread-vq-mapping’ feature and its different use cases. We also tested this new feature using the fio tool to see how it can help improve performance.

References