In the 3.13 Linux Kernel a new block layer queuing mechanism, blk-mq, was added. This allowed users to take advantage of modern hardware and drivers that exported multiple queues, so they could send IO requests from multiple CPUs in parallel more efficiently. QEMU has long supported multi-queue storage, but performance has been limited by a bottleneck caused by it’s use of a single thread per device to service IO. In QEMU 9.0, this was fixed for virtio-blk and the Linux kernel vhost-scsi target.

As was discussed in a previous blog, virtio-blk added the iothread-vq-mapping feature to allow users to create multiple iothreads and map them to different virtqueues. Similarly, vhost-scsi added a feature, worker_per_virtqueue, that allows users to create a vhost worker thread per virtqueue. In this article we will describe how to prepare a KVM host, setup a vhost-scsi target, create a VM with the new vhost-scsi feature, then run some IO tests.

If you’ve read the virtio-blk blog about the iothread-vq-mapping feature then you can use the same KVM host setup. The kernel installation and QEMU build are the same so skip to the Install Target Tools section. If this is a new setup, proceed to the next section to setup the host.

Setup the KVM Host

The distro used for the KVM host will be Oracle Linux 9 with a UEK-next Linux Kernel installed. In the next sections, we will see how to install a UEK-next kernel, build QEMU 9.0 create a NULL device for testing, then add it to a vhost-scsi target.

Install a UEK-next Linux Kernel

First, install the gpg key, then add the UEK-next yum repository.

$ rpm --import https://yum.oracle.com/RPM-GPG-KEY-oracle-development
$ dnf config-manager --add-repo "https://yum.oracle.com/repo/OracleLinux/OL9/developer/UEKnext/x86_64"

Now install the UEK-next kernel:

$ dnf install kernel-ueknext

Dependencies resolved.
========================================================================================================================================
 Package                        Arch      Version                 Repository                                                       Size
========================================================================================================================================
Installing:
 kernel-ueknext                 x86_64    6.10.0-2.el9ueknext     yum.oracle.com_repo_OracleLinux_OL9_developer_UEKnext_x86_64    286 k
Installing dependencies:
 kernel-ueknext-core            x86_64    6.10.0-2.el9ueknext     yum.oracle.com_repo_OracleLinux_OL9_developer_UEKnext_x86_64     17 M
 kernel-ueknext-modules         x86_64    6.10.0-2.el9ueknext     yum.oracle.com_repo_OracleLinux_OL9_developer_UEKnext_x86_64     53 M
 kernel-ueknext-modules-core    x86_64    6.10.0-2.el9ueknext     yum.oracle.com_repo_OracleLinux_OL9_developer_UEKnext_x86_64     35 M

Transaction Summary
========================================================================================================================================
Install  4 Packages

Total download size: 106 M
Installed size: 160 M
Is this ok [y/N]: y
Downloading Packages:
(1/4): kernel-ueknext-6.10.0-2.el9ueknext.x86_64.rpm                                                    858 kB/s | 286 kB     00:00
(2/4): kernel-ueknext-core-6.10.0-2.el9ueknext.x86_64.rpm                                                17 MB/s |  17 MB     00:00
(3/4): kernel-ueknext-modules-6.10.0-2.el9ueknext.x86_64.rpm                                             20 MB/s |  53 MB     00:02
(4/4): kernel-ueknext-modules-core-6.10.0-2.el9ueknext.x86_64.rpm                                        12 MB/s |  35 MB     00:02
----------------------------------------------------------------------------------------------------------------------------------------
Total                                                                                                    33 MB/s | 106 MB     00:03
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                                                1/1
  Installing       : kernel-ueknext-modules-core-6.10.0-2.el9ueknext.x86_64                                                         1/4
  Running scriptlet: kernel-ueknext-core-6.10.0-2.el9ueknext.x86_64                                                                 2/4
  Installing       : kernel-ueknext-core-6.10.0-2.el9ueknext.x86_64                                                                 2/4
  Running scriptlet: kernel-ueknext-core-6.10.0-2.el9ueknext.x86_64                                                                 2/4
  Installing       : kernel-ueknext-modules-6.10.0-2.el9ueknext.x86_64                                                              3/4
  Running scriptlet: kernel-ueknext-modules-6.10.0-2.el9ueknext.x86_64                                                              3/4
  Installing       : kernel-ueknext-6.10.0-2.el9ueknext.x86_64                                                                      4/4
  Running scriptlet: kernel-ueknext-modules-core-6.10.0-2.el9ueknext.x86_64                                                         4/4
  Running scriptlet: kernel-ueknext-core-6.10.0-2.el9ueknext.x86_64                                                                 4/4
  Running scriptlet: kernel-ueknext-modules-6.10.0-2.el9ueknext.x86_64                                                              4/4
  Running scriptlet: kernel-ueknext-6.10.0-2.el9ueknext.x86_64                                                                      4/4
  Verifying        : kernel-ueknext-6.10.0-2.el9ueknext.x86_64                                                                      1/4
  Verifying        : kernel-ueknext-core-6.10.0-2.el9ueknext.x86_64                                                                 2/4
  Verifying        : kernel-ueknext-modules-6.10.0-2.el9ueknext.x86_64                                                              3/4
  Verifying        : kernel-ueknext-modules-core-6.10.0-2.el9ueknext.x86_64                                                         4/4

Installed:
  kernel-ueknext-6.10.0-2.el9ueknext.x86_64                        kernel-ueknext-core-6.10.0-2.el9ueknext.x86_64
  kernel-ueknext-modules-6.10.0-2.el9ueknext.x86_64                kernel-ueknext-modules-core-6.10.0-2.el9ueknext.x86_64

Complete!

Once installation has completed, reboot the server and confirm that we have booted into the new kernel:

$ uname -r
6.10.0-2.el9ueknext.x86_64

Build QEMU 9.0

To build QEMU, run the commands below:

Notes:

  • We will do a build for a x86_64 target.
  • NR_CPUS is the maximum number of CPUs to perform the build on. To get the system’s number of CPUs, use the lscpu command.
$ wget https://download.qemu.org/qemu-9.0.0.tar.xz
$ tar xvJf qemu-9.0.0.tar.xz
$ cd qemu-9.0.0
$ ./configure --target-list=x86_64-softmmu
$ make -j $NR_CPUS

In the next sections, we will run the QEMU command directly from the build directory to create a virtual machine (VM).

Setup a vhost-scsi Device on The Host

Create a NULL Device for Testing

The null-block device (/dev/nullb*) emulates a block device with a size of X GB. Its purpose is to test the different block-layer implementations. Instead of performing physical read/write operations, it simply marks them as complete as they are dequeued.

Create a null-block device /dev/nullb0:

$ sudo modprobe null_blk hw_queue_depth=1024 gb=100 submit_queues=$NR_CPUS nr_devices=1 max_sectors=1024

Use the number of CPUs on your system for ‘submit-queues’, set size to 100GB, queue depth of 1024 commands and max IO size as 512K.

Confirm creation of the device:

$ ls -l /dev/nullb*
brw-rw----. 1 root disk 251, 0 Jul 13 13:38 /dev/nullb0

Note: You can use any back end device of your choice instead of nullb0, but remember to replace nullb0 with your device when setting up the vhost-scsi target in later sections.

Install Target Tools

$ sudo dnf install targetcli

Setup a vhost-scsi Target

Open the targetcli shell:

$ targetcli

Create a vhost-scsi target and map /dev/nullb0 to a LUN that we name disk0:

/> backstores/block create name=disk0 dev=/dev/nullb0
Created block storage object disk0 using /dev/nullb0.
/> backstores/block/disk0 set attribute emulate_pr=0
Parameter emulate_pr is now '0'.
/> vhost/  create
Created target naa.5001405b58bf97f1.
Created TPG 1.
/> vhost/naa.5001405b58bf97f1/tpg1/luns create /backstores/block/disk0
Created LUN 0.
/> exit

Above we performed one extra step and set emulate_pr=0. This disabled persistent reservation (PR) emulation in the target layer. Instead, the target will pass PR commands directly to the backing device. This avoids a possible performance issue that limits IOPS and throughput.

Run QEMU with a vhost-scsi Device

In the example in this section, we will pass the vhost-scsi device to QEMU on the command line.

Increase Queuing Limits

We want to increase the SCSI host (virtqueue_size) and logical unit (cmd_per_lun) queuing limits, so the guest can avoid waiting for free commands as much as possible. The default value for both limits is 128, and we will increase them to 1024, which is the max QEMU will allow, by adding the following to the device definition:

cmd_per_lun=1024,virtqueue_size=1024

Create a Worker Per Queue

By default, the vhost layer uses a single worker thread per device to service IO. After we increase the queuing limits, this will become a bottleneck. To be able to process all the commands a guest will now be able to send, we will enable the worker_per_virtqueue feature. Setting this variable to true will tell QEMU and the kernel to create a vhost worker thread per virtqueue.

worker_per_virtqueue=true

QEMU -device Definition

Finally, to add the vhost-scsi device, wwpn=naa.5001405b58bf97f1, we made with targetcli and use the settings above, use the following with your QEMU call:

-device vhost-scsi-pci,wwpn=naa.5001405b58bf97f1,cmd_per_lun=1024,
virtqueue_size=1024,worker_per_virtqueue=true

Start QEMU

You can add the -device definition from above to an existing QEMU command and start it like normal, but you may need to increase the number of vCPUs. To take advantage of multi-queue support we need to be able to spread IO over multiple queues, so we will have to send IO from different CPUs. The minimum number of vCPUs we will want to test with is 4 and the maximum is 16.

This command will create a VM with 16 vCPUs and will have 16GB of memory. It uses an Oracle Linux 9 image which we have installed the UEK-next kernel on to.

$ qemu-system-x86_64 -smp 16 -m 16G -enable-kvm -cpu host  \
-hda /work/OL9U4_x86_64.qcow2 -name debug-threads=on \
-serial mon:stdio -vnc :7 \
-device vhost-scsi-pci,wwpn=naa.5001405b58bf97f1,cmd_per_lun=1024,virtqueue_size=1024,worker_per_virtqueue=true

Note: Open the VNC client application and establish a connection to localhost:5907. A VM may be accessed with any VNC client program. For instance, you may utilize RealVNC or TightVNC if you’re using Windows. Use the vncviewer software that comes with your Linux distribution if you’re running Linux. Moreover, you may launch a VNC server using display number X. Replace the display number with X, so that, for example, 0 will listen on 5900, 1 on 5901, and so on.

Bind QEMU to CPUs

Check your system’s CPU configuration by running the lscpu -e command. On our system, CPUs 16 to 31 are on the same NUMA node and are free. Since our VM has 16 vCPUs, we’ll bind all the QEMU threads to 16 physical CPUs on the host.

Run the following command on the host to bind all the QEMU threads to physical CPUs 16 to 31.

$ taskset -cp -a 16-31 <QEMU-pid>

Verify Settings

Before we run IO, we will check that are our settings are being used. To get this info we need the SCSI Host number and the /dev entry on the guest. To see how the target and LUN targetcli created got mapped to the guest OS’s names/IDs run lsscsi in the VM:

$ lsscsi
[0:0:1:0]    disk    LIO-ORG  disk0            4.0   /dev/sda

When we created the block device with targetcli we gave it the name “disk0”, and we see it above attached to [0:0:1:0]. This is [SCSI host 0: bus 0: target 1: LUN 0] and the OS’s name for the device is sda.

Below we will use host0 for the SCSI Host and sda for the block device name in our commands to verify settings and for our fio tests.

Check Multi-queue Is Enabled

In the VM, the command:

$ ls /sys/block/sda/mq
0  1  10  11  12  13  14  15  16 2  3  4  5  6  7  8  9

should display a directory per CPU. In the example above, the VM has 16 CPUs so we see 0-15. Each directory represents a hardware queue, and should have a cpu_list file that prints out a single CPU that the queue is mapped to:

$ more /sys/block/sda/mq/*/cpu_list
::::::::::::::
/sys/block/sda/mq/0/cpu_list
::::::::::::::
0
::::::::::::::
/sys/block/sda/mq/1/cpu_list
::::::::::::::
1
::::::::::::::
/sys/block/sda/mq/2/cpu_list
::::::::::::::
2
::::::::::::::
/sys/block/sda/mq/3/cpu_list
::::::::::::::
3
::::::::::::::
/sys/block/sda/mq/4/cpu_list
::::::::::::::
4
...

Verify Queuing Limits

To check that the new cmd_per_lun value is being used on the VM run:

$ cat /sys/block/sda/queue/nr_requests
1024
$ cat /sys/class/scsi_host/host0/cmd_per_lun
1024

And, to check virtqueue_size run:

$ cat /sys/scsi_host/host0/can_queue
1024

Verify Worker Per Virtqueue

Use pidstat -t -p <QEMU-pid> to check that we have created a vhost worker thread per virtqueue.

pidstat -t -p 3645
09:18:40 PM   UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
...
09:18:40 PM     0         -      3651    0.14    0.06    0.08    0.00    0.28     6  |__CPU 0/KVM
09:18:40 PM     0         -      3652    0.00    0.00    0.02    0.00    0.03    35  |__CPU 1/KVM
09:18:40 PM     0         -      3653    0.00    0.00    0.01    0.00    0.01    17  |__CPU 2/KVM
09:18:40 PM     0         -      3654    0.00    0.00    0.01    0.00    0.02    38  |__CPU 3/KVM
09:18:40 PM     0         -      3655    0.00    0.00    0.01    0.00    0.01    39  |__CPU 4/KVM
09:18:40 PM     0         -      3656    0.00    0.01    0.02    0.00    0.03     1  |__CPU 5/KVM
09:18:40 PM     0         -      3657    0.00    0.01    0.01    0.00    0.02    12  |__CPU 6/KVM
09:18:40 PM     0         -      3658    0.00    0.00    0.01    0.00    0.01     5  |__CPU 7/KVM
09:18:40 PM     0         -      3659    0.00    0.00    0.02    0.00    0.03    16  |__CPU 8/KVM
09:18:40 PM     0         -      3660    0.00    0.00    0.01    0.00    0.01    27  |__CPU 9/KVM
09:18:40 PM     0         -      3661    0.00    0.01    0.01    0.00    0.02    19  |__CPU 10/KVM
09:18:40 PM     0         -      3662    0.00    0.00    0.01    0.00    0.01     8  |__CPU 11/KVM
09:18:40 PM     0         -      3663    0.00    0.00    0.01    0.00    0.02     9  |__CPU 12/KVM
09:18:40 PM     0         -      3664    0.00    0.00    0.01    0.00    0.02    29  |__CPU 13/KVM
09:18:40 PM     0         -      3665    0.00    0.00    0.01    0.00    0.01    14  |__CPU 14/KVM
09:18:40 PM     0         -      3666    0.00    0.00    0.01    0.00    0.01    32  |__CPU 15/KVM
09:18:40 PM     0         -      3668    0.00    0.00    0.00    0.00    0.00    10  |__vhost-3645
09:18:40 PM     0         -      3669    0.00    0.00    0.00    0.00    0.00    35  |__vhost-3645
09:18:40 PM     0         -      3670    0.00    0.00    0.00    0.00    0.00    37  |__vhost-3645
09:18:40 PM     0         -      3671    0.00    0.00    0.00    0.00    0.00    11  |__vhost-3645
09:18:40 PM     0         -      3672    0.00    0.00    0.00    0.00    0.00     3  |__vhost-3645
09:18:40 PM     0         -      3673    0.00    0.00    0.00    0.00    0.00     0  |__vhost-3645
09:18:40 PM     0         -      3674    0.00    0.00    0.00    0.00    0.00    10  |__vhost-3645
09:18:40 PM     0         -      3675    0.00    0.00    0.00    0.00    0.00    26  |__vhost-3645
09:18:40 PM     0         -      3676    0.00    0.00    0.00    0.00    0.00    28  |__vhost-3645
09:18:40 PM     0         -      3677    0.00    0.00    0.00    0.00    0.00    10  |__vhost-3645
09:18:40 PM     0         -      3678    0.00    0.00    0.00    0.00    0.00    10  |__vhost-3645
09:18:40 PM     0         -      3679    0.00    0.00    0.00    0.00    0.00    11  |__vhost-3645
09:18:40 PM     0         -      3680    0.00    0.00    0.00    0.00    0.00     1  |__vhost-3645
09:18:40 PM     0         -      3681    0.00    0.00    0.00    0.00    0.00    18  |__vhost-3645
09:18:40 PM     0         -      3682    0.00    0.00    0.00    0.00    0.00    10  |__vhost-3645
09:18:40 PM     0         -      3683    0.00    0.00    0.00    0.00    0.00    29  |__vhost-3645
...

QEMU 9.0 will create a virtqueue per vCPU by default, so by setting worker_per_virtqueue to true we should see the same number of vhost and CPU threads. Above we see 16 CPU threads and 16 vhost threads, so we know the feature was setup correctly. Note that if you are not using the QEMU example command in this article, you may see more vhost threads if you are using vhost for networking or for other hardware.

Test It Out

Run Fio

Run this fio workload that spreads I/O across 4 queues:

$ cat randread.4jobs.fio

[global]
bs=4K
iodepth=128
direct=1
ioengine=libaio
group_reporting
time_based
runtime=120
name=standard-iops
rw=randread
numjobs=4
cpus_allowed=0-3

[job1]
filename=/dev/sda

$ fio randread.4jobs.fio

The cpus_allowed=0-3 argument tells fio to spread the 4 jobs we specified with numjobs=4 over CPUs 0 to 3. Because we have a queue per CPU, this will spread IO across 4 queues. We can increase numjobs and add more CPUs to increase performance by using more queues. Here we use 8:

$ cat randread.8jobs.fio

[global]
bs=4K
iodepth=128
direct=1
ioengine=libaio
group_reporting
time_based
runtime=120
name=standard-iops
rw=randread
numjobs=8
cpus_allowed=0-7

[job1]
filename=/dev/sda

$ fio randread.8jobs.fio

As you utilize more queues IOPS should increase close to linearly. However, depending on factors like the performance of the physical CPUs, scaling will stop before we have fio use all 16 queues we created the VM with. This is caused by the VM and vhost-scsi threads sharing the same physical CPUs and eventually reaching 100% usage.

Conclusion

In this article, we saw how to setup a VM and use vhost-scsi’s worker_per_virtqueue feature that was added in QEMU 9.0 and the 6.5 Linux Kernel. By enabling this feature, applications in the guest can now execute IO through multiple queues and avoid the bottleneck that was limiting performance.

References