In the 3.13 Linux Kernel a new block layer queuing mechanism, blk-mq, was added. This allowed users to take advantage of modern hardware and drivers that exported multiple queues, so they could send IO requests from multiple CPUs in parallel more efficiently. QEMU has long supported multi-queue storage, but performance has been limited by a bottleneck caused by it’s use of a single thread per device to service IO. In QEMU 9.0, this was fixed for virtio-blk and the Linux kernel vhost-scsi target.
As was discussed in a previous blog, virtio-blk added the iothread-vq-mapping feature to allow users to create multiple iothreads and map them to different virtqueues. Similarly, vhost-scsi added a feature, worker_per_virtqueue, that allows users to create a vhost worker thread per virtqueue. In this article we will describe how to prepare a KVM host, setup a vhost-scsi target, create a VM with the new vhost-scsi feature, then run some IO tests.
If you’ve read the virtio-blk blog about the iothread-vq-mapping feature then you can use the same KVM host setup. The kernel installation and QEMU build are the same so skip to the Install Target Tools section. If this is a new setup, proceed to the next section to setup the host.
Setup the KVM Host
The distro used for the KVM host will be Oracle Linux 9 with a UEK-next Linux Kernel installed. In the next sections, we will see how to install a UEK-next kernel, build QEMU 9.0 create a NULL device for testing, then add it to a vhost-scsi target.
Install a UEK-next Linux Kernel
First, install the gpg key, then add the UEK-next yum repository.
$ rpm --import https://yum.oracle.com/RPM-GPG-KEY-oracle-development $ dnf config-manager --add-repo "https://yum.oracle.com/repo/OracleLinux/OL9/developer/UEKnext/x86_64"
Now install the UEK-next kernel:
$ dnf install kernel-ueknext Dependencies resolved. ======================================================================================================================================== Package Arch Version Repository Size ======================================================================================================================================== Installing: kernel-ueknext x86_64 6.10.0-2.el9ueknext yum.oracle.com_repo_OracleLinux_OL9_developer_UEKnext_x86_64 286 k Installing dependencies: kernel-ueknext-core x86_64 6.10.0-2.el9ueknext yum.oracle.com_repo_OracleLinux_OL9_developer_UEKnext_x86_64 17 M kernel-ueknext-modules x86_64 6.10.0-2.el9ueknext yum.oracle.com_repo_OracleLinux_OL9_developer_UEKnext_x86_64 53 M kernel-ueknext-modules-core x86_64 6.10.0-2.el9ueknext yum.oracle.com_repo_OracleLinux_OL9_developer_UEKnext_x86_64 35 M Transaction Summary ======================================================================================================================================== Install 4 Packages Total download size: 106 M Installed size: 160 M Is this ok [y/N]: y Downloading Packages: (1/4): kernel-ueknext-6.10.0-2.el9ueknext.x86_64.rpm 858 kB/s | 286 kB 00:00 (2/4): kernel-ueknext-core-6.10.0-2.el9ueknext.x86_64.rpm 17 MB/s | 17 MB 00:00 (3/4): kernel-ueknext-modules-6.10.0-2.el9ueknext.x86_64.rpm 20 MB/s | 53 MB 00:02 (4/4): kernel-ueknext-modules-core-6.10.0-2.el9ueknext.x86_64.rpm 12 MB/s | 35 MB 00:02 ---------------------------------------------------------------------------------------------------------------------------------------- Total 33 MB/s | 106 MB 00:03 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : kernel-ueknext-modules-core-6.10.0-2.el9ueknext.x86_64 1/4 Running scriptlet: kernel-ueknext-core-6.10.0-2.el9ueknext.x86_64 2/4 Installing : kernel-ueknext-core-6.10.0-2.el9ueknext.x86_64 2/4 Running scriptlet: kernel-ueknext-core-6.10.0-2.el9ueknext.x86_64 2/4 Installing : kernel-ueknext-modules-6.10.0-2.el9ueknext.x86_64 3/4 Running scriptlet: kernel-ueknext-modules-6.10.0-2.el9ueknext.x86_64 3/4 Installing : kernel-ueknext-6.10.0-2.el9ueknext.x86_64 4/4 Running scriptlet: kernel-ueknext-modules-core-6.10.0-2.el9ueknext.x86_64 4/4 Running scriptlet: kernel-ueknext-core-6.10.0-2.el9ueknext.x86_64 4/4 Running scriptlet: kernel-ueknext-modules-6.10.0-2.el9ueknext.x86_64 4/4 Running scriptlet: kernel-ueknext-6.10.0-2.el9ueknext.x86_64 4/4 Verifying : kernel-ueknext-6.10.0-2.el9ueknext.x86_64 1/4 Verifying : kernel-ueknext-core-6.10.0-2.el9ueknext.x86_64 2/4 Verifying : kernel-ueknext-modules-6.10.0-2.el9ueknext.x86_64 3/4 Verifying : kernel-ueknext-modules-core-6.10.0-2.el9ueknext.x86_64 4/4 Installed: kernel-ueknext-6.10.0-2.el9ueknext.x86_64 kernel-ueknext-core-6.10.0-2.el9ueknext.x86_64 kernel-ueknext-modules-6.10.0-2.el9ueknext.x86_64 kernel-ueknext-modules-core-6.10.0-2.el9ueknext.x86_64 Complete!
Once installation has completed, reboot the server and confirm that we have booted into the new kernel:
$ uname -r 6.10.0-2.el9ueknext.x86_64
Build QEMU 9.0
To build QEMU, run the commands below:
Notes:
- We will do a build for a x86_64 target.
NR_CPUSis the maximum number of CPUs to perform the build on. To get the system’s number of CPUs, use thelscpucommand.
$ wget https://download.qemu.org/qemu-9.0.0.tar.xz $ tar xvJf qemu-9.0.0.tar.xz $ cd qemu-9.0.0 $ ./configure --target-list=x86_64-softmmu $ make -j $NR_CPUS
In the next sections, we will run the QEMU command directly from the build directory to create a virtual machine (VM).
Setup a vhost-scsi Device on The Host
Create a NULL Device for Testing
The null-block device (/dev/nullb*) emulates a block device with a size of X GB. Its purpose is to test the different block-layer implementations. Instead of performing physical read/write operations, it simply marks them as complete as they are dequeued.
Create a null-block device /dev/nullb0:
$ sudo modprobe null_blk hw_queue_depth=1024 gb=100 submit_queues=$NR_CPUS nr_devices=1 max_sectors=1024
Use the number of CPUs on your system for ‘submit-queues’, set size to 100GB, queue depth of 1024 commands and max IO size as 512K.
Confirm creation of the device:
$ ls -l /dev/nullb* brw-rw----. 1 root disk 251, 0 Jul 13 13:38 /dev/nullb0
Note: You can use any back end device of your choice instead of nullb0, but remember to replace nullb0 with your device when setting up the vhost-scsi target in later sections.
Install Target Tools
$ sudo dnf install targetcli
Setup a vhost-scsi Target
Open the targetcli shell:
$ targetcli
Create a vhost-scsi target and map /dev/nullb0 to a LUN that we name disk0:
/> backstores/block create name=disk0 dev=/dev/nullb0 Created block storage object disk0 using /dev/nullb0. /> backstores/block/disk0 set attribute emulate_pr=0 Parameter emulate_pr is now '0'. /> vhost/ create Created target naa.5001405b58bf97f1. Created TPG 1. /> vhost/naa.5001405b58bf97f1/tpg1/luns create /backstores/block/disk0 Created LUN 0. /> exit
Above we performed one extra step and set emulate_pr=0. This disabled persistent reservation (PR) emulation in the target layer. Instead, the target will pass PR commands directly to the backing device. This avoids a possible performance issue that limits IOPS and throughput.
Run QEMU with a vhost-scsi Device
In the example in this section, we will pass the vhost-scsi device to QEMU on the command line.
Increase Queuing Limits
We want to increase the SCSI host (virtqueue_size) and logical unit (cmd_per_lun) queuing limits, so the guest can avoid waiting for free commands as much as possible. The default value for both limits is 128, and we will increase them to 1024, which is the max QEMU will allow, by adding the following to the device definition:
cmd_per_lun=1024,virtqueue_size=1024
Create a Worker Per Queue
By default, the vhost layer uses a single worker thread per device to service IO. After we increase the queuing limits, this will become a bottleneck. To be able to process all the commands a guest will now be able to send, we will enable the worker_per_virtqueue feature. Setting this variable to true will tell QEMU and the kernel to create a vhost worker thread per virtqueue.
worker_per_virtqueue=true
QEMU -device Definition
Finally, to add the vhost-scsi device, wwpn=naa.5001405b58bf97f1, we made with targetcli and use the settings above, use the following with your QEMU call:
-device vhost-scsi-pci,wwpn=naa.5001405b58bf97f1,cmd_per_lun=1024, virtqueue_size=1024,worker_per_virtqueue=true
Start QEMU
You can add the -device definition from above to an existing QEMU command and start it like normal, but you may need to increase the number of vCPUs. To take advantage of multi-queue support we need to be able to spread IO over multiple queues, so we will have to send IO from different CPUs. The minimum number of vCPUs we will want to test with is 4 and the maximum is 16.
This command will create a VM with 16 vCPUs and will have 16GB of memory. It uses an Oracle Linux 9 image which we have installed the UEK-next kernel on to.
$ qemu-system-x86_64 -smp 16 -m 16G -enable-kvm -cpu host \ -hda /work/OL9U4_x86_64.qcow2 -name debug-threads=on \ -serial mon:stdio -vnc :7 \ -device vhost-scsi-pci,wwpn=naa.5001405b58bf97f1,cmd_per_lun=1024,virtqueue_size=1024,worker_per_virtqueue=true
Note: Open the VNC client application and establish a connection to localhost:5907. A VM may be accessed with any VNC client program. For instance, you may utilize RealVNC or TightVNC if you’re using Windows. Use the vncviewer software that comes with your Linux distribution if you’re running Linux. Moreover, you may launch a VNC server using display number X. Replace the display number with X, so that, for example, 0 will listen on 5900, 1 on 5901, and so on.
Bind QEMU to CPUs
Check your system’s CPU configuration by running the lscpu -e command. On our system, CPUs 16 to 31 are on the same NUMA node and are free. Since our VM has 16 vCPUs, we’ll bind all the QEMU threads to 16 physical CPUs on the host.
Run the following command on the host to bind all the QEMU threads to physical CPUs 16 to 31.
$ taskset -cp -a 16-31 <QEMU-pid>
Verify Settings
Before we run IO, we will check that are our settings are being used. To get this info we need the SCSI Host number and the /dev entry on the guest. To see how the target and LUN targetcli created got mapped to the guest OS’s names/IDs run lsscsi in the VM:
$ lsscsi [0:0:1:0] disk LIO-ORG disk0 4.0 /dev/sda
When we created the block device with targetcli we gave it the name “disk0”, and we see it above attached to [0:0:1:0]. This is [SCSI host 0: bus 0: target 1: LUN 0] and the OS’s name for the device is sda.
Below we will use host0 for the SCSI Host and sda for the block device name in our commands to verify settings and for our fio tests.
Check Multi-queue Is Enabled
In the VM, the command:
$ ls /sys/block/sda/mq 0 1 10 11 12 13 14 15 16 2 3 4 5 6 7 8 9
should display a directory per CPU. In the example above, the VM has 16 CPUs so we see 0-15. Each directory represents a hardware queue, and should have a cpu_list file that prints out a single CPU that the queue is mapped to:
$ more /sys/block/sda/mq/*/cpu_list :::::::::::::: /sys/block/sda/mq/0/cpu_list :::::::::::::: 0 :::::::::::::: /sys/block/sda/mq/1/cpu_list :::::::::::::: 1 :::::::::::::: /sys/block/sda/mq/2/cpu_list :::::::::::::: 2 :::::::::::::: /sys/block/sda/mq/3/cpu_list :::::::::::::: 3 :::::::::::::: /sys/block/sda/mq/4/cpu_list :::::::::::::: 4 ...
Verify Queuing Limits
To check that the new cmd_per_lun value is being used on the VM run:
$ cat /sys/block/sda/queue/nr_requests 1024 $ cat /sys/class/scsi_host/host0/cmd_per_lun 1024
And, to check virtqueue_size run:
$ cat /sys/scsi_host/host0/can_queue 1024
Verify Worker Per Virtqueue
Use pidstat -t -p <QEMU-pid> to check that we have created a vhost worker thread per virtqueue.
pidstat -t -p 3645 09:18:40 PM UID TGID TID %usr %system %guest %wait %CPU CPU Command ... 09:18:40 PM 0 - 3651 0.14 0.06 0.08 0.00 0.28 6 |__CPU 0/KVM 09:18:40 PM 0 - 3652 0.00 0.00 0.02 0.00 0.03 35 |__CPU 1/KVM 09:18:40 PM 0 - 3653 0.00 0.00 0.01 0.00 0.01 17 |__CPU 2/KVM 09:18:40 PM 0 - 3654 0.00 0.00 0.01 0.00 0.02 38 |__CPU 3/KVM 09:18:40 PM 0 - 3655 0.00 0.00 0.01 0.00 0.01 39 |__CPU 4/KVM 09:18:40 PM 0 - 3656 0.00 0.01 0.02 0.00 0.03 1 |__CPU 5/KVM 09:18:40 PM 0 - 3657 0.00 0.01 0.01 0.00 0.02 12 |__CPU 6/KVM 09:18:40 PM 0 - 3658 0.00 0.00 0.01 0.00 0.01 5 |__CPU 7/KVM 09:18:40 PM 0 - 3659 0.00 0.00 0.02 0.00 0.03 16 |__CPU 8/KVM 09:18:40 PM 0 - 3660 0.00 0.00 0.01 0.00 0.01 27 |__CPU 9/KVM 09:18:40 PM 0 - 3661 0.00 0.01 0.01 0.00 0.02 19 |__CPU 10/KVM 09:18:40 PM 0 - 3662 0.00 0.00 0.01 0.00 0.01 8 |__CPU 11/KVM 09:18:40 PM 0 - 3663 0.00 0.00 0.01 0.00 0.02 9 |__CPU 12/KVM 09:18:40 PM 0 - 3664 0.00 0.00 0.01 0.00 0.02 29 |__CPU 13/KVM 09:18:40 PM 0 - 3665 0.00 0.00 0.01 0.00 0.01 14 |__CPU 14/KVM 09:18:40 PM 0 - 3666 0.00 0.00 0.01 0.00 0.01 32 |__CPU 15/KVM 09:18:40 PM 0 - 3668 0.00 0.00 0.00 0.00 0.00 10 |__vhost-3645 09:18:40 PM 0 - 3669 0.00 0.00 0.00 0.00 0.00 35 |__vhost-3645 09:18:40 PM 0 - 3670 0.00 0.00 0.00 0.00 0.00 37 |__vhost-3645 09:18:40 PM 0 - 3671 0.00 0.00 0.00 0.00 0.00 11 |__vhost-3645 09:18:40 PM 0 - 3672 0.00 0.00 0.00 0.00 0.00 3 |__vhost-3645 09:18:40 PM 0 - 3673 0.00 0.00 0.00 0.00 0.00 0 |__vhost-3645 09:18:40 PM 0 - 3674 0.00 0.00 0.00 0.00 0.00 10 |__vhost-3645 09:18:40 PM 0 - 3675 0.00 0.00 0.00 0.00 0.00 26 |__vhost-3645 09:18:40 PM 0 - 3676 0.00 0.00 0.00 0.00 0.00 28 |__vhost-3645 09:18:40 PM 0 - 3677 0.00 0.00 0.00 0.00 0.00 10 |__vhost-3645 09:18:40 PM 0 - 3678 0.00 0.00 0.00 0.00 0.00 10 |__vhost-3645 09:18:40 PM 0 - 3679 0.00 0.00 0.00 0.00 0.00 11 |__vhost-3645 09:18:40 PM 0 - 3680 0.00 0.00 0.00 0.00 0.00 1 |__vhost-3645 09:18:40 PM 0 - 3681 0.00 0.00 0.00 0.00 0.00 18 |__vhost-3645 09:18:40 PM 0 - 3682 0.00 0.00 0.00 0.00 0.00 10 |__vhost-3645 09:18:40 PM 0 - 3683 0.00 0.00 0.00 0.00 0.00 29 |__vhost-3645 ...
QEMU 9.0 will create a virtqueue per vCPU by default, so by setting worker_per_virtqueue to true we should see the same number of vhost and CPU threads. Above we see 16 CPU threads and 16 vhost threads, so we know the feature was setup correctly. Note that if you are not using the QEMU example command in this article, you may see more vhost threads if you are using vhost for networking or for other hardware.
Test It Out
Run Fio
Run this fio workload that spreads I/O across 4 queues:
$ cat randread.4jobs.fio [global] bs=4K iodepth=128 direct=1 ioengine=libaio group_reporting time_based runtime=120 name=standard-iops rw=randread numjobs=4 cpus_allowed=0-3 [job1] filename=/dev/sda $ fio randread.4jobs.fio
The cpus_allowed=0-3 argument tells fio to spread the 4 jobs we specified with numjobs=4 over CPUs 0 to 3. Because we have a queue per CPU, this will spread IO across 4 queues. We can increase numjobs and add more CPUs to increase performance by using more queues. Here we use 8:
$ cat randread.8jobs.fio [global] bs=4K iodepth=128 direct=1 ioengine=libaio group_reporting time_based runtime=120 name=standard-iops rw=randread numjobs=8 cpus_allowed=0-7 [job1] filename=/dev/sda $ fio randread.8jobs.fio
As you utilize more queues IOPS should increase close to linearly. However, depending on factors like the performance of the physical CPUs, scaling will stop before we have fio use all 16 queues we created the VM with. This is caused by the VM and vhost-scsi threads sharing the same physical CPUs and eventually reaching 100% usage.
Conclusion
In this article, we saw how to setup a VM and use vhost-scsi’s worker_per_virtqueue feature that was added in QEMU 9.0 and the 6.5 Linux Kernel. By enabling this feature, applications in the guest can now execute IO through multiple queues and avoid the bottleneck that was limiting performance.
References
- https://docs.kernel.org/block/blk-mq.html
- https://blogs.oracle.com/linux/post/virtioblk-using-iothread-vq-mapping
- https://wiki.libvirt.org/Vhost-scsi_target.html
- https://yum.oracle.com/oracle-linux-isos.html
- https://yum.oracle.com/oracle-linux-templates.html
- https://blogs.oracle.com/linux/post/uek-next
- https://docs.kernel.org/block/null_blk.html
- https://docs.oracle.com/en-us/iaas/Content/Block/References/samplefiocommandslinux.htm