In this blog post we will talk about the performance of HVX, our nested, high-performance hypervisor that runs in the public cloud. When we mention nested virtualization to our customers, one of the first questions we always get is about performance. This is understandable. Virtualization, especially in its early days, had a connotation of being slow. So in our case, where there are two hypervisors involved, one running inside the other, concerns about performance are understandable.
Before getting into the details, we want to point out that performance is paramount to us, and it has been from the very beginning. When it became clear that we needed to build a nested hypervisor, the question we asked ourselves was not how to build one that works, but how to build one that works with excellent performance. So performance has been engineered into our technology right from the start.
This blog is part of our series on the future of virtualization.
A standard way and easy to understand way to measure virtualization performance is to establish the hypervisor overhead using a few standard benchmarks. We can measure the score of these benchmarks on the host, which in our case is a virtual machine in the public cloud, and the guest, which is actually a nested guest. (For brevity, and to prevent confusion, we will use the terminology "host VM", and "nested guest" from now on.) When we compare the two scores, we see what overhead was introduced by our hypervisor. The benefit of measuring performance this way is that it does not depend on the actual performance of the host itself, and should therefore have relatively good repeatability. We will call the performance of the guest as a percentage of the performance of the host the virtualization efficiency. If there is no overhead, we expect the efficiency to be close to 100%. (The efficiency can actually be significantly more than 100%, see our next blog post for that).
For our tests, we have selected a standard set of benchmarks below. Together they measure a wide spectrum of metrics from CPU to memory and IO.
We ran the benchmarks on an m3.xlarge instance in Amazon EC2. We used Ubuntu 12.04 as the OS, both in the host VM as well as in the nested guest. The FIO and pgbench benchmarks used an EBS volume of 300GB with EBS optimization. We created an ext4 file system on the volume. The volume was mounted directly on the host VM, and passed through as a virtio block device to the nested guest. Caching in the host was disabled by using direct IO when exporting the volume to the nested guest.
The FIO random IO tests use 4KB random reads in a 200GB file on the EBS volume. Direct I/O was used for all FIO tests to reduce the impact of caching. The test configuration files are available in the following Github repository.
The results of the test are given in the table below:
|Benchmark||Efficiency||Nested VM Score||Host VM Score|
|Dhrystone||96%||31.6 Mlps||33.0 Mlps|
|Whetstone||95%||3,449 MWIPS||3,635 MWIPS|
|FIO random read||89%||166 IOP/s||187 IOP/s|
|FIO random write||103%||1,808 IOP/s||1,761 IOP/s|
|FIO sequential read||86%||58.7 MB/s||68.0 MB/s|
|FIO sequential write||96%||33.7 MB/s||35.0 MB/s|
|pgbench||91%||157 tps||173 tps|
As you can see, performance is very good. The CPU based Dhrystone and Whetstone benchmarks are nearly at 100%. This can be explained because HVX uses binary translation with direct execution, where application code is run directly on the CPU. The IO benchmarks are also very good, all larger than 85%. Pgebench at 91% is a very respectable score.
The score for the random write test which at 103% is larger than the host score is likely a statistical fluctuation in our measurement.
Another interesting way in which we can look at our performance is to compare it with other implementations of binary translation that we can run in the cloud. We have selected two: VMware and QEmu. There are some caveats here. VMware only supports 32-bit guests in binary translation mode. And while QEmu supports 64-bits, it does not do direct execution, so the results will likely be extremely poor.
We used the same tests as in the overhead tests above, using the same m3.xlarge instance in Amazon EC2 and EBS volume. The VMware tests were done with VMware Player 6 for Linux 64-bit. VMware Player was installed inside the EC2 instance. We selected VMware player because we cannot install ESXi in Amazon EC2. ESXi comes in the form of an installable CD-ROM only, and Amazon offers no way to install it. VMware Player is a free download from the VMware web site. The VM running under Player has to be configured with "vmx.allowNested = TRUE" in the VMX file to allow it to be run nested. The efficient PVSCSI interface was used to pass the EBS device into the guest. No VMware tools were installed but instead the builtin Ubuntu PVSCSI driver was used, which should give the same performance. Write caching is disabled in VMware (like we did for HVX) as with it, it is impossible to get meaningful numbers for the write benchmarks.
QEmu was configured to use virtio, and direct IO.
To give a proper comparison, we ran the HVX benchmarks with both 64-bit as well as 32-bit guests. This is because the CPU based benchmarks will be impacted by the smaller word size.
The results are given below:
|Benchmark / Efficiency||HVX 64-bit||HVX 32-bit||VMware 32-bit||QEmu 64-bit|
|FIO random read||89%||86%||80%||91%|
|FIO random write||103%||102%||58%||97%|
|FIO sequential read||86%||86%||64%||87%|
|FIO sequential write||96%||95%||48%||96%|
We can draw some interesting conclusions from this. First of all, our performance is very good. It is well known that VMware has significantly optimized its hypervisor over the last decade and is the undisputed leader in datacenter virtualization. The fact that as a small startup we can match and exceed their performance is something we are very proud of.
The second conclusion is that for IO, our 32-bit performance is similar to 64-bit, but there's a difference for the other benchmarks. This is expected. The IO benchmarks are the same because word size doesn't really factor in IO performance. For the other benchmarks, the biggest different is in Dhrystone, which is an integer benchmark. The difference in performance is likely caused the use of the wider 64-bit integer instructions in the guest, rather than any additional hypervisor overhead.
Finally, as you can see, QEmu CPU performance is really slow. Dhrystone is at 5%, while Whetstone is even slower at 4%. This is because QEmu lacks direct execution and therefore all code (including user code) is translated. The IO benchmarks are very similar to HVX, which is expected as well. During an IO benchmark, the processor is mostly idle waiting for IO, which means that QEmu's lack of direct execution doesn't really impact it. The pgbench score at 27% indicates that this is a combined CPU/IO benchmark.
We have looked at the performance of HVX from two angles: measuring the hypervisor overhead, and comparing it against the binary translation implementations from VMware and QEmu. We score very well on both counts. Our virtualization efficiency is larger than 85% for all the benchmarks that we did. And when comparing our performance against VMware, there's a statistical tie for 3 of the benchmarks and a clear win for HVX in the other 4. QEmu is two orders of magnitude slower, because it lacks direct execution.
In our next blog we'll look at consolidating multiple nested VMs on a single host VM. Please stay tuned.