In this blog post we will explain the technology approach used by HVX, our nested, high-performance hypervisor. We will also give an overview of today's virtualization technology landscape, and show you where HVX fits in.
This blog is part of our series on the future of virtualization.
HVX is a software-only, full virtualization solution based on dynamic binary translation combined with direct execution. That is quite a mouthful :) Let's go over these terms one by one:
Binary translation is one way to implement a full virtualization solution when there is no hardware support. We will now take a deeper look at why it is desirable to create a hypervisor that does not need hardware virtualization support. But before we do that, we'll look at what exactly this "hardware virtualization" support is.
The classical way to implement a hypervisor is using the "trap and emulate"approach. This approach was used by the very first hypervisor developed by IBM in the late 60s, and is used again today on 64-bit Intel and AMD systems.
Trap-and-emulate works as follows. Executable code from the guest is allowed to execute directly on the host CPU by the hypervisor. However, the hypervisor has configured the CPU in such a way that all potentially unsafe instructions will caused a "trap". An unsafe instruction is one that for example tries to access or modify the memory of another guest. A trap is an exceptional condition that transfers control back to the hypervisor. Once the hypervisor has received a trap, it will inspect the offending instruction, emulate it in a safe way, and continue execution after the instruction. The approach usually has good performance, because the majority of the instructions will not cause a trap, and will execute straight on the CPU with no overhead.
The "unsafe" instructions are also called "sensitive" instructions in virtualization parlance. And instructions that cause a trap are called "privileged". There is a famous paper from 1974 by Goldberg and Popek that says that a CPU is virtualizable using trap-and-emulate if and only if the set of sensitive instructions is a subset of the set of privileged instructions. Intuitively we can see that this is correct: we need all sensitive instructions to trap, and therefore they need to be privileged as well. The Goldberg and Popek paper does a much more rigorous job in proving that this is true. Even after 40 years the paper is very readable and makes for an interesting read.
When the Intel architecture first came out, it was not virtualizable according to Goldberg and Popek. This is because there were a whopping 17 instructions that were sensitive, but not privileged. What the Intel and AMD hardware virtualization features do, is to offer a way to to make these sensitive instructions privileged. Simple isn't it? Intel calls these instructions "VT-x" while AMD calls them "AMD-V". For brevity, we'll simply call them "VT" from now on.
So why did we develop a hypervisor that doesn't need VT? These features were introduced in 2005, and since a few years virtually all servers that are shipped include them. So why not use them?
The answer is simple. None of the public clouds offers these features. While the public cloud provider themselves have access to and use the VT features on the hardware, they do not expose them to the virtual guest that you as a customer buy from them. This means that when you buy a VM from a public cloud, you buy just that: a single, static system that you cannot further subdivide or otherwise use any of the advanced features that virtualization makes possible in the datacenter. Think for example about snapshots and restore, live migration, software-defined networking, etc. None of these features are available to you when you buy a machine from Amazon EC2 or any of the other providers.
At Ravello we have big plans to enable these features for the cloud, and that is why we have implemented HVX the way we did, allowing it to run nested inside a virtual machine, without requiring VT features.
So why did we choose binary translation? Simple. It is the most efficient way do so software-based full virtualization.
Binary translation is not a new technique, and it is actually a fairly broad technique that covers use cases outside virtualization as well.
Binary translation was first described in a paper from 1992 by Digital Equipment Corporation, but the technique is probably older. When DEC introduced its new 64-bit Alpha AXP processor, it wanted a way to run binaries compiled for the VAX and MIPS architectures unmodified on the Alpha processor. The solution chosen by DEC was to do binary translation. In essence, what this does is that it read chunks of VAX or MIPS instructions from memory before they are executed, translate them to native Alpha AXP instructions, save the translated chunk to memory, and then executes it. If done well, this technique can have good performance. The DEC paper cites that they were able to run translated VAX and MIPS binaries at the same or higher performance on the Alpha processor.
The same idea of binary translation can be used to create a hypervisor on a processor that does not support the Goldberg-Popek requirement. The idea is that the hypervisor can examine a piece of guest code before it runs, find the sensitive but unprivileged instructions, translate them into something privileged, and then run the translated code. Compare this with a hardware-based virtualization approach where the sensitive unprivileged instructions will cause a trap. In the case of binary translation approach there are no sensitive unprivileged instructions because they all got translated.
The binary translation can be done statically on a full program, or dynamically, on-demand just before a chunk of code is executed. When done on-demand, it is usually performed in small units called "basic blocks". A basic block is a set of instructions that ends with a branch instruction but does not have any branch instructions inside. Such a block will always be executed start to finish by a CPU, and is therefore an ideal unit for translation. The translations of the basic blocks are cached. This means that the overhead of translating only happens the first time a block is executed.
The translated blocks are loaded at a different memory offset than the untranslated blocks, and are also usually larger than the original blocks. This means that both absolute and relative memory references need to be relinked in the translated code. Branch instructions at the end of a basic block can also be relinked to jump directly to another basic block. This is called "block chaining" and is an important performance optimization.
Implementing binary translation as described above would be very slow, and inefficient. This is because in "traditional" binary translation all code is translated. But this is actually not needed. Let's see why.
If you look at the sensitive instructions, they include things like loading memory mapping tables and accessing I/O devices. Normal applications better not use any of these instructions. Imagine what would happen if a normal application like a word processor would suddenly be able to write to arbitrary memory locations, or get raw access to your hard drive. Any application would be able to crash the system, and in effect we'd be back to the times of MS-DOS. Modern operating systems use what's called "protected mode" to separate operating system code from regular application code. The Intel processors define 4 different "rings" of security. Ring 0 is the most privileged code, and is reserved for the operating system kernel. All applications are run in ring 3 where it is not possible to execute sensitive instructions.
This means that in order to implement a hypervisor based on binary translation, we only need to translate kernel code that is executing in ring 0. Depending on the workload, this is a small fraction of the total code. So with binary translation combined with direct execution, most code will actually run straight on the CPU without ever being translated.
The above is the reason why very high performance can be achieved with binary translation. In fact, it is well know that VMware runs faster on first generation VT hardware using binary translation than using VT. The second generation VT mitigated that performance differential, but as we'll show in our next blog, we have achieved some extremely compelling benchmark results.
For an even more in-depth look at the HVX architecture, see our submission to Usenix HotCloud 13.
Most hypervisors of today implement hardware-based, "trap and emulate" full virtualization. This includes VMware ESX(i), KVM, Microsoft's Hyper-V and Xen. From this list, ESX(i) and KVM exclusively do full virtualization, while Hyper-V and Xen include an additional paravirtualized mode.
In software-based virtualization, the most important implementation today is the binary translation mode in VMware, which is available for 32-bit guests and supports SMP. The performance of this implementation is very good. VirtualBox has a 32-bit mode based on "binary patching". This approach is similar to binary translation, but rather than translating code, code is patched in-place. This is a much less clean approach than binary translation, because it modifies the guest OS memory image and these modifications are visible to guest. Also in this mode VirtualBox only supports one CPU.
The famous QEmu also has a binary implementation mode called TCG (Tiny Code Generator). TCG does not do direct execution however, and is therefore very slow.
Instead of binary translation, one can also opt for full system emulation. Here, the full fetch-decode-execute pipeline of a CPU is emulated in software. Performance of such solutions is extremely slow. An example of such software is Bochs.
To the best of our knowledge, HVX is the only working implementation today of 64-bit binary translation with direct execution.
The graphic below gives an overview of the different technologies. In square brackets, some example implementations of the technology are provided.
In this blog we introduced HVX as a software-only, full virtualization solution based on binary translation with direct execution. As far as we are aware, this is the only working implementation that exists today of 64-bit.
In our next blog we will share some performance benchmarks for HVX. Please stay tuned.
VMware product names, logos, brands, and other trademarks featured or referred to in the ravellosystems domain are the property of VMware. VMware is not affiliated with Ravello Systems or any of Ravello System's employees or representatives. VMware does not sponsor or endorse the contents, materials, or processes discussed on the site.