By Jsavit-Oracle on Nov 11, 2006
First, some background on virtual machines Old School Style - "trap and emulate".
In the beginning, say about 1966, virtual machines emerged on one of the IBM mainframes of the time (the 370/67), with a hypervisor called CP/67, which became VM/370 and the successor virtual machine products for mainframes (the definitive history of this product line is Melinda Varian's VM and the VM Community: Past, Present, and Future). This family of hypervisors - also called virtual machine monitors (VMMs) - influenced other product lines, such as VMware on x64/x86.
These systems use their platforms' architectural support for privileged state (OS level) and unprivileged state (application program level) to let a hypervisor, also called a host run multiple virtual machines (called guests). The hypervisor time-slices between guests just as a conventional multiprogramming OS time-slices between processes. The difference is that operating systems execute privileged instructions - things that change memory mapping for virtual memory, perform I/O, and so forth. Since the guests run in unprivileged state, this causes a program exception - a trap - which causes a context switch to the hypervisor. The hypervisor then figures out what the guest was trying to do (issue I/O, enable or disable interrupt masks, flush cache, switch address spaces, whatever), and then emulates that instruction on behalf of the guest. Periodically an event happens on the real machine that the hypervisor "reflects" to the guest: a timer or I/O interrupt, for example. All this makes it possible to give the guest OS the illusion that it has its own machine. Neat, huh?
Well, yes this is neat - very elegant, in fact, but I've just hand-waved over tremendous complexity that is resolved only via Heroic Tricky Programming in each hypervisor. Before I discuss some of these complexities, I'm going to mention a few of the constraints that shaped how these traditional hypervisors work. First and foremost: CPU power was limited and highly granular sharing is mandatory. Most machines are uniprocessors (SMP systems come into the picture much later), so the only way to meaningfully run virtual machines is to time-slice between them, even if (as it turns out) the cost of context switch turned out to be high. The second major constraint is that RAM is also scarce, and the only way to meaningfully fit virtual machines into RAM is to demand page them, and keep only their working sets in RAM with other parts paged out (swapped out) to disk. This turns out to be problematic too, as operating systems have very poor locality of reference (why should they have good locality of reference? They're written under the assumption that all of the RAM visible to them should be used) and thus working sets tend towards virtual machine defined memory size. An honorable exception is the timesharing shell, VM/CMS: a single-user (per virtual machine) simple OS, it has a small memory footprint, and benefits from the fact that pages can be recycled if necessary during user think times between interactions.
So, what did that lead to? It defined families of virtual machine systems in which operating systems were time-sliced on a small number of CPUs ("1" not being too small a number), with actively used pages maintained in RAM via demand paging (I'm simplifying here a bit, but not in a way that distorts how this works). When the guest issues a privileged operation ("privop") the hardware traps out to the hypervisor: trap and emulate.
Well, that's what we had for many years, and it worked pretty well. However, it also could incur tremendous overhead. In my next installment I'm going to discuss some of the difficult things I blithely dismissed a few sentences ago:
- context switch overhead between virtual machines, and between processes in a virtual machine.
- 3rd level storage addressing and shadow page tables
- the difficulty of emulating privileged operation in general, and I/O emulation in particular. Plus: the specific problem of trapping privileged operations in x64 and x86 architecture.
- the problem of having nested CPU and memory resource managers that don't talk to one another.
These are very difficult issues to solve well, as I'll describe. In fact, some issues now encountered in the VMware world are well known to veterans of the VM/370 world, and never satisfactorily solved in either (or solved better then than now: one VM expert wryly commented to me that in reinventing the wheel some solutions aren't quite as round as they should be!). Now that the constraining assumptions on CPU and memory resources have been overtaken by Moore's Law, there may be a completely different way of handling the same question. I'll discuss that in an installment or two from now.