Trickiest (virtual) bits - more on why virtual machines are hard to do well
By jsavit on Dec 24, 2006
- Privileged operations in general
- I/O instructions
- Timer management
- Sharing CPU, and going to sleep the right way
- Virtual memory management
Privileged operations in general
We've already discussed the general issue of privileged operations, that privileged operations executed from a virtual machine's OS generally require a trap to the hypervisor and then emulation, for a cost of at least two context switches, which costs instruction counts and displaces cache and TLB contents.
What I didn't mention previously is that some architectures (x86 in particular) are not well suited for virtualization because there are instructions that should be intercepted by the hypervisor, but actually fail silently when run in without being in the right mode (aka ring). For example, the POPF instruction controls whether interrupts are masked off or enabled, but are silently ignored when not in supervisor state. An OS can't run properly if interrupts are enabled when it needs them disabled, or vice-versa, so this is a problem that has to be solved. VMware handles this cleverly by scanning for offending binary code sequences in the virtual machine and replacing them so the guest traps out when this type of intervention is needed. Xen handles this by modifying the guest OS so such instruction sequences aren't needed in the first place, an elegant alternative when OS source code is available. z/VM and LPAR on mainframe don't have this particular problem, and also try to address the cost of guest privileged operations by pushing some of the work into microcode - it still has a cost, but less than when processed in software. VM's CMS timesharing environment reduces privop handling costs by using paravirtualization for many of its operations - the word "paravirtualization" is new, I suppose, but the technique has been around since the 1970s.
I/O instructionsWell, that's nice for the general question of privileged operations, and the above especially address the context switch aspects, but some privileged instructions have such complex semantics that they always are problematic. Enabling or disabling interrupts is just flipping a bit in a descriptor - converting a virtual I/O operation into a real one is a lot more involved.
Consider the flow of doing I/O from a virtual machine:
- Guest application performs I/O operation. The system call might itself cause an intercept that traps to the hypervisor, who then hands it back to the guest OS.
- Depending on OS, the application's buffers may or may not yet pinned (fixed) into memory where they can be used for DMA transfers, so the guest OS may have to copy application buffers to kernel RAM or otherwise lock virtual memory pages into RAM for the duration of the I/O operation.
- The OS issues the physical I/O instruction for that computer's architecture.
- The hypervisor intercepts the I/O instruction (another context switch)
- The I/O instruction is checked for validity: make sure disk seeks don't go outside the bounds of a virtual disk, buffers map into the address space of the guest, I/O is to a device present in the guest's virtual configuration. Take some work to do this.
- Guest virtual memory is pinned into real memory, possibly even copied from disk if those locations were paged/swapped out.
- The I/O instruction is issued on the real hardware, with the buffer and I/O devices address mapped into the real device. If the I/O is bridged by software in the hypervisor (for example, a virtual disk or network provided by the hypervisor) then an even more complex flow of events start, and the I/O is converted into a request to a virtual device service.
- Eventually the I/O even completes, and interrupt status and buffers are returned to the guest OS.
Timer managementOften overlooked, timers can be difficult to maintain accurately. Consider an OS written with the assumption that it owns the real machine and that real ("wall clock") time proceeds in uniform manner, possibly notified by a hardware-provide metronome to keep time. An OS might expect a clock interrupt every 10ms. and suitably adjust its time of day clock. Well, when running as a guest, an OS has no way of knowing that in between two of its instructions, the hypervisor chose to run another guest for a few milliseconds. Wall clock time proceeds, while "on CPU" time doesn't, so the time of day clock in the guest gets further and further away from actual time. The lack of repeatable timing can be solved but this is architecture dependent and clock skew can be alarming. One solution is shown on mainframes, which have architected instructions for wall-clock and CPU time. There is still clock skew (the hypervisor might not deliver a simulated clock cycle exactly on time, for example), but it's not as tough a problem as in some other places.
Actually, simply handling clock interrupts can be a cause of massive overhead. Consider a Linux guest on VMware or z/VM that receives a timer interrupts very 10ms, as was the case in 2.4 kernels (this is the so-called "jiffy" event). A dedicated PC can easily handle 100 clock interrupts per second, but this can snowball in virtual machines. Imagine if you have 1,000 guests each taking 100 clock interrupts per second (with a pair of context switches for each one). Experience several years ago on mainframe Linux, where people were trying to drive very high numbers of guests to compensate for the platform's price, showed that large numbers of Linux guests could saturate expensive systems just processing jiffies, even when they were otherwise idle! Yes, even idle Linux guests had enough overhead to swamp CPUs. Some people tested these systems with the HZ value changed to make the clock interrupts far less frequent - but then the guests became unresponsive to events. There eventually was a fix to this - the Linux kernel was changed to use a different time-keeping algorithm on this architecture - but this was only the most crippling timer event that had to be muzzled.
Sharing CPU, and going to sleep the right wayoing to sleep the right wayAny OS that is trained to believe that all the CPU cycles on a box belong to it might be an inhospitable guest, soaking up all the CPU it can with background activities. A generally unsolved (that is, not solved to particular satisfaction) is how to handle CPU priorities when a high priority guest (say, one running your database or transaction processor) needs to run some low priority work (such as maintenance tasks like RPM management on Linux, backups, other tasks than can run at low priority).
One of two things typically happens: either virtual machine's priority is so high that it continues to starve other guests from running their work(even when running non-critical work), or it absorbs so much capacity running the low priority work that its absorbed its share, and has nothing left when it needs to run its serious work again. A good solution to this was designed in the late 1970s by Robert Cowles, then of Cornell University, and now at SLAC. He wrote an interface by which a guest signaled to the hypervisor the importance of the work it was about to do, and the hypervisor ran it at a lower priority, relative to the other guests, if it was low priority work from a high priority guest. This was done as open source modifications - I don't know of any commercial offerings that adopted this architecture.
The worst case is when operating systems run "idle loops" when they have nothing to do. On real machines this makes some sense, because there is nothing else on the machine - but in a virtual machine it means that CPU cycles are burnt by idle guests when there is real work to be done elsewhere on the machine. The only answer for this is for guest operating systems to yield control and enter a wait state, enabled for interrupts that signal new work has arrived.
Virtual memory managementFinally, there's the problem of memory management. Systems like VMware and z/VM let you overcomit RAM. That is, the size of installed RAM can be much smaller than the memory sizes of the virtual machines that are running, and working sets are kept resident in memory, while unused pages are saved on disk and fetched as needed.
This is the same thing that virtual memory managers do in standard OSes, with two important exceptions: first, operating systems have crummy locality of reference. They are designed to use all the RAM available on the box they run on, for buffers, for application binaries, etc. This makes complete sense when run on the real machine (otherwise the RAM would be wasted), but is counter productive when there may be dozens of virtual machines all contending for the same RAM. In general, if you tell a guest OS that it has a virtual machine size of 512MB, it will use pretty close to 512MB.
Another annoyance is that nested memory managers don't play nicely together. The usual idea is to attempt to provide an "LRU" (Least Recently Used) policy, where the least recently used ("oldest") memory is displaced onto swap or page disk when there's a real memory shortage for other virtual memory locations. But, when a virtual memory system (the guest OS, whether it be Linux, z/OS, Solaris, or Windows) runs under a hypervisor that is also a virtual memory system (VMware, z/VM, etc) you can get double paging: the guest needs to page or swap out to disk from RAM in order to make room for a new request, but the hypervisor has already paged out that location. So, you need to do a page read just to get the contents that will be written out again and overwritten in memory. Ow.
VMware addresses the general problem via a balloon technique: the hypervisor tells the guest when its under memory pressure, and the guest obligingly uses less (the ballon is a buffer of RAM that isn't actually referred to) and reduces its memory footprint. This, and a similar feature from IBM for mainframe VM ("cooperative memory management") help reduce the pain. What people do in practice is buy additional memory to prevent these problems in the first place, but unfortunately that raises the cost of virtualization solutions.
It's hard out thereI hope that this closing entry for the year demonstrates why doing virtual machines is tricky business, with lots of technical problems that must be solved to make them work at all, and many more to make them work efficiently. Next year, I'll talk about why it frequently doesn't matter, and about some completely different approaches that make some of the problems simply disappear.
In the meantime, happy holidays and New Year to all!