By JoeBonasera on Jul 14, 2006
Three Moments of Xen
I've been quiet for a long time -- busily working on the version of OpenSolaris that runs on Xen. Our team has come a long way in just 4 months. You can now run OpenSolaris domains under Xen as dom0 with OpenSolaris domUs too. The domains can be either 32 bit or 64 bit (if dom0 is 64 bit), and multiprocessor works too. We've also done some simple testing of Linux domUs running on a system with a Solaris dom0 and that seems to work too. Doing save/restore or live migrate of domUs mostly works too. HVM (ie. hardware support virtualization) hasn't been tested yet, but it's one of the next things we'll be shaking out.
We've made the code drop to the outside world \*very\* early, so be prepared to stumble into quite a few bugs. We thought it would be best to let the rest of the world play with it now, rather than wait for us to polish off all the bugs. Now that the bulk of the coding is done, we hope to roll out mini-releases much more often and soon we should be able to host a live mercurial repository that gives the world full access to the code as it's done.
Check out the Xen community on OpenSolaris.org for the full scoop. Today I'll mention 3 interesting things I had to deal with over the last few months. Ok.. "interesting" like "beauty" is in the eye of the beholder.
Spurious Page Faults
One of the nastier problems we had in the last few months has been dealing with the problem of spurious page faults. In order to enforce isolation of the memory used by domains, Xen never allows an active page table in any domain to be writable. Spurious page faults are an artifact of the way that Xen deals with page table updates on multiprocessor domains. When a domain tries to modify a page table entry that maps a 4K page, Xen does the the following:
- Verifies that the page being written to is an active page table
- Clears the PRESENT bit (bit 0) of the page directory entry that linked the page table into the current page table tree. This removes an entire 2Meg (or 4Meg for non-PAE mode) virtual region surrounding the virtual address mapped by the page table from the address space.
- Remaps the page table being written to as writable
- Returns to guest domain without reporting the pagefault
The interesting bit is what happens if 2 or more CPUs simultaneously try to dereference memory that is within the now missing 2 (or 4) Meg region of VA. Both CPUs will pagefault and trap into Xen. One of them will acquire the domain's memory lock and do the following, while the other spins waiting for the lock:
- Verifies that the faulting VA is in the current writable page table region.
- Verifies that all the updated PTEs in the page table are allowable mappings.
- Remaps the page table as not writable
- Sets the PRESENT bit (bit 0) of the page directory entry that linked the page table into the current page table tree.
- Drops the big memory lock
- Returns to guest domain with out reporting a pagefault
To deal with this on Solaris we had to add a bit of code to cmntrap (in locore.s), trap.c and hat_i86.c to check if pagefaults were spurious. The basic idea s to check pagefaults that report a non-present kernel mapping and see if the actual page table entry is there or not. If the pagetable entry is there, we treat it as spurious and ignore the fault. Once this code was in, we started getting various failures in Xen, for example where the spurious fault happened inside Xen during a copyin/copyout operation. Keir Frasier promptly added some better code to Xen to deal with that after we e-mailed him about it.
We ran quite happily with that fix for quite a bit, but eventually we discovered the following issue. When reporting a pagefault, Xen 3.0 passes the faulting address in a per-virtual cpu structure. Since the OS has to handle nested pagefaults, our trap code has to quickly save that value on the stack, so that a subsequent pagefault doesn't overwrite the old fault value. Unfortunately the attempt to write the value to the stack may cause a spurious page fault itself. Solaris has always ensured that the page containing the stack, the cpu structures, etc are always mapped and won't recursivly pagefault, but not the entire 2 or 4 meg regions surrounding those things. To avoid this we'd have to ensure that all of the data structures used in this code were not within 2 (or 4) Meg of a VA that could possibly have it's mapping PTE changed. For Solaris this is not currently practical. Kernel stacks, cpu structures and other things are all dynamically allocated (and mapped) from the heap.
For now, our solution has fallen back to disabling some of the writable pagetable mechanism in Xen itself. In xen/arch/x86/mm.c, ptwr_do_page_fault() our version of Xen has the "goto emulate;" statement compiled in -- i.e. we take out the #if 0. This means that rather than remapping page tables as writable, Xen just emulates the instruction that writes to the pagetable and returns to the guest OS. In the future we'll investigate other solutions.
Kernel Physical Map
The amd64 version of OpenSolaris uses a large section of kernel VA to map all of physical memory. This region is called seg_kpm. It's used by the kernel to quickly access any physical address w/o having to manipulate with page tables. For example, any given physical memory addresses (PA) is mapped at kpm_vbase + PA. These mappings are normally writable. In order to deal with Xen's limitations on writable mappings (page tables, descriptor tables must be read only), I had to rework quite a bit of the code that constructs and manages the seg_kpm mappings. In particular the kpm mappings are now constructed as read only at first. Later in startup we make them writable where possible. Code was added to the page table allocation and free paths as well as the network driver path to deal with kpm mappings that either have to become read-only or must be removed or changed.
In OpenSolaris domain, the kernel uses a PFN (Page Frame Number or type pfn_t) to refer to a psuedo-physical address. To get the actual Machine Frame Number (MFN or type mfn_t), you must go through the mfn_list array which is coordinated with Xen itself. Only a very few parts of Solaris actually have to know about MFN values. The HAT - which manages page tables and the I/O DMA code are the major ones. We didn't want to add knowledge about MFNs to all the places and interfaces that deal with PFN values, so we introduced a mechanism where an MFN value can be assigned a PFN value and then passed around through the system as a PFN. We call these foreign PFNs or foreign pages. The MFN is turned into a PFN by setting a high order bit in the MFN value that puts it outside the range of an ordinary PFN. This let us pass MFN values masquerading as PFNs around through lots of driver (and other) interfaces with out change the interfaces or code.
The other interesting change around foreign pages was in the HAT. When tearing down a virtual mapping, the HAT needs to determine from the page table entry if the mapping was to a psuedo-physical PFN or to a foreign MFN. At first I used one of the soft bits in the PTE to indicate this. This worked well, since all the HYPERVISOR interfaces allowed for setting the non-PFN related PTE bits. Unfortunately the Xen grant table interfaces for creating mappings are not as well thought out. They provide no access to the PTE soft bits, caching bits, nor NX bit. As such we had to fall back to attempting to guess from the value of the MFN if this is a psuedo physical PFN or not.
I call it guessing, as the code could get an incorrect answer. The idea is to take the MFN from the PTE then check if mfn_list[maching_to_phys_mapping[mfn]] returns the same value of mfn. If the starting mfn is for a device mapping, then it can access a memory location that is outside the bounds of maching_to_phys_mapping. That memory location could cause an unexpected pagefault, manipulation of a device or some other very bad thing. Unfortunately the existing Xen interfaces don't provide a defined way to know the limit of the machine_to_phys_mapping. I've put in some heuristic checks, but they can't be 100% accurate. A simple change to the Xen interfaces to export the size of the table would solve the problem. I've mentioned this to Keir and hope he'll add it soon. Note that the Linux code that does a similar operation suffers from the same short comings.
The crippled grant table mapping interface also caused me to change the way that the HAT tracks page table usage. The HAT used to keep accurate counters for how many valid entries each page table had. It turns out those counters weren't used on any critical performance paths, but were difficult to maintain in the face of having to use special interfaces (like the the grant table ops) to establish and tear down mappings. I decided to take out the counters and recompute the values on the fly where needed. That code is very new and still has a lurking bug that I'm still trying to diagnose. Hence if you run a domain with low memory and the pageout daemon starts actively paging, you're likely to see a panic or ASSERT failure somewhere in the HAT code.
All that said, I need to stop blogging and get back to working on those bugs....