More on Solaris x86 and page tables
By JoeBonasera on May 10, 2005
Tim Marsland has started blog entries about the Solaris project to port to x86 64. For my part, here's more low level detail about the Solaris 10 x86 page table management.
One issue that any x86 Operating System has to deal with is how to manage software access to page tables. The hardware does page table lookups using physical (not virtual) addresses. However, in order for an OS to create, modify or remove page table entries it has to have the page table mapped in virtual memory.
Solaris 9 stored page tables in the "user" part of virtual address space. Whenever the kernel had to access a pagetable entry, it would change %cr3 once to switch to the page table address space and then again to get back to the original address space. One of the ramifications of changing %cr3 on x86 is that the entire contents of the TLB may be invalidated.
In Solaris 10 we take a different approach to minimize the impact of page table accesses on the TLB. The kernel maintains 4K page aligned peep holes which are remapped on demand to access pagetables. Remapping a single page requires one INVLPG instruction which can be much quicker than an entire address space change, TLB flush and subsequent TLB reloads. Solaris allocates a unique peep hole for each CPU, to avoid contention or interference between CPUs. To use the peep hole, the HAT does:
- Disable thread preemption, so it won't migrate to a different CPU.
- Acquire a per-peep-hole spin lock, to avoid conflicting with interrupt code
- If the peep-hole doesn't already point to the desired physical page
- Update the PTE to the peep hole to the new page.
- Issue an INVLPG for the peep hole address
- Access the desired page table entry (usually one XCHG instruction)
- Release the spin lock
- Enable preemption
The spin lock comes into play when an interrupt happens during a page table access and the interrupt code also has to access a pagetable. In that case the spin lock acquisition allows the interrupt thread to yield back to the interrupted, or pinned, thread to allow it to free up the peep hole. The code to disable and enable preemption is very quick. On Solaris it's just an increment of a thread flag for disable and an increment/compare for enable.
When running the 64 bit kernel on a processor that has much more virtual address space than actual physical memory, this is all much easier. The kernel maintains a region of virtual address space that is mapped 1:1 to physical addresses called seg_kpm (kernel physical map). The pagetable code uses addresses in seg_kpm to access page tables instead of using the peep hole. This saves executing a lot of code and is much faster. One of the many benefits of a 64 bit operating system.
The page tables for certain special purpose parts of the kernel address space are always maintained in virtual memory. This includes the PTEs that map peep holes as well as something called segmap which is used frequently in I/O transactions.
One note for anybody looking at the source once Open Solaris hits the streets, is that the code confusingly calls the peep holes "windows" - for the purpose of this blog, the term peep hole seemed better. Maybe I'll get the code changed to match.
A final optimization to mention here is that the Solaris 32 and 64 bit kernels avoid allocating overhead pagetables for 32 bit user processes when using PAE. Since a 32 bit user process has at most 4 page table entries at level 2, the HAT stores the entries in part of the address space data structure. When a thread using that address space starts to run a CPU, the 4 entries are copied to a per-cpu set of pagetables at the start of the current level 2 page table. This saves approximately 1 page for each 32 bit process in the 32 bit kernel and 3 pages for each 32 bit process in the 64 bit kernel using a consistent mechanism.