Thursday Nov 06, 2008

kmdb in a Solaris VirtualBox Guest

How to enter the Solaris kernel debugger in a running virtual machine.
[Read More]

Wednesday Feb 13, 2008

32 bit Virtualbox and binary install help

Update --

VirtualBox for 32 bit x86 OpenSolaris host is now available for download at virtualbox.org

32 bit VirtualBox

My previous blog entry gave instructions for building a 64 bit version of VirtualBox on OpenSolaris. To get the 32 bit version some simple changes have to be applied to the instructions.

  • Do this on a system booted in 32 bit mode.
  • In the configure step for libqt, change "solaris-g++-64" to "solaris-g++".
  • Every place that says "/opt/qt64", should be "/opt/qt32" or whatever you'd rather use.

Binary Install Instructions

Also people have been asking for directions on how to install the 64 bit binary download for OpenSolaris available at virtualbox.org. This should work:

joe% gunzip VirtualBox-opensolaris-amd64-1.5.51-r28040-beta1.gz
joe% su
Password:
# pkgadd -d ./VirtualBox-opensolaris-amd64-1.5.51-r28040-beta1
...
# exit

Now you just need to modify your environment and run VirtualBox:

joe% setenv PATH /opt/VirtualBox:/opt/VirtualBox/qtgcc/bin:${PATH} 
joe% setenv LD_LIBRARY_PATH /opt/VirtualBox:/opt/VirtualBox/qtgcc/lib:${LD_LIBRARY_PATH} 
joe% VirtualBox

Tuesday Feb 12, 2008

VirtualBox on OpenSolaris

VirtualBox on OpenSolaris[Read More]

Wednesday Sep 19, 2007

Detecting Hardware Virtualization support for xVM

Yesterday, the xVM project team integrated the OpenSolaris code to enable using a Hypervisor on Solaris x64. This supports running multiple OS instances on a single box. If you want to run an operating system that has been paravirtualized, such as OpenSolaris or Linux, almost any CPU that supports regular OpenSolaris is good enough.

However to run an Operating System, say one from Microsoft, that hasn't been paravirtualized, you'll need a recent Intel or AMD CPU that contains hardware virtualization support and the BIOS in your system will have to enable (or not disable) that support.

The following Solaris program can be compiled and run to determine if HVM support will work when a system is booted under an xVM dom0. Note that this should be run when the system is not using the hypervisor.

/\*
 \* CDDL HEADER START
 \*
 \* The contents of this file are subject to the terms of the
 \* Common Development and Distribution License (the "License").
 \* You may not use this file except in compliance with the License.
 \*
 \* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
 \* or http://www.opensolaris.org/os/licensing.
 \* See the License for the specific language governing permissions
 \* and limitations under the License.
 \*
 \* When distributing Covered Code, include this CDDL HEADER in each
 \* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
 \* If applicable, add the following below this CDDL HEADER, with the
 \* fields enclosed by brackets "[]" replaced with your own identifying
 \* information: Portions Copyright [yyyy] [name of copyright owner]
 \*
 \* CDDL HEADER END
 \*/

/\*
 \* Test to see if Intel VT-x or AMD-v is supported according to cpuid.
 \*/
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>
#include <ctype.h>

static const char devname[] = "/dev/cpu/self/cpuid";

#define EAX     0
#define EBX     1
#define ECX     2
#define EDX     3

int
main(int argc, char \*\*argv)
{
        int device;
        uint32_t func;
        uint32_t regs[4];
        uint32_t v;
        int r;
        int bit;
        int nbits;

        /\*
         \* open cpuid device
         \*/
        device = open(devname, O_RDONLY);
        if (device == -1)
                goto fail;

        func = 0x0;
        if (pread(device, regs, sizeof (regs), func) != sizeof (regs))
                goto fail;

        if (regs[EBX] == 0x68747541 &&
            regs[ECX] == 0x444d4163 &&
            regs[EDX] == 0x69746e65) { /\* AuthenticAMD \*/

                func = 0x80000001;
                r = ECX;
                bit = 2;
                nbits = 1;

        } else if (regs[EBX] == 0x756e6547 &&
            regs[ECX] == 0x6c65746e &&
            regs[EDX] == 0x49656e69) { /\* GenuineIntel \*/

                func = 1;
                r = ECX;
                bit = 5;
                nbits = 1;

        } else {
                goto fail;
        }

        if (pread(device, regs, sizeof (regs), func) != sizeof (regs))
                goto fail;

        v = regs[r] >> bit;
        if (nbits < 32 && nbits > 0)
                v &= (1 << nbits) - 1;

        if (v)
                printf("yes\\n");
        else
                printf("no\\n");

        (void) close(device);
        exit(0);

fail:
        printf("no\\n");
        (void) close(device);
        exit(1);
}

Friday Jul 14, 2006

I've got spur(ious page fault)s that jingle, jangle, jingle...

Three Moments of Xen

I've been quiet for a long time -- busily working on the version of OpenSolaris that runs on Xen. Our team has come a long way in just 4 months. You can now run OpenSolaris domains under Xen as dom0 with OpenSolaris domUs too. The domains can be either 32 bit or 64 bit (if dom0 is 64 bit), and multiprocessor works too. We've also done some simple testing of Linux domUs running on a system with a Solaris dom0 and that seems to work too. Doing save/restore or live migrate of domUs mostly works too. HVM (ie. hardware support virtualization) hasn't been tested yet, but it's one of the next things we'll be shaking out.

We've made the code drop to the outside world \*very\* early, so be prepared to stumble into quite a few bugs. We thought it would be best to let the rest of the world play with it now, rather than wait for us to polish off all the bugs. Now that the bulk of the coding is done, we hope to roll out mini-releases much more often and soon we should be able to host a live mercurial repository that gives the world full access to the code as it's done.

Check out the Xen community on OpenSolaris.org for the full scoop. Today I'll mention 3 interesting things I had to deal with over the last few months. Ok.. "interesting" like "beauty" is in the eye of the beholder.

Spurious Page Faults

One of the nastier problems we had in the last few months has been dealing with the problem of spurious page faults. In order to enforce isolation of the memory used by domains, Xen never allows an active page table in any domain to be writable. Spurious page faults are an artifact of the way that Xen deals with page table updates on multiprocessor domains. When a domain tries to modify a page table entry that maps a 4K page, Xen does the the following:

  • Verifies that the page being written to is an active page table
  • Clears the PRESENT bit (bit 0) of the page directory entry that linked the page table into the current page table tree. This removes an entire 2Meg (or 4Meg for non-PAE mode) virtual region surrounding the virtual address mapped by the page table from the address space.
  • Remaps the page table being written to as writable
  • Returns to guest domain without reporting the pagefault
The guest domain can now update as many entries in this page table as it wants with out having to trap into the hypervisor. This is meant to speed up operations such as fork(), where the guest OS wants to make several changes at once to the same page table.

The interesting bit is what happens if 2 or more CPUs simultaneously try to dereference memory that is within the now missing 2 (or 4) Meg region of VA. Both CPUs will pagefault and trap into Xen. One of them will acquire the domain's memory lock and do the following, while the other spins waiting for the lock:

  • Verifies that the faulting VA is in the current writable page table region.
  • Verifies that all the updated PTEs in the page table are allowable mappings.
  • Remaps the page table as not writable
  • Sets the PRESENT bit (bit 0) of the page directory entry that linked the page table into the current page table tree.
  • Drops the big memory lock
  • Returns to guest domain with out reporting a pagefault
The 2nd CPU will then succeed at getting the lock, but since the region is no longer an active page table region, Xen doesn't know what to do about this page fault. So, it sends a pagefault on to the guest domain to deal with. We call this kind of a pagefault spurious, since from the guest's perspective it looks like there is no reason that the pagefault should have happened. Hence, Xen currently puts the burden of determining that the pagefault was spurious on the paravirtualized guest OS.

To deal with this on Solaris we had to add a bit of code to cmntrap (in locore.s), trap.c and hat_i86.c to check if pagefaults were spurious. The basic idea s to check pagefaults that report a non-present kernel mapping and see if the actual page table entry is there or not. If the pagetable entry is there, we treat it as spurious and ignore the fault. Once this code was in, we started getting various failures in Xen, for example where the spurious fault happened inside Xen during a copyin/copyout operation. Keir Frasier promptly added some better code to Xen to deal with that after we e-mailed him about it.

We ran quite happily with that fix for quite a bit, but eventually we discovered the following issue. When reporting a pagefault, Xen 3.0 passes the faulting address in a per-virtual cpu structure. Since the OS has to handle nested pagefaults, our trap code has to quickly save that value on the stack, so that a subsequent pagefault doesn't overwrite the old fault value. Unfortunately the attempt to write the value to the stack may cause a spurious page fault itself. Solaris has always ensured that the page containing the stack, the cpu structures, etc are always mapped and won't recursivly pagefault, but not the entire 2 or 4 meg regions surrounding those things. To avoid this we'd have to ensure that all of the data structures used in this code were not within 2 (or 4) Meg of a VA that could possibly have it's mapping PTE changed. For Solaris this is not currently practical. Kernel stacks, cpu structures and other things are all dynamically allocated (and mapped) from the heap.

For now, our solution has fallen back to disabling some of the writable pagetable mechanism in Xen itself. In xen/arch/x86/mm.c, ptwr_do_page_fault() our version of Xen has the "goto emulate;" statement compiled in -- i.e. we take out the #if 0. This means that rather than remapping page tables as writable, Xen just emulates the instruction that writes to the pagetable and returns to the guest OS. In the future we'll investigate other solutions.

Kernel Physical Map

The amd64 version of OpenSolaris uses a large section of kernel VA to map all of physical memory. This region is called seg_kpm. It's used by the kernel to quickly access any physical address w/o having to manipulate with page tables. For example, any given physical memory addresses (PA) is mapped at kpm_vbase + PA. These mappings are normally writable. In order to deal with Xen's limitations on writable mappings (page tables, descriptor tables must be read only), I had to rework quite a bit of the code that constructs and manages the seg_kpm mappings. In particular the kpm mappings are now constructed as read only at first. Later in startup we make them writable where possible. Code was added to the page table allocation and free paths as well as the network driver path to deal with kpm mappings that either have to become read-only or must be removed or changed.

Foreign Pages

In OpenSolaris domain, the kernel uses a PFN (Page Frame Number or type pfn_t) to refer to a psuedo-physical address. To get the actual Machine Frame Number (MFN or type mfn_t), you must go through the mfn_list[] array which is coordinated with Xen itself. Only a very few parts of Solaris actually have to know about MFN values. The HAT - which manages page tables and the I/O DMA code are the major ones. We didn't want to add knowledge about MFNs to all the places and interfaces that deal with PFN values, so we introduced a mechanism where an MFN value can be assigned a PFN value and then passed around through the system as a PFN. We call these foreign PFNs or foreign pages. The MFN is turned into a PFN by setting a high order bit in the MFN value that puts it outside the range of an ordinary PFN. This let us pass MFN values masquerading as PFNs around through lots of driver (and other) interfaces with out change the interfaces or code.

The other interesting change around foreign pages was in the HAT. When tearing down a virtual mapping, the HAT needs to determine from the page table entry if the mapping was to a psuedo-physical PFN or to a foreign MFN. At first I used one of the soft bits in the PTE to indicate this. This worked well, since all the HYPERVISOR interfaces allowed for setting the non-PFN related PTE bits. Unfortunately the Xen grant table interfaces for creating mappings are not as well thought out. They provide no access to the PTE soft bits, caching bits, nor NX bit. As such we had to fall back to attempting to guess from the value of the MFN if this is a psuedo physical PFN or not.

I call it guessing, as the code could get an incorrect answer. The idea is to take the MFN from the PTE then check if mfn_list[maching_to_phys_mapping[mfn]] returns the same value of mfn. If the starting mfn is for a device mapping, then it can access a memory location that is outside the bounds of maching_to_phys_mapping[]. That memory location could cause an unexpected pagefault, manipulation of a device or some other very bad thing. Unfortunately the existing Xen interfaces don't provide a defined way to know the limit of the machine_to_phys_mapping[]. I've put in some heuristic checks, but they can't be 100% accurate. A simple change to the Xen interfaces to export the size of the table would solve the problem. I've mentioned this to Keir and hope he'll add it soon. Note that the Linux code that does a similar operation suffers from the same short comings.

The crippled grant table mapping interface also caused me to change the way that the HAT tracks page table usage. The HAT used to keep accurate counters for how many valid entries each page table had. It turns out those counters weren't used on any critical performance paths, but were difficult to maintain in the face of having to use special interfaces (like the the grant table ops) to establish and tear down mappings. I decided to take out the counters and recompute the values on the fly where needed. That code is very new and still has a lurking bug that I'm still trying to diagnose. Hence if you run a domain with low memory and the pageout daemon starts actively paging, you're likely to see a panic or ASSERT failure somewhere in the HAT code.

All that said, I need to stop blogging and get back to working on those bugs....

Monday Feb 13, 2006

OpenSolaris on Xen

For the past year, I've been busy on the team that is porting OpenSolaris to run as a fully para-virtualized domain under the Xen hypervisor. The areas I've been concentrating on are changes to virtual and physical memory management and the mechanisms by which OpenSolaris gets loaded and started, aka boot.

Memory Management under Xen

The changes to physical memory management  translate what OpenSolaris calls a Page Frame Number (PFN or  pfn_t) into Machine Frame Numbers (MFNs) under Xen before using them in page tables, descriptor tables or programming DMA. Under Xen addresses derived from PFNs are referred to as pseudo-Physical addresses and are used in the kernel with the existing type paddr_t. Note that not all MFN values the kernel sees can be translated into PFNs, so a way to distinguish them was needed. Several routines were added to the kernel to deal with these translation issues.

The changes to virtual memory management are primarily around:

  • The HAT must translate PFNs into MFNs when creating page table entries and do the reverse translation, MFN to PFN, when examining pagetables.
  • Xen requires that page tables that are in active use be mapped read-only. The code to access page tables in the HAT is now aware of when it should be using read-only mappings.
  • Changing the algorithm used for TLB shootdowns. Xen provides a single interface to simultaneously change a page table entry and invalidate TLB entries. To reduce the differences between Xen and non-Xen code, the HAT code was restructured.

Some kmdb dcmds have been modified and new ones introduced to help manage the difference between PFN and MFNs during kernel development or crash analysis.

Booting the Kernel

The changes to the way OpenSolaris boots were extensive and complicated. The goal was to make the boot time code used on plain hardware and the code used under Xen as similar as possible. As part of that approach we decide to eliminate the separate boot loader found in /boot/multiboot altogether.

Review of Pre-Xen Boot

As a refresher, the pre-Xen version of OpenSolaris gets into memory in the following way on x64 hardware:

  • Grub is used to load the /boot/multiboot program and the boot_archive into memory.
  • multiboot then determines which version of unix in the boot_archive to boot based on what sort of hardware (32 or 64 bit) is present and any command line information passed to it in the menu.lst file.
  • multiboot builds an intial set of 32 bit page tables to enable it to load the unix executable at the appropriate place in virtual memory as described the the unix ELF file. When booting the 64 bit kernel, an optional 2nd layer is used to automatically double map the 32 bit virtual memory into the top of 64 bit virtual memory.
  • The "unix" executable is rather incomplete (ie. it won't run by itself) but has embedded in it a PT_INTERP section that points to the krtld (kernel runtime loader) module. multiboot combines krtld from the boot_archive with unix as it loads both into memory.
  • Execution actually starts in krtld. Additional modules needed by the kernel are loaded by krtld from the boot_archive. Once the kernel is complete enough to run, execution in the kernel finally begins.
  • multiboot continues to be used, via the BOP_X\*() interfaces, to manage virtual memory and console I/O until the kernel has initialized itself enough to take over.

New Approach to Boot

This seemed like a lot of code to port to Xen, especially since multiboot effectively is just a memory allocator and ELF file decoder. An additional problem was that multiboot was very much a 32 bit program, but on amd64 platforms the Xen domain is always entered in 64 bit mode. A lot of tedious clean up work would be required to make mutltiboot even compile, let along work, as a 64 bit program. We decided to make the following changes to the way in which we build Unix:

  • Link krtld (as well as enough other code) into the unix ELF file at build time. Hence, there is no more PT_INTERP section in unix.
  • We rely on grub to load the unix file directly. For amd64 kernels this relies on grub's a.out hack code to load the 64 bit ELF based on an embedded multiboot header.
  • The unix ELF file's text and data segments now have explicitly specified physical load addresses which are at 4 Meg and 8 Meg.
  • A third loadable segment was added to the unix ELF file. The code in this segment is compiled to load and run at address 12 Meg. The code is always 32 bit executable on hardware, but is native when under Xen. It contains the ELF (or multiboot header) specified entry point. We call this code "dboot", short for Direct Boot.
If you want to understand these changes more completely, you can read the OpenSolaris makefiles (both Xen and pre-Xen). Another way to compare them is to run elfdump(1) on the unix files that result.

Using this new version of the unix file, the following happens at boot:

  • Grub loads the UNIX file, either as 32 bit ELF or 64 bit using the a.out hack and transfers control to the dboot code.
  • The dboot code builds page tables that exactly match what the booted kernel (64 bit, 32 bit PAE or 32 bit non-PAE) will use. The page table entries include mappings for the kernel text and data at the correct high virtual memory addresses.
  • For non-Xen, dboot activates paging mode
  • The dboot code finally jumps into unix kernel text.
  • The entry point in unix, _start, is provided by i86pc/os/fake_bop.c. As the name implies, this is kernel code which emulates the old BOP_\*() interfaces that the rest of kernel startup relies on.

This new boot approach is much smaller and simpler. It also removes many artificial restrictions that startup.c had to deal with, like a 32 bit allocator in the 64 bit kernel. You can read more about these in Nils blog.

As an additional clean up, the code to manage console I/O and to deal with boot time page table and memory management was made "common" source between the dboot code and what the kernel needed in early startup.

The big benefit for the Xen port was that the dboot code was easy to port to Xen. Since much of the code is now common between dboot and the rest of the kernel, it was designed to work from the beginning in a 64 bit environment.

menu.lst changes

The new way of booting requires you to specify the kernel you want to boot explicitly in your grub menu.lst file. You can see more of what is going on by adding prom_debug=true,kbm_debug=true to your menu.lst file. This is done by adding the -B

title 32 bit OpenSolaris with boot time debug output
kernel /platform/i86pc/kernel/unix -B prom_debug=true,kbm_debug=true
module /platform/i86pc/boot_archive

title 64 bit OpenSolaris no debug output, but console I/O to serial port
kernel /platform/i86pc/kernel/amd64/unix -B console=ttya
module /platform/i86pc/boot_archive
Under Xen you include these settings in your domain builder configuration file in the "extra" property.

Tuesday May 10, 2005

More on Solaris x86 and page tables

Tim Marsland has started blog entries about the Solaris project to port to x86 64. For my part, here's more low level detail about the Solaris 10 x86 page table management.

One issue that any x86 Operating System has to deal with is how to manage software access to page tables. The hardware does page table lookups using physical (not virtual) addresses. However, in order for an OS to create, modify or remove page table entries it has to have the page table mapped in virtual memory.

Solaris 9 stored page tables in the "user" part of virtual address space. Whenever the kernel had to access a pagetable entry, it would change %cr3 once to switch to the page table address space and then again to get back to the original address space. One of the ramifications of changing %cr3 on x86 is that the entire contents of the TLB may be invalidated.

In Solaris 10 we take a different approach to minimize the impact of page table accesses on the TLB. The kernel maintains 4K page aligned peep holes which are remapped on demand to access pagetables. Remapping a single page requires one INVLPG instruction which can be much quicker than an entire address space change, TLB flush and subsequent TLB reloads. Solaris allocates a unique peep hole for each CPU, to avoid contention or interference between CPUs. To use the peep hole, the HAT does:

  • Disable thread preemption, so it won't migrate to a different CPU.
  • Acquire a per-peep-hole spin lock, to avoid conflicting with interrupt code
  • If the peep-hole doesn't already point to the desired physical page
    • Update the PTE to the peep hole to the new page.
    • Issue an INVLPG for the peep hole address
  • Access the desired page table entry (usually one XCHG instruction)
  • Release the spin lock
  • Enable preemption

The spin lock comes into play when an interrupt happens during a page table access and the interrupt code also has to access a pagetable. In that case the spin lock acquisition allows the interrupt thread to yield back to the interrupted, or pinned, thread to allow it to free up the peep hole. The code to disable and enable preemption is very quick. On Solaris it's just an increment of a thread flag for disable and an increment/compare for enable.

When running the 64 bit kernel on a processor that has much more virtual address space than actual physical memory, this is all much easier. The kernel maintains a region of virtual address space that is mapped 1:1 to physical addresses called seg_kpm (kernel physical map). The pagetable code uses addresses in seg_kpm to access page tables instead of using the peep hole. This saves executing a lot of code and is much faster. One of the many benefits of a 64 bit operating system.

The page tables for certain special purpose parts of the kernel address space are always maintained in virtual memory. This includes the PTEs that map peep holes as well as something called segmap which is used frequently in I/O transactions.

One note for anybody looking at the source once Open Solaris hits the streets, is that the code confusingly calls the peep holes "windows" - for the purpose of this blog, the term peep hole seemed better. Maybe I'll get the code changed to match.

A final optimization to mention here is that the Solaris 32 and 64 bit kernels avoid allocating overhead pagetables for 32 bit user processes when using PAE. Since a 32 bit user process has at most 4 page table entries at level 2, the HAT stores the entries in part of the address space data structure. When a thread using that address space starts to run a CPU, the 4 entries are copied to a per-cpu set of pagetables at the start of the current level 2 page table. This saves approximately 1 page for each 32 bit process in the 32 bit kernel and 3 pages for each 32 bit process in the 64 bit kernel using a consistent mechanism.

Wednesday Apr 13, 2005

Mad Hatters

This is the start of a series of posts on my role in the x86 64 bit Solaris port. They'll be in several pieces, as I'll never get time to explain the whole thing at once. My largest single contribution to Solaris 10 was a complete rewrite of something called the x86 HAT layer. Though I started out alone, Nils Nieuwejaar later joined in and contributed some major parts of the effort as well as some much needed comic relief.

The HAT provides interfaces to the "common" Solaris virtual memory code that manage architectural dependent things like page tables and page mapping lists. If you're not pretty familiar with how x86 page tables look, the rest of these posts will make about as much sense as a gaggle of honking geese. A good reference is the AMD x86-64 Architecture Programmer's Manual, Volume 2 System Programming, Chapter 5 Page Translation and Protection.

You'll eventually also need to know a little bit about the HAT. The major interfaces exported by a Solaris HAT are:

  • hat_memload(address_space, virt_addr, phys_page, permissions, etc) - loads a translation for the given virtual address to the given physical page for an address space.
  • hat_devload(address_space, virt_addr, phys_addr, etc) - similar to above but generally used for device memory.
  • hat_memload_array(address_space, virt_addr, phys_page list, etc) - similar to hat_memload(), but allows for multiple or large page mappings in a single call.
  • hat_unload(address_space, virt_addr, length) - undoes the above, ie. removes mappings from an address space
  • hat_pageunload(phys_page) - given a physical page, remove all virtual mappings to that page from all address spaces.

The previous x86 HAT's design was rather tied up in the requirements of running in a 32 bit address space on small memory PCs that were typical of the early/mid 1990s. It contained quite a bit of special code to deal with memory allocation and address space manipulations in order to be have large amounts of page tables and mapping list data structures even though normal kernel virtual address range is limited to the top 1 Gigabyte (or so) of memory. The idea of 2 levels of page tables was pretty much hard coded into it with some slight of hand #ifdef-ing to have the partial 3rd table needed in PAE mode.

At the start of the project we planned to just extend the old HAT code in order to get us able to run in 64 bit mode as quickly as possible. The project started rather late in the release cycle for Solaris 10 and had a very tight schedule to meet. We expected to go back later and possibly rewrite much of the HAT for better 64 bit performance for Solaris 10 updates.

After a week of looking at what needed to be done, I proposed writing all new code from the start. If you're a kernel developer, your reaction to that statement should be the same as my project leaders' were at that time, that it was crazy to propose such a risky approach. But it made sense due to a new design for the HAT that made the code neutral to 32bit non-PAE vs 32 bit PAE vs 64 bit environments. The new HAT would execute almost all the same code paths in all modes. Hence, I could write and debug it in the existing stable 32 bit version of Solaris with a reasonable expectation that the code should just recompile and work in the 64 bit environment. I'd be able to start coding and testing immediately, while the rest of the amd64 team was still working on other startup issues, like 64 bit compilers and boot loaders and other tasks.

The new design idea was to encode all the parameters about the paging hierarchy (ie, page tables, page directories, page directory pointer tables and page-map Level-4 tables) into an mmu description structure. The mmu description would be filled in early at boot once Solaris determines what mode the processor will run in. The HAT then always interprets this description when manipulating page tables.

To illustrate the difference, I'll show some psuedo-code for a mythical HAT function which looks for a PTE for a given virtual address by walking down the page table hierarchy. I've tried to use "variable" names to make the code self-explanatory. First the old code if it were extended to 64 bits in the most obvious fashion:



#if defined(__amd64) || defined(PAE)
typedef uint64_t pte_t;
#else
typedef uint32_t pte_t;
#endif


pte_t
hat_probe(caddr_t address)
{
	pte_t pte;
	uintptr_t va = (uintptr_t)address;
	uint_t index;
	pte_t \*ptable;

	ptable = find_top_table(current_addr_space);
	ASSERT(ptable != NULL);

#if defined(__amd64)
	/\*
	 \* 64 bit mode uses 4 levels of page tables
	 \* MMU_PML4_SHIFT is 39
	 \* MMU_PLM4_MASK is (512 - 1)
	 \*/
	index = (va >> MMU_PML4_SHIFT) & MMU_PML4_MASK;
	pte = ptable_extract(ptable, index)
	if (pte == 0)
		return 0;
	ptable = find_PDP_table(pte & MMU_PAGEMASK);
	ASSERT(ptable != NULL);
#endif

#if defined(__amd64) || defined(PAE)
	/\*
	 \* 3rd level of pagetables.
	 \* MMU_PDP_SHIFT is 30
	 \* MMU_PPD_MASK is either (512 - 1) or (4 - 1)
	 \*/
	index = (va >> MMU_PDP_SHIFT) & MMU_PDP_MASK;
	pte = ptable_extract(ptable, index)
	if (pte == 0)
		return 0;
	ptable = find_PD_table(pte & MMU_PAGEMASK);
	ASSERT(ptable != NULL);
#endif

	/\*
	 \* 2nd level page tables
	 \* MMU_PD_SHIFT is either 21 or 22
	 \* MMU_PD_MASK is either (512 - 1) or (1024 - 1)
	 \*/
	index = (va >> MMU_PD_SHIFT) & MMU_PD_MASK;
	pte = ptable_extract(ptable, index)
	if (pte == 0)
		return 0;
	if (pte & PT_PAGESIZE_BIT)
		return pte;
	ptable = find_PT_table(pte & MMU_PAGEMASK);
	ASSERT(ptable != NULL);

	/\*
	 \* Lowest level page table
	 \* MMU_PT_SHIFT is 12
	 \* MMU_PT_MASK is either (512 - 1) or (1024 - 1)
	 \*/
	index = (va >> MMU_PT_SHIFT) & MMU_PT_MASK;
	pte = ptable_extract(ptable, index)
	return (pte);
}

Under the new scheme the same interface looks like this:



typedef uint64_t pte_t;
typedef void \*ptable_t;
struct mmu_description {...} mmu;

pte_t
hat_probe(caddr_t address)
{
	pte_t pte;
	uintptr_t va = (uintptr_t)address;
	uint_t index;
	int level;
	ptable_t ptable;

	for (level = mmu.top_level; level >= 0; --level) {
		ptable = ptable_lookup(va, level, current_addr_space);
		ASSERT(ptable != NULL);
		index = (va >> mmu.shift[level]) & mmu.index_mask[level];
		pte = ptable_extract(ptable, index)
		if (pte == 0)
			return 0;
		if ((pte & mmu.is_page_mask[level]) != 0)
			return pte;
	}
	return 0;
}

The new code has a small amount of additional looping and memory reference overhead in exchange for it's compactness and improved extensiblity. If a future processor adds additional large pagesizes or more pagetable levels the mmu description might change, but this code would just work. Another thing to note is that you would probably change the new version to use:

	for (level = mmu.top_pagesize_level; level >= 0; --level) {

as the loop boundaries to improve its performance.

The important thing at the time for the project, was that the old style 64 bit code couldn't have been tested until we had a 64 bit kernel partially working. With the new style code, we could do a lot of testing on 32 bit platforms, long before any other part of the 64 bit kernel was ready and be fairly confident that the code was correct. In the end this proved to be a great choice as the 64 bit HAT was almost never on the critical path for code development.

Saturday Oct 02, 2004

Baby's First Weblog

Hello, world! This is Joe Bonasera. I've been a software engineer in the Solaris Kernel Group for 4 1/2 years working mostly on core Virtual Memory support. My team's major project for Solaris 10 integrated this week, so I have some extra time available to initiate a weblog.

That project was the amd64 porting effort to allow Solaris 10 and applications to run in 64 bits on x86 platforms with either AMD 64 or Intel EM64T processors. To get the Solaris terminology right from the start:

  • x86 refers to the computer platform, be it a 32 or 32/64 bit capable system
  • i386 is for 32 bit OS, programs, ABI's, ... on x86
  • amd64 is for 64 bit OS, programs, ABI's, ... on x86

The "i" in i386 applies and "amd" in amd64 applies no matter who makes the CPU. There'll be much more news about Solaris and amd64 in coming posts.

I came to Sun after working on various types of software (mostly optimizing compilers and some parallel database engine work), so I have a long running interest in software performance and scalability issues in high performance computing. Some of my posts will include information about those issues for large memory programs on Solaris.

About

JoeBonasera

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
News

No bookmarks in folder

Blogroll

No bookmarks in folder