Now that more of the boot code has made its way into the OpenSolaris tree, I can finally say a few words about the page translation mechanisms that make booting a 64-bit amd64 system possible, a topic Tim Marsland first alluded to back in May (!).

Essentially, the booter is a 32-bit binary, but the 64-bit kernel needs the booter to perform initial memory allocations for it until it can take over and do things for itself.

When the kernel needs to allocate memory for itself before its own allocation mechanism is set up, it calls the BOP_ALLOC() macro, which ends up calling the booter's registered bsys_alloc() routine, bkern_alloc(), via the bootops callback mechanism.

But this has an inherent problem; memory allocation will be performed by the 32-bit booter and a mapping made in the booter's 32-bit page tables. How will the 64-bit kernel use those mappings?

The answer is to use an interpositioned translation layer between the two, often called a thunking layer, and have the "thunking" code translate any 32-bit memory allocations made on the kernel's behalf into entries in the kernel's 64-bit page tables. The routines that do all this are now available for your viewing pleasure in the OpenSolaris source tree under /on/usr/src/psm/stand/boot/amd64, with the real meat located in the files ptops.c and ptxlate.c.

Whenever a mapping is made in the 32-bit page tables on the kernel's behalf, it is translated into a 64-bit mapping before returning to the 64-bit kernel by the routine amd64_xlate_legacy_va(). This code walks the 32-bit page tables for the memory range allocated, decodes the information the booter placed there and makes equivalent entries in the kernel's 64-bit page tables.

Now this may sound simple, but in practice it's rather a pain to get right, and errors in this type of code are always difficult to find and debug, especially as when they occur the most common failure modes are:

  • Random corruption of memory
  • A hard hang of the machine
  • A complete reset of the machine
None of which is especially diagnosable without a good simulator and an excellent kernel debugger, like kmdb.

The heart of this code is the routine amd64_tbl_lookup(). Since 32-bit and 64-bit page tables are very closely related, by manipulating bit shift amounts and page table index sizes, we can use a single routine to do all needed lookups, whether in 32-bit or 64-bit page tables.

Looking at struct amd64_mmumode:

    typedef struct amd64_mmumode {
            uint8_t shift_base;     /\* shift to start of page tables \*/
            uint8_t level_shift;    /\* shift between page table levels \*/
            uint8_t map_level;      /\* mapping level for AMD64_PAGESIZE pages \*/
            uint16_t tbl_entries;   /\* number of entries per table level \*/
    } amd64_mmumode_t;
and the values in the mmus array:
    static amd64_mmumode_t mmus[] = {
            { 22, 10, 2, 1024 },    /\* legacy 32-bit mode \*/
            { 39, 9, 4, 512 }       /\* long 64-bit mode \*/
you can begin to see how this works. For example, for 64-bit page tables, you would shift a virtual address right by shift_base (39) bits to form an index into the level 1 page tables, each successive page table index can be formed by shifting the virtual address level_shift (nine) fewer bits to the right, and a full 64-bit page table has map_level (four) levels.

(The tbl_entries field? Well, it's a bit of historical cruft left over from initial debugging that saved humans from having to calculate the value (1 << level_shift).)

In addition to being responsible for on-the-fly translation done after each memory allocation, there are two routines responsible for creating the initial 64-bit page table used to bring the system up into 64-bit mode for the first time, amd64_init_longpt() and amd64_xlate_boot_tables().

You'll see that amd64_init_longpt() creates the 64-bit page tables with a single Page Map Level-4 Offset table (pml4), two Page-Directory Pointer Offset tables (pdpt and pdpt_hi), four contiguous Page-Directory Offset tables (pdt) and a full page table (pte_zero.)

What's interesting here is what we decided to do to make memory operations easier, namely the entire 4G of the 32-bit address space is "reflected" into the top 4G of the 64-bit address space. This means that if the booter mapped an entry at virtual address 0x1000 in 32-bit space, the kernel should be able to access that memory at the 64-bit virtual address 0xffffffff00001000. The initial page tables are setup to do just that - to provide just enough page table entries to map an initial block of memory which boot has identity mapped (physical address = virtual address, 1:1) and the top 4G of the 64-bit address space.

(You may notice that we start our identity mapped pages at 0x1000; what about the low page of memory, from 0x0 - 0xfff? That range is purposely left unmapped to catch NULL pointer references.)

One optimization we make is to identity map the range from 0x200000 through magic_phys (currently set to be 15M) using large (2M) pages. While that does map an "extra" 1M of memory, the magic_phys address is only significant internally within the booter, so it doesn't really matter if an extra 1M of memory is mapped. Note also that we save a considerable bit of memory by using 2M pages as compared to having to map the range using conventional 4K pages, so it's all worth it.

Then when all this is done, amd64_xlate_boot_tables() is called to convert all the mappings the booter has setup to date into 64-bit mappings suitable for use by the kernel.

So to sum up all this, when the booter is ready to hand off execution to the amd64 kernel, it calls amd64_handoff(), which in turn calls amd64_makectx64(), which then calls amd64_init_longpt() to set up the initial 64-bit page tables.

To give you a flavor of what the kernel needs to do to perform memory allocations until it is capable of handling things itself, it will:

  1. Trap back into the booter via the bootops callback mechanism.
  2. Begin execution in __vtrap_common(), which will switch the CPU from 64-bit mode back into 32-bit mode. Once the CPU is back in 32-bit mode, the code will call
  3. amd64_vtrap(), which will call
  4. amd64_invoke_bootop(). There memory will be allocated via the BOP_ALLOC() macro, which will call bkern_alloc() to allocate space in the 32-bit page tables the booter operates with.

    When that routine returns, control will return to
  5. amd64_invoke_bootop(), which will call
  6. amd64_xlate_legacy_va() to translate the mapping into an entry in the kernel's 64-bit page tables. When amd64_invoke_bootop() returns, it will return to
  7. amd64_vtrap(), which will switch the CPU back into 64-bit mode before returning to the kernel, which can now use the new entries in its page tables provided by the page table translation mechanism.
Simple, isn't it? :-)
Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Post a Comment:
  • HTML Syntax: NOT allowed



« July 2016