Kernel Address Space Layout on x86/x64

To celebrate the launch of OpenSolaris, I thought I would spend a little time describing the layout of the kernel's virtual address space.

For reasons that will become clear, this cannot be done without first giving an overview of how a Solaris system is booted. (If you want to know any more about booting Solaris, you should read Tim Marsland's writeup. He talks a bit more about the mechanisms, rationale, and history of the Solaris boot process.)

As Joe and Kit have been (and presumably will be) discussing, there are a number of complications brought on by the limited amount of virtual address space available to the kernel in 32-bit mode. For most VM operations the 64-bit address space allows for a much simpler implementation. On the other hand, booting a 64-bit system is significantly more complicated than a 32-bit system.

When we boot an x86/x64 system, we do it in several stages. The primary boot code is an OS-independent routine in the BIOS that reads the OS-specific boot code. In Solaris, the code that the BIOS loads is called the 'primary bootstrap', which is responsible for reading the next stage of boot code (which we usually call the 'boot loader') from a fixed location on the disk or from the network. The boot loader has to identify the hardware in the system to find the boot device, send messages to the console or serial port, and so on. Once the hardware has been identified, the boot loader loads the Solaris kernel into memory and jumps to the first instruction of the _start() routine.

Sounds complicated, right? We're just getting started. It turns out that we need to keep the boot loader around for a while. Until we get a fair amount of initialization out of the way, the boot loader is the only one who knows which memory is available and how to talk to the various devices in the system. This means that any memory allocations, disk access, or console message requires calling back into the boot loader.

Having to call back into the boot loader is a minor nuisance on 32-bit systems, but on 64-bit systems it's a nightmare. The problem is that the OS is running in 64-bit mode, but the boot loader is a 32-bit program. So, every I/O access or memory allocation requires a mode-switch between 32-bit and 64-bit mode. Fortunately, implementing this particular nastiness fell to somebody else so I didn't have to think about it. What does affect us is that the 32-bit boot loader can only use the lower 4GB of the virtual address space. So, we are trying to load and execute a kernel in the top of a 64-bit address range, but the memory allocator only recognizes the bottom 4GB of memory.

Getting the two worldviews to make some kind of sense required a trick we called 'doublemapping'. Essentially, we would call into the 32-bit program to allocate a range of memory. The 32-bit program would reserve a bit of physical memory and set up the page tables to map that memory at address X in the bottom 4GB of virtual address space. During the transition from 32-bit mode back to 64-bit mode, we would map the same physical memory into the upper 4GB of memory, at address 0xFFFFFFFFF.00000000 + X. So a single piece of memory would be mapped at two different addresses. The boot loader lives in roughly the bottom 1GB of memory which, due to doublemapping. means it lives from 0x00000000.00000000 to 0x00000000.3FFFFFFF and from 0xFFFFFFFF.00000000 to 0xFFFFFFFF.3FFFFFFF.

Finally, we get to discussing the actual layout of the kernel's address space. Early on in the AMD64 porting project we started with a blank sheet of paper and laid out a clean, sensible, elegant address space for Solaris on x64 systems. What you see below is what is left of the design once it ran headlong into the constraints and limitations imposed by the x86 and AMD64 architectures and by the Solaris boot process.

From startup.c:

 \*		64-bit Kernel's Virtual memory layout. (assuming 64 bit app)
 \*			+-----------------------+
 \*			|	psm 1-1 map	|
 \*			|	exec args area	|
 \* 0xFFFFFFFF.FFC00000  |-----------------------|- ARGSBASE
 \*			|	debugger 	|
 \* 0xFFFFFFFF.FF800000  |-----------------------|- SEGDEBUGBASE
 \*			|      unused    	|
 \*			+-----------------------+
 \*			|      Kernel Data	|
 \* 0xFFFFFFFF.FBC00000  |-----------------------|
 \*			|      Kernel Text	|
 \* 0xFFFFFFFF.FB800000  |-----------------------|- KERNEL_TEXT
 \* 			|     LUFS sinkhole	|
 \* 0xFFFFFFFF.FB000000 -|-----------------------|- lufs_addr
 \* ---                  |-----------------------|- valloc_base + valloc_sz
 \* 			|   early pp structures	|
 \* 			|   memsegs, memlists, 	|
 \* 			|   page hash, etc.	|
 \* ---                  |-----------------------|- valloc_base
 \* 			|     ptable_va    	|
 \* ---                  |-----------------------|- ptable_va
 \* 			|      Core heap	| (used for loadable modules)
 \* 0xFFFFFFFF.C0000000  |-----------------------|- core_base / ekernelheap
 \*			|	 Kernel		|
 \*			|	  heap		|
 \* 0xFFFFFXXX.XXX00000  |-----------------------|- kernelheap (floating)
 \*			|	 segkmap	|
 \* 0xFFFFFXXX.XXX00000  |-----------------------|- segkmap_start (floating)
 \*			|    device mappings	|
 \* 0xFFFFFXXX.XXX00000  |-----------------------|- toxic_addr (floating)
 \*			|	  segkp		|
 \* ---                  |-----------------------|- segkp_base
 \*			|	 segkpm		|
 \* 0xFFFFFE00.00000000  |-----------------------|
 \*			|	Red Zone	|
 \* 0xFFFFFD80.00000000  |-----------------------|- KERNELBASE
 \*			|     User stack	|- User space memory
 \* 			|			|
 \* 			| shared objects, etc	|	(grows downwards)
 \*			:			:
 \* 			|			|
 \* 0xFFFF8000.00000000  |-----------------------|
 \* 			|			|
 \* 			| VA Hole / unused	|
 \* 			|			|
 \* 0x00008000.00000000  |-----------------------|
 \*			|			|
 \*			|			|
 \*			:			:
 \*			|	user heap	|	(grows upwards)
 \*			|			|
 \*			|	user data	|
 \*			|-----------------------|
 \*			|	user text	|
 \* 0x00000000.04000000  |-----------------------|
 \*			|	invalid		|
 \* 0x00000000.00000000	+-----------------------+
Within the kernel's address space, there are only three fixed addresses.
  • KERNEL_TEXT: the address at which 'unix' is loaded.
  • BOOT_DOUBLEMAP_BASE: the lowest address the kernel can use while the boot loader is still in memory.
  • KERNELBASE: the bottom of the kernel's address space or, alternatively, the top of the user's address space.

At KERNEL_TEXT, we allocate two 4MB pages: one for the kernel text and one for kernel data. This value is hardcoded in two different places: the kernel source ( i86pc/sys/machparam.h) and in the unix ELF header (in i86pc/conf/Mapfile.amd64). This is a value that changes rarely, but it does happen. As Solaris 10 neared completion, one of the ISVs using the first release of Solaris Express to include AMD64 support found that putting kernel text and data in the top 64MB of the address range was interfering with the operation of a particular piece of system virtualization software. This interference did not cause any probles with correctness, but it did reduce performance. By lowering the kernel to 2\^64-64MB-8MB we were able to remove the interference and eliminate the performance problem.

Directly below the kernel text is a 8MB region which is used as a scratch area by the logging UFS filesystem during boot. According to the comments in the boot loader code, that address was chosen because it was safely hidden within segmap in the 32-bit x86 address space. Since that segmap isn't touched by Solaris proper until after the boot loader is finished unrolling the UFS log, the Solaris address map in startup.c was not updated when the change was putback to the gate. After all, nobody would ever think of restructuring the kernel's address space, would they?

This feature was putback to the main Solaris gate while the AMD64 team was still in the early stages of bringing Solaris up on 64-bit x86 processors for the first time. When that code was putback, the entire AMD64 team was sequestered in a conference room in Colorado for a week of work uninterrupted by meetings, conference calls, or managers. At that time we were still struggling along without kabd, kmdb, or a working simulator, so the only debugging facility available to us was inserting printf()s into the code stream. Obviously this made debugging tedious, tricky, and time consuming. Anything that introduced new work and new debugging headaches was not well received. Thus, the somewhat disparaging title this region was given in the address space map once we identified it as the cause of our new failures.

The moral of this story: don't use Magic Values. If you must use Magic Values, make sure they are used in a contained fashion and within a single layer. If you have Magic Values that must be shared between layers, document their every use, assumption, and dependency. If you can't be bothered to document things properly...well, come talk to me. And wear a helmet.

Below the LUFS sinkhole is a region used for some fundamental virtual memory data structures. For example:

  • The memlists represent the physical memory exported to us by the boot loader.
  • The memsegs describe regions of physical memory that have been incorporated by the operating system.
  • The page_hash is a hash table used to translate <vnode, offset> pairs into page structures. In other words, this is the core data structure used to translate virtual addresses into physical addresses.
  • The page_ts are the data structures used to track the state of every physical page in the system.
  • The page_freelists form the head of the linked lists of free pages in the system.
  • The page_cachelists form the head of the linked lists of cached pages in the system. These are pages that contain valid data but which are not currently being used by any applications. When your web browser crashes, the pages containing the program text end up on the cachelists where they can be quickly reclaimed when you restart the browser.
  • The page_counters are a little more complicated than the name would suggest - and more complicated than a little squib in a bootblog can do justice to. For now, let's just say that they are used to keep track of free contiguous physical memory.

All of the above consumes about 62MB on my laptop with 2GB of memory, which we round up to the nearest 4MB. On a system with 16GB in it, the page_t structures alone would consume 120MB. Since we cannot afford to use that much memory in this stage of boot, only the page_ts for the bottom 4GB of physical memory are allocated in this part of memory. The page_ts for the rest of memory are allocated once we have broken free of the boot loader.

Next we start loading kernel modules. Once again, the oddities of 64-bitness come into play. For performance reasons, modules generally use %rip-relative addressing to access their data. In order to use this addressing, there is a limit to how distant a relative the address can be. Specifically, the address you are accessing cannot be more than 2GB away from the instruction that is doing the accessing. As long as a system doesn't have too many modules, the text is loaded into the 4MB kernel text page and the data is allocated in the 4MB kernel data page. Since these pages are adjacent, there are no problems with %rip-relative addressing.

Once we run out of space on those two pages, we have to start allocating kernel heap space for the module text and data. To maintain the 2GB limit we subdivided the heap, introducing a 2GB region we call the 'core' heap. This subdivision can be seen in the code for kernelheap_init(), which grows the heap and in the kernel's runtime linker

Below the core heap is the general-purpose kernel heap, which is the memory region from which most kmem_alloc()s are satisfied. This is the last region to be created or used while we are still dependent on the boot loader's services. The astute reader will note that the top of the kernel heap is in the upper 4GB of memory but the bottome of the heap, while not specifically defined, is clearly not in the upper 4GB. As one might expect, this causes problems with the boot loader and our doublemapping trick. In fact while we are still making use of the boot loader the kernelheap bottom is set to boot_kernelheap, which is defined as the bottom of the doublemapping region. When we unload the bootloader we call kernelheap_extend(), which grows the heap down to its true lower bound.

The regions below the kernel heap are created after Solaris has unloaded the boot loader and taken over responsibility for managing the full virtual addrss space. This post has already gotten much longer than I intended, so I'll have to leave their description to a later date. In the meantime, you can find a bit more about segkp at Prakash's blog. If you are interested in the topic, you should really be reading Solaris Internals.

Technorati Tag:
Technorati Tag:

On personal opinion, I find this very helpful. Guys, I have also posted some more relevant info further on this, not sure if you find it useful:

Posted by ocnsss on March 22, 2007 at 04:40 PM GMT-05:00 #

Post a Comment:
  • HTML Syntax: NOT allowed



« April 2014