Mad Hatters

This is the start of a series of posts on my role in the x86 64 bit Solaris port. They'll be in several pieces, as I'll never get time to explain the whole thing at once. My largest single contribution to Solaris 10 was a complete rewrite of something called the x86 HAT layer. Though I started out alone, Nils Nieuwejaar later joined in and contributed some major parts of the effort as well as some much needed comic relief.

The HAT provides interfaces to the "common" Solaris virtual memory code that manage architectural dependent things like page tables and page mapping lists. If you're not pretty familiar with how x86 page tables look, the rest of these posts will make about as much sense as a gaggle of honking geese. A good reference is the AMD x86-64 Architecture Programmer's Manual, Volume 2 System Programming, Chapter 5 Page Translation and Protection.

You'll eventually also need to know a little bit about the HAT. The major interfaces exported by a Solaris HAT are:

  • hat_memload(address_space, virt_addr, phys_page, permissions, etc) - loads a translation for the given virtual address to the given physical page for an address space.
  • hat_devload(address_space, virt_addr, phys_addr, etc) - similar to above but generally used for device memory.
  • hat_memload_array(address_space, virt_addr, phys_page list, etc) - similar to hat_memload(), but allows for multiple or large page mappings in a single call.
  • hat_unload(address_space, virt_addr, length) - undoes the above, ie. removes mappings from an address space
  • hat_pageunload(phys_page) - given a physical page, remove all virtual mappings to that page from all address spaces.

The previous x86 HAT's design was rather tied up in the requirements of running in a 32 bit address space on small memory PCs that were typical of the early/mid 1990s. It contained quite a bit of special code to deal with memory allocation and address space manipulations in order to be have large amounts of page tables and mapping list data structures even though normal kernel virtual address range is limited to the top 1 Gigabyte (or so) of memory. The idea of 2 levels of page tables was pretty much hard coded into it with some slight of hand #ifdef-ing to have the partial 3rd table needed in PAE mode.

At the start of the project we planned to just extend the old HAT code in order to get us able to run in 64 bit mode as quickly as possible. The project started rather late in the release cycle for Solaris 10 and had a very tight schedule to meet. We expected to go back later and possibly rewrite much of the HAT for better 64 bit performance for Solaris 10 updates.

After a week of looking at what needed to be done, I proposed writing all new code from the start. If you're a kernel developer, your reaction to that statement should be the same as my project leaders' were at that time, that it was crazy to propose such a risky approach. But it made sense due to a new design for the HAT that made the code neutral to 32bit non-PAE vs 32 bit PAE vs 64 bit environments. The new HAT would execute almost all the same code paths in all modes. Hence, I could write and debug it in the existing stable 32 bit version of Solaris with a reasonable expectation that the code should just recompile and work in the 64 bit environment. I'd be able to start coding and testing immediately, while the rest of the amd64 team was still working on other startup issues, like 64 bit compilers and boot loaders and other tasks.

The new design idea was to encode all the parameters about the paging hierarchy (ie, page tables, page directories, page directory pointer tables and page-map Level-4 tables) into an mmu description structure. The mmu description would be filled in early at boot once Solaris determines what mode the processor will run in. The HAT then always interprets this description when manipulating page tables.

To illustrate the difference, I'll show some psuedo-code for a mythical HAT function which looks for a PTE for a given virtual address by walking down the page table hierarchy. I've tried to use "variable" names to make the code self-explanatory. First the old code if it were extended to 64 bits in the most obvious fashion:



#if defined(__amd64) || defined(PAE)
typedef uint64_t pte_t;
#else
typedef uint32_t pte_t;
#endif


pte_t
hat_probe(caddr_t address)
{
	pte_t pte;
	uintptr_t va = (uintptr_t)address;
	uint_t index;
	pte_t \*ptable;

	ptable = find_top_table(current_addr_space);
	ASSERT(ptable != NULL);

#if defined(__amd64)
	/\*
	 \* 64 bit mode uses 4 levels of page tables
	 \* MMU_PML4_SHIFT is 39
	 \* MMU_PLM4_MASK is (512 - 1)
	 \*/
	index = (va >> MMU_PML4_SHIFT) & MMU_PML4_MASK;
	pte = ptable_extract(ptable, index)
	if (pte == 0)
		return 0;
	ptable = find_PDP_table(pte & MMU_PAGEMASK);
	ASSERT(ptable != NULL);
#endif

#if defined(__amd64) || defined(PAE)
	/\*
	 \* 3rd level of pagetables.
	 \* MMU_PDP_SHIFT is 30
	 \* MMU_PPD_MASK is either (512 - 1) or (4 - 1)
	 \*/
	index = (va >> MMU_PDP_SHIFT) & MMU_PDP_MASK;
	pte = ptable_extract(ptable, index)
	if (pte == 0)
		return 0;
	ptable = find_PD_table(pte & MMU_PAGEMASK);
	ASSERT(ptable != NULL);
#endif

	/\*
	 \* 2nd level page tables
	 \* MMU_PD_SHIFT is either 21 or 22
	 \* MMU_PD_MASK is either (512 - 1) or (1024 - 1)
	 \*/
	index = (va >> MMU_PD_SHIFT) & MMU_PD_MASK;
	pte = ptable_extract(ptable, index)
	if (pte == 0)
		return 0;
	if (pte & PT_PAGESIZE_BIT)
		return pte;
	ptable = find_PT_table(pte & MMU_PAGEMASK);
	ASSERT(ptable != NULL);

	/\*
	 \* Lowest level page table
	 \* MMU_PT_SHIFT is 12
	 \* MMU_PT_MASK is either (512 - 1) or (1024 - 1)
	 \*/
	index = (va >> MMU_PT_SHIFT) & MMU_PT_MASK;
	pte = ptable_extract(ptable, index)
	return (pte);
}

Under the new scheme the same interface looks like this:



typedef uint64_t pte_t;
typedef void \*ptable_t;
struct mmu_description {...} mmu;

pte_t
hat_probe(caddr_t address)
{
	pte_t pte;
	uintptr_t va = (uintptr_t)address;
	uint_t index;
	int level;
	ptable_t ptable;

	for (level = mmu.top_level; level >= 0; --level) {
		ptable = ptable_lookup(va, level, current_addr_space);
		ASSERT(ptable != NULL);
		index = (va >> mmu.shift[level]) & mmu.index_mask[level];
		pte = ptable_extract(ptable, index)
		if (pte == 0)
			return 0;
		if ((pte & mmu.is_page_mask[level]) != 0)
			return pte;
	}
	return 0;
}

The new code has a small amount of additional looping and memory reference overhead in exchange for it's compactness and improved extensiblity. If a future processor adds additional large pagesizes or more pagetable levels the mmu description might change, but this code would just work. Another thing to note is that you would probably change the new version to use:

	for (level = mmu.top_pagesize_level; level >= 0; --level) {

as the loop boundaries to improve its performance.

The important thing at the time for the project, was that the old style 64 bit code couldn't have been tested until we had a 64 bit kernel partially working. With the new style code, we could do a lot of testing on 32 bit platforms, long before any other part of the 64 bit kernel was ready and be fairly confident that the code was correct. In the end this proved to be a great choice as the 64 bit HAT was almost never on the critical path for code development.

Comments:

Post a Comment:
Comments are closed for this entry.
About

JoeBonasera

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
News

No bookmarks in folder

Blogroll

No bookmarks in folder