Page Fault Handling in Solaris

Welcome to OpenSolaris! In this entry, I'll walk through the page fault handling code, which is ground zero of the Solaris virtual memory subsystem. Due to the nature of this level in the system, part of the code (the lowest level that interfaces to hardware registers) is machine dependent, while the rest is common code written in C. Hence, I will present this topic in three parts: x64 machine dependent code, which has the most hardware handling for TLB misses, followed by the more complex SPARC machine dependent code, which relies on assembly code to handle TLB misses from trap context; I'll wrap up by covering the common code which is executed from kernel context.

Part 1: x64 Machine Dependent Layer

Since all x86-class machines handle TLB misses using a hardware page table walk mechanism, the Hardware Address Translation, or HAT, layer for x64 systems is the least complex of the two system architectures Solaris currently supports. Both the x86 and AMD systems use a page directory scheme to map per-address-space virtual memory addresses to physical memory addresses. When a TLB miss occurs, the MMU (memory management unit) hardware searches the page table for the page table entry (PTE) associated with the virtual address of the memory access, if one exists. In the page directory model, the virtual address is divided up into several parts; each successive part of the virtual address forms an index into each successive level in the directory, while the higher level directory entries point to the address in memory of the next lowest directory. Each directory table is 4K in size, which corresponds to the base page size of the processor. The pointer to the top-level page directory is programmed into the cr3 hardware register on context switch.

  [Directory based page table]
   Directory-based page tables

Since we're discussing the page fault path in this blog entry, we are interested in the case where the processor fails to find a valid PTE in the lowest level of the directory. This results in a page fault exception (#pf), which passes control synchronously to a page fault handler in trap context. This low-level handler is pftrap(), located in exception.s. The handler jumps to cmntrap() over in locore.s which pushes the machine state onto the stack, switches to kernel context, and invokes the C kernel-side trap handler trap() in trap.c with a trap type of T_PGFLT. The trap() routine figures out that this is a user fault since it lies below KERNELBASE, and calls pagefault() in vm_machdep.c. The pagefault() routine collects the necessary arguments for the common as_fault() routine, and passes control to it.

For more information regarding the x64 HAT layer, refer to Joe Bonasera's blog where he has started blogging about this subsystem which he and Nils Nieuwejaar redesigned from the ground up for the AMD64 port in Solaris 10.

Part 2: SPARC Machine Dependent Layer

The UltraSPARC architecture, the only SPARC architecture currently supported by Solaris -- relies entirely on software to handle TLB misses1. Hence, the HAT layer for SPARC is a bit more complex than the x64 one. To speed up handling of TLB miss traps, the processor provides a hardware-assisted lookup mechanism2 called the Translation Storage Buffer (TSB). The TSB is a virtually indexed, direct-mapped, physically contiguous, and size-aligned region of physical memory which is used to cache recently used Translation Table Entries (TTEs) after retrieval from the page tables. When a TLB miss occurs, the hardware uses the virtual address of the miss combined with the contents of a TSB base address register (which is pre-programmed on context switch) to calculate the pointer into the TSB of the entry corresponding to the virtual address. If the TSB entry tag matches the virtual address of the miss, the TTE is loaded into the TLB by the TLB miss handler, and the trapped instruction is retried. See DTLB_MISS() in trap_table.s and sfmmu_udtlb_slowpath in sfmmu_asm.s. If no match is found, the trap handler branches to a slow path routine called the TSB miss handler3.

The SPARC HAT layer (named sfmmu after the codename spitfire MMU, the first UltraSPARC MMU supported) uses an open hashing technique to implement the page tables in software. The hash lookup is performed using the struct hat pointer for the currently running process and the virtual address of the TLB miss. On a TSB miss, the function sfmmu_tsb_miss_tt in sfmmu_asm.s searches the hash for successive page sizes using the GET_TTE() assembly macro. If a match is found, the TTE is inserted into the TSB, loaded into the TLB, and the trapped instruction is re-issued. If a match is not found, or the access type does not match the permitted access for this mapping (e.g. a write is attempted to a read-only mapping) control is transferred to the sys_trap() routine in mach_locore.s after setting up the appropriate fault type. The sys_trap() routine (which is very involved due to SPARC's register windows) saves the machine state to the stack, switches from trap context to kernel context, and invokes the kernel-side trap handler in C, trap() over in trap.c. The trap() routine recognizes the T_DATA_MMU_MISS trap code and branches to pagefault() in vm_dep.c. As its x64 counterpart does, pagefault() collects the appropriate arguments and invokes the common handler as_fault().

For more information about the sfmmu HAT layer, keep coming back -- this subsystem warrants a more in-depth tour in future blogs.

Part 3: Common Code Layer

The Solaris virtual memory (VM) subsystem uses a segmented model to map each process' address space, as well as the kernel itself. Each segment object maps a contiguous range of virtual memory with common attributes. The backing store for each segment may be device memory, a file, physical memory, etc. Each backing store type is handled by a different segment driver. The most commonly used segment driver is seg_vn, so-named because it maps vnodes associated with files. Perhaps more interestingly, the seg_vn segment driver is also responsible for implementing anonymous memory which is so-called because it is private to a process and is backed by swap space rather than by a file object. Since seg_vn maps the majority of a process' address space, including all text, heap, and stack, I'll use it to illustrate the most common page fault path encountered by a process4.

Returning to the page fault path, assume that the page fault being examined has occurred in a virtual address range that corresponds to a process heap -- for instance, the first touch of new memory allocated by a brk() system call performed by the C library's malloc() routine. Such a fault will allocate process private, anonymous memory which is pre-filled with zeros, known to VM geeks as a ZFOD fault -- short for zero fill on demand. In such a situation, the as_fault() routine (vm_as.c) will search the process' segment tree looking for the segment that maps the virtual address range corresponding to the fault. If as_fault() discovers that no such segment exists, a fatal segmentation violation is signalled to the process causing it to terminate. In our example, a segment is found whose seg_ops corresponds to segvn_ops (seg_vn.c). The SEGOP_FAULT() macro is called, which invokes the segvn_fault() routine in seg_vn.c. In our example, the backing store is swap, so segvn_faultpages() will find there is no vnode backing this range, but rather an anon object and will allocate a page to back this virtual address through anon_zero() in vm_anon.c.

Here is a sample callstack into anon_zero() as viewed from DTrace on my workstation (which is a dual-CPU Opteron running Solaris 64-bit kernel):
# dtrace -qn '::anon_zero:entry { stack(8); exit(0); }'

              genunix`segvn_faultpage+0x16c
              genunix`segvn_fault+0x647
              genunix`as_fault+0x3c8
              unix`pagefault+0x7e
              unix`trap+0x792
              unix`_cmntrap+0x83
Here is another sample callstack into anon_zero(), this time from a Ultra-Enterprise 10000:
# dtrace -qn '::anon_zero:entry { stack(8); exit(0); }'

              genunix`segvn_faultpage+0x238
              genunix`segvn_fault+0x920
              genunix`as_fault+0x4b8
              unix`pagefault+0xac
              unix`trap+0xc5c
              unix`utl0+0x4c
Note that, in both cases, we can only see back as far as where we switched to kernel context, since trap context uses only registers or scratch space for its work and does not save traceable stack frames for us.

For those of you who wish to trace the path of a process' zero fill page faults from beginning to end, you may do so quite easily by running this DTrace script, as root. The script takes one argment, which is the exec name of the binary to trace. I recommend a simple one like "ls" since it is relatively small and short lived.

Technorati Tag:
Technorati Tag:
1 The topic of the details of SPARC TLB handling is one that will take many blog entries to cover from beginning to end, so I'm skipping over many of the details here for now. For the impatient, pick up a copy of Solaris Internals by Jim Mauro and Richard McDougall (ISBN 0-13-022496-0); though much of the material is dated now, many of the details are still accurate.

2 This TSB mechanism could be employed by the hardware for a little extra effort. No current sun4u systems do so, but some future systems may support the TSB lookup in hardware.

3 I'm skipping a step here for the sake of brevity -- there are actually two TSB searches in the case of a process which is using large pages, since a separate 4M-indexed TSB is kept for large pages. If the process is using 4M or larger pages, the second TSB must be searched also prior to a TSB miss. This second search is performed using a software generated TSB index, since the hardware assist only generates a 8K-indexed TSB pointer into the first TSB. See sfmmu_udtlb_slowpath() in the source if you care to see what really happens... Go on, you really have the source now, so no excuses :)

4 In some ways, this is unfortunate because the seg_vn segment driver is the most complicated of all the segment drivers in the Solaris VM subsystem, and as such has a very steep learning curve. Within Sun, we often joke that nobody understands how it all works, as it has evolved over a period of many, many years, and all of the original implementors have since moved on or are now part of Sun's upper management. While the spirit of the code hasn't changed significantly from the original SVR4 code, much of the complexity added over the years has evolved to support modern features like superpages that were not anticipated in the original design. This can make for a few twists and turns in the source even for following the path of a simple example like our ZFOD fault.

Comments:

Post a Comment:
Comments are closed for this entry.
About

elowe

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today