Introduction
As Ksplice engineers, we often have to look at completely different sub-systems of the Linux kernel to patch them, either to fix a vulnerability or to add trip wires. As a result, we gain a lot of knowledge in various areas. In this blog post, I’ll share my experience regarding pinning user pages in the kernel and the related page table walking internals, based on insights gained through a Known Exploit Detection Update.
For example, if a kernel mode driver wants to perform DMA transfer from a device to a memory range that is mapped in user mode, the device driver can call pin_user_pages
to obtain an array of struct page
structures that represents that range of memory. Page frame numbers (PFNs) can be derived from each struct page
and be given to the DMA engine to perform direct byte transfer from the device to the memory pages. Since these pages are already mapped in user mode, the user mode application can directly access the data right after the DMA transfer, without requiring the kernel to have a temporary buffer to transfer the data from DMA engine and then copy it back to user mode. Furthermore, the kernel mode driver can map the physical pages represented by the struct page
structures in kernel mode, and write data into them, which will be visible to user mode instantly.
The functions pin_user_pages
, get_user_pages
, and the variants are collectively referred to as gup functions. These functions define the interface by which kernel drivers request an array of struct page
structures for a given range of user mode virtual address and specify their intended operations on these pages.
gup functions are the opposite of calling mmap
to map kernel memory to user mode address space. mmap
is mainly used for three subcategories – a) mapping anonymous pages (either shared or exclusive) in user mode, b) mapping specific physical memory ranges using remap_pfn_range
function to map iomem
, and c) mapping filesystem data from page caches. On the other hand, gup functions are used to reference the pages from user mode so that these pages can be mapped in kernel mode.
I will explain how a range of page frames mapped in user mode is referenced in struct page
form in kernel mode and how the referenced struct page
pointers are stored in an array so that kernel mode components can use these pointers to map the underlying pages or directly feed them to the DMA engine.
The scope of this blog post is to explore gup functions, page table walking code to get the struct page
structures corresponding to the page table entries, various vma
and struct page
flags, and finally how an array of struct page
is returned to the caller.
This blog post assumes that readers have a good understanding of paging in a 64-bit system. A useful resource to refresh the concept is available here.
Readers are also expected to have a basic understanding of the gup APIs and flags (i.e., FOLL_GET, FOLL_PIN and FOLL_LONGTERM), which can be found in the documentation here.
This blog post is based on Linux kernel version 6.8.
Overview
I am going to explain both pin_user_pages
and get_user_pages
functions together as these are the most important of all the gup functions. Both functions follow a common path by internally calling the same function. They employ different flags to handle specific cases where necessary.
The functions pin_user_pages
and get_user_pages
assume the caller has taken the mmap lock. After this, they call __get_user_pages_locked
. It then checks if the FOLL_PIN
flag is passed; if not, it sets the FOLL_GET
flag. The FOLL_PIN
and FOLL_GET
flags are mutually exclusive. Both flags take references on pages, but the difference is that FOLL_GET
only ensures the struct page
is not freed while FOLL_PIN
ensures both the presence of the struct page
and its content will not be modified. It distinguishes between the two reference counting methods by using the higher bits on page->refcount
for FOLL_PIN
and lower bits for FOLL_GET
.
Control then enters a loop to repeatedly call get_user_pages
in order to walk the page table and get the corresponding struct page
s from PFNs for each of the PTEs (Page Table Entry). Initially, the loop tries to get all the pages in the specified range (using the starting virtual address and number of pages) at once. If it fails, it must be because a PTE is missing the corresponding page frame or the PTE is read-only when write access was requested. In such cases, the fault handler is invoked to populate the missing PTE, either with an anonymous page or a page backed by page cache, or by correcting the PTE by making it writable, or by creating a copy of the page. After the PTE is populated or corrected, the operation is retried to walk the page tables to retrieve the struct page
s from the PTEs, and populate the pages
array with them as the user has requested.
GUP Functions Initialization, Locking and Flag Handling
pin_user_pages
and get_user_pages
functions
The pin_user_pages
function should be enclosed within mmap_read_lock
and mmap_read_unlock
calls to acquire and release the mmap lock, as shown in the code snippet below. Passing the FOLL_WRITE and FOLL_LONGTERM flags to pin_user_pages
is optional. The array of struct page
s pointers is returned in the pages
variable.
mmap_read_lock(current->mm); ret = pin_user_pages(ubuf, nr_pages, FOLL_WRITE | FOLL_LONGTERM, pages); mmap_read_unlock(current->mm);
Implementation of pin_user_pages
is as follows:
long pin_user_pages(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages) { int locked = 1; if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_PIN)) return 0; return __gup_longterm_locked(current->mm, start, nr_pages, pages, &locked, gup_flags); }
The function initializes a local variable locked
to 1 to indicate mmap lock is acquired. As previously mentioned, it is required to hold the mmap lock when calling pin_user_pages
and get_user_pages
.
The implementation of get_user_pages
is as follows:
long get_user_pages(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages) { int locked = 1; if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_TOUCH)) return -EINVAL; return __get_user_pages_locked(current->mm, start, nr_pages, pages, &locked, gup_flags);}
This implementation is very similar to pin_user_pages
, except that for get_user_pages
, the FOLL_GET
flag is implicitly set, whereas pin_user_pages
initializes gup_flags
with FOLL_PIN
. It also calls __get_user_pages_locked
instead of calling __gup_longterm_locked
which handles long-term locking functionality.
is_valid_gup_args
Next, I will explain the function is_valid_gup_args
which is invoked from both pin_user_pages
and get_user_pages
functions. This function validates the various FOLL_* flags passed by the caller and sets other flags deemed necessary. During validation, it checks if both the FOLL_GET
and FOLL_PIN
flags are set; if so, it returns false
, which is an invalid combination. Also, the FOLL_LONGTERM
flag can only be set with FOLL_PIN
which is more restrictive than FOLL_PIN
alone. FOLL_LONGTERM
suggests that the pages will be held for an extended period (e.g. RDMA driver uses it to pin pages for transferring bytes).
Excerpts from the is_valid_gup_args
function:
/* FOLL_GET and FOLL_PIN are mutually exclusive. */ if (WARN_ON_ONCE((gup_flags & (FOLL_PIN | FOLL_GET)) == (FOLL_PIN | FOLL_GET))) return false; /* LONGTERM can only be specified when pinning */ if (WARN_ON_ONCE(!(gup_flags & FOLL_PIN) && (gup_flags & FOLL_LONGTERM))) return false; /* Pages input must be given if using GET/PIN */i if (WARN_ON_ONCE((gup_flags & (FOLL_GET | FOLL_PIN)) && !pages)) return false; /* We want to allow the pgmap to be hot-unplugged at all times */ if (WARN_ON_ONCE((gup_flags & FOLL_LONGTERM) && (gup_flags & FOLL_PCI_P2PDMA))) return false;
If all these checks are passed, then flags passed by users are set.
__gup_longterm_locked
Following is the implementation of the __gup_longterm_locked
function which is invoked by pin_user_pages
family of functions:
static long __gup_longterm_locked(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, struct page **pages, int *locked, unsigned int gup_flags) { unsigned int flags; long rc, nr_pinned_pages; if (!(gup_flags & FOLL_LONGTERM)) return __get_user_pages_locked(mm, start, nr_pages, pages, locked, gup_flags); flags = memalloc_pin_save(); do { nr_pinned_pages = __get_user_pages_locked(mm, start, nr_pages, pages, locked, gup_flags); if (nr_pinned_pages <= 0) { rc = nr_pinned_pages; break; } /* FOLL_LONGTERM implies FOLL_PIN */ rc = check_and_migrate_movable_pages(nr_pinned_pages, pages); } while (rc == -EAGAIN); memalloc_pin_restore(flags); return rc ? rc : nr_pinned_pages; }
The __gup_longterm_locked
function handles the FOLL_LONGTERM
case differently than the non FOLL_LONGTERM
case. The FOLL_LONGTERM
flag is used when the user intends to use the pages for a prolonged period. If this flag is not set, the function directly calls __get_user_pages_locked
.
When FOLL_LONGTERM
is set, the function first saves PF_MEMALLOC_PIN
in the current process’s flags using memalloc_pin_save
. This flag ensures that the page allocation routine does not allocate memory from the movable zone to prevent a situation where a page is returned to the gup caller but later moved by the memory manager while the caller is still using it. Once __get_user_pages_locked
returns, check_and_migrate_movable_pages
is invoked to verify whether all the pinned pages in the pages
array are suitable for long-term pinning. A page is considered long-term pinnable if it is not from coherent device memory or it is allocated from a non-movable zone. If any long-term unpinnable pages are present in the pages
array, they are migrated or moved from device coherent memory to regular memory before retrying the operation. At the end of the function, the original process flag is restored using memalloc_pin_restore
.
__get_user_pages_locked
Both pin_user_pages
and get_user_pages
eventually call __get_user_pages_locked
after validating the input flags which handles the locking of the mmap lock in case it has been released by internal functions. This function repeatedly calls __get_user_pages
in a loop to populate the pages
array with struct page
pointers until all the pages are loaded or an error occurs. Initially, it attempts to retrieve all the struct page
s at once by calling __get_user_pages
with the total number of pages requested. __get_user_pages
returns the number of pages it successfully retrieved by walking the page table (with the possibility of taking minor page fault), and the pages
array is filled with struct page
s, except when a device driver returns the VM_FAULT_NOPAGE code.
If the fault handler is called from __get_user_pages
and determines that a retry operation is necessary, it releases the mmap lock. Upon receiving this retry request, the __get_user_pages
returns zero to its caller __get_user_pages_locked
. The __get_user_pages_locked
function then checks if the mmap lock is no longer held, specifying a retry is needed. It then re-acquires the mmap lock and calls __get_user_pages
to retrieve a single page, assuming this was the only page that caused the fault. Once the fault is resolved, it proceeds to load the remaining pages.
Here’s the implementation of __get_user_pages_locked
, where __get_user_pages
is called in a loop. It decrements nr_pages
to track the remaining number of pages to process, and increments the pages_done
variable for the number of pages processed so far. The pages
array holds pointers to the retrieved struct page
structures. If the mmap lock is still held after the call returns from __get_user_pages
, as shown by the locked
variable, the function breaks out of the loop.
pages_done = 0; for (;;) { ret = __get_user_pages(mm, start, nr_pages, flags, pages, locked); if (!(flags & FOLL_UNLOCKABLE)) { /* VM_FAULT_RETRY couldn't trigger, bypass */ pages_done = ret; break; } /* VM_FAULT_RETRY or VM_FAULT_COMPLETED cannot return errors */ if (!*locked) { BUG_ON(ret < 0); BUG_ON(ret >= nr_pages); } if (ret > 0) { nr_pages -= ret; pages_done += ret; if (!nr_pages) break; } if (*locked) { /* * VM_FAULT_RETRY didn't trigger or it was a * FOLL_NOWAIT. */ if (!pages_done) pages_done = ret; break; } /* * VM_FAULT_RETRY triggered, so seek to the faulting offset. * For the prefault case (!pages) we only update counts. */ if (likely(pages)) pages += ret; start += ret << PAGE_SHIFT; /* The lock was temporarily dropped, so we must unlock later */ must_unlock = true;
This is the retry portion of the loop in __get_user_pages_locked
where the previous call to __get_user_pages
could not get all the requested pages, because a retry operation is required. Since the mmap lock was released by the fault handler, it needs to be re-acquired before calling __get_user_pages
again for a single page. The loop then continues to process the remaining pages.
retry: /* * Repeat on the address that fired VM_FAULT_RETRY * with both FAULT_FLAG_ALLOW_RETRY and * FAULT_FLAG_TRIED. Note that GUP can be interrupted * by fatal signals of even common signals, depending on * the caller's request. So we need to check it before we * start trying again otherwise it can loop forever. */ if (gup_signal_pending(flags)) { if (!pages_done) pages_done = -EINTR; break; } ret = mmap_read_lock_killable(mm); if (ret) { BUG_ON(ret > 0); if (!pages_done) pages_done = ret; break; } *locked = 1; ret = __get_user_pages(mm, start, 1, flags | FOLL_TRIED, pages, locked); if (!*locked) { /* Continue to retry until we succeeded */ BUG_ON(ret != 0); goto retry; } if (ret != 1) { BUG_ON(ret > 1); if (!pages_done) pages_done = ret; break; } nr_pages--; pages_done++; if (!nr_pages) break; if (likely(pages)) pages++; start += PAGE_SIZE; }
__get_user_pages
Now, I delve into the __get_user_pages
function which is responsible for finding the struct page
and filling the pages
array by walking through the page tables. If a PTE is missing, or if the PTE permissions do not match the user’s request, the function calls the memory manager’s page fault handler to populate the PTE with the appropriate page.
The function begins by obtaining the virtual memory area (VMA) for the given address. It then calls follow_page_mask
to start walking the page tables to find page frame numbers (PFNs) and retrieve the struct page
. If it cannot obtain the required page, this indicates that either the PTE is missing, or read-only when write was requested, or that the page needs to be unshared, among other possibilities. This scenario triggers a page fault, which is handled by handle_mm_fault
. handle_mm_fault
resolves the fault by populating the correct PTE entry, either by allocating a page or making adjustment to the existing PTE flags. Control is then returned to the caller to retry the operation.
The __get_user_pages
function first retrieves the VMA for the virtual address currently being processed (denoted by start
variable) in a loop. It then verifies the VMA for the current operation using check_vma_flags
. One important check performed by check_vma_flags
is that the FOLL_LONGTERM
flag cannot be requested for an address range that belongs to a file system if the pages backing the data require dirty tracking.
do { /* first iteration or cross vma bound */ if (!vma || start >= vma->vm_end) { vma = gup_vma_lookup(mm, start); [...] if (!vma) { ret = -EFAULT; goto out; } ret = check_vma_flags(vma, gup_flags); if (ret) goto out; }
The function then retrieves the struct page
for the start
address by calling follow_page_mask
, which walks through the page table. The retry
label facilitates re-walking the page table entries after resolving a page fault.
retry: [...] page = follow_page_mask(vma, start, foll_flags, &ctx);
If there’s a missing entry in the page table for the requested address, or if the caller has requested a write operation where the page table entry is read-only, the function calls faultin_pages
to address the page fault scenario. Typically, the operation is retried after invoking the fault-in code, which corrects the page table entry if possible. The page table walking code is handled in follow_page_mask
, making the retry
label in the previous code block useful.
The function returns zero to the caller when it receives -EBUSY and -EAGAIN error codes from the fault handler. The caller should then check if the mmap lock has been released before retrying the operation.
The condition that checks PTR_ERR(page) == -EEXIST
signifies a valid PTE without a corresponding struct page
for the virtual address. This situation occurs when the VMA corresponding to the virtual address has the VM_PFNMAP
flag set.
if (!page || PTR_ERR(page) == -EMLINK) { ret = faultin_page(vma, start, &foll_flags, PTR_ERR(page) == -EMLINK, locked); switch (ret) { case 0: goto retry; case -EBUSY: case -EAGAIN: ret = 0; fallthrough; case -EFAULT: case -ENOMEM: case -EHWPOISON: goto out; } BUG(); } else if (PTR_ERR(page) == -EEXIST) { /* * Proper page table entry exists, but no corresponding * struct page. If the caller expects **pages to be * filled in, bail out now, because that can't be done * for this page. */ if (pages) { ret = PTR_ERR(page); goto out; } } else if (IS_ERR(page)) { ret = PTR_ERR(page); goto out; }
Page Table Walking
Page table walking is performed by calling follow_page_mask
, starting at the first level of paging. The struct page
is retrieved from the underlying PTE and returned to the caller if it meets certain criteria, which will be explained in the following sections.
Processing First Level of Paging – follow_page_mask
follow_page_mask
handles the PML4 (Page Map Level 4) table, which is the topmost level of the page table hierarchy. The mm->pgd
pointer contains the virtual address of the PML4 table (physical address of the PML4 table corresponds to bits 51:12 of CR3 register). The function uses the pgd_offset
macro to calculate the virtual address of the PML4 entry (PML4E) and stores it as a pointer in a local variable named pgd
. The pgd_offset
macro uses bits 47:39 of the virtual address as an index to lookup entries in the mm->pgd
(PML4 table) where each entry is 64 bits in size. Since pgd
holds the address of a PML4E, dereferencing it yields the physical address of the PDPT (Page Directory Pointer Table), which is the next level in the hierarchy.
After checking the presence of the pgd
entry, it calls follow_p4d_mask
.
pgd = pgd_offset(mm, address); if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) return no_page_table(vma, flags); return follow_p4d_mask(vma, address, pgd, flags, ctx);
Processing Second Level of Paging – follow_p4d_mask
follow_p4d_mask
handles the PDPT (Page Directory Pointer Table), which is the second level of the page table hierarchy. The pgdp
pointer contains the virtual address of the PML4E. The function uses the p4d_offset
macro to calculate the virtual address of the PDPT entry (PDPTE) and stores it as a pointer in a local variable named p4d
. The p4d_offset
macro uses bits 38:30 of the virtual address as an index to look up entries in the PDPT (PDPT is obtained from pgdp
), where each entry is 64 bits in size. Since p4d
holds the address of a PDPTE, dereferencing it yields the physical address of the PD (Page Directory), which is the next level in the hierarchy.
After checking the presence of the p4d
entry, it calls follow_pud_mask
.
static struct page *follow_p4d_mask(struct vm_area_struct *vma, unsigned long address, pgd_t *pgdp, unsigned int flags, struct follow_page_context *ctx) { p4d_t *p4d; p4d = p4d_offset(pgdp, address); if (p4d_none(*p4d)) return no_page_table(vma, flags); BUILD_BUG_ON(p4d_huge(*p4d)); if (unlikely(p4d_bad(*p4d))) return no_page_table(vma, flags); return follow_pud_mask(vma, address, p4d, flags, ctx); }
In 4 level paging, the PUD (Page Upper Directory) is folded into the PDPT, so the pud_offset
macro yields the same value of p4d
. This function is present to facilitate 5 level paging. It will call follow_pmd_mask
to process the third level of paging structure.
static struct page *follow_pud_mask(struct vm_area_struct *vma, unsigned long address, p4d_t *p4dp, unsigned int flags, struct follow_page_context *ctx) { pud = pud_offset(p4dp, address); if (pud_none(*pud)) return no_page_table(vma, flags); [...] if (unlikely(pud_bad(*pud))) return no_page_table(vma, flags); return follow_pmd_mask(vma, address, pud, flags, ctx); }
Processing Third Level of Paging – follow_pmd_mask
follow_pmd_mask
handles the PD (Page Directory), which is the third level of the page table hierarchy. In 4 level paging, PUD is folded into the PDPT. So, the pudp
pointer contains the virtual address of the PDPTE. The function uses the pmd_offset
macro to calculate the virtual address of the PD entry (PDE) and stores it as a pointer in a local variable named pmd
. In Linux terminology, the PD is referred to as the Page Middle Directory (PMD). The pmd_offset
macro uses bits 29:21 of the virtual address as an index to look up entries in the PD (PD is obtained from pudp
), where each entry is 64 bits in size. Since pmd
holds the address of a PDE, dereferencing it yields the physical address of the PT (Page Table), which is the last level in the hierarchy.
In addition to calculating the PDE entry stored in the pmd
variable, the function checks if the entry maps to a 2MB page instead of pointing to a PT (Page Table), by examining the bit 7 in the pmd
entry which is the PS bit in x86, using the pmd_trans_huge
function. If it is a huge page, then it calls follow_trans_huge_pmd
to return the page by dereferencing bits 51:22 in pmd
. The Linux kernel code always locks the final level page table entry to ensure it cannot be modified while it’s being accessed.
If the PS bit is not set, then we can follow the last level of the paging structure by calling follow_page_pte
, which dereferences bits 51:12 in pmd
to obtain the physical address of PT.
The check for pmd_trans_huge
is performed twice: the first without acquiring the pmd lock on the pmd
entry, and the second time after acquiring the lock. The entry, dereferenced by pmd
might change between the first check and acquiring the lock, in case of a huge page. The second check after acquiring the lock ensures the correct entry.
static struct page *follow_pmd_mask(struct vm_area_struct *vma, unsigned long address, pud_t *pudp, unsigned int flags, struct follow_page_context *ctx) { pmd_t *pmd, pmdval; spinlock_t *ptl; struct page *page; struct mm_struct *mm = vma->vm_mm; pmd = pmd_offset(pudp, address); pmdval = pmdp_get_lockless(pmd); if (pmd_none(pmdval)) return no_page_table(vma, flags); if (!pmd_present(pmdval)) return no_page_table(vma, flags); [...] if (likely(!pmd_trans_huge(pmdval))) return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags)) return no_page_table(vma, flags); ptl = pmd_lock(mm, pmd); if (unlikely(!pmd_present(*pmd))) { spin_unlock(ptl); return no_page_table(vma, flags); } if (unlikely(!pmd_trans_huge(*pmd))) { spin_unlock(ptl); return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); } [...] page = follow_trans_huge_pmd(vma, address, pmd, flags); spin_unlock(ptl); ctx->page_mask = HPAGE_PMD_NR - 1; return page; }
Processing Fourth Level of Paging – follow_page_pte
follow_page_pte
handles the PT (Page Table), which is the last level of the page table hierarchy. The pmd
pointer contains the virtual address of the PDE. The function uses the pte_offset_map_lock
macro to calculate the virtual address of the PT entry (PTE) and stores it as a pointer in a local variable named ptep
. The pte_offset_map_lock
macro uses bits 20:12 of the virtual address as an index to look up entries in the PT (PT is obtained from pmd
), where each entry is 64 bits in size. Since ptep
holds the address of a PTE, dereferencing it yields the physical address of the page. It also locks the PTE entry.
The macro ptep_get
returns the dereferenced PTE in the pte
variable.
static struct page *follow_page_pte(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags, struct dev_pagemap **pgmap) { struct mm_struct *mm = vma->vm_mm; struct page *page; spinlock_t *ptl; pte_t *ptep, pte; int ret; /* FOLL_GET and FOLL_PIN are mutually exclusive. */ if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == (FOLL_PIN | FOLL_GET))) return ERR_PTR(-EINVAL); ptep = pte_offset_map_lock(mm, pmd, address, &ptl); if (!ptep) return no_page_table(vma, flags); pte = ptep_get(ptep); if (!pte_present(pte)) goto no_page; if (pte_protnone(pte) && !gup_can_follow_protnone(vma, flags)) goto no_page;
The function then calls vm_normal_page
to retrieve the struct page
corresponding to the PTE for a “normal” page. A “normal” page refers to a page with a valid struct page
backing it, as opposed to special mappings like direct device memory mappings or pages without backing struct page
structures.
First, vm_normal_page
checks if the VMA has either VM_PFNMAP
or VM_MIXEDMAP
flags set, indicating the VMA contains pages that map physical addresses for DMA usage.
The VM_PFNMAP
flag is set for PFNs that do not have corresponding struct page
s and is handled by the remap_pfn_range
function to map kernel memory to userspace. If the VM_PFNMAP
flag is set, the function checks if the translation between the virtual address and PFN is linear by comparing if the PFN equals the sum of offset from vma->vm_start
and vma->vm_pgoff
. If this condition is true, no struct page
is returned because these mappings are treated as special. This suggests that no backing struct page
exists, so no reference counting on a page is needed. If the translation is non-linear, then the further check determines if it’s a COW (copy-on-write) page. Non-COW pages with VM_PFNMAP flag set in their VMA are special pages (i.e., they do not return the corresponding struct page
).
If VM_MIXEDMAP
flag is set on a VMA, it checks if the PFN is valid by calling pfn_valid
. Validity of PFNs depend on whether systems use FLATMEM or SPARSEMEM. If the PFN is valid, it is backed by a struct page
and this page is returned to the caller. The VM_MIXEDMAP
flag facilitates a range of addresses supporting COW mapping for which it can return struct page
.
Thus, the main difference is that VM_PFNMAP
considers a page as “special” (i.e., no backing struct page
) for linear mapping or non-COW pages, where VM_MIXEDMAP
can return struct page
if the PFN falls into a valid physical range, without concern for COW status.
For pages backed by a file, COW (copy-on-write) mapping is determined by examining the vma->vm_flags
for the presence of the VM_MAYWRITE
flag and absence of the VM_SHARED
flag. VMAs with the VM_SHARED
flag set mean that the pages are shared and cannot be COWed. For anonymous pages, the COW is checked by verifying if the anonymous page is exclusive, meaning the page is not shared with another process.
Next, the function checks if it’s a zero PFN. A struct page
for zero PFN always exists and never goes away, therefore, no need for reference counting on the page and no struct page
is returned.
After performing all these checks, the function returns the corresponding struct page
for the PFN by calling pfn_to_page
.
if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { if (vma->vm_flags & VM_MIXEDMAP) { if (!pfn_valid(pfn)) return NULL; goto out; } else { unsigned long off; off = (addr - vma->vm_start) >> PAGE_SHIFT; if (pfn == vma->vm_pgoff + off) return NULL; if (!is_cow_mapping(vma->vm_flags)) return NULL; } } if (is_zero_pfn(pfn)) return NULL; [...] out: return pfn_to_page(pfn);
Once the function returns from vm_normal_page
, the execution of follow_page_pte
resumes. It checks if the user requested writing to the page using the FOLL_WRITE
flag and whether writing is allowed:
if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, page, vma, flags)) { page = NULL; goto out; }
can_follow_write_pte
function checks if the PTE is writable. If it is, it returns true. Otherwise, it checks for the FOLL_FORCE
flag, which is relevant from ptrace path and is out of the scope of this blog post. If these checks return false, follow_page_pte
returns NULL. Upon receiving NULL in the top level __get_user_pages
function, which is analogous for page fault, it will call faultin_page
to either populate the PTE, make the PTE writable, or handle COW scenario.
Next, it checks if the page needs to be “unshared”. This occurs for COW pages where the PTE is not writable, and the caller is pinning the page without requesting write access. When a write operation occurs on a COW page, a copy of the page is made, and modifications are applied to the copy. However, with gup functions, we aim to access the unmodified, original data. Therefore, we attempt to break the COW early by returning true, which unshares the page. When long-term pinning is requested for the COW pages backed by a file, those pages need to be unshared. To indicate unsharing of a page, it will return -EMLINK to the caller.
if (!pte_write(pte) && gup_must_unshare(vma, flags, page)) { page = ERR_PTR(-EMLINK); goto out; }
Next, it obtains a reference to the page by calling try_grab_pages
. This function retrieves the folio of the page (a folio represents a collection of pages, and the number of pages in a folio is always a power of 2). A folio is represented by struct folio
which contains a struct page
within a union structure. Many fields of struct page
must match corresponding fields in struct folio
, and these are verified using the FOLIO_MATCH
macro. When a struct page
is part of a larger folio, the folio is returned via the page->compound_head
field, which points to the head page structure of the folio. Note that struct folio
is larger in size than struct page
when it represents multiple consecutive pages; otherwise, they are the same size.
The try_grab_folio
function increases the reference count of the folio. For the FOLL_GET
flag, it increments the page->refcount
of the head page. For the FOLL_PIN
flag, the reference counting mechanism differs based on the folio size. If it’s a large folio (i.e., it contains more than one page), it increments both the head page’s page->refcount
and the folio’s folio->_pincount
. If the folio represents a single page, the higher bits of page->refcount
, represented by GUP_PIN_COUNTING_BIAS, are incremented instead. This hack is to save space within page->refcount
and to determine whether a page is pinned.
When control returns to the __get_user_pages
function, it checks whether the returned page is either NULL or has the -EMLINK error code, both of which indicate that faultin_page
should be called to populate the PTE, possibly with a new page, as discussed in the next section.
When a valid page is returned, __get_user_pages
fills the pages array and continues to populate it in a loop.
Handling Memory Fault
At this stage, no suitable page has been found, so page fault handler needs to be invoked to either allocate a new page or modify the permission bits in the PTE. This section provides an overview of how the memory fault is handled.
Faulting in a Page
The internal memory management (mm) page fault routine uses FAULT_FLAG_* flags instead of the FOLL_* flags used in the gup routines. The first step to call the mm internal page fault handler is to convert the FOLL_* flags into their corresponding FAULT_FLAG_* equivalents.
The faultin_page
function receives the unshare
parameter as true from __get_user_pages
if -EMLINK error code is returned from follow_page_mask
.
static int faultin_page(struct vm_area_struct *vma, unsigned long address, unsigned int *flags, bool unshare, int *locked) { unsigned int fault_flags = 0; vm_fault_t ret; if (*flags & FOLL_NOFAULT) return -EFAULT; if (*flags & FOLL_WRITE) fault_flags |= FAULT_FLAG_WRITE; if (*flags & FOLL_REMOTE) fault_flags |= FAULT_FLAG_REMOTE; if (*flags & FOLL_UNLOCKABLE) { fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; /* * FAULT_FLAG_INTERRUPTIBLE is opt-in. GUP callers must set * FOLL_INTERRUPTIBLE to enable FAULT_FLAG_INTERRUPTIBLE. * That's because some callers may not be prepared to * handle early exits caused by non-fatal signals. */ if (*flags & FOLL_INTERRUPTIBLE) fault_flags |= FAULT_FLAG_INTERRUPTIBLE; } if (*flags & FOLL_NOWAIT) fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT; if (*flags & FOLL_TRIED) { /* * Note: FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED * can co-exist */ fault_flags |= FAULT_FLAG_TRIED; } if (unshare) { fault_flags |= FAULT_FLAG_UNSHARE; /* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */ VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE); }
Next, it will call handle_mm_fault
, which is the actual mm fault handler routine.
ret = handle_mm_fault(vma, address, fault_flags, NULL);`
Explanation of the page fault handler is out of the scope of this blog and will be covered in the next blog post.
After the handle_mm_fault
completes, if a page is successfully populated for the address, the operation retries from the __get_user_pages
routine or from its caller. It then returns the struct page
pointer by walking the page table again to populate the pages
array.
Unpinning Pages
When the user is done with the pages previously pinned, the unpin_user_pages
function must be called to release the references to those pages.
unpin_user_pages
Let’s take a look at the implementation of unpin_user_pages
:
void unpin_user_pages(struct page **pages, unsigned long npages) { unsigned long i; struct folio *folio; unsigned int nr; [...] for (i = 0; i < npages; i += nr) { folio = gup_folio_next(pages, npages, i, &nr); gup_put_folio(folio, nr, FOLL_PIN); } }
In this implementation of unpin_user_pages
, the folio is retrieved by calling gup_folio_next
on the pages
array. If any page
in the pages
array is part of a larger folio, the function skips to the page representing the next folio. The variable nr
holds the number of pages in the current folio.
gup_folio_next
The implementation of the gup_folio_next
is as follows:
static inline struct folio *gup_folio_next(struct page **list, unsigned long npages, unsigned long i, unsigned int *ntails) { struct folio *folio = page_folio(list[i]); unsigned int nr; for (nr = i + 1; nr < npages; nr++) { if (page_folio(list[nr]) != folio) break; } *ntails = nr - i; return folio; }
In the above implementation of gup_folio_next
, the folio is first retrieved using the page_folio
function that accepts a page as input. The folio is pointed to by the page->compound_head
member variable. Next, the function calculates how many pages are in the current folio by locating the next folio. The ntails
pointer receives the number of pages in the current folio.
gup_put_folio
Once the folio is retrieved, its reference count is decremented by calling gup_put_folio
.
static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) { if (flags & FOLL_PIN) { if (is_zero_folio(folio)) return; if (folio_test_large(folio)) atomic_sub(refs, &folio->_pincount); else refs *= GUP_PIN_COUNTING_BIAS; } if (!put_devmap_managed_page_refs(&folio->page, refs)) folio_put_refs(folio, refs); }
When the FOLL_PIN
flag is passed to the gup_put_folio
function (i.e., when called from unpin_user_pages
), it handles the specific case of decrementing the reference count for pages pinned using pin_user_pages
. For large folios, it decrements the _pin_count
member variable of the folio. For a single-page folio, the higher bits represented by GUP_PIN_COUNTING_BIAS
are decremented. Finally, the page->refcount
of the head page pointed to by the folio is decremented.
Conclusion
As Ksplice engineers, we often have to learn new areas of the kernel very rapidly to understand how a patch can be applied without rebooting the system. This work can be challenging, but it’s also rewarding to explore new areas of the kernel and understand how they interact with each other.
If you find this type of work interesting, consider applying for a job with the Ksplice team! Feel free to drop us a line at ksplice-support_ww@oracle.com.
Further Reading
The incremental development of pinning user pages, as presented at the annual Linux Storage, Filesystem, and Memory-Management Summits, is documented in several LWN articles. Interested readers can find detailed documentation at this link.