Pinning User-space Pages in the Linux Kernel: Exploring get_user_pages, pin_user_pages, and Page Table Walking

Introduction

As Ksplice engineers, we often have to look at completely different sub-systems of the Linux kernel to patch them, either to fix a vulnerability or to add trip wires. As a result, we gain a lot of knowledge in various areas. In this blog post, I’ll share my experience regarding pinning user pages in the kernel and the related page table walking internals, based on insights gained through a Known Exploit Detection Update.

For example, if a kernel mode driver wants to perform DMA transfer from a device to a memory range that is mapped in user mode, the device driver can call pin_user_pages to obtain an array of struct page structures that represents that range of memory. Page frame numbers (PFNs) can be derived from each struct page and be given to the DMA engine to perform direct byte transfer from the device to the memory pages. Since these pages are already mapped in user mode, the user mode application can directly access the data right after the DMA transfer, without requiring the kernel to have a temporary buffer to transfer the data from DMA engine and then copy it back to user mode. Furthermore, the kernel mode driver can map the physical pages represented by the struct page structures in kernel mode, and write data into them, which will be visible to user mode instantly.

The functions pin_user_pages, get_user_pages, and the variants are collectively referred to as gup functions. These functions define the interface by which kernel drivers request an array of struct page structures for a given range of user mode virtual address and specify their intended operations on these pages.

gup functions are the opposite of calling mmap to map kernel memory to user mode address space. mmap is mainly used for three subcategories – a) mapping anonymous pages (either shared or exclusive) in user mode, b) mapping specific physical memory ranges using remap_pfn_range function to map iomem, and c) mapping filesystem data from page caches. On the other hand, gup functions are used to reference the pages from user mode so that these pages can be mapped in kernel mode.

I will explain how a range of page frames mapped in user mode is referenced in struct page form in kernel mode and how the referenced struct page pointers are stored in an array so that kernel mode components can use these pointers to map the underlying pages or directly feed them to the DMA engine.

The scope of this blog post is to explore gup functions, page table walking code to get the struct page structures corresponding to the page table entries, various vma and struct page flags, and finally how an array of struct page is returned to the caller.

This blog post assumes that readers have a good understanding of paging in a 64-bit system. A useful resource to refresh the concept is available here.

Readers are also expected to have a basic understanding of the gup APIs and flags (i.e., FOLL_GET, FOLL_PIN and FOLL_LONGTERM), which can be found in the documentation here.

This blog post is based on Linux kernel version 6.8.

Overview

I am going to explain both pin_user_pages and get_user_pages functions together as these are the most important of all the gup functions. Both functions follow a common path by internally calling the same function. They employ different flags to handle specific cases where necessary.

The functions pin_user_pages and get_user_pages assume the caller has taken the mmap lock. After this, they call __get_user_pages_locked. It then checks if the FOLL_PIN flag is passed; if not, it sets the FOLL_GET flag. The FOLL_PIN and FOLL_GET flags are mutually exclusive. Both flags take references on pages, but the difference is that FOLL_GET only ensures the struct page is not freed while FOLL_PIN ensures both the presence of the struct page and its content will not be modified. It distinguishes between the two reference counting methods by using the higher bits on page->refcount for FOLL_PIN and lower bits for FOLL_GET.

Control then enters a loop to repeatedly call get_user_pages in order to walk the page table and get the corresponding struct pages from PFNs for each of the PTEs (Page Table Entry). Initially, the loop tries to get all the pages in the specified range (using the starting virtual address and number of pages) at once. If it fails, it must be because a PTE is missing the corresponding page frame or the PTE is read-only when write access was requested. In such cases, the fault handler is invoked to populate the missing PTE, either with an anonymous page or a page backed by page cache, or by correcting the PTE by making it writable, or by creating a copy of the page. After the PTE is populated or corrected, the operation is retried to walk the page tables to retrieve the struct pages from the PTEs, and populate the pages array with them as the user has requested.

GUP Functions Initialization, Locking and Flag Handling

`pin_user_pages` and `get_user_pages` functions

The pin_user_pages function should be enclosed within mmap_read_lock and mmap_read_unlock calls to acquire and release the mmap lock, as shown in the code snippet below. Passing the FOLL_WRITE and FOLL_LONGTERM flags to pin_user_pages is optional. The array of struct pages pointers is returned in the pages variable.

    mmap_read_lock(current->mm);
    ret = pin_user_pages(ubuf, nr_pages, FOLL_WRITE | FOLL_LONGTERM, pages);
    mmap_read_unlock(current->mm);

Implementation of pin_user_pages is as follows:

long pin_user_pages(unsigned long start, unsigned long nr_pages,
                    unsigned int gup_flags, struct page **pages)
{
    int locked = 1;

    if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_PIN))
        return 0;
    return __gup_longterm_locked(current->mm, start, nr_pages,
                                 pages, &locked, gup_flags);
}

The function initializes a local variable locked to 1 to indicate mmap lock is acquired. As previously mentioned, it is required to hold the mmap lock when calling pin_user_pages and get_user_pages.

The implementation of get_user_pages is as follows:

long get_user_pages(unsigned long start, unsigned long nr_pages,
                    unsigned int gup_flags, struct page **pages)
{
    int locked = 1;

    if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_TOUCH))
        return -EINVAL;

    return __get_user_pages_locked(current->mm, start, nr_pages, pages,
                                   &locked, gup_flags);}

This implementation is very similar to pin_user_pages, except that for get_user_pages, the FOLL_GET flag is implicitly set, whereas pin_user_pages initializes gup_flags with FOLL_PIN. It also calls __get_user_pages_locked instead of calling __gup_longterm_locked which handles long-term locking functionality.

`is_valid_gup_args`

Next, I will explain the function is_valid_gup_args which is invoked from both pin_user_pages and get_user_pages functions. This function validates the various FOLL_* flags passed by the caller and sets other flags deemed necessary. During validation, it checks if both the FOLL_GET and FOLL_PIN flags are set; if so, it returns false, which is an invalid combination. Also, the FOLL_LONGTERM flag can only be set with FOLL_PIN which is more restrictive than FOLL_PIN alone. FOLL_LONGTERM suggests that the pages will be held for an extended period (e.g. RDMA driver uses it to pin pages for transferring bytes).

Excerpts from the is_valid_gup_args function:

    /* FOLL_GET and FOLL_PIN are mutually exclusive. */
    if (WARN_ON_ONCE((gup_flags & (FOLL_PIN | FOLL_GET)) ==
                     (FOLL_PIN | FOLL_GET)))
        return false;

    /* LONGTERM can only be specified when pinning */
    if (WARN_ON_ONCE(!(gup_flags & FOLL_PIN) && (gup_flags & FOLL_LONGTERM)))
        return false;

    /* Pages input must be given if using GET/PIN */i
    if (WARN_ON_ONCE((gup_flags & (FOLL_GET | FOLL_PIN)) && !pages))
        return false;

    /* We want to allow the pgmap to be hot-unplugged at all times */
    if (WARN_ON_ONCE((gup_flags & FOLL_LONGTERM) &&
                     (gup_flags & FOLL_PCI_P2PDMA)))
    return false;

If all these checks are passed, then flags passed by users are set.

`__gup_longterm_locked`

Following is the implementation of the __gup_longterm_locked function which is invoked by pin_user_pages family of functions:

static long __gup_longterm_locked(struct mm_struct *mm,
                                  unsigned long start,
                                  unsigned long nr_pages,
                                  struct page **pages,
                                  int *locked,
                                  unsigned int gup_flags)
{
    unsigned int flags;
    long rc, nr_pinned_pages;

    if (!(gup_flags & FOLL_LONGTERM))
        return __get_user_pages_locked(mm, start, nr_pages, pages,
                                       locked, gup_flags);

    flags = memalloc_pin_save();
    do {
        nr_pinned_pages = __get_user_pages_locked(mm, start, nr_pages,
                                                  pages, locked,
                                                  gup_flags);
        if (nr_pinned_pages <= 0) {
            rc = nr_pinned_pages;
            break;
        }

        /* FOLL_LONGTERM implies FOLL_PIN */
        rc = check_and_migrate_movable_pages(nr_pinned_pages, pages);
    } while (rc == -EAGAIN);
    memalloc_pin_restore(flags);
    return rc ? rc : nr_pinned_pages;
}

The __gup_longterm_locked function handles the FOLL_LONGTERM case differently than the non FOLL_LONGTERM case. The FOLL_LONGTERM flag is used when the user intends to use the pages for a prolonged period. If this flag is not set, the function directly calls __get_user_pages_locked.

When FOLL_LONGTERM is set, the function first saves PF_MEMALLOC_PIN in the current process’s flags using memalloc_pin_save. This flag ensures that the page allocation routine does not allocate memory from the movable zone to prevent a situation where a page is returned to the gup caller but later moved by the memory manager while the caller is still using it. Once __get_user_pages_locked returns, check_and_migrate_movable_pages is invoked to verify whether all the pinned pages in the pages array are suitable for long-term pinning. A page is considered long-term pinnable if it is not from coherent device memory or it is allocated from a non-movable zone. If any long-term unpinnable pages are present in the pages array, they are migrated or moved from device coherent memory to regular memory before retrying the operation. At the end of the function, the original process flag is restored using memalloc_pin_restore.

`__get_user_pages_locked`

Both pin_user_pages and get_user_pages eventually call __get_user_pages_locked after validating the input flags which handles the locking of the mmap lock in case it has been released by internal functions. This function repeatedly calls __get_user_pages in a loop to populate the pages array with struct page pointers until all the pages are loaded or an error occurs. Initially, it attempts to retrieve all the struct pages at once by calling __get_user_pages with the total number of pages requested. __get_user_pages returns the number of pages it successfully retrieved by walking the page table (with the possibility of taking minor page fault), and the pages array is filled with struct pages, except when a device driver returns the VM_FAULT_NOPAGE code.

If the fault handler is called from __get_user_pages and determines that a retry operation is necessary, it releases the mmap lock. Upon receiving this retry request, the __get_user_pages returns zero to its caller __get_user_pages_locked. The __get_user_pages_locked function then checks if the mmap lock is no longer held, specifying a retry is needed. It then re-acquires the mmap lock and calls __get_user_pages to retrieve a single page, assuming this was the only page that caused the fault. Once the fault is resolved, it proceeds to load the remaining pages.

Here’s the implementation of __get_user_pages_locked, where __get_user_pages is called in a loop. It decrements nr_pages to track the remaining number of pages to process, and increments the pages_done variable for the number of pages processed so far. The pages array holds pointers to the retrieved struct page structures. If the mmap lock is still held after the call returns from __get_user_pages, as shown by the locked variable, the function breaks out of the loop.

    pages_done = 0;
    for (;;) {
        ret = __get_user_pages(mm, start, nr_pages, flags, pages,
                               locked);
        if (!(flags & FOLL_UNLOCKABLE)) {
            /* VM_FAULT_RETRY couldn't trigger, bypass */
            pages_done = ret;
            break;
        }

        /* VM_FAULT_RETRY or VM_FAULT_COMPLETED cannot return errors */
        if (!*locked) {
            BUG_ON(ret < 0);
            BUG_ON(ret >= nr_pages);
        }

        if (ret > 0) {
            nr_pages -= ret;
            pages_done += ret;
            if (!nr_pages)
                break;
        }
        if (*locked) {
            /*
             * VM_FAULT_RETRY didn't trigger or it was a
             * FOLL_NOWAIT.
             */
            if (!pages_done)
                pages_done = ret;
            break;
        }
        /*
         * VM_FAULT_RETRY triggered, so seek to the faulting offset.
         * For the prefault case (!pages) we only update counts.
         */
        if (likely(pages))
            pages += ret;
        start += ret << PAGE_SHIFT;

        /* The lock was temporarily dropped, so we must unlock later */
        must_unlock = true;

This is the retry portion of the loop in __get_user_pages_locked where the previous call to __get_user_pages could not get all the requested pages, because a retry operation is required. Since the mmap lock was released by the fault handler, it needs to be re-acquired before calling __get_user_pages again for a single page. The loop then continues to process the remaining pages.

retry:
        /*
         * Repeat on the address that fired VM_FAULT_RETRY
         * with both FAULT_FLAG_ALLOW_RETRY and
         * FAULT_FLAG_TRIED.  Note that GUP can be interrupted
         * by fatal signals of even common signals, depending on
         * the caller's request. So we need to check it before we
         * start trying again otherwise it can loop forever.
         */
        if (gup_signal_pending(flags)) {
            if (!pages_done)
                pages_done = -EINTR;
            break;
        }

        ret = mmap_read_lock_killable(mm);
        if (ret) {
            BUG_ON(ret > 0);
            if (!pages_done)
                pages_done = ret;
            break;
        }

        *locked = 1;
        ret = __get_user_pages(mm, start, 1, flags | FOLL_TRIED,
                               pages, locked);
        if (!*locked) {
            /* Continue to retry until we succeeded */
            BUG_ON(ret != 0);
            goto retry;
        }
        if (ret != 1) {
            BUG_ON(ret > 1);
            if (!pages_done)
                pages_done = ret;
            break;
        }
        nr_pages--;
        pages_done++;
        if (!nr_pages)
            break;
        if (likely(pages))
            pages++;
        start += PAGE_SIZE;
    }

`__get_user_pages`

Now, I delve into the __get_user_pages function which is responsible for finding the struct page and filling the pages array by walking through the page tables. If a PTE is missing, or if the PTE permissions do not match the user’s request, the function calls the memory manager’s page fault handler to populate the PTE with the appropriate page.

The function begins by obtaining the virtual memory area (VMA) for the given address. It then calls follow_page_mask to start walking the page tables to find page frame numbers (PFNs) and retrieve the struct page. If it cannot obtain the required page, this indicates that either the PTE is missing, or read-only when write was requested, or that the page needs to be unshared, among other possibilities. This scenario triggers a page fault, which is handled by handle_mm_fault. handle_mm_fault resolves the fault by populating the correct PTE entry, either by allocating a page or making adjustment to the existing PTE flags. Control is then returned to the caller to retry the operation.

The __get_user_pages function first retrieves the VMA for the virtual address currently being processed (denoted by start variable) in a loop. It then verifies the VMA for the current operation using check_vma_flags. One important check performed by check_vma_flags is that the FOLL_LONGTERM flag cannot be requested for an address range that belongs to a file system if the pages backing the data require dirty tracking.

    do {
        /* first iteration or cross vma bound */
        if (!vma || start >= vma->vm_end) {
            vma = gup_vma_lookup(mm, start);
            [...]
            if (!vma) {
                ret = -EFAULT;
                goto out;
            }
            ret = check_vma_flags(vma, gup_flags);
            if (ret)
                goto out;
        }

The function then retrieves the struct page for the start address by calling follow_page_mask, which walks through the page table. The retry label facilitates re-walking the page table entries after resolving a page fault.

retry:
       [...]
       page = follow_page_mask(vma, start, foll_flags, &ctx);

If there’s a missing entry in the page table for the requested address, or if the caller has requested a write operation where the page table entry is read-only, the function calls faultin_pages to address the page fault scenario. Typically, the operation is retried after invoking the fault-in code, which corrects the page table entry if possible. The page table walking code is handled in follow_page_mask, making the retry label in the previous code block useful.

The function returns zero to the caller when it receives -EBUSY and -EAGAIN error codes from the fault handler. The caller should then check if the mmap lock has been released before retrying the operation.

The condition that checks PTR_ERR(page) == -EEXIST signifies a valid PTE without a corresponding struct page for the virtual address. This situation occurs when the VMA corresponding to the virtual address has the VM_PFNMAP flag set.

        if (!page || PTR_ERR(page) == -EMLINK) {
            ret = faultin_page(vma, start, &foll_flags,
                               PTR_ERR(page) == -EMLINK, locked);
            switch (ret) {
            case 0:
                goto retry;
            case -EBUSY:
            case -EAGAIN:
                ret = 0;
                fallthrough;
            case -EFAULT:
            case -ENOMEM:
            case -EHWPOISON:
                goto out;
            }
            BUG();
        } else if (PTR_ERR(page) == -EEXIST) {
            /*
             * Proper page table entry exists, but no corresponding
             * struct page. If the caller expects **pages to be
             * filled in, bail out now, because that can't be done
             * for this page.
             */
            if (pages) {
                ret = PTR_ERR(page);
                goto out;
            }
        } else if (IS_ERR(page)) {
            ret = PTR_ERR(page);
            goto out;
        }

Page Table Walking

Page table walking is performed by calling follow_page_mask, starting at the first level of paging. The struct page is retrieved from the underlying PTE and returned to the caller if it meets certain criteria, which will be explained in the following sections.

Processing First Level of Paging – `follow_page_mask`

follow_page_mask handles the PML4 (Page Map Level 4) table, which is the topmost level of the page table hierarchy. The mm->pgd pointer contains the virtual address of the PML4 table (physical address of the PML4 table corresponds to bits 51:12 of CR3 register). The function uses the pgd_offset macro to calculate the virtual address of the PML4 entry (PML4E) and stores it as a pointer in a local variable named pgd. The pgd_offset macro uses bits 47:39 of the virtual address as an index to lookup entries in the mm->pgd (PML4 table) where each entry is 64 bits in size. Since pgd holds the address of a PML4E, dereferencing it yields the physical address of the PDPT (Page Directory Pointer Table), which is the next level in the hierarchy.

After checking the presence of the pgd entry, it calls follow_p4d_mask.

    pgd = pgd_offset(mm, address);

    if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
        return no_page_table(vma, flags);

    return follow_p4d_mask(vma, address, pgd, flags, ctx);

Processing Second Level of Paging – `follow_p4d_mask`

follow_p4d_mask handles the PDPT (Page Directory Pointer Table), which is the second level of the page table hierarchy. The pgdp pointer contains the virtual address of the PML4E. The function uses the p4d_offset macro to calculate the virtual address of the PDPT entry (PDPTE) and stores it as a pointer in a local variable named p4d. The p4d_offset macro uses bits 38:30 of the virtual address as an index to look up entries in the PDPT (PDPT is obtained from pgdp), where each entry is 64 bits in size. Since p4d holds the address of a PDPTE, dereferencing it yields the physical address of the PD (Page Directory), which is the next level in the hierarchy.

After checking the presence of the p4d entry, it calls follow_pud_mask.

static struct page *follow_p4d_mask(struct vm_area_struct *vma,
                                    unsigned long address, pgd_t *pgdp,
                                    unsigned int flags,
                                    struct follow_page_context *ctx)
{
    p4d_t *p4d;

    p4d = p4d_offset(pgdp, address);
    if (p4d_none(*p4d))
        return no_page_table(vma, flags);
    BUILD_BUG_ON(p4d_huge(*p4d));
    if (unlikely(p4d_bad(*p4d)))
        return no_page_table(vma, flags);

    return follow_pud_mask(vma, address, p4d, flags, ctx);
}

In 4 level paging, the PUD (Page Upper Directory) is folded into the PDPT, so the pud_offset macro yields the same value of p4d. This function is present to facilitate 5 level paging. It will call follow_pmd_mask to process the third level of paging structure.

static struct page *follow_pud_mask(struct vm_area_struct *vma,
                                    unsigned long address, p4d_t *p4dp,
                                    unsigned int flags,
                                    struct follow_page_context *ctx)
{
    pud = pud_offset(p4dp, address);
    if (pud_none(*pud))
        return no_page_table(vma, flags);
    [...]
    if (unlikely(pud_bad(*pud)))
        return no_page_table(vma, flags);

    return follow_pmd_mask(vma, address, pud, flags, ctx);
}

Processing Third Level of Paging – `follow_pmd_mask`

follow_pmd_mask handles the PD (Page Directory), which is the third level of the page table hierarchy. In 4 level paging, PUD is folded into the PDPT. So, the pudp pointer contains the virtual address of the PDPTE. The function uses the pmd_offset macro to calculate the virtual address of the PD entry (PDE) and stores it as a pointer in a local variable named pmd. In Linux terminology, the PD is referred to as the Page Middle Directory (PMD). The pmd_offset macro uses bits 29:21 of the virtual address as an index to look up entries in the PD (PD is obtained from pudp), where each entry is 64 bits in size. Since pmd holds the address of a PDE, dereferencing it yields the physical address of the PT (Page Table), which is the last level in the hierarchy.

In addition to calculating the PDE entry stored in the pmd variable, the function checks if the entry maps to a 2MB page instead of pointing to a PT (Page Table), by examining the bit 7 in the pmd entry which is the PS bit in x86, using the pmd_trans_huge function. If it is a huge page, then it calls follow_trans_huge_pmd to return the page by dereferencing bits 51:22 in pmd. The Linux kernel code always locks the final level page table entry to ensure it cannot be modified while it’s being accessed.

If the PS bit is not set, then we can follow the last level of the paging structure by calling follow_page_pte, which dereferences bits 51:12 in pmd to obtain the physical address of PT.

The check for pmd_trans_huge is performed twice: the first without acquiring the pmd lock on the pmd entry, and the second time after acquiring the lock. The entry, dereferenced by pmd might change between the first check and acquiring the lock, in case of a huge page. The second check after acquiring the lock ensures the correct entry.

static struct page *follow_pmd_mask(struct vm_area_struct *vma,
                                    unsigned long address, pud_t *pudp,
                                    unsigned int flags,
                                    struct follow_page_context *ctx)
{
    pmd_t *pmd, pmdval;
    spinlock_t *ptl;
    struct page *page;
    struct mm_struct *mm = vma->vm_mm;

    pmd = pmd_offset(pudp, address);
    pmdval = pmdp_get_lockless(pmd);
    if (pmd_none(pmdval))
        return no_page_table(vma, flags);
    if (!pmd_present(pmdval))
        return no_page_table(vma, flags);
    [...]
    if (likely(!pmd_trans_huge(pmdval)))
        return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);

    if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
        return no_page_table(vma, flags);

    ptl = pmd_lock(mm, pmd);
    if (unlikely(!pmd_present(*pmd))) {
        spin_unlock(ptl);
        return no_page_table(vma, flags);
    }
    if (unlikely(!pmd_trans_huge(*pmd))) {
        spin_unlock(ptl);
        return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
    }

    [...]
    page = follow_trans_huge_pmd(vma, address, pmd, flags);
    spin_unlock(ptl);
    ctx->page_mask = HPAGE_PMD_NR - 1;
    return page;
}

Processing Fourth Level of Paging – `follow_page_pte`

follow_page_pte handles the PT (Page Table), which is the last level of the page table hierarchy. The pmd pointer contains the virtual address of the PDE. The function uses the pte_offset_map_lock macro to calculate the virtual address of the PT entry (PTE) and stores it as a pointer in a local variable named ptep. The pte_offset_map_lock macro uses bits 20:12 of the virtual address as an index to look up entries in the PT (PT is obtained from pmd), where each entry is 64 bits in size. Since ptep holds the address of a PTE, dereferencing it yields the physical address of the page. It also locks the PTE entry.

The macro ptep_get returns the dereferenced PTE in the pte variable.

static struct page *follow_page_pte(struct vm_area_struct *vma,
                                    unsigned long address, pmd_t *pmd, unsigned int flags,
                                    struct dev_pagemap **pgmap)
{
    struct mm_struct *mm = vma->vm_mm;
    struct page *page;
    spinlock_t *ptl;
    pte_t *ptep, pte;
    int ret;

    /* FOLL_GET and FOLL_PIN are mutually exclusive. */
    if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
                     (FOLL_PIN | FOLL_GET)))
        return ERR_PTR(-EINVAL);

    ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
    if (!ptep)
        return no_page_table(vma, flags);
    pte = ptep_get(ptep);
    if (!pte_present(pte))
        goto no_page;
    if (pte_protnone(pte) && !gup_can_follow_protnone(vma, flags))
        goto no_page;

The function then calls vm_normal_page to retrieve the struct page corresponding to the PTE for a “normal” page. A “normal” page refers to a page with a valid struct page backing it, as opposed to special mappings like direct device memory mappings or pages without backing struct page structures.

First, vm_normal_page checks if the VMA has either VM_PFNMAP or VM_MIXEDMAP flags set, indicating the VMA contains pages that map physical addresses for DMA usage.

The VM_PFNMAP flag is set for PFNs that do not have corresponding struct pages and is handled by the remap_pfn_range function to map kernel memory to userspace. If the VM_PFNMAP flag is set, the function checks if the translation between the virtual address and PFN is linear by comparing if the PFN equals the sum of offset from vma->vm_start and vma->vm_pgoff. If this condition is true, no struct page is returned because these mappings are treated as special. This suggests that no backing struct page exists, so no reference counting on a page is needed. If the translation is non-linear, then the further check determines if it’s a COW (copy-on-write) page. Non-COW pages with VM_PFNMAP flag set in their VMA are special pages (i.e., they do not return the corresponding struct page).

If VM_MIXEDMAP flag is set on a VMA, it checks if the PFN is valid by calling pfn_valid. Validity of PFNs depend on whether systems use FLATMEM or SPARSEMEM. If the PFN is valid, it is backed by a struct page and this page is returned to the caller. The VM_MIXEDMAP flag facilitates a range of addresses supporting COW mapping for which it can return struct page.

Thus, the main difference is that VM_PFNMAP considers a page as “special” (i.e., no backing struct page) for linear mapping or non-COW pages, where VM_MIXEDMAP can return struct page if the PFN falls into a valid physical range, without concern for COW status.

For pages backed by a file, COW (copy-on-write) mapping is determined by examining the vma->vm_flags for the presence of the VM_MAYWRITE flag and absence of the VM_SHARED flag. VMAs with the VM_SHARED flag set mean that the pages are shared and cannot be COWed. For anonymous pages, the COW is checked by verifying if the anonymous page is exclusive, meaning the page is not shared with another process.

Next, the function checks if it’s a zero PFN. A struct page for zero PFN always exists and never goes away, therefore, no need for reference counting on the page and no struct page is returned.

After performing all these checks, the function returns the corresponding struct page for the PFN by calling pfn_to_page.

    if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
        if (vma->vm_flags & VM_MIXEDMAP) {
            if (!pfn_valid(pfn))
                return NULL;
            goto out;
        } else {
            unsigned long off;
            off = (addr - vma->vm_start) >> PAGE_SHIFT;
            if (pfn == vma->vm_pgoff + off)
                return NULL;
            if (!is_cow_mapping(vma->vm_flags))
                return NULL;
        }
    }

    if (is_zero_pfn(pfn))
        return NULL;

    [...]
out:
    return pfn_to_page(pfn);

Once the function returns from vm_normal_page, the execution of follow_page_pte resumes. It checks if the user requested writing to the page using the FOLL_WRITE flag and whether writing is allowed:

    if ((flags & FOLL_WRITE) &&
        !can_follow_write_pte(pte, page, vma, flags)) {
        page = NULL;
        goto out;
    }

can_follow_write_pte function checks if the PTE is writable. If it is, it returns true. Otherwise, it checks for the FOLL_FORCE flag, which is relevant from ptrace path and is out of the scope of this blog post. If these checks return false, follow_page_pte returns NULL. Upon receiving NULL in the top level __get_user_pages function, which is analogous for page fault, it will call faultin_page to either populate the PTE, make the PTE writable, or handle COW scenario.

Next, it checks if the page needs to be “unshared”. This occurs for COW pages where the PTE is not writable, and the caller is pinning the page without requesting write access. When a write operation occurs on a COW page, a copy of the page is made, and modifications are applied to the copy. However, with gup functions, we aim to access the unmodified, original data. Therefore, we attempt to break the COW early by returning true, which unshares the page. When long-term pinning is requested for the COW pages backed by a file, those pages need to be unshared. To indicate unsharing of a page, it will return -EMLINK to the caller.

    if (!pte_write(pte) && gup_must_unshare(vma, flags, page)) {
        page = ERR_PTR(-EMLINK);
        goto out;
    }

Next, it obtains a reference to the page by calling try_grab_pages. This function retrieves the folio of the page (a folio represents a collection of pages, and the number of pages in a folio is always a power of 2). A folio is represented by struct folio which contains a struct page within a union structure. Many fields of struct page must match corresponding fields in struct folio, and these are verified using the FOLIO_MATCH macro. When a struct page is part of a larger folio, the folio is returned via the page->compound_head field, which points to the head page structure of the folio. Note that struct folio is larger in size than struct page when it represents multiple consecutive pages; otherwise, they are the same size.

The try_grab_folio function increases the reference count of the folio. For the FOLL_GET flag, it increments the page->refcount of the head page. For the FOLL_PIN flag, the reference counting mechanism differs based on the folio size. If it’s a large folio (i.e., it contains more than one page), it increments both the head page’s page->refcount and the folio’s folio->_pincount. If the folio represents a single page, the higher bits of page->refcount, represented by GUP_PIN_COUNTING_BIAS, are incremented instead. This hack is to save space within page->refcount and to determine whether a page is pinned.

When control returns to the __get_user_pages function, it checks whether the returned page is either NULL or has the -EMLINK error code, both of which indicate that faultin_page should be called to populate the PTE, possibly with a new page, as discussed in the next section.

When a valid page is returned, __get_user_pages fills the pages array and continues to populate it in a loop.

Handling Memory Fault

At this stage, no suitable page has been found, so page fault handler needs to be invoked to either allocate a new page or modify the permission bits in the PTE. This section provides an overview of how the memory fault is handled.

Faulting in a Page

The internal memory management (mm) page fault routine uses FAULT_FLAG_* flags instead of the FOLL_* flags used in the gup routines. The first step to call the mm internal page fault handler is to convert the FOLL_* flags into their corresponding FAULT_FLAG_* equivalents.

The faultin_page function receives the unshare parameter as true from __get_user_pages if -EMLINK error code is returned from follow_page_mask.

static int faultin_page(struct vm_area_struct *vma,
                        unsigned long address, unsigned int *flags, bool unshare,
                        int *locked)
{
    unsigned int fault_flags = 0;
    vm_fault_t ret;

    if (*flags & FOLL_NOFAULT)
        return -EFAULT;
    if (*flags & FOLL_WRITE)
        fault_flags |= FAULT_FLAG_WRITE;
    if (*flags & FOLL_REMOTE)
        fault_flags |= FAULT_FLAG_REMOTE;
    if (*flags & FOLL_UNLOCKABLE) {
        fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
        /*
         * FAULT_FLAG_INTERRUPTIBLE is opt-in. GUP callers must set
         * FOLL_INTERRUPTIBLE to enable FAULT_FLAG_INTERRUPTIBLE.
         * That's because some callers may not be prepared to
         * handle early exits caused by non-fatal signals.
         */
        if (*flags & FOLL_INTERRUPTIBLE)
            fault_flags |= FAULT_FLAG_INTERRUPTIBLE;
    }
    if (*flags & FOLL_NOWAIT)
        fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
    if (*flags & FOLL_TRIED) {
        /*
         * Note: FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED
         * can co-exist
         */
        fault_flags |= FAULT_FLAG_TRIED;
    }
    if (unshare) {
        fault_flags |= FAULT_FLAG_UNSHARE;
        /* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */
        VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE);
    }

Next, it will call handle_mm_fault, which is the actual mm fault handler routine.

    ret = handle_mm_fault(vma, address, fault_flags, NULL);`

Explanation of the page fault handler is out of the scope of this blog and will be covered in the next blog post.

After the handle_mm_fault completes, if a page is successfully populated for the address, the operation retries from the __get_user_pages routine or from its caller. It then returns the struct page pointer by walking the page table again to populate the pages array.

Unpinning Pages

When the user is done with the pages previously pinned, the unpin_user_pages function must be called to release the references to those pages.

`unpin_user_pages`

Let’s take a look at the implementation of unpin_user_pages:

void unpin_user_pages(struct page **pages, unsigned long npages)
{
    unsigned long i;
    struct folio *folio;
    unsigned int nr;

    [...]
    for (i = 0; i < npages; i += nr) {
        folio = gup_folio_next(pages, npages, i, &nr);
        gup_put_folio(folio, nr, FOLL_PIN);
    }
}

In this implementation of unpin_user_pages, the folio is retrieved by calling gup_folio_next on the pages array. If any page in the pages array is part of a larger folio, the function skips to the page representing the next folio. The variable nr holds the number of pages in the current folio.

`gup_folio_next`

The implementation of the gup_folio_next is as follows:

static inline struct folio *gup_folio_next(struct page **list,
        unsigned long npages, unsigned long i, unsigned int *ntails)
{
    struct folio *folio = page_folio(list[i]);
    unsigned int nr;

    for (nr = i + 1; nr < npages; nr++) {
        if (page_folio(list[nr]) != folio)
            break;
    }

    *ntails = nr - i;
    return folio;
}

In the above implementation of gup_folio_next, the folio is first retrieved using the page_folio function that accepts a page as input. The folio is pointed to by the page->compound_head member variable. Next, the function calculates how many pages are in the current folio by locating the next folio. The ntails pointer receives the number of pages in the current folio.

`gup_put_folio`

Once the folio is retrieved, its reference count is decremented by calling gup_put_folio.

static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
{
    if (flags & FOLL_PIN) {
        if (is_zero_folio(folio))
            return;

        if (folio_test_large(folio))
            atomic_sub(refs, &folio->_pincount);
        else
            refs *= GUP_PIN_COUNTING_BIAS;
    }

    if (!put_devmap_managed_page_refs(&folio->page, refs))
        folio_put_refs(folio, refs);
}

When the FOLL_PIN flag is passed to the gup_put_folio function (i.e., when called from unpin_user_pages), it handles the specific case of decrementing the reference count for pages pinned using pin_user_pages. For large folios, it decrements the _pin_count member variable of the folio. For a single-page folio, the higher bits represented by GUP_PIN_COUNTING_BIAS are decremented. Finally, the page->refcount of the head page pointed to by the folio is decremented.

Conclusion

As Ksplice engineers, we often have to learn new areas of the kernel very rapidly to understand how a patch can be applied without rebooting the system. This work can be challenging, but it’s also rewarding to explore new areas of the kernel and understand how they interact with each other.

If you find this type of work interesting, consider applying for a job with the Ksplice team! Feel free to drop us a line at ksplice-support_ww@oracle.com.

Pinning User-space Pages in the Linux Kernel: Exploring get_user_pages, pin_user_pages, and Page Table Walking

Introduction

Overview

GUP Functions Initialization, Locking and Flag Handling

`pin_user_pages` and `get_user_pages` functions

`is_valid_gup_args`

`__gup_longterm_locked`

`__get_user_pages_locked`

`__get_user_pages`

Page Table Walking

Processing First Level of Paging – `follow_page_mask`

Processing Second Level of Paging – `follow_p4d_mask`

Processing Third Level of Paging – `follow_pmd_mask`

Processing Fourth Level of Paging – `follow_page_pte`

Handling Memory Fault

Faulting in a Page

Unpinning Pages

`unpin_user_pages`

`gup_folio_next`

`gup_put_folio`

Conclusion

Further Reading

Shoily Rahman

RoCEv2 Congestion Counters Explained

Using Leapp to Upgrade Systems Managed by OS Management Services

Pinning User-space Pages in the Linux Kernel: Exploring get_user_pages, pin_user_pages, and Page Table Walking

Introduction

Overview

GUP Functions Initialization, Locking and Flag Handling

pin_user_pages and get_user_pages functions

is_valid_gup_args

__gup_longterm_locked

__get_user_pages_locked

__get_user_pages

Page Table Walking

Processing First Level of Paging – follow_page_mask

Processing Second Level of Paging – follow_p4d_mask

Processing Third Level of Paging – follow_pmd_mask

Processing Fourth Level of Paging – follow_page_pte

Handling Memory Fault

Faulting in a Page

Unpinning Pages

unpin_user_pages

gup_folio_next

gup_put_folio

Conclusion

Further Reading

Authors

Shoily Rahman

RoCEv2 Congestion Counters Explained

Using Leapp to Upgrade Systems Managed by OS Management Services

`pin_user_pages` and `get_user_pages` functions

`is_valid_gup_args`

`__gup_longterm_locked`

`__get_user_pages_locked`

`__get_user_pages`

Processing First Level of Paging – `follow_page_mask`

Processing Second Level of Paging – `follow_p4d_mask`

Processing Third Level of Paging – `follow_pmd_mask`

Processing Fourth Level of Paging – `follow_page_pte`

`unpin_user_pages`

`gup_folio_next`

`gup_put_folio`