Linux shares data pages efficiently, but doesn’t yet share the mappings for those shared pages. This can result in significant overhead: in the worst case, mapping a 512 GB shared region with 4K pages (at 8 bytes per page) requires 1 GB per process just for page tables; with a thousand processes, that’s 1 TB of RAM! We usually mitigate this with huge pages (THP/hugetlbfs), reducing page-table footprint dramatically, but the underlying issue still scales with process count and shows up as duplicated page-table pages, per-process first-touch minor faults, and repeated work for operations like permission changes.

To remedy this we’re proposing ‘mshare’: a Linux kernel mechanism to make shared memory also share its mappings. mshare allows processes to share page tables for designated regions, effectively removing the per-process cost of new connections as well as allowing cross-process actions like mprotect to be applied once on a shared page table and not coordinated through inter-process communication. Bringing this capability to Linux promises more predictable memory overhead and lower kernel MM churn for workloads that share hundreds of gigabytes across hundreds or thousands of processes. The latest progress can be found in our upstream post here.

What is mshare?

The memory management needed to support shared memory in large applications can mean a significant increase in the resources required to run the application when memory is shared with a large numbers of processes. mshare aims to address this and open up new avenues of functionality in future Linux kernels through ongoing development and by working with the upstream community.

There are many ways to share memory between processes. Some examples are shared anonymous memory, memory mapped files, tmpfs files, and hugetlbfs. With the exception of hugetlbfs, one thing each of these have in common is that every process accessing shared memory maps it with its own set of page table entries (PTEs). Each process generates a minor page fault the first time it accesses a shared page which allocates page table pages as needed and populates them with an entry. With smaller amounts of memory shared by a small number of processes this overhead is minimal. As the number of shared pages and the number of times pages are shared goes up the amount of memory allocated for page tables starts to become significant.

mshare takes the concept of shared memory a step further to include the sharing of kernel internal memory mapping resources, namely page tables. For large applications where hundreds of gigabytes may be shared between hundreds or even thousands of processes, sharing these resources can mean a substantial savings in memory and reduction in memory management overhead that contributes to higher application performance.

Doesn’t Linux already have page table sharing?

Yes, but it is limited to hugetlb pages. See here for details.

Mshare expands on and brings the benefits of page table sharing to the other forms of shared memory.

Sharing page tables saves memory

As an example of how memory may be consumed by page tables, to map 512GB of memory with 4k pages in a single process requires an additional 1GB of page table pages just to store all of the PTEs. Sharing the memory with hundreds of processes each requiring more than a 1GB for its page tables pages will quickly exhaust memory and is untenable.

Using transparent hugepages or hugetlb pages significantly reduces the number of page table pages needed. Mapping 512GB of memory with 2MB pages only requires an additional 2MB of page table pages, but a thousand processes mapping the shared memory can mean nearly 2GB of memory consumed just for the page tables.

Mapping 512GB of shared memory using mshare means that the needed page table allocations are capped to 1GB for the worst case of mapping the entire range with 4k pages regardless of how many processes share the memory. Mapping the range with 2MB hugepages reduces the page tables pages needed to 2MB.

Other Benefits

Updating access protections across processes is easier

What if hundreds of processes have mapped a region of shared memory with read/write access and the application wants to make a portion of that memory read-only? If page tables are not shared, communication must be coordinated with the processes so that each one can update its copy of the PTEs with the new permissions (e.g. call mprotect()). That’s one system call executed hundreds of times plus the overhead of IPC.

With mshare the shared page tables are updated once with the change immediately visible to all sharing processes. That’s one system call with no IPC needed. This is something that cannot be done with the existing hugetlb page table sharing functionality where a change of protections causes the shared PMD page table to be unshared and results in the allocation of additional page table pages.

Fewer minor page faults

When multiple processes share memory but not page tables, each process generates a page fault on the first access of any page in the shared region by that process. The number of minor faults needed for P processes to access N shared pages is calculated as P x N.

When page tables are shared, only one process generates a page fault on the first access of any page in the shared region. On the first access anywhere in the shared region the shared page table is linked with the process page table. A page table walk will then find entries already established, possibly by other processes. The number of minor faults needed is now just N.

Example Usage

Since the concept of mshare was first proposed to the community the API has undergone numerous changes and continues to be refined. More recently, signficant changes to the API are in progress as a result of feedback from the upstream community since the last patch series. What follows is intended to give an idea of how mshare is used by an application with the caveat that precise call names and arguments are subject to change.

1. Determine size and alignment requirements for an mshare region

The size of an mshare region determines the amount of virtual address space the region covers.

The start address and size must be properly aligned. This alignment and size requirement can be obtained by reading the file /sys/kernel/mm/mshare/mshare_align which returns a number in text format. On x86_64 the required alignment is 512GB since a PUD page table page in a 4- or 5-level page table covers 512GB of address space aligned on a 512GB boundary, and sharing a PUD page table page allows mappings for 4k, 2MB, and 1GB page sizes to be shared.

2. For the process creating an mshare region:

  • Create an empty, zero-length mshare region
       mshare_fd = memfd_mshare();
  • Establish the size of the region
       fd = open("/sys/kernel/mm/mshare/mshare_align", O_RDONLY);
       read(fd, req, 128);
       alignsize = atoi(req);

       size = alignsize * 2;
       ftruncate(mshare_fd, size);
  • Map some memory in the region Can map any combination of shared anonymous, hugetlb, and/or files into an mshare region.
       /* Map a single 4k page of shared anonymous memory */
       mshare_mmap(mshare_fd, 0, 0x1000, PROT_READ | PROT_WRITE,
               MAP_ANONYMOUS | MAP_SHARED, -1, 0);
  • Attach the mshare region to the process at shmaddr1
       addr = mshare_attach(mshare_fd, shmaddr1, flags);
  • Write and read the mshare region normally
       *addr = 'm';

3. For other processes accessing the mshare region:

  • Acquire the file descriptor of an existing mshare region Pass the file descriptor from the creating process through inheritance via fork() or by using unix domain sockets
  • Attach the mshare region to the process at shmaddr2
       addr = mshare_attach(mshare_fd, shmaddr2, flags);
  • Write and read to mshare region normally
       if (*addr == 'm')
               ...

Technical Challenges

A project like mshare has many technical challenges to overcome: Which page table walks in the kernel should walk a shared page table? Which should not? How does locking work? One challenge addressed recently that I’ll briefly discuss here is how to flush TLB entries for shared page tables.

First, some background

In the kernel every user process has a struct mm_struct (mm) associated with it to hold per-process memory management information. Among other things this structure holds the page tables and VMAs (virtual memory areas) representing the objects mapped in the process address space.

For typical memory mappings in a process when an address is accessed for the first time, a page fault is generated and the process mm and its VMAs are used to populate an entry in its page tables. On subsequent accesses, the hardware caches the page table translation in per-core caches known as TLBs. This speeds up later accesses by avoiding the need to repeatedly walk the page table.

Entries in a TLB are tagged with a context identifier that associates them with a particular user process or with the kernel. When a page table entry is updated or removed (e.g. when access permissions are changed or memory is unmapped), any entries cached in the TLBs must be flushed. For user memory the kernel takes care of this by using data in the process mm to limit the flush to entries of the affected process.

Sharing

The mshare implementation works by creating a separate mm to hold the shared page tables and VMAs for objects mapped into the shared region. A page fault in a shared region links the process page table to the shared page table and uses the mshare mm to populate entries in the shared page table.

This mm is associated with the mshare region and not with the process.

The challenge for mshare is that a shared page table entry accessed by multiple processes results in separate TLB entries for each of the processes. When a shared mapping is updated or removed, all associated TLB entries must be flushed or processes risk accessing stale entries leading to data corruption. There currently isn’t a straightforward way for the kernel to flush entries for multiple processes for a single mapping. The more recent publicized patch series handled this issue by simply flushing all entries from all TLBs every time a shared mapping is updated. This is okay for prototyping but disastrous for performance since all processes and the kernel are affected.

A solution is to maintain a list of all mm’s that have attached a particular mshare region and to keep that list with the associated mshare mm. When a shared mapping is updated, walk the list and flush TLB entries as needed for each mm. Add a process mm to the list when the process attaches the mshare region and remove it when the region is detached. Because a process may attach or detach an mshare region at any time, use RCU to protect the list. This has been shown to avoid the collateral performance damage caused by repeatedly flushing TLB entries of the kernel and unrelated processes. Further architecture-specific optimizations may be possible as well and will be explored as the project progresses.

Ongoing Work

Development of mshare is ongoing and the feature is not yet available in any linux kernel though the intrepid soul can find the most recently submitted upstream patch series here.

Stay tuned!