There is a struct page associated with every base page (4K) of system memory, regardless of use. This rather contentious data structure is 64bytes long. While it looks small, it exists for every base page (4K) of system memory regardless of how process page tables (huge pages) may look. Thus, on x86, its overhead is about 1.5% of total physical memory... or for quick math, it is 16 GB per 1 TB. In systems where the Trusted Computing Base (TCB) is small and memory availability is a concern, 1.5% can quickly translate to tens of GB of wasted memory that could be dedicated to guests. Let's extrapolate that number for a bit:
32Gb for a 2Tb machine
128Gb for an 8Tb machine
1Tb for a 64Tb machine
It's just 1.5%, heh?
This structure is also the main tracking data structure when memory crosses subsystems, say scatter-gather services in block or networking services. It wouldn't be a simplification to say that without a struct page you are mostly stripped away from using most kernel services.
Here we depict some of the options we are exploring into addressing this long standing issue, specifically in the context of virtual machines and persistent memory.
One easy way to get that 1.5% capacity back is to "just" rip away the struct page, which leads to an interesting exercise on how much the hypervisor relies on this piece of metadata.
Today in Linux, the one way you can use memory not being backed by a struct page is via a special character device called /dev/mem that is meant to provide direct memory access. The user can do it by limiting the amount of memory the kernel manages, either with memmap= or mem= kernel command-line options and mmap() the device at a given starting offset. Doing so for a virtual machine requires a few non-trivial compromises:
Each mmap() performed on the device gives you one and only one contiguous chunk of memory: in a potentially fragmented hypervisor one needs to pick several ranges to fulfill the necessary guest allocation.
/dev/mem gives you access to every PFN not mapped by the kernel, as long as it's not System RAM that kernel manages or the first megabyte of memory. That also means giving access to the whole /dev/mem device and mmap-ing it multiple times to deal with fragmentation. Any added policing would require non-trival policies for each mmap() intercepted by the VMM to restrict the offsets the caller can use.
No support for huge pages with /dev/mem, which can lead to a less efficient use of the TLB consequently lowering performance.
/dev/mem can map any memory, and you have no way to give the memory back, should the host be seen in an out-of-memory (OOM) situation it wants to get out of.
The other alternative to /dev/mem is DAX. DAX is the other subsystem that provides the bare minimum to allow applications direct access to memory. Linux v5.10 contains newly added infrastructure that allows carving up memory for applications, as a better alternative to the emulated label-less NVDIMMs way on x86, supporting subdivision while handling fragmentation of the region. With a stricter version of DAX device and daxctl v71+ users can manage the memory freely.
What remains to be done is tracking the cacheability of struct page-less ranges as writeback (mostly a problem for x86 PAT), supporting huge pages without struct page, and addressing the handling of the MCEs. We first explored this idea here and you can find a more up to date branch on github.
Although ripping struct page away for guest memory acts as a quick way to address the metadata costs by removing it from the kernel mapping, it also means stripping away a lot of features from its users, and only works well in specialized environments with minimal dependence on kernel services.
It isn't a fix but rather bypassing into how memory is tracked in the kernel. The fundamental problem is that there's just a lot of these struct pages regardless of how these are going to be reflected in page tables.
The memory model at the core is the fundamental component that tracks kernel-managed system memory. There are three models available: flat, sparse, and discontiguous. The sparse memory model has metadata organized in sections (with size depending on architecture) and uses page flags or bits of the PFN to do conversions between frame and metadata (struct page). To make the lookup cheaper, the sparse memory model has an extension dubbed vmemmap that allocates a contiguous virtual address range that indexes all struct pages of all the memory tracked in the system. Such a big array is called the vmemmap. Each frame allocated for these struct pages virtual addresses is called a vmemmap page/area. A PFN is an index into that vmemmap array, and the struct page subtracted with the vmemmap global array calculates the said index. Note that, the vmemmap only works based on base pages, as at that layer there's no knowledge of how different struct page are related to each other, like buddy allocator or other allocators do.
The sparse memory model is the most common, and the vmemmap/altmap extensions are also used today in commodity architectures like x86 or ARM.
Each virtual address of the vmemmap points to a physical frame that contains 64 struct pages (on x86, where each one is sized at 64 bytes). These frames are unique for the set of memory online in the system. And often use contiguous ranges of 4K pages which leads to the use of fewer page table entries to represent the said vmemmap.
The buddy allocator or hugetlbfs can group these pages to represent bigger chunks of memory such as 2M huge pages (order-21 page allocation) or 1G (order-30). The subsystems will then group these pages into one head page plus one or more tail pages that are treated as a "set". This "set" is called a compound page, and the number of pages these can group are independent of the CPU-supported page sizes.
One interesting property of these compound pages is that all the important information being tracked is stored in the head page e.g. the address space that the page is tracked to, the index within mapping or private page data. The tail pages act as a proxy and contain enough info to point to the head page (e.g. the compound_head field). As you can imagine there are going to be a lot of these tail pages, all of them mostly duplicated information e.g. a 2M compound page will have 511 tail pages and a 1G huge page will have 262143 tail pages.
The kernel cannot make efficient use of the memory allocated for the virtual memory map, given that it needs to track at its finest granularity for the different memory allocators. But recently Muchun Song, revisited this problem with a fresh new look for hugetlbfs and making the following proposition in his series: if the tail pages are mainly used as a proxy, can we reuse its data across the many tail pages?
The sparse vmemmap model allows such a thing because the page address is virtual and it can point to any frame in the system. So the idea is simple:
allocate one page for the head and the first 63 tail pages (0 .. 63)
allocate another page for the 64 tail pages right after (64 .. 127)
point the remainder of the vmemmap PTEs to the frame allocated in the previous step (128 .. N)
Usually the majority of tail pages aren't used except for the first few tail pages that extend information in the head page. The rest of the tail pages are only used read-only in order to get to the head page. It is this one characteristic that makes it possible to do this with compound pages, as well as the fact that there's a reservation that needs to be done in advance with hugetlbfs or device-dax.
Things get slightly more complicated once we allocate or free the said compound pages. Because the memory is initialized at boot, we need to walk the existing vmemmap page tables and rearrange them to enable this reuse of tail page entries mentioned in step 3. Once that is done, we free the unused frames back to the kernel. The other hurdle occurs once we try to give these huge pages back to the kernel, and we may fail to allocate the necessary vmemmap frames when trying to give back the huge pages to the page allocator as base pages.
Things get a lot simpler with ZONE_DEVICE, though.
ZONE_DEVICE is another foundational part of Linux MM that is used for P2P DMA and persistent memory. Persistent memory is one special case, often a contentious one, as generally vendors want machines with as much persistent memory as possible, while its metadata is stored in its fastest medium (RAM). So metadata overhead being minimal is crucial to putting fewer RAM requirements.
Contrary to hugetlbfs which starts with the boot vmemmap, ZONE_DEVICE have to explicitly allocate the struct page metadata and thus allocate vmemmap space. This particular difference makes it's much easier to reuse tail pages, without having to deal with changing the existing virtual memory map or worrying about giving the said pages back in the right shape to the previous owner (i.e. the page allocator). But there are more advantages as well:
initializing the memory map takes significantly less time: because we have a constant number of struct pages per compound page, thus less to initialize. Meaning we only need to initialize 128 struct pages regardless, whereas without the reusal of tail pages with 2M page size we would do 512 or 262144 struct pages with 1G huge pages.
caching: there are a lot fewer struct pages therefore a lot less required to stay in cache. This increases the likelihood of having the head struct page in the cache, consequently speeding up the pinning of memory.
For more information have a look into the series for reusing tail pages for ZONE_DEVICE, specifically for a user like device-dax.
Naturally, such an approach is incompatible with the splitting of compound pages into smaller ones, as that requires changing the virtual memory map. With device-dax that is not a problem as splitting only occurs at a runtime-configured device-dax page size, much like hugetlbfs. As such poisoning of pages also means poisoning the entirety of the compound page. Finally, when using together with other drivers like dax_kmem the memory map needs to be torn down regardless, before hotplugging the memory into the kernel.
The memory savings with the tail page reuse are almost as good as removing the struct page. Its effectiveness is tied to the page size, as you still need 8Kb of memory for compound pages of order 19 (512k) or above. Rather than saving 1.5% as you do when removing the struct pages, you would save 1.17% with 2M pages and 1.4986% with 1G pages. Quantitatively this means:
With no struct page we save 16GB per TB, but no kernel services are supported
With tail page reuse and 2M huge pages we save ~12GB per TB
With tail page reuse and 1G huge pages we save ~16G (16368M) per TB
In terms of initialization times, pmem sees about 268-358 ms for a 128GB DIMM when struct pages are placed on RAM. But if tail struct pages are reused per compound page we see an impressive decrease in initialization when struct pages are placed in memory to ~78-100 ms with 2M compound pages, and less than a 1ms with 1G compound pages!
The blog post discusses the impact of page metadata from an efficiency standpoint, and how we can tidy up some subsystems. There are various advantages and there are still challenges and potentially other optimizations to tackle.
For example, one that remains to be resolved is to keep certain pages out of the the direct map, to have the same security boundary as achieved by just not having a struct page around.
Of course, a lot of this is being discussed with the community, so there isn't yet an ultimate solution that makes everybody happy.
Thank you to the following folks (in alphabetical order) who have made this work possible:
Matthew R. Wilcox