One of the key duties of the Linux kernel is to manage resources and share them fairly and efficiently between processes and the kernel itself.

This blog post explores how we manage one of the most important resources a system possesses – its memory – and how ongoing work (spearheaded by Oracle Engineer Matthew Wilcox) has led to significant performance improvements and in the near future, significant memory savings.

Introduction to Physical Memory

Modern hardware allows memory to be mapped between logically mappable ‘virtual’ memory addresses, and ‘physical’ addresses which tell hardware where that memory actually exists physically.

In order for these mappings to be manageable, hardware divides them into power-of-2 sized physically contiguous ranges known as ‘pages’ (physically contiguous means an unbroken span starting at one physical address and ending at another).

The smallest granularity hardware can work with is known as the ‘base page’ size. In many architectures this can be varied (for instance arm64 has page sizes of 4 KB, 16 KB, 64 KB, etc.)

In order to manage this memory, the kernel must maintain metadata describing each page of physical memory.

In the past, this has been the job of the struct page data structure – currently 64 bytes in size (though it can vary with kernel configuration options).

There is a single struct page structure allocated to describe each page of memory, so for a 64 GB system with a base page size of 4 KiB, that implies 1 GB of memory is used just to describe the memory, or 1.6% of total memory.

What if we need more physical memory?

Now, things are further complicated by the fact that fairly often blocks of physically contiguous memory larger than a base page size are required as hardware that operate in terms of physical memory needs to work with it.

This is the case with, for instance, huge pages (whether via hugetlb or Transparent Huge Pages), or devices that need to address RAM directly.

To describe this, you have more than one struct page data structure describing a contiguous range, which are (rather confusingly) known as ‘compound pages’.

The key metadata is then kept in the first of these struct page entries, termed the ‘head’ page, with the remaining pages termed ‘tail’ pages.

This is obviously a less than optimal situation, as it leads to a great deal of wasted space, as well as another rather thorny issue. Consider:

void do_something_with_phys_mem(struct page *page)
{
	/* Hang on is 'page' a head page or a tail page?? */
	struct page *head = compound_head(page);

	...
}

We don’t know if the input page is a head page, a tail page or a non-compound page, so we have to invoke a function to check and fetch the head page if necessary.

The compound_head() function dereferences struct page->compound_head and looks up the head page if necessary, adding overhead to any function which can’t rule out being passed a compound page.

Folios: The Ottawa Interpretation

With this in mind, the concept of the ‘folio’ was born – a new data structure with fields identical to struct page, that is guaranteed to either be a single page or to be the head page of a compound page.

This simple interpretation was established in Ottawa so naturally is referenced as the ‘Ottawa’ interpretation of the concept.

This addresses the performance overhead of dealing with compound pages, but not the issue with memory usage (we will come on that later).

As a sense of the impact of this change, an early (and not particularly tuned) version of this work saw a 7% performance increase when building the kernel from this alone, which is very significant for a kernel-only optimisation like this.

Folios were first introduced in Linux v5.16.

Wider Benefits

There are significantly larger benefits to be had elsewhere however as a result of the introduction of folios.

By utilising a data structure that explicitly describes folios of a range of different sizes, logic that previously only dealt with pages at the smallest granularity, typically 4 KB, can use larger ranges of physical memory.

This is a hugely powerful concept, as it turns out a great many areas of the kernel are significantly impacted by this and some performance gains simply cannot be had without this capability.

This is especially important with respect to the ‘page cache’, which is RAM which caches on-disk data and through which nearly all disk operations are performed on the system.

Previously this could only operate at a 4 KB granularity, thus operations performed against block devices which could potentially handle more than 4 KB at once had to be broken into 4 KB chunks.

This was a huge impediment. As per early discussions performance improvements of between 20-500% could be had depending on hardware and use case through use of larger I/O blocks, and that was in 2017.

Through the use of folios, we are able to utilise devices at their actual capacity, thereby utilising their full I/O throughput. This is a profound difference.

This also opened the door to filesystems being able to utilise a block size greater than 4 KB which was previously impossible.

Another key impact of this work was in reclaim. This maintains a list of active and inactive pages (now folios) and determines, based on which data is (approximately) least used, and thus which memory to free to relieve memory pressure.

The limited granularity of the page cache therefore impacted reclaim as well – since each entry on this list was limited to 4 KB, it resulted in reclaim taking longer.

Through use of folios we immediately fix this issue and, depending on large folio usage, reduce the size of the reclaim lists, thus improving reclaim performance.

The work which impacted the page cache and reclaim specifically was done in Linux v5.18 (see the relevant series for details).

This saw very significant performance throughput increases ranging from 24% – 241%.

We term folios larger than the base page size ‘large folios’, as it is at these larger sizings that the advantages of folios really come into play.

A case in point is memory mapping – even when not mapping huge pages, fewer page faults are required to map large folios thus batching up operations which takes less time.

mTHP

Virtual memory, which allows for logical addresses belonging to each individual process to be mapped to physical memory, is only efficient as a result of aggressive caching between virtual and physical addresses.

This is because actually looking up the physical address from a virtual address is a relatively expensive operation in hardware.

This cache is known as the ‘Translation Lookaside Buffer’ or TLB. It is maintained at a granularity of the page size being mapped.

Modern hardware allows for page sizes larger than the base page size to be used, which are termed ‘huge pages’. These naturally map more memory per cache entry, and thus reduces cache misses thus (often significantly) reducing memory lookup latency.

In order to do this, the kernel must deal with physically contiguous ranges of the appropriate size to enable this functionality.

For instance, x86-64 allows 2 MB huge pages (as well as 1 GB, though 2 MB is more typically utilised), and thus the kernel maintains heuristics to reduce fragmentation of physical memory at a 2 MB granularity.

However, arm64 permits huge pages of a variety of page sizes between 4 KB and 2 MB, e.g. 16 KB, 64 KB, 256 KB, etc. which also exhibit similar performance improvements in reduced TLB cache contention.

Without folios, which have an innate capacity to be split into folios of smaller size when required, supporting this would either be impossible or at least, very difficult.

This functionality, termed ‘mTHP’ (multi-size THP), results in significant performance improvements ranging from 5%-20% depending on use case.

Folios: The New York Interpretation

So performance and latency improvements aside, we come back to the issue of the memory wasted in maintaining struct page‘s for each and every base page of physical memory.

In discussion with Johannes Weiner it was decided that in fact folios could address this issue also. Since he resides in New York state, this has subsequently been termed the ‘New York’ interpretation.

As a result, a key aim of folios is to reduce the size of struct page from 64 bytes to 8 bytes.

The intent is that each struct page will now simply contain a ‘memory descriptor’ (memdesc for short) which will either encode metadata within the 8 byte value, or contain a pointer to another data structure which provides relevant metadata.

While this represents a 87.5% reduction in size of struct page objects, things pointed to by the memdesc now have to be allocated separately. However, since fewer objects are required across physically contiguous ranges this still represents a significant memory saving.

Importantly, as a result of this change, folios themselves only provide metadata describing user memory, which is a significant divergence from the Ottawa interpretation.

Until memdescs are established for each type of kernel memory allocation, kernel memory still largely uses struct page directly.

The reduction in memory used is especially important as the memory used for struct page is allocated for the kernel, meaning it cannot be reclaimed.

See Matthew’s Memdescs page for a detailed description of the design.