We are always looking to solve hard problems and enhance the Linux Kernel. In this blog post, Oracle Linux kernel engineer Liam Howlett introduces a new data structure that he and Matthew Wilcox are developing.
The Linux Memory Management layer supports the very common technique of virtual memory. Linux splits blocks of virtual memory into areas specified by the c structure vm_area_struct. Each vm_area_struct contain information associated with mapped memory and are used to find the associated pages of memory which contain the actual information. Virtual memory areas (VMAs) could be the contents of a file on disk, the memory that contains the program, or even the memory the program uses during execution. Literally everything that is run on Linux uses vm_area_struct for memory mapping. This vital area of the kernel needs to be quick and avoid contention whenever possible.
There are two main operations that are performed on the VMAs: lookups and modifications. Lookup time is important as it is used to find the actual memory or where to place the memory on pagefaults. Modifications are important when memory is no longer needed such as exec() or when new memory is needed during runtime such as stack expansion, heap expansion, mmap(), or loading a new library. When new libraries are loaded, the system searches for an available gap to place the library in the VMA. Depending on the architecture, that search will start from an address and walk up or down until a sufficient gap is found for the library. There are other calls that modify existing VMAs by changing the metadata of a VMA, but these can be viewed as modifications (remove/re-add, split, etc).
Linux stores the VMAs in an rbtree which can achieve O(log(n)) time for search, insert, and delete. One issue with rbtree is that walking the nodes in numerical order is not very efficient. Walking the Linux VMAs is done by creating a doubly linked list between the VMAs. The start of the VMA doubly linked list is stored in the mm_struct as mmap. A further optimization is to also track the gaps between the VMAs for faster searching of the VMA space for a sufficient gap. The rbtree requires the user to write their own search function which has led to very few other places of use beyond the memory management subsystem.
One of the other users of the rbtree in the kernel is the interval tree. The interval tree is used to track file operations by various tasks as it supports overlapping ranges. The interval tree comes with a search function which has led to the interval tree being used in many places that don't have overlaps at all - simply because it's easy and faster than a linked list.
The rbtree is tightly coupled with the VMA structures in the memory management subsystem, in fact the rbtree nodes are embedded within the vm_area_struct itself. This leads to faster allocation time as the vm_area_struct comes with what is needed.
Having the rbtree node integrated into the vm_area_struct means that any walking of the tree requires the entire VMA to be loaded into cache even if it's not a node of interest. This is also true of the doubly linked list - inserting a VMA into the list means the next and previous VMA are needed to be read and then written out. Currently, the vm_area_struct is 3 cache lines. Neither of these operations can take advantage of prefetching done by the hardware.
In recent years the processors have experienced growth in core counts which have pushed software to be multi-threaded and increased contention in the virtual memory data structure. The memory management subsystem uses the mmap_sem lock for write protection of the VMAs. Optimizing the mmap_sem lock into a rw-semaphore helped contention but did not solve the underlying issue. Even with a single threaded program and a well-intended system admin, contention does arise through proc file accesses for application monitoring.
The combination of the rbtree nodes being embedded in the VMA and the fact that rbtrees 'rotate' makes it extremely difficult if possible to write a lockless version of the rbtree. Furthermore, the doubly linked list and the calculated gaps ensure a lockless version cannot be written as it would require four pointers be replaced at the same time.
The answer to the mmap_sem contention would need a new data structure. A data structure that could track gaps, store ranges, and be implemented in an RCU compatible manner. Matthew Wilcox and I set out to design just that and have arrived at what is now known as the Maple Tree. The Maple Tree is a range-based B-tree designed to be RCU-safe and track gaps.
The tree stores the VMAs in nodes that are separate from the vm_area_struct so that the leaves can be copied and updated (C and U in RCU) without readers (R in RCU) having to leave the tree. Gaps are tracked in the same tree and report the largest gap in the sub-tree. Gap tracking is optional and decided on tree creation. Leaves store up to 16 entries, which means the next/prev are almost always in the same node.
Non-leaves have a branching factor of 10 when tracking gaps and 16 when not tracking gaps, while leaves have a branching factor of 16. The significant increase in branching factor means the tree is significantly shorter than the rbtree which has a branching factor of 2. Maple tree nodes are also only 2 cache lines, which is 50% smaller than the rbtree 3 cache lines.
One goal of the Maple Tree is to have a simple interface for users who just want to store data while having an 'advanced' interface for things such as the VMAs. By providing a simple interface, the users who just want to store data can have an RCU safe tree that does not have unused code to handle non-existent overlaps. The advanced interface allows for users to optimize the operations further and pre-allocate storage space - although the nodes are only 256 Bytes.
Another reason for the advanced interface is to encompass most of the common uses within the tree implementation for all tree users benefits and to make the kernel subsystem code more coherent. There are iterators that accept limits which will be used in place of doubly linked lists for the most part.
The RFC for the maple tree that was sent to LKML on 2020/12/10 still needs work; the doubly linked list remains, 32-bit support does not exist, nommu is not supported, and RCU is currently disabled. You can also find the current repository on GitHub.