Last week, Liam Howlett posted the first version of the Maple Tree to the linux-kernel mailing list. The Maple Tree, a collaboration between Liam Howlett and Matthew Wilcox, introduces a B-tree based range-locked tree which could significantly reduce unnecessary contention on the memory management subsystem — with the eventual goal of perhaps removing mmap_sem entirely. They have been working on this for a year over at github.com/oracle/linux-uek and I’m really excited to see this project sent out for comment and review!
Read the RFC at lore.kernel.org
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel that a
non-overlapping range-based tree would be beneficial, especially one with a
simple interface. The first user that is covered in this patch set is the
vm_area_struct rbtree in the mm_struct with the long term goal of reducing the
contention of the mmap_sem.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes.
With the increased branching factor, it is significantly short than the rbtree
so it has fewer cache misses. As you can see below, the performance is getting
very close even without the vmacache. I am looking for ways to optimize the
vma code to make this even faster and would like input. As this tree requires
allocations to provide RCU functionality, it is worth avoiding expanding the
tree only to contract it in the next step.
This patch set is based on 5.10-rc1.
Please note that 35821 lines of this patch is the test code.
It is important to note that this is an RFC and there are limitations around
what is currently supported.
The current status of this release is:
- **No support for 32 bit or nommu builds**
- There is a performance regression with regards to kernel builds from -3% to
-5% of system time. Elapsed time is within 1 second of 5.10-rc1.
- Removal of the vmacache
- The mm struct uses the maple tree to store and retrieve all VMAs
- Decrease in performance in the following micro-benchmarks:
- will-it-scale brk1 which tests insert speed of a vma as it is written of
-25 to -33%. The test does not seem to test what it is meant to test.
- kernbuild (~-4 to -5% system, less that -1% elapsed Amean)
- Increase in performance in the following micro-benchmarks in Hmean:
- will-it-scale malloc1-processes (~1-9%)
- will-it-scale malloc1-threads (~29-71%)
- will-it-scale pthread_mutex1-threads (~0-16%)
- will-it-scale signal1-processes (2-17%)
- will-it-scale brk1-threads **This test doesn't make sense, disregard**
- Converted the following to use the advanced maple tree interface:
- brk()
- mmap_region()
- do_munmap()
- dup_mmap()
- Currently operating in non-RCU mode. Once enabled, RCU mode will be enabled
only when more than one thread is active on a task. At that point the maple
tree will enable RCU lookup of VMAs.
The long term goal of the maple tree is to reduce mmap_sem contention by
removing users that cause contention one by one. This may lead to the lock
being completely removed at some point.
The secondary goals is to provide the kernel with a fast RCU tree with a simple
interface and to clean up the mm code by removing the integration of the tree
within the code and structures themselves.
This patch set is based on 5.10-rc1.
Link: https://github.com/oracle/linux-uek/releases/tag/howlett%...
Implementation details:
Each node is 256 Bytes and has multiple node types. The current implementation
uses two node types and has a branching factor of 16 for leaf/non-alloc nodes
and 10 for internal alloc nodes. The advantage of the internal alloc nodes is
the ability to track the largest sub-tree gap for each entry.
Even with the removal of the vmacache, the benchmarks are getting close to the
rbtree, but I'd like some help in regards to making things faster. The tree
is at a disadvantage of needing to allocate storage space for internal use
where as the rbtree stored the data within the vm_area_struct. This is a
necessary trade off to move to RCU mode in the future. 