The Maple Tree Internals Introduction
The Maple Tree is a data structure created to efficiently store ranges, prioritising look up speed and data density. This article will provide a conceptual view of the internal storage layout and a guide to interpreting maple tree dumps produced by the debug code. The tree was designed to be very information dense and allow for effective use of the CPU cache.
For Maple Tree fundamentals, please see one of the articles or videos listed at the end of the article.
Storing Ranges
Entries are stored in slots with ranges defined by pivots. Each range ends at a pivot
and starts at the previous pivot + 1
.
This example fragment represents a range of 10 to 15 pointing to an entry of 0x7f3bc7b00020. :
10 0x7f3bc7b00020 15 └────────────────────┘
15 is the pivot
for this range.
The Maple Tree has an entry for all ranges, including NULL ranges: :
0 (nil) 9 10 0x7f3bc7b00020 15 16 (nil) ∞ └─────────┘ └──────────┘
The ∞
represents ULONG_MAX
, for the kernel implementation of the maple tree.
The newly added ranges are 0-9 and 16-∞ with (nil)
in the slots.
The root node always contains the range 0 – ∞. The internal representation does not need to store the upper and lower limits as they are implied by the parent. The upper limit is stored as long as there is sufficient space as an optimisation for other operations.
Overlapping ranges are not supported. Ranges are sequential and contiguous. The start of a range is known by adding one to the previous pivot
. The start of the ranges can be removed. :
0 (nil) 9 10 0x7f3bc7b00020 15 16 (nil) ∞ └─┘ └──┘ └──┘
Removing the unnecessary start of the ranges yields closer to what a Maple Tree node contains:
(nil) 9 0x7f3bc7b00020 15 (nil) ∞
Adding another range from 0-5 is represented below: :
0xfffff7b00050 5 (nil) 9 0x7f3bc7b00020 15 (nil) ∞ └────────────────┘
The ranges being: 0-5, 6-9, 10-15, and 16-∞.
The corresponding pivots being: 5, 9, 15, and ∞.
This is, essentially, the basic contents of a Maple Tree node. Maple Range 64 nodes (range64) have 16 slots, and 15 pivots. The slots contain the entries, while the pivots contain the upper limit for a slot. Each slot
and the corresponding pivot
are said to be at a particular offset
. In the example above, the pivot
is 5
and the slot
is 0xfffff7b00050
at offset 0
.
Searching nodes
Searching for an index in a node is accomplished by iterating through the pivots from the lowest to highest until a pivot
is greater than or equal to the desired index.
Entries are located in the area of the node called a slot
, at a particular offset
. range64 nodes contain 16 slots and 15 pivots, with the final pivot implied from the parent node.
The pivots and slots are the main contents of a node. The dump of an example node would contain the following output: :
contents: 0x0000DEADBEEF 5 (nil) 9 0x0000BADDECAF 15 (nil) 18446744073709551615 (nil) 0 (nil) 0 ...
Here is the same output with the ranges underlined, as they were in previous examples: :
contents: 0x0000DEADBEEF 5 (nil) 9 0x0000BADDECAF 15 (nil) 18446744073709551615 (nil) 0 (nil) 0 ... └──────────────┘ └─────┘ └───────────────┘ └────────────────────────┘
Where ...
denotes the removal (nil)
and 0
‘s for ease of readability.
Internal Nodes
Internal nodes are any node that is not a leaf node. Non-leaf nodes do not store nulls, so each entry and NULL are at the same height within the tree.
When a tree has multiple levels, the maximum of one node is stored in the pivot
of the parent, while the minimum of the sibling to the right is implied, and is pivot + 1
of the parent. It is important to note that the implied minimum pivot
is already the previous pivot + 1
, so if the first entry in a node is of size 1 then pivot 0
may be the same as the implied minimum. This holds true for the first node and the implied pivot
of 0; this means a node with the following contents is valid: :
contents: 0x7f8e9b300020 0 0x7f8e9b300020 1 0x7f8e9b300020 2 0x7f8e9b300020 ...
The contents of the node represents singletons (entries with ranges 1 in size):
`0: 0x7f8e9b300020`, `1: 0x7f8e9b300020`, `2: 0x7f8e9b300020`, `...`
The process for searching for an entry in a multiple level Maple Tree will search each node starting at the root. The node is searched in the same way at each level; finding the pivot
of greater or equal value. Internal nodes contain nodes that also need to be searched. The procedure restarts with the node being searched set to the entry stored in the slot
at offset
of the internal node, and the offset
reset to 0
– also referred to as descending the tree. Once a leaf node is encountered, the search result of the leaf node will be the desired entry.
Every child stores a parent pointer for tree operations such as previous and next, which may need to walk up the tree to get to the next leaf node.
Gaps
Some users of the Maple Tree need to find space to store an entry based on the size of the range for that entry. When a tree is created, you can specify that it should track gaps, this can be referred to as an allocation tracking tree. The NULL areas are called gaps and can be tracked within the same tree structure in allocation range nodes (arange64), as apposed to a more general range64 node.
The arange64 nodes exist to track the largest block of available range in the corresponding slot. They have a space to store the largest gap
for each entry; in other words each subtree (represented by the entry) has its largest gap
stored in the parent arange64. Each arange64 also stores the location of the largest gap
it contains.
The arange64 nodes are only internal nodes – never a leaf; leaves do not track gaps, the gaps in leaves are calculated during the walking of the node. arange64 nodes are always used as internal nodes when tracking gaps, otherwise the range64 nodes are used as internal nodes.
If an arange64 node had two nodes with gaps of 4
and 18446744073709551510
, respectively, it could be represented as follows: :
contents: 4 18446744073709551510 0 ... | ... | 0x61500000080c 45 0x61500000030c 18446744073709551615 (nil) 0 ... └┬┘└──────┬───────────┘└┬┘ └─────┬────────┘└┬┘└─────┬───────┘└──────┬────────────┘ │ │ │ │ │ │ │ └─ gap 0 └─ gap 1 └─ gap 2 └─ slot 0 │ └─ slot 1 │ └─ pivot 0 └─ pivot 1
Where ...
denotes the removal (nil)
, 0
‘s, and other information for ease of readability.
The gap
at offset 0
states that there is a NULL
of size 4 in the node (or in the subtree below the node) at slot 0
. In this example, the node 0x61500000080c
is at slot 0
. The gap
has the same offset
as the slot
it is describing. This is the same as the pivot
matching the slot
at a given offset
, except there is no implied gap
from the parent node.
The branching factor is an important metric for how many nodes need to be walked prior to arriving at the desired entry; a larger branching factor means less nodes need to be in the CPU cache. Some users prefer to reduce the branching factor for the benefit of tracking the gaps.
Tracking gaps has the trade off of reducing the branching factor of internal nodes from 16 to 10. The decreased branching factor is due to the node sizes being the same and the additional space required to store the maximum gaps. That is to say that arange64 nodes have 10 slots, 10 gaps, and 9 pivots but the node size remains 256 bytes, for cache line efficiency.
Other Information
A node has more information than the pivots and entries; it has information used for tree operations. Information stored within the tree includes:
- Maple Tree Structure
- Contains data about a specific tree
- Node type
- A variety of node types are supported
- Internal entry encodings
- Parent pointers
- Metadata
- hints for certain operations, if space in the node allows
Some items have been left off the list for future discussions, but can be safely ignored for now.
Density of information
Nodes need to store the pointers to the data (entries) or other nodes, and a pointer to a parent node (or the tree for the root node). The main factor that decided the size of the node was the CPU cache line size, typically 64 bytes. After testing various size nodes, the decision was made to use 256 bytes (4 cache lines) to keep a node relatively small while keeping the overhead of a parent pointer low. The node type and metadata have been cleverly encoded without impacting the size of the contents of the node; the node memory alignment allows for lower bits of the address to be utilised for other purposes.
The Maple Tree Structure
The Maple Tree structure itself contains three things:
- The write lock
- The pointer to the root node
- The Maple Tree flags
The Write Lock
The write lock is for writing as the tree only supports a single write operation at a time. The write lock also protects the Maple Tree flags.
The Root Node
The root node is the first node in the tree, which may be a leaf node, an internal node containing more nodes, a NULL pointer for an empty tree, or a simple pointer in the special case of a single entry at index 0.
The Maple Tree Flags
The Maple Tree flags tells the code about the current tree. This includes:
- The tree height
- If the tree is tracking gaps
- If the tree is in RCU mode
A tree of height 0 means there is no node at all, it’s just a pointer. If a store to 0 of size 1 occurs, then the entry will simply be stored where the root node would otherwise be stored. This is an optimisation for a single entry store at 0. Instead of having a root with the following contents: :
contents: 0x000DEADBEEF0 0 (nil) 18446744073709551615 (nil) 0 (nil) 0 ... └──────────────┘ └────────────────────────┘
The tree would simply point to 0x000DEADBEEF0 directly.
When walking the tree, the depth into the tree starts at 0, and not at 1. This differs from the height, which does start at 1 when there is a node. The depth would be 0 at the root node, and 1 at the children of the root node.
Node Type
One method of increasing the density of the information stored within a node is to use bits of the stored address to encode information. This is made possible by aligning the node memory addresses to 256 bytes, yielding 8 bits (0 – 7) for internal tree use. The lower 8 bits are used to encode information such as the type of node.
All nodes are the same size and the same alignment. The parent pointer within the child node also has room for 8 bits of encoded information. The encoding of the parent pointer and child pointer differ, but the memory address can easily be retrieved from both with the same bit mask.
It is important to note that the addresses stored within an internal node is not strictly a memory address when reading a tree dump. There are memory addresses and memory addresses with embedded encoded information; both are present in a dump.
An internal node 0x615000000600
may contain the contents 0x61500000080c
, which means the child node is at the memory address 0x615000000800
. The address can be obtained by stripping the encoded data.
Likewise, the child 0x615000000800
may have a parent pointer of 0x615000000606
, which does point back to the correct address after removing the encoding. The code leverages the compiler type checking to verify these addresses are not confused.
Metadata
Frequently the nodes are not completely full. If the nodes are not full, then the last slot
in a node may be used for metadata storage, depending on the node type. arange64 nodes have space for metadata already, so they do not need to use the last slot as a storage area for this data. Having this metadata is used to speed up some operations and does show up in tree dumps for debugging.
The added metadata allows for fewer instructions while iterating over a node by avoiding checking the pivot for maximums. The largest gap tracking allows for a more efficient means to update the parent and knowing when to stop walking up the tree for those updates.
The metadata contains the offset
of the end of the data in the node, arange64 nodes also store the location of the largest gap.
arange64 metadata example: :
contents: 4 18446744073709551510 ... | 01 01| 0x61500000080c 45 0x61500000030c 18446744073709551615 ... └┬┘└──────┬───────────┘ └┬┘└┬┘ └─────┬──────┘ └┬┘└──────┬─────┘ └───────┬───────────┘ │ │ │ │ │ │ │ │ └─ gap 0 └─ gap 1 │ │ └─ slot 0 │ └─ slot 1 │ │ └─ largest gap └─ pivot 0 └─ pivot 1 └─ end of data
Where ...
denotes the removal (nil)
and 0
‘s for ease of readability.
The end of the data is at offset 1
, which also contains the largest gap
. The two values in the metadata reflect these positions.
Dumping the Tree
Having a general idea of how information is stored within the tree is vital to understanding the output of a tree dump. The Maple Tree dump will be dense with information, since the tree itself is so densely packed.
The dump supports both printing the pivots in hexadecimal and decimal, using the defined values mt_dump_hex
and mt_dump_dec
, respectively. In the examples decimal is used.
Entries in Leaves
Leaf nodes are where the user data is stored. Other internal nodes serve to find the correct leaf quickly. Once at the leaf, the Maple Tree dump will output the ranges as well as the contents.
Looking at the example from before: :
0x7f4ac7b00020 5 (nil) 9 0x7f3bc7b00020 15 (nil) ∞ └────────────────┘└──────┘└────────────────┘└──────┘
Would produce a node summary of the contents
followed by the verbose output: :
0-5: 0x7f3bc7b00020 6-9: (nil) 10-15: 0x7f3bc7b00020 16-18446744073709551615: (nil)
Tree Dump Example
Creating a tree with gaps and ranges will produce a tree dump with most of the desired traits to examine.
void test_tree() { unsigned long i, max = 10; DEFINE_MTREE(tree); mt_init_flags(&tree, 0); /* Create a tree with gaps */ for (i = 0; i <= max; i++) mtree_store_range(&tree, i * 10, i * 10 + 5, &i, GFP_KERNEL); /* Dump the tree in decimal (hex available too) */ mt_dump(&tree, mt_dump_dec); /* Clean up */ mtree_destroy(&tree); }
The resulting tree dump: :
maple_tree(0x7f3bc7b00040) flags 8, height 2 root 0x615000000616 0-18446744073709551615: node 0x615000000600 depth 0 type 2 parent 0x7f3bc7b00041 contents: 0x61500000080c 45 0x61500000030c 18446744073709551615 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 0x1 0-45: node 0x615000000800 depth 1 type 1 parent 0x615000000606 contents: 0x7f3bc7b00020 5 (nil) 9 0x7f3bc7b00020 15 (nil) 19 0x7f3bc7b00020 25 (nil) 29 0x7f3bc7b00020 35 (nil) 39 0x7f3bc7b00020 45 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 0x8 0-5: 0x7f3bc7b00020 6-9: (nil) 10-15: 0x7f3bc7b00020 16-19: (nil) 20-25: 0x7f3bc7b00020 26-29: (nil) 30-35: 0x7f3bc7b00020 36-39: (nil) 40-45: 0x7f3bc7b00020 46-18446744073709551615: node 0x615000000300 depth 1 type 1 parent 0x61500000060e contents: (nil) 49 0x7f3bc7b00020 55 (nil) 59 0x7f3bc7b00020 65 (nil) 69 0x7f3bc7b00020 75 (nil) 79 0x7f3bc7b00020 85 (nil) 89 0x7f3bc7b00020 95 (nil) 99 0x7f3bc7b00020 105 (nil) 18446744073709551615 (nil) 0 (nil) 0 0xc 46-49: (nil) 50-55: 0x7f3bc7b00020 56-59: (nil) 60-65: 0x7f3bc7b00020 66-69: (nil) 70-75: 0x7f3bc7b00020 76-79: (nil) 80-85: 0x7f3bc7b00020 86-89: (nil) 90-95: 0x7f3bc7b00020 96-99: (nil) 100-105: 0x7f3bc7b00020 106-18446744073709551615: (nil)
Examining the first line of output: :
maple_tree(0x7f3bc7b00040) flags 8, height 2 root 0x615000000616 └──────┬───────┘ └┬┘ └┬┘ └─────┬──────┘ │ │ │ └─ The encoded root node │ │ └─ There are two levels of the tree (0 and 1 depth) │ └─ The tree flags └─ The tree memory location
The next line contains the root node summary: :
0-18446744073709551615: node 0x615000000600 depth 0 type 2 parent 0x7f3bc7b00041 └─────────┬────────────┘ └──────┬───────┘ └┬┘ └┬┘ └──────┬─────┘ └─ The node range │ │ └─ node type └─ Encoded parent pointer │ │ │ └─ Depth of this node in the tree │ └─ node memory address (encoded node after bitmask)
The encoded parent pointer is close to the Maple Tree memory location of the previous line, but the encoded information is stating that this is the root node.
A tree of height 2 has nodes at depth 0 and 1. If the tree were of height 0, it would either be empty (if the root pointer is NULL), or have a single entry at index 0, length 1. This is an important optimisation for some users who usually have a single entry, but might have many entries.
The contents summary is next: :
contents: 0x61500000080c 45 0x61500000030c 18446744073709551615 (nil) 0 (nil) ... 0 (nil) 0 0x1 └──────┬──────┘└┬┘ └─────┬──────┘ └─────┬────────────┘ └┬┘ │ │ │ │ └─ Metadata │ │ │ └─ Pivot 1 │ │ │ │ │ └─ encoded node at offset 1 │ │ │ └─ Pivot 0 │ └─ encoded node at offset 0, node 0's address can be seen in the next line: 0-45: node 0x615000000800
Where ...
denotes the removal (nil)
and 0
‘s for ease of readability.
The slots
are the encoded nodes; the memory addresses with the encoded information in bits 0 – 7.
The metadata in this node type (range64) can be stored in slot 15 when pivot 14 is 0 or the implied maximum. See ma_data_end(). The metadata is specifying the end of the node is offset 1
.
As previously stated, the encoded pointers (both the parent pointer and slot
contents) rely on the nodes being 256 byte aligned and yields 8 bits (bits 0 – 7) of the address space for internal tree uses. The lower bits indicate information about the node type. The encoded parent pointer contains the offset
in the parent where the node is stored.
The encoded addresses can be seen in the output for the node in slot 0
, which is 0x61500000080c
in the root node: :
0-45: node 0x615000000800 depth 1 type 1 parent 0x615000000606 contents: ... └┬─┘ └─────┬────────┘ └───────┬──────┘ │ └─ Slot 0 decoded └─ encoded parent └─ Pivot 0 from parent
Singleton Tree Dump Example
The tree also supports a singletons; ranges with a size of 1:
void test_singleton_tree() { unsigned long i, max = 10; DEFINE_MTREE(tree); mt_init_flags(&tree, 0); /* Create a tree WITHOUT gaps */ for (i = 0; i <= max; i++) mtree_store_range(&tree, i, i, &i, GFP_KERNEL); /* Dump the tree in decimal (hex available too) */ mt_dump(&tree, mt_dump_dec); /* Clean up */ mtree_destroy(&tree); }
maple_tree(0x7f8e9b300040) flags 4, height 1 root 0x61500000010e 0-18446744073709551615: node 0x615000000100 depth 0 type 1 parent 0x7f8e9b300041 contents: 0x7f8e9b300020 0 0x7f8e9b300020 1 0x7f8e9b300020 2 0x7f8e9b300020 3 0x7f8e9b300020 4 0x7f8e9b300020 5 0x7f8e9b300020 6 0x7f8e9b300020 7 0x7f8e9b300020 8 0x7f8e9b300020 9 0x7f8e9b300020 10 (nil) 18446744073709551615 (nil) 0 (nil) 0 (nil) 0 0xb 0: 0x7f8e9b300020 1: 0x7f8e9b300020 2: 0x7f8e9b300020 3: 0x7f8e9b300020 4: 0x7f8e9b300020 5: 0x7f8e9b300020 6: 0x7f8e9b300020 7: 0x7f8e9b300020 8: 0x7f8e9b300020 9: 0x7f8e9b300020 10: 0x7f8e9b300020 11-18446744073709551615: (nil)
This example shows the implied pivot
is the same as the pivot 0
, and the metadata of 0xb
in slot 15
indicates the data ends at slot 11
.
Gap Tracking Tree Dump Example
Tracking gaps in the same tree allows for finding space for a store operation within a range relatively quickly, often with a single tree walk. When storing gaps, the branching factor is reduced to 10 slots which leaves room for metadata outside of the pivots. This means that the largest gap and end of the node can reliably be stored regardless of the node being full or not.
void test_alloc_tree() { unsigned long i, max = 10; DEFINE_MTREE(tree); /* The flags are new */ mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE); /* Create a tree with gaps */ for (i = 0; i <= max; i++) mtree_store_range(&tree, i * 10, i * 10 + 5, &i, GFP_KERNEL); /* Dump the tree in decimal (hex available too) */ mt_dump(&tree, mt_dump_dec); /* Clean up */ mtree_destroy(&tree); }
maple_tree(0x7ff25be00040) flags 9, height 2 root 0x61500000061e 0-18446744073709551615: node 0x615000000600 depth 0 type 3 parent 0x7ff25be00041 contents: 4 18446744073709551510 0 0 0 0 0 0 0 0 | 01 01| 0x61500000080c 45 0x61500000030c 18446744073709551615 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0-45: node 0x615000000800 depth 1 type 1 parent 0x615000000606 contents: 0x7ff25be00020 5 (nil) 9 0x7ff25be00020 15 (nil) 19 0x7ff25be00020 25 (nil) 29 0x7ff25be00020 35 (nil) 39 0x7ff25be00020 45 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 0x8 0-5: 0x7ff25be00020 6-9: (nil) 10-15: 0x7ff25be00020 16-19: (nil) 20-25: 0x7ff25be00020 26-29: (nil) 30-35: 0x7ff25be00020 36-39: (nil) 40-45: 0x7ff25be00020 46-18446744073709551615: node 0x615000000300 depth 1 type 1 parent 0x61500000060e contents: (nil) 49 0x7ff25be00020 55 (nil) 59 0x7ff25be00020 65 (nil) 69 0x7ff25be00020 75 (nil) 79 0x7ff25be00020 85 (nil) 89 0x7ff25be00020 95 (nil) 99 0x7ff25be00020 105 (nil) 18446744073709551615 (nil) 0 (nil) 0 0xc 46-49: (nil) 50-55: 0x7ff25be00020 56-59: (nil) 60-65: 0x7ff25be00020 66-69: (nil) 70-75: 0x7ff25be00020 76-79: (nil) 80-85: 0x7ff25be00020 86-89: (nil) 90-95: 0x7ff25be00020 96-99: (nil) 100-105: 0x7ff25be00020 106-18446744073709551615: (nil)
The contents of the non-leaf node (the root node, in this case) differs when tracking gaps: :
contents: 4 18446744073709551510 0 0 0 0 0 0 0 0 | 01 01| 0x61500000080c 45 0x61500000030c 18446744073709551615 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) 0 (nil) └┬┘└──────┬───────────┘ └┬┘└┬┘ └──────┬─────┘ └┬┘└──────┬─────┘ └────────┬──────────┘ │ └─ Largest gap in slot 1 subtree │ │ │ │ │ └─ Pivot 1 │ │ │ │ │ │ └─ Largest gap in slot 0 subtree │ │ │ │ └─ Slot 1 (encoded node) │ │ │ └─ Pivot 0 │ │ │ │ │ └─ Slot 0 (encoded node) │ │ │ └─ Largest gap is in offset 1 │ └─ End of node data
When searching for space to store a given range, the gaps can be searched within a range at a higher level in the tree. This can reduce the number of cache lines occupied by reading the tree nodes, and reduce the number of entries searched for a specific gap size.
Conclusion
Though the Maple Tree data structure has a complex visualisation, the layout of nodes and the density of information are key in maximizing CPU cache utility while minimizing the cache footprint.
The debug output and the storage ideas are simple when presented individually. By approaching each element separately, the output is easier to read and continues to be a valuable tool in replacing existing data structures, and with the development of the Maple Tree itself.
Resources
- LPC 2019 – The Maple Tree – Not Just For Delicious Pancakes
- LPC 2021 – The Maple Tree – Enhances the Natural Flavour of Waffles
- LPC 2022 – The Maple Tree – Condensing 40 Liters of Data into 1
- Oracle Linux Blog – The Maple Tree, A Modern Data Structure for a Complex Problem
- LWN – Introducing maple trees