The Unbreakable Enterprise Kernel (UEK) 7 was based on the 5.15 upstream Linux kernel which was released on 31-Oct-21. UEK8 is based upon the 6.12 long term stable (LTS) kernel which was released on 17-Nov-24. That is almost a three year difference and what a difference that makes in terms of functionality! This blog will highlight just a few of the changes that were made in the intervening kernels that the Oracle Linux core kernel team worked on or find interesting or notable. The changes are in the areas of memory management, scheduler and cgroups.

Memory Management

Folios data structure introduced

Folios is a new data structure that represents one or more pages of memory. There are two main reasons why the folio structure is important: type safety, and memory overhead. This is a large project that has many parts.

Folios gets around the multiple meanings of “pages” that resulted in a class of issues caused by type confusion. The type confusion would result in overloaded structure members being used in the incorrect capacity. The added type safety also protects against passing the wrong ‘page’ into set_page_dirty() or other functions.

The longer term goal is to reduce the memory used to track pages; reducing the size of struct page. As the developers of the filesystems and mm areas remove the use of the struct page members, the amount of memory used to track pages is shrinking. Each reduction of struct page can be viewed and tackled as its own project. Expect to hear more on the savings the folios provide in future releases.

Maple Tree

The virtual memory area of the linux kernel has long suffered lock contention on an rw semaphore called the mmap lock (or mmap sem). The maple tree was developed to remove the contention by allowing readers to operate while writes occur in parallel.

The maple tree is an RCU-safe b-tree that efficiently stores ranges. It replaces the three data structures that were used to track the VMAs: an rbtree, doubly linked list, and vmacache.

The RCU-safe design of the maple tree, along with the removal of the various data structure has been leveraged in the design of per-vma locking to avoid using the mmap lock in most page faults. Starting with 6.4 and until 6.6, more page fault type handling has been added to reduce the contention of the locking. Other areas of the kernel have followed (such as certain userfaultfd uses) to reduce the lock contention further.

The simplified interface to a tree data structure means more users can stop using linked lists and interval trees (when ranges do not overlap, which is a common anti-pattern).

The design of the tree to track ranges means it has significant space savings over the radix tree when dealing with sparse data. Future developments will leverage this particular savings to help in other areas.

Multi-generational LRU (MGLRU)

Linux maintains two lists of least recently used (LRU) pages. The LRU is a way to prioritize the pages needed to be kept in memory for fast accessing, effectively predicting what will most likely be needed soon.

The multi-generational LRU is an alternative way to track these pages to help reclaim memory and work better when memory pressure arises. Instead of a single list, there are multi-tiered, multiple generations of the LRU. Some performance tests exhibit better page migration onto the active list and less time used during reclaim.

Anonymous VMA naming

This feature allows anonymous Virtual Memory Areas (VMAs) to be named, making it easier to identify and debug memory-related issues.

Previously, anonymous VMAs were unnamed, making it challenging to determine which application or process was responsible for a particular memory allocation. With anonymous VMA naming, the kernel can assign a unique name to each VMA, providing valuable information for debugging and troubleshooting purposes. This feature is particularly useful for developers and system administrators who need to diagnose memory leaks, optimize memory usage, and improve overall system performance. By providing more detailed information about anonymous VMAs, UEK8 can offer better support for memory-intensive workloads, such as scientific simulations, data analytics, and machine learning, making it an even more robust platform for a wide range of applications.

Per-VMA locking

Per-VMA locks are a significant step forward in improving memory management and scalability. Per-VMA locking is a new method that allows each Virtual Memory Area (VMA) to have its own lock, rather than relying on a single global lock for all VMAs. With per-VMA locks, UEK8 can handle memory management operations, such as page faults and memory mapping, in a more fine-grained and efficient manner, reducing contention and improving concurrency.

This change enables better support for multi-threaded workloads, and large memory systems. It brings significant performance and scalability benefits to a wide range of use cases, from cloud computing and data centers to high-performance computing and embedded systems. With per-VMA locks, UEK8 can now scale more efficiently and effectively, making it an even more attractive choice for systems that require high levels of performance, reliability, and concurrency.

Introduction of struct ptdesc

UEK8 has undergone a significant transformation with the introduction of “struct ptdesc”, which aims to improve the kernel’s page table handling. The page table descriptor struct ptdesc, allows for more standardized and flexible management of page tables. By splitting struct ptdesc out from struct page, the kernel can handle page tables of various architectures more securely using a standardized API.

This change also paves the way for future optimizations and improvements, such as more efficient page fault handling and better support for TLB optimizations. Overall, splitting struct ptdesc out of struct page represents a major step forward in the evolution of the kernel’s memory management system, and paves the way for significant performance and scalability benefits to a wide range of workloads and use cases.

Handle hugetlb faults under per-vma lock

UEK8’s hugetlb memory management has recently undergone a significant improvement. Previously, the hugetlb fault pathway required a process’s mmap_sem lock to be held. With this new feature, we handle hugetlb faults by holding the VMA (Virtual Memory Area) lock instead, reducing lock contention. The benefits of this feature are numerous, including improved performance, enhanced scalability, and better fault handling, making it a valuable addition to UEK8. With this change, the kernel can more efficiently handle HugeTLB page faults, leading to improved overall system performance and reliability.

Multi-size THP for anonymous memory

Multi-size transparent huge pages (mTHP) enables the allocation of folios larger than base page size but smaller than PMD-size for backing anonymous memory. It benefits performance through a reduction in page faults and in some architectures, a reduced number of TLB misses. mTHP is disabled by default but can be enabled at runtime.

Split underused THPs

Split under-utilized THP pages to recover unused memory when memory pressure occurs. Benefits configurations with THP policy=always where there can be numerous THP pages in use in sparsely accessed memory regions.

MADV_COLLAPSE

The MADV_COLLAPSE advice for madvise() provides a way for a user process to directly (best effort) collapse eligible ranges of memory into transparent hugepages. This gives processes more control over THP utilization.

MADV_DONTNEED works on hugepages

MADV_DONTNEED now works on hugetlbfs pages. This may be useful for unmapping and freeing private mapped hugetlb pages.

memcg: further decouple v1 code from v2

The memory cgroup related code for v1 was moved to a separate file and a config option, CONFIG_MEMCG_V1, was added to allow a kernel to be compiled without v1 support.

Userspace controls soft-offline pages

There is a new sysctl in UEK8 in the space of memory CE handling to be added to Linux system administrators’ tool box. By default, this control is enabled. The new ‘enable_soft_offline’ sysctl is a tool in addition to the existing CE handling capability. If an administrator decided that in order to complete a mission critical task, it’s worthwhile to reduce latency from CE handling, the admin can disable the kernel’s soft-offline capability for a duration. But disabling soft-offline indefinitely without monitoring the CE situation is not recommended.

hugetlb: alloc/free gigantic folios

The patch series reduced the hugetlb lines of code by 200+, simplified the gigantic hugetlb page allocation process and improved performance. Prior to the patch series, the hugetlb subsystem calls alloc_gigantic_folio() that calls cma_alloc() to allocate a series of contiguous small pages which then are bundled into a hugetlb page.

With the change, instead of calling cma_alloc(), alloc_gigantic_folio() calls a new variant cma_alloc_folio() passing GPF_COMP to it, the GFP_COMP flags is carried to alloc_contig_range_noprof() which then bypass the page split step and go straight to prep_new_page().

Scheduler

Deadline Server

The Linux kernel restricts realtime processes to consume no more than 95% (by default) of the available CPUs. The extant 5% is reserved for lower-priority processes – typically processes needed to keep the system alive. If the lower-priority processes don’t consume the entire 5% allocation, then the CPU will become idle. Extra cycles cannot be given back to the realtime processes.

To address this, the deadline server is now available in UEK8. Because the deadline server is higher priority than realtime, it can consume the extra 5% of reserved CPU. The deadline server will schedule lower-priority processes, and if there are no more low-priority processes, then it will schedule realtime processes. This allows realtime to consume as much CPU as possible without starving lower-priority processes.

EEVDF Scheduler

For 16 years, the default scheduling algorithm in the Linux kernel was the Completely Fair Scheduler (CFS). CFS tracked the amount of CPU runtime for each process. Processes that have not run recently or had less CPU time were more likely to be scheduled on a CPU. CFS did not have a mechanism to specify a process’ latency requirements.

For UEK8, Earliest Eligible Virtual Deadline First (EEVDF) replaces CFS. EEVDF determines the amount of CPU time a process should have been allocated and the actual amount of CPU time it has been granted. This is called lag. EEVDF also calculates the soonest when a process should be given CPU time. This is called the virtual deadline. Processes with the most pressing lag and virtual deadlines will be run before processes with less-pressing lag and virtual deadline values. EEVDF’s design should provide lower latency than CFS for processes with short time slices.

sched-ext

Extensible and runtime-modifiable BPF schedulers were recently added to the Linux kernel in a feature called sched-ext. sched-ext is in UEK8 but is currently disabled. Expect more information in subsequent UEK8 updates.

Other scheduler changes

The scheduler in UEK8 is now aware of cluster-level shared resources on modern hardware. This knowledge allows it to better place processes to utilize shared cache resources.

The syscall, sched_setaffinity(), has been in the kernel since v3.0. Previously, a user-specified CPU affinity could be overwritten by other activity on the system. For UEK8, the user’s requested affinity is now persisted through these external events.

The UEK8 kernel now reports scheduler statistics for processes utilizing the deadline and realtime scheduler classes.

The energy-aware scheduling logic in UEK8 has been improved and should result in better energy performance.

cgroups

The cpuset cgroup v2 controller in UEK8 can now create exclusive cpusets that are not subject to load balancing. Processes in these cgroups should be manually bound to individual CPUs for optimal performance. See cpuset.cpus.partition for more details.

The cpuset controller is now aware of CPUs listed in the “isolcpus” boot command line parameter.

A kernel boot command line parameter, cgroup_favordynmods, has been added to UEK8. This feature is useful for workloads that frequently change cgroup parameters, turn cgroup controllers on/off, and migrate processes between cgroups. Note that this makes forks and exits more expensive.

Two new cgroup watermark files, pids.peak and misc.peak, were added in UEK8. Both show the maximum recorded value of that particular resource.

Two local event files, pids.events.local and misc.events.local, were added in UEK8. Both report local events for their respective controllers. (As in previous UEK kernels, the hierarchical form of these files, pids.events and misc.events, are also available.)

~