Introduction
Last year, I covered features in Linux kernel 5.0 that we thought were worth highlighting. Unbreakable Enterprise Kernel 6 is based on stable kernel 5.4 and was recently made available as a developer preview. So, now is as good a time as any to review developments that have occurred since 5.0. While the features below are roughly in chronological order, there is no significance to the order otherwise.
BPF spinlock patches
BPF (Berkeley Packet Filter) spinlock patches give BPF programs increased control over concurrency. Learn more about BPF and how to use it in this seven part series by Oracle developer Alan Maguire.
Btrfs ZSTD compression
The Btrfs filesystem now supports the use of multiple ZSTD (Zstandard) compression levels. See this commit for some information about the feature and the performance characteristics of the various levels.
Memory compaction improvements
Memory compaction has been reworked, resulting in significant improvements in compaction success rates and CPU time required. In benchmarks that try to allocated Transparent HugePages in deliberatly fragmented virtual memory, the number of pages scanned for migration was reduced by 65% and the free scanner was reduced by 97.5%.
io_uring API for high-performance async IO
The io_uring API has been added, providing a new (and hopefully better) way of achieving high-performance asynchronous I/O.
Build improvements to avoid unnecessary retpolines
The GCC compiler can use indirect jumps for switch statements; those can end up using retpolines on x86 systems. The resulting slowdown is evidently inspiring some developers to recode switch statements as long if-then-else sequences. In 5.1, the compiler’s case-values-threshold will be raised to 20 for builds using retpolines — meaning that GCC will not create indirect jumps for statements with less than 20 branches — addressing the performance issue without the need for code changes that might well slow things down on non-retpoline systems. See patch
Improved fanotify() to efficiently detect changes on large filesystem
fanotify is a mechanism for monitoring filesystem events. This improvement enables watching of super bock root to be notified that any file was changed anywhere on the filesystem. See patch.
Higher frequency Pressure Stall Information monitoring
First introduced in 4.20, Pressure Stall Information (PSI) tells a system administrator how much wall clock time an application spends, on average, waiting for system resources such as memory or CPU. This view into how resource-constrained a system is can help prevent catastrophe. Whereas previously PSI only reported averages for fixed, relatively large time windows, these improvements enable user-defined and more fine-grained measurements as well as mechanisms to be notified when thresholds are reached. For more information, see this article.
devlink health notifications
The new “devlink health” mechanism provides notifications when an interface device has problems. See this merge commit and this documentation for details.
BPF optimizations
The BPF verifier has seen some optimization work that yields a 20x speedup on large programs. That has enabled an increase in the maximum program size (for the root user) from 4096 instructions to 1,000,000. Read more about the BPF Verifier here.
Pressure stall monitors
Pressure stall monitors, which allow user space to detect and respond quickly to memory pressure, have been added. See this commit for documentation and a sample program.
MM optimizations to reduce unnecessary cache line movements/TLB misses
Optimizations to memory management code reduces TLB (translation lookaside buffer) misses. More details in this commit.
Control Group v2 enhancements
Control Group or cgroup is a kernel feature that enables hierarchichal grouping of processes such that their use of system resources (memory, CPU, I/O, etc) can be controlled, monitored and limited.
Version 1 of this feature has been in the kernel for a long time and is a crucial element of the implementation of containers in Linux.
Version 2 or cgroup v2 is a re-work of control group, under development since version 4 of the kernel, that intends to remove inconsistencies and enable better resource isolation and better management for containers. Some of its characteristics include:
- unified hierarchy
- better support for rootless, unprivileged containers
- secure delegation of cgroups
See also this documentation
Power efficient userspace waiting
The x86 umonitor, umwait, and tpause instructions are available in user-space code; they make it possible to efficiently execute small delays without the need for busy loops on Intel “Tremont” chips. Thus, applications can employ short waits while using less power and with reduced impact on the performance of other hypertreads. A tunable has been provided to allow system administrators to control the maximum period for which the CPU can be paused.
pidfd_open() system call
The pidfd_open() system call has been added; it allows a process to obtain a pidfd for another, existing process. It is also now possible to use poll() on a pidfd to get notification when the associated process dies.
kdump support for AMD Secure Memory Encryption (SME)
See this article for more details.
Exposing knfsd state to userspace
The NFSv4 server now creates a directory under /proc/fs/nfsd/clients with information about current NFS clients, including which files they have open. See patch. Previously, it was not possible to get information about open files held open by NFSv4 clients.
haltpoll CPU idle governer
The “haltpoll” CPU idle governor has been merged. This governor will poll for a while before halting an otherwise idle CPU; it is intended for virtualized guest applications where it can improve performance by avoiding exits to the hypervisor. See this commit for some more information.
New madvice() commands
There are two new madvise() commands to force the kernel to reclaim specific pages. MADV_COLD moves the indicated pages to the inactive list, essentially marking them unused and suitable targets for page reclaim. A stronger variant is MADV_PAGEOUT, which causes the pages to be reclaimed immediately.
dm-clone target
The new dm-clone target makes a copy of an existing read-only device. “The main use case of dm-clone is to clone a potentially remote, high-latency, read-only, archival-type block device into a writable, fast, primary-type device for fast, low-latency I/O”. More information can be found in this commit.
virtio-fs
Virtio-fs is a shared file system that lets virtual machines access a directory tree on the host. See this document and this commit message for more information.
Kernel lockdown
Kernel lockdown seeks to improve on guarantees that a system is running software intended by its owner. The idea is to build on protections offered at boot time (e.g. by UEFI secure boot) and extend it such that no program can modify the running kernel. This has recently been implemented as a security module.
Improved AMD EPYC scheduler/load balancing
Fixes to ensure the scheduler properly load balances across NUMA nodes on different sockets. See commit message
Preparations for realtime preemption
Those who need realtime support in Linux have to this day had to settle for using the out-of-tree patchset PREEMPT_RT. 5.4 saw a number of patches preparing the kernel for native PREEMPT_RT support.
pidfd API
pidfd is a new concept in the kernel that represents a process as a file descriptor. As described in this article, the primary purpose is to prevent the delivery of signals to the wrong process should the target exit and be replaced —at the same ID— by an unrelated process, also known as PID recycling.
Conclusion
In slightly less than a year, a lot has happened in mainline kernel development. While the features covered here represent a mere subset of all the work that went into the kernel since 5.0, we thought they were noteworthy. If there are features you think we missed, please let us know in the comments!
Acknowledgments
- lwn.net
- kernelnewbies.org
- Chuck Anderson, Oracle
- Scott Davenport, Oracle