Introduction

Last year, I covered features in Linux kernel 5.0 that we thought were worth highlighting. Unbreakable Enterprise Kernel 6 is based on stable kernel 5.4 and was recently made available as a developer preview. So, now is as good a time as any to review developments that have occurred since 5.0. While the features below are roughly in chronological order, there is no significance to the order otherwise.

BPF spinlock patches

BPF (Berkeley Packet Filter) spinlock patches give BPF programs increased control over concurrency. Learn more about BPF and how to use it in this seven part series by Oracle developer Alan Maguire.

Btrfs ZSTD compression

The Btrfs filesystem now supports the use of multiple ZSTD (Zstandard) compression levels. See this commit for some information about the feature and the performance characteristics of the various levels.

Memory compaction improvements

Memory compaction has been reworked, resulting in significant improvements in compaction success rates and CPU time required. In benchmarks that try to allocated Transparent HugePages in deliberatly fragmented virtual memory, the number of pages scanned for migration was reduced by 65% and the free scanner was reduced by 97.5%.

io_uring API for high-performance async IO

The io_uring API has been added, providing a new (and hopefully better) way of achieving high-performance asynchronous I/O.

Build improvements to avoid unnecessary retpolines

The GCC compiler can use indirect jumps for switch statements; those can end up using retpolines on x86 systems. The resulting slowdown is evidently inspiring some developers to recode switch statements as long if-then-else sequences. In 5.1, the compiler’s case-values-threshold will be raised to 20 for builds using retpolines — meaning that GCC will not create indirect jumps for statements with less than 20 branches — addressing the performance issue without the need for code changes that might well slow things down on non-retpoline systems. See patch

Improved fanotify() to efficiently detect changes on large filesystem

fanotify is a mechanism for monitoring filesystem events. This improvement enables watching of super bock root to be notified that any file was changed anywhere on the filesystem. See patch.

Higher frequency Pressure Stall Information monitoring

First introduced in 4.20, Pressure Stall Information (PSI) tells a system administrator how much wall clock time an application spends, on average, waiting for system resources such as memory or CPU. This view into how resource-constrained a system is can help prevent catastrophe. Whereas previously PSI only reported averages for fixed, relatively large time windows, these improvements enable user-defined and more fine-grained measurements as well as mechanisms to be notified when thresholds are reached. For more information, see this article.

The new “devlink health” mechanism provides notifications when an interface device has problems. See this merge commit and this documentation for details.

BPF optimizations

The BPF verifier has seen some optimization work that yields a 20x speedup on large programs. That has enabled an increase in the maximum program size (for the root user) from 4096 instructions to 1,000,000. Read more about the BPF Verifier here.

Pressure stall monitors

Pressure stall monitors, which allow user space to detect and respond quickly to memory pressure, have been added. See this commit for documentation and a sample program.

MM optimizations to reduce unnecessary cache line movements/TLB misses

Optimizations to memory management code reduces TLB (translation lookaside buffer) misses. More details in this commit.

Control Group v2 enhancements

Control Group or cgroup is a kernel feature that enables hierarchichal grouping of processes such that their use of system resources (memory, CPU, I/O, etc) can be controlled, monitored and limited.

Version 1 of this feature has been in the kernel for a long time and is a crucial element of the implementation of containers in Linux.

Version 2 or cgroup v2 is a re-work of control group, under development since version 4 of the kernel, that intends to remove inconsistencies and enable better resource isolation and better management for containers. Some of its characteristics include:

  • unified hierarchy
  • better support for rootless, unprivileged containers
  • secure delegation of cgroups

See also this documentation

Power efficient userspace waiting

The x86 umonitor, umwait, and tpause instructions are available in user-space code; they make it possible to efficiently execute small delays without the need for busy loops on Intel “Tremont” chips. Thus, applications can employ short waits while using less power and with reduced impact on the performance of other hypertreads. A tunable has been provided to allow system administrators to control the maximum period for which the CPU can be paused.

pidfd_open() system call

The pidfd_open() system call has been added; it allows a process to obtain a pidfd for another, existing process. It is also now possible to use poll() on a pidfd to get notification when the associated process dies.

kdump support for AMD Secure Memory Encryption (SME)

See this article for more details.

Exposing knfsd state to userspace

The NFSv4 server now creates a directory under /proc/fs/nfsd/clients with information about current NFS clients, including which files they have open. See patch. Previously, it was not possible to get information about open files held open by NFSv4 clients.

haltpoll CPU idle governer

The “haltpoll” CPU idle governor has been merged. This governor will poll for a while before halting an otherwise idle CPU; it is intended for virtualized guest applications where it can improve performance by avoiding exits to the hypervisor. See this commit for some more information.

New madvice() commands

There are two new madvise() commands to force the kernel to reclaim specific pages. MADV_COLD moves the indicated pages to the inactive list, essentially marking them unused and suitable targets for page reclaim. A stronger variant is MADV_PAGEOUT, which causes the pages to be reclaimed immediately.

dm-clone target

The new dm-clone target makes a copy of an existing read-only device. “The main use case of dm-clone is to clone a potentially remote, high-latency, read-only, archival-type block device into a writable, fast, primary-type device for fast, low-latency I/O”. More information can be found in this commit.

virtio-fs

Virtio-fs is a shared file system that lets virtual machines access a directory tree on the host. See this document and this commit message for more information.

Kernel lockdown

Kernel lockdown seeks to improve on guarantees that a system is running software intended by its owner. The idea is to build on protections offered at boot time (e.g. by UEFI secure boot) and extend it such that no program can modify the running kernel. This has recently been implemented as a security module.

Improved AMD EPYC scheduler/load balancing

Fixes to ensure the scheduler properly load balances across NUMA nodes on different sockets. See commit message

Preparations for realtime preemption

Those who need realtime support in Linux have to this day had to settle for using the out-of-tree patchset PREEMPT_RT. 5.4 saw a number of patches preparing the kernel for native PREEMPT_RT support.

pidfd API

pidfd is a new concept in the kernel that represents a process as a file descriptor. As described in this article, the primary purpose is to prevent the delivery of signals to the wrong process should the target exit and be replaced —at the same ID— by an unrelated process, also known as PID recycling.

Conclusion

In slightly less than a year, a lot has happened in mainline kernel development. While the features covered here represent a mere subset of all the work that went into the kernel since 5.0, we thought they were noteworthy. If there are features you think we missed, please let us know in the comments!

Acknowledgments