Thanks to Chuck Anderson, Linux kernel developer, Oracle, for compiling the information in this post.

Enhancements to mainline Linux continue at a steady pace, though we don’t always hear a lot about this work. With the 5.1 release upon us, we wanted to give a shout out to some important and notable additions that the 5.0 release brought to bear. Some we chose because our kernel developers are directly involved, some because they affect Oracle workloads, and others simply because they piqued our interest. Here are our top picks. For a complete overview of Linux kernel 5.0 new features, see LWN.

Valuable New Features:

arm64 support

The arm64 architecture has gained support for a number of features including the kexec_file_load() system call, 52-bit virtual address support for user space, hotpluggable memory, per-thread stack canaries, and pointer authentication (for user space only at this point). This commit has some documentation for the pointer-authentication feature.

Retpoline-elimination

The first two of the retpoline-elimination mechanisms described in this article have been merged, improving performance in core parts of the DMA mapping and networking layers.

Core Kernel Changes:

The long-awaited energy-aware scheduling patches have found their way into the mainline. This code adds a new energy model that allows the scheduler to determine the relative power cost of scheduling decisions. This enables the mainline scheduler to get better results on mobile devices and, should reduce or eliminate the scheduler patching that various vendors engage in now.

The cpuset controller now works (with reduced features) under the version-2 control-group API. See the documentation updates in this commit for details.

There is also a new “dynamic events” interface to the tracing subsystem. It unifies the three distinct interfaces (for kprobes, uprobes, and synthetic events) into a single control file. See this patch posting for a brief overview of how this interface works.

Improving idle behavior in tickless systems

Lead paragraph: “Most processors spend a great deal of their time doing nothing, waiting for devices and timer interrupts. In these cases, they can switch to idle modes that shut down parts of their internal circuitry, especially stopping certain clocks. This lowers power consumption significantly and avoids draining device batteries. There are usually a number of idle modes available; the deeper the mode is, the less power the processor needs. The tradeoff is that the cost of switching to and from deeper modes is higher; it takes more time and the content of some caches is also lost. In the Linux kernel, the cpuidle subsystem has the task of predicting which choice will be the most appropriate. Recently, Rafael Wysocki proposed a new governor for systems with tickless operation enabled that is expected to be more accurate than the existing menu governor.”

Ringing in a new asynchronous I/O API: io_uring

io_uring is a new asynchronous I/O kernel interface whose development we’ve been watching with great interest, not only because it promises to deliver buffered asynchronous I/O via a simplified interface, but especially for its efficiency, scalability and the performance gains that come with it. For more background, see this article by the architect and lead io_uring developer, Jens Axboe.

Lead paragraph: “While the kernel has had support for asynchronous I/O (AIO) since the 2.5 development cycle, it has also had people complaining about AIO for about that long. The current interface is seen as difficult to use and inefficient; additionally, some types of I/O are better supported than others. That situation may be about to change with the introduction of a proposed new interface from Jens Axboe called “io_uring”. As might be expected from the name, io_uring introduces just what the kernel needed more than anything else: yet another ring buffer.”

Pressure stall monitors

Lead paragraph: “One of the useful features added during the 4.20 development cycle was the availability of pressure-stall information, which provides visibility into how resource-constrained the system is. Interest in using this information has spread beyond the data-center environment where it was first implemented, but it turns out that there some shortcomings in the current interface that affect other use cases. Suren Baghdasaryan has posted a patch set aimed at making pressure-stall information more useful for the Android use case — and, most likely, for many other use cases as well.”

Persistent memory for transient data

Lead paragraph: “Arguably, the most notable characteristic of persistent memory is that it is persistent: it retains its contents over power cycles. One other important aspect of these persistent-memory arrays that, we are told, will soon be everywhere, is their sheer size and low cost; persistent memory is a relatively inexpensive way to attach large amounts of memory to a system. Large, cheap memory arrays seem likely to be attractive to users who may not care about persistence and who can live with slower access speeds. Supporting such users is the objective of a pair of patch sets that have been circulating in recent months.”

Concurrency management in BPF

Lead paragraph “In the beginning, programs run on the in-kernel BPF virtual machine had no persistent internal state and no data that was shared with any other part of the system. The arrival of eBPF and, in particular, its maps functionality, has changed that situation, though, since a map can be shared between two or more BPF programs as well as with processes running in user space. That sharing naturally leads to concurrency problems, so the BPF developers have found themselves needing to add primitives to manage concurrency (the “exchange and add” or XADD instruction, for example). The next step is the addition of a spinlock mechanism to protect data structures, which has also led to some wider discussions on what the BPF memory model should look like.”

io_uring, SCM_RIGHTS, and reference-count cycles

Lead paragraph: “The io_uring mechanism that was described here, in January has been through a number of revisions since then; those changes have generally been fixing implementation issues rather than changing the user-space API. In particular, this patch set seems to have received more than the usual amount of security-related review, which can only be a good thing. Security concerns became a bit of an obstacle for io_uring, though, when virtual filesystem (VFS) maintainer Al Viro threatened to veto the merging of the whole thing. It turns out that there were some reference-counting issues that required his unique experience to straighten out.”

Per-vector software-interrupt masking

Lead paragraph: “Software interrupts (or “softirqs”) are one of the oldest deferred-execution mechanisms in the kernel, and that age shows at times. Some developers have been occasionally heard to mutter about removing them, but softirqs are too deeply embedded into how the kernel works to be easily ripped out; most developers just leave them alone. So the recent per-vector softirq masking patch set from Frederic Weisbecker is noteworthy as an exception to that rule. Weisbecker is not getting rid of softirqs, but he is trying to reduce their impact and improve their latency.”

Memory-mapped I/O without mysterious macros

Lead paragraph:“Concurrency is hard even when the hardware’s behavior is entirely deterministic; it gets harder in situations where operations can be reordered in seemingly random ways. In these cases, developers tend to reach for barriers as a way of enforcing ordering, but explicit barriers are tricky to use and are often not the best way to think about the problem. It is thus common to see explicit barriers removed as code matures. That now seems to be happening with an especially obscure type of barrier used with memory-mapped I/O (MMIO) operations.”

Reimplementing printk()

Lead paragraph: “The venerable printk() function has been part of Linux since the very beginning, though it has undergone a fair number of changes along the way. Now, John Ogness is proposing to fundamentally rework printk() in order to get rid of handful of issues that currently plague it. The proposed code does this by adding yet another ring-buffer implementation to the kernel; this one is aimed at making printk() work better from hard-to-handle contexts. For a task that seems conceptually simple—printing messages to the console—printk() is actually a rather complex beast; that won’t change if these patches are merged, though many of the problems with the current implementation will be removed.”

The RCU API, 2019 edition

Lead paragraph:“Read-copy update (RCU) is a synchronization mechanism that was added to the Linux kernel in October 2002. RCU is most frequently described as a replacement for reader-writer locking, but has also been used in a number of other ways. RCU is notable in that readers do not directly synchronize with updaters, which makes RCU read paths extremely fast; that also permits RCU readers to accomplish useful work even when running concurrently with updaters. Although the basic idea behind RCU has not changed in decades following its introduction into DYNIX/ptx, the API has evolved significantly over the five years since the 2014 edition of the RCU API, to say nothing of the nine years since the 2010 edition of the RCU API.”

Containers as kernel objects — again

Lead paragraph: “Linus Torvalds once famously said that there is no design behind the Linux kernel. That may be true, but there are still some guiding principles behind the evolution of the kernel; one of those, to date, has been that the kernel does not recognize “containers” as objects in their own right. Instead, the kernel provides the necessary low-level features, such as namespaces and control groups, to allow user space to create its own container abstraction. This refusal to dictate the nature of containers has led to a diverse variety of container models and a lot of experimentation. But that doesn’t stop those who would still like to see the kernel recognize containers as first-class kernel-supported objects.”

Internal Kernel Changes:

There is a new “software node” concept that is meant to be analogous to the “firmware nodes” created in ACPI or device-tree descriptions. See this commit for some additional information.

The software-tag-based mode for KASAN has been added for the arm64 architecture.

The switch to using JSON schemas for device-tree bindings has begun with the merging of the core infrastructure and the conversion of a number of binding files.

The long-deprecated SUBDIRS= build option is going away in the 5.3 merge window; users will start seeing a warning as of 5.0. The M= option should be used instead.

The venerable access_ok() function, which verifies that an address lies within the user-space region, has lost its first argument. This argument was either VERIFY_READ or VERIFY_WRITE depending on the type of access, but no implementation of access_ok() actually used that information.

Filesystems and Block Layer Changes:

The Btrfs filesystem has regained the ability to host swap files, though with a lot of limitations (no copy-on-write, must be stored on a single device, and no compression allowed, for example).

The fanotify() mechanism supports a new FAN_OPEN_EXEC request to receive notifications when a file is opened to be executed.

The legacy (non-multiqueue) block layer code has been removed, now that no drivers require it. The legacy I/O schedulers (including CFQ and deadline) have been removed as well.

Networking Changes:

Generic receive offload (GRO) can now be enabled on plain UDP sockets. Benchmark numbers in this commit show a significant increase in receive bandwidth and a large reduction in the number of system calls required.

ICMP error handling for UDP tunnels is now supported.

The MSG_ZEROCOPY option is now supported for UDP sockets.

Security Changes

Support for the Streebog hash function (also known as GOST R 34.11-2012) has been added to the cryptographic subsystem.

The kernel is now able to support non-volatile memory arrays with built-in security features; see Documentation/nvdimm/security.txt for details.

A small piece of the secure-boot lockdown patch set has landed in the form of additional control over the kexec_load_file() system call. There is a new keyring (called .platform) for keys provided by the platform; it cannot be updated by a running system. Keys in this ring can be used to control which images may be run via kexec_load_file(). It has also become possible for security modules to prevent calls to kexec_load(), which cannot be verified in the same manner.

The secure computing (seccomp) mechanism can now defer policy decisions to user space. See this new documentation for details on the final version of the API.

The fscrypt filesystem encryption subsystem has gained support for the Adiantum encryption mode (which was added earlier in the merge window).

The semantics of the mincore() system call have changed. In this commit, Linus Torvalds explains, how the new semantics of this system call restrict access to pages that are mapped by the calling process.

An ancient OpenSSH vulnerability

Lead paragraph:“An advisory from Harry Sintonen describes several vulnerabilities in the scp clients shipped with OpenSSH, PuTTY, and others. “Many scp clients fail to verify if the objects returned by the scp server match those it asked for. This issue dates back to 1983 and rcp, on which scp is based. A separate flaw in the client allows the target directory attributes to be changed arbitrarily. Finally, two vulnerabilities in clients may allow server to spoof the client output.” The outcome is that a hostile (or compromised) server can overwrite arbitrary files on the client side. There do not yet appear to be patches available to address these problems.”

Defending against page-cache attacks

Lead paragraph: “The kernel’s page cache works to improve performance by minimizing disk I/O and increasing the sharing of physical memory. But, like other performance-enhancing techniques that involve resources shared across security boundaries, the page cache can be abused as a way to extract information that should be kept secret. A recent paper [PDF] by Daniel Gruss and colleagues showed how the page cache can be targeted for a number of different attacks, leading to an abrupt change in how the mincore() system call works at the end of the 5.0 merge window. But subsequent discussion has made it clear that mincore() is just the tip of the iceberg; it is unclear what will really need to be done to protect a system against page-cache attacks or what the performance cost might be.”

Fixing page-cache side channels, second attempt

Lead paragraph:“The kernel’s page cache, which holds copies of data stored in filesystems, is crucial to the performance of the system as a whole. But, as has recently been demonstrated, it can also be exploited to learn about what other users in the system are doing and extract information that should be kept secret. In January, the behavior of the mincore() system call was changed in an attempt to close this vulnerability, but that solution was shown to break existing applications while not fully solving the problem. A better solution will have to wait for the 5.1 development cycle, but the shape of the proposed changes has started to come into focus.”

A proposed API for full-memory encryption

Lead paragraph: “Hardware memory encryption is, or will soon be, available on multiple generic CPUs. In its absence, data is stored — and passes between the memory chips and the processor — in the clear. Attackers may be able to access it by using hardware probes or by directly accessing the chips, which is especially problematic with persistent memory. One new memory-encryption offering is Intel’s Multi-Key Total Memory Encryption (MKTME) [PDF]; AMD’s equivalent is called Secure Encrypted Virtualization (SEV). The implementation of support for this feature is in progress for the Linux kernel. Recently, Alison Schofield proposed a user-space API for MKTME, provoking a long discussion on how memory encryption should be exposed to the user, if at all.”

Other Developments of Note

Snowpatch: continuous-integration testing for the kernel

Lead paragraph: “Many projects use continuous-integration (CI) testing to improve the quality of the software they produce. By running a set of tests after every commit, CI systems can identify problems quickly, before they find their way into a release and bite unsuspecting users. The Linux kernel project lags many others in its use of CI testing for a number of reasons, including a fundamental mismatch with how kernel developers tend to manage their workflows. At linux.conf.au 2019, Russell Currey described a CI system called Snowpatch that, he hopes, will bridge the gap and bring better testing to the kernel development process.”

The Firecracker virtual machine monitor

The Firecracker virtual machine monitor is not strictly speaking a Linux kernel 5.0 feature but it does use the KVM API.

Lead paragraph:“Cloud computing services that run customer code in short-lived processes are often called “serverless“. But under the hood, virtual machines (VMs) are usually launched to run that isolated code on demand. The boot times for these VMs can be slow. This is the cause of noticeable start-up latency in a serverless platform like Amazon Web Services (AWS) Lambda. To address the start-up latency, AWS developed Firecracker, a lightweight virtual machine monitor (VMM), which it recently released as open-source software. Firecracker emulates a minimal device model to launch Linux guest VMs more quickly. It’s an interesting exploration of improving security and hardware utilization by using a minimal VMM built with almost no legacy emulation.”

 

As covered above, there are many interesting developments in mainline Linux kernel 5.0, some of which we believe are interesting for Oracle customers. As of this writing 5.0.10 is considered stable. We will continue to monitor developments in upcoming kernels, so look for a blog post with highlights in the next few months.

Additional Resources:

For more on mainline Linux and other related topics, see:

  • LWN
  • Oracle’s Linux Kernel Development blog