Reasons to install Unbreakable Enterprise Kernel release 7 (UEK7) on Oracle Linux

July 8, 2022 | 13 minute read
Text Size 100%:

Introduction

At Oracle, we believe that Linux runs best when the code is as close to upstream Linux as possible. We’re excited to announce the Release 7 of the Unbreakable Enterprise Kernel (UEK7) based on upstream Linux 5.15. UEK7 allows customers to take advantage of all the latest features as they exist in upstream Linux without relying on complex backporting or re-implementation of features onto older Enterprise kernels. UEK7 contains the latest Linux features, security patches, performance improvements and innovations that enable UEK to power Oracle’s Cloud Infrastructure and the Oracle Exadata Database Platform in addition to our thousands of Oracle Linux customers.

One reason we’re so committed to bringing upstream Linux to Enterprise customers is because Oracle is a top contributor to the Linux kernel and many code changes came from Oracle teams. Oracle’s contributions are particularly strong in the areas of filesystems, memory management and other core kernel subsystems. Kernel developers at Oracle are encouraged to get their changes into upstream Linux first to ensure the long-term viability of their changes, thereby ensuring that customers can upgrade to modern kernels without losing the features they depend on for their business. UEK7 also tracks upstream LTS fixes, allowing customers rapid access to the security bug fixes being tracked upstream.

UEK7 will be available for customers on both Oracle Linux 8 and Oracle Linux 9. UEK7 brings the best upstream Linux kernel capabilities to Oracle Linux 8 where the current Red Hat Compatible Kernel (RHCK) is still at version 4.18 (and UEK6 is 5.4). By releasing the kernel on both OL8 and OL9 platforms, customers have a choice. Although we encourage customers to upgrade to Oracle Linux 9, customers can get many of the benefits of upgrading their OS just by upgrading the kernel without having to reinstall their whole environment.

Customers who move to UEK7 will get kernel improvements like core scheduling for improved process security, io_uring for high-performance asynchronous I/O; filesystem features like XFS Direct Access (DAX) support, NFS eager write support for improved performance and Btrfs fs-verity support for improved data integrity; Networking features like WireGuard VPN and the latest RDS code; and plenty more. As we were building this list, we noted that a number of these features are already available in UEK6 via our backports and can be used on OL8…however for all these features, UEK7 is the first kernel where the code is resident in its upstream unmodified form.

UEK7 includes a new and exciting feature to help improve diagnostics and debugging on Oracle Linux systems – bundled CTF data. CTF, the Compact C-type Format, allows the 600 MB debuginfo to be stripped and compressed into just 10 MB, allowing us to bundle this data along with the default UEK7 install. Having CTF available means DTrace works out of the box with no kernel modifications, and unlocks additional complex diagnostic capabilities. We’ll be posting more details and step-by-step instructions in a future blog post!

Oracle Linux with UEK7 is supported on the x86-64 and 64-bit Arm (aarch64) architectures.

Architecture-specific Arm

A number of new security features for Arm are included in UEK7.

ARM pointer authentication is a mechanism for applying cryptographic signatures to pointers used in running code. This feature helps prevent a variety of potential security vulnerabilities by detecting and rejecting pointers that have been modified unexpectedly by attackers.

A new mechanism has been introduced to prevent attackers (or bad code) from hijacking the return addresses for function calls on the stack. Return address signing uses Arm pointer authentication to verify the signature on the return stack. Should the verification fail, a kernel oops will be generated and the running process will be killed, protecting against attacks that hijack the return address via stack overwriting and other means.

Branch-Target Identification (BTI) is a mechanism that can be used to validate wild or indirect jumps. Attackers often try to trick the kernel to jump to a random address, for instance, by overwriting a structure full of function pointers called by the kernel. Enabling BTI provides one means of protecting against such attacks. These features coupled with pointer authentication provide some excellent defense against vulnerabilities such as buffer-overflow attacks.

Architecture-specific x86_64

A notable performance feature to arrive is Split-lock Detection, which detects when atomic CPU instructions cross cache lines. Split-locks can cause performance degradation on other cores and are disallowed on architectures other than x86. UEK7 introduces Split-lock detection to help developers remove split locks from their code.

UEK7 will have native Linux support for Intel Software Guard Extensions (SGX)’s hardware protected enclaves.

This kernel also enables userspace to start using Intel Ivy Bridge’s FSGSBASE instruction set, allowing the direct manipulation of FS and GS segment base registers. This is significant as it paves the way for some substantial performance improvements by allowing userspace to set registers directly, avoiding a kernel call.

Core Kernel

io_uring

io_uring and it’s high-performance asynchronous I/O was introduced in UEK6, however many notable improvements are now available in UEK7. Asynchronous buffered reads work better than kernel thread offload, providing some great performance improvements. With asynchronous buffered reads, a sample application doing 4G worth of random reads had a 33% speed up, with CPU usage dropping from about 82% to 52%. Other optimizations related to request recycling and task_work have also resulted in 10-20% improvements for workloads that are mostly inline.

io_uring adds a “multishot” mode to io_uring’s poll operation. Normally, a poll (like any other io_uring operation) is removed from the ring once it generates a completion event. A multishot poll, instead, remains active and, uniquely, can generate multiple completion events from a single submitted event. Traditional unix processes use file descriptors to handle open files, io_uring however does not require file descriptors as it now supports opening files directly in the fixed-file index table which significantly speed up some types of operations. Another performance improvement has been achieved via the introducton of a new BIO recycling mechanism removing some internal memory-management overhard that reportedly provides a 10% increase in the number of I/O operations per second that io_uring can sustain.

A new security related API has been added which facilitate the secure sharing of rings. Operations restrictions for io_uring allow processes to share a ring with less trusted processes allowing untrusted applications or guests to safely consume io_uring. io_uring is now fully Integrated with Memory Control Groups, which means all memory usage is properly accounted for and regulated providing for better security and accountability.

Berkeley Packet Filter (BPF)

Berkeley Packet Filter (BPF) was initially developed to analyse network traffic, however it has since evolved to become one the key tools for observing the Linux operating system. Dynamic tracing of an operating system can be costly and BPF have introduced some specific features to help address this. Such as BPF Trampoline, Compile Once Run Everywhere, BPF Type Format (BTF), and Userspace BPF Programs kernel function access. BPF Trampoline is a mechanism which provides a bridge between BPF programs and kernel functions. Now kernel code can call BPF programs with virtually zero overhead. Compile Once Run Everywhere coupled with BPF Type Format (BTF) ensures BPF tracing is both safer and faster by allowing the BPF verifier to type check BPF assembly code. BPF userspace programs can now make direct calls to kernel functions, originally implemented in BPF to facilitate calling TCP congestion-control algorithms in the kernel but now can be consumed directly by the kernel. This direct access is limited to explicitly whitelisted functions and helps to improve code efficiency.

Sleepable BPF programs are now allowed for tracing and security-module BPF program types allowing them to effectively sleep or block. This expands the capabilities of BPF programs, for example catering for blocking operations such as reading userspace memory. The new BPF_PROG_TYPE_SK_LOOKUP BPF program type allows for more flexible packet steering. They run when the kernel is looking for an open socket for an incoming connection, where the program can then decide which socket should receive the connection. This simplifies the tracing of traffic destined for a range of address or port numbers by facilitating the binding of a single socket to this range of addresses or port numbers.

Core Scheduling

The Spectre class of security vulnerability have been widely publicised, detailing how private data belonging to one process could be accessed by another process running on the same core. This only affected Simultaneous Mulththreading (SMT) enabled CPU’s and the initial solution was to simply disable SMT in order to mitigate against these Microarchitectural Data Sampling (MDS) bugs. However this comes with a significant performance impact and that’s where Core Scheduling can provide assistance. One of the many uses of core scheduling is to allow for the segregation of these shared core processes, and thus ensuring data integrity and privacy between them. This allows for SMT to remain enabled and therefore the avoidance of any performance impact.

Memory Management

Proactive compaction for the kernel aims to facilitate faster Huge Page allocation by automatically triggering memory compaction before the allocation is performed. Huge Pages are used by the kernel to improve performance, however their allocation require large amounts of contiguous free memory. When a huge page needs to be allocated and there is insufficient contiguous memory available memory compaction (degragmentation) gets triggered, and this can be very costly and time consuming. Pro-actively compacting memory prior to these requests attempts to alleviate this expense.

The kernel now contains Kernel Electric-Fence (KFENCE), a new memory-debugging and safety error detecting tool which provides near zero performance overhead making it ideal for running on producton kernels. KFENCE is less thorough when compared to the already available memory error detector Kernel Address Sanitizer (KASAN), however it can still improve kernel reliability by detecting issues such as heap out-of-bounds access, use-after-free and invalid-free errors. Also newly available is the Data Access MONitor subsystem (DAMON), which allows for the monitoring and optimization of memory management usage by userspace applications. You can now determine the exact memory being accessed by a userspace application allowing you to optimize your applications memory behaviour.

Event Handling

The new General notification mechanism provides userspace the welcome ability of receiving notification information from the kernel. Instead of having to poll the kernel, userspace can now create a standard pipe into which the kernel can now splice notification messages for userspace consumption. When creating the pipe the userspace application can inform the kernel of the specific resources it would like to observe. Source types and subevents to be ignored can also be specified via placing additional filters on the pipe. The ability to enable higher-resolution timeouts via the new epoll_pwait2() system call will be welcomed by datacenter networking operators, by allowing them to specify more granular timeouts this will help to reduce latency.

CGroups

Reducing memory usage is always welcome, and that’s exactly what the new cgroup slab memory controller promises. The original cgroup slab memory controller replicated slab allocator internals for each memory cgroup, meaning slab memory was not shared between cgroups. With the new cgroup slab memory controller slab memory can now be shared between memory cgroups, providing for a substantial reduction in memory usage. Savings of up to 45% in the kernels memory footprint have been observed, which is particularly significant given that the cgroup memory controller is enabled by default in the kernel. Another resource saving feature is the fact that entire control groups can now be placed into the SCHED_IDLE scheduling class. This ensures the cgroup in its entirety will only run when the CPU is idle, providing for the improved utilisation of CPU resources.

Filesystem

NFS

New features of note for NFS related to performance are eager writes and cross-device offloading of copy operations. Eager writes is a mount time option that when enabled ensures file writes are sent immediately to the server instead of remaining in page cache. This has shown to reduce memory pressure on the client whilst also providing immediate notification when the filesystem is full. Cross-device offloading of copy operations provides for direct server-to-server efficient file copies using Server-Side Copy (SSC), meaning that files are copied directly from one server to another instead of where previously files would have to be copied to the client system first. Another notable feature is support for Extended File Attributes (RFC 8276) has arrived for the NFS client. This feature is particularly welcomed as most modern filesystems have provided support for Extended File Attributes for some time. It provides NFS users with the ability to associate non-filesystem interpreted metadata with files on the filesystem.

Btrfs

Btrfs fs-verity support has arrived providing transparent integrity and authenticity protection of read-only files. fs-verity is essentially a way to hash a file in constant time and provides integrity protection by detecting accidental (non-malicious) corruption. Administrators are provided with multiple per-filesystem checksum algorithms to choose from, including xxhash64, blake2b and sha256 hash algorithms, with tradeoffs between integrity and performance. Btrfs has also introduced new rescue mount options aimed at improving data recovery on a corrupted filesystem. The RAID1 implementation in Btrfs increases the number of devices that can be used for replication from just two up to three or four.

Btrfs uses Asynchronous SSD trimming to improve performance by informing the SSD drive when data blocks are no longer in use due to files having been removed. These operations were previously executed synchronously,; with asynchronous SSD trimming support, the actual discard IO requests have been moved into a worker thread instead of being part of the transaction commit, thus improving commit latency significantly. btrfs improved the speed of parallel fsync operations for reflinked files, and also reduced the number of checksum tree lookups. These improvements reduced runtime by roughly 30% while gaining 50% more throughput.

Initial support has been added to Btrfs for Zoned devices are storage devices allowing enterprise data centers to consume Btrfs to help them architect more efficient, highly-scalable data storage tiers. Support has also been added for ID-mapped mount points, these provide the ability to map the user/group id’s of one mount to another, providing greater flexibility is setting mount point ownership and permissions. Another feature related to mounting and compatibility is the allowance for block sizes smaller than the size of the machine’s page size. Prior to this a Btrfs instance created on a 4k page size machine was not mountable on machines with a larger page size, however that’s now the case anymore thus providing greater compatibility across machines with varying page sizes.

XFS

XFS now supports Direct Access (DAX) operations on a per-file/per-directory level, thus permitting direct access to byte-addressable persistent memory and avoiding the latency of having to use traditional block I/O conventions where data was copied through the page cache. A number of other optimization changes have are also been integrated, such as various cache flushes and updating the logging recovery code to operate on contiguous bit ranges instead of walking the log bit-by-bit, this is particularly beneficial when using large directory block sizes.

XFS online filesystem checking (fsck) continues to be a Tech Preview, but we hope to make this production in a future release of UEK7. This will allow users to verify the consistency of their filesystems at runtime without requiring the filesystem to be unmounted – and also prevent fsck from slowing down the reboot process. We’re also working on online filesystem repair, but that feature will appear in a future version of UEK.

XFS has solved the Year 2038 (Y2038) bug by extending timestamps representaton to 64-bit nanosecond counters and therefore facilitating timestamps out to 2 July 2486. Another feature now available in XFS is Delay Ready Attributes, which allows attribute operations (set and remove) to be logged and committed in the same way that other delayed operations do. The facilitates the breaking up of complex operations (like parent pointers) into multiple smaller transactions.

Networking

NAPI Network Device Polling

Most network devices use software interrupts to handle packet processing; UEK7 introduces the NAPI (new API) which polls the network device for new packets at frequent intervals, resulting in the processing of all waiting packets at once. This can result in a single poll potentially replacing dozens of hardware interrupts. The polling runs in Software Interrupt Context or softirq context, where the task scheduler cannot see it and therefore tuning the system for maximum performance is quite difficult. A new feature that allows for this polling to be configured so that it is performed by a kernel thread which is managed by the tasks scheduler and therefore visible to userspace administrators, allowing them to tune their systems by pinning individual kernel threads to specific CPU’s. This feature is not guaranteed to provide performance improvements, though, so users should enable this behavior with care.

WireGuard

Now officially supported in UEK7 is WireGuard, a relatively new secure VPN networking tunnel that aims to be faster, and less complex compared to other IPsec solutions. WireGuard configuration is really straightforward in that it is as simple as configuring and deploying SSH. Just like SSH, initiating a VPN connection is achieved by exchanging simple public keys, and WireGuarld transparently handles everything else. The administrative burden and complexity of managing connections and daemons, or being concerned about state can all be set aside as WireGuard handles all of this for you. Lots of choice is provided when choosing which cryptographic protocol to implement as it provides support for quite a few including Noise protocol framework, Curve25519, ChaCha20, Poly1305, BLAKE2, SipHash24, HKDF, and secure trusted constructions. Having been designed with ease-of-use and simplicity in mind from the onset, this means it can be easily implemented with a minimal amount of code. This makes it more secure due to its minimal attack surface, especially when compared to the likes of Swan or OpenVPN. As WireGuard lives inside the kernel, and given it uses the latest extremely fast cryptographic primitives, this ensures high-speed secure networking can be achieved. Regardless of its young age, and high level of development, it is already considered by some to be the most secure, fastest and simplest VPN solution available.

Security

There has been many security related features and improvements intergated into UEK7, as has been discussed elsewhere in this article, however there is one more that is worth mentioning and that is Randomized Stack Offsets. This feature now available in UEK7 greatly increases the level of protection for the Linux kernel against stack-based vulnerabilities. Many stack-based attacks tend to rely on a deterministic stack structure, therefore by randomizing stack offsets it is much harder for an attack to reliably land in any particular place on the thread stack.

Linux Integrity Management Architecture (IMA)

Oracle linux 9 has introduced signing to the user space build packages in that all the files will now be signed with IMA signatures. These signatures can be used with Linux IMA to verify the integrity of the files during their execution based on a policy. The set policies can ensure the machine is used as intended. Oracle Linux does not install a default policy. The user has the option to enable a policy that is best suited to their needs and have that enforced.

Virtualization

VirtualBox Guest Shared Folders

The VirtualBox Shared Folder VFS driver is now available in UEK7. This effectively adds a new filesystem vboxsf to the Linux kernel and facilitates the sharing of folders between Oracle VM VirtualBox Linux guests and the host operating system. Shared folders are similar to network shares in Windows networks, except they are much simpler as they do not require networking.

Conclusion

This list represents a small fraction of the contributions and features that make UEK7 the best choice for enterprise systems. We hope this article has given you insight into what to expect with the latest Linux kernel, and encourage you to give it a try on Oracle Linux 8 and Oracle Linux 9!

This post has been updated to reflect a feature that has been disabled.

Greg Marsden

Matt Keenan


Previous Post

Announcing Oracle Linux 9 general availability

Simon Coter | 4 min read

Next Post


Oracle is the #1 contributor to the core of Linux in 5.18*

Greg Marsden | 2 min read