What’s new for NFS in Unbreakable Enterprise Kernel Release 8?

UEK R8 is based on the upstream longterm stable (LTS) Linux kernel v6.12, and introduces many new features, improvements and bug fixes, compared to the previous version UEK R7, which was based on upstream LTS version v5.15.

In this note, we look at what has been improved in the UEK R8 NFS & RPC client & server implementations.

NFSv4 Courteous Server

The NFSv4 protocol uses the concept of a lease duration, which indicates the valid lifetime of all the state that an NFSv4 server holds for an NFS client, e.g. the state relating to open files, and file locks. By default, the NFSv4 lease duration is 90 seconds.

An NFSv4 client must renew its lease at some point during the lease duration, either by performing an operation using some of that state, or by the explicit RENEW operation, if the client is otherwise quiescent.

Previously, if an NFSv4 client did not renew its lease within the 90s lease duration, the NFS server would immediately remove all its state relating to that NFS client, i.e. all the client’s file open and locking state. This might happen during an extended network outage, for example, leading to NFS client and application errors, when the client eventually regains contact with the NFS server after restoration of the network connection.

With the new Courteous Server feature, the NFSv4 server will not remove state for a client that is no longer renewing its lease, as long as there is no competing access to that client’s file opens and locks from another NFS client. This allows a client suffering from an extended network partition to seamlessly resume its operation with the NFS server after the network is restored.

Note that the Courteous Server feature was previously backported to the UEK7 kernel; it is covered in more detail in this article.

NFSv4 server support for Write Delegations

The NFSv4 server has for some time offered its clients Read delegations, which allow an NFS client to perform local opens for reading without needing to check with the NFS server to see if a file’s content or attributes have changed, in the absence of competing access from other NFS clients. However, Write operations still need to go to the NFS server.

The NFSv4 server now also offers its clients Write delegations, which extend this to allow the client to allow local writes of a file’s content, and some attributes, without needing those writes to go to the NFS server (again in the absence of competing access from other clients).

A write delegation enables a client to cache data and metadata for a single file more aggressively, reducing network round trips and server workload.

A further optimisation is added with the NFSv4 server’s ability to now send a CB_GETATTR request to the client, to retrieve the client’s definitive view of a write-delegated file’s attributes. In some situations, this may remove the need for the NFS server to recall the write delegation.

Kerberos AES-SHA2 RPCSEC GSS encryption types

The NFS client and server now support the FIPS-140 compliant Kerberos AES-SHA2 encryption types (as per RFC 8009), and the Camellia encryption types (RFC 6803).

The encryption types DES and Triple-DES (3DES/DES3) have been removed. These encryption types were deprecated by RFCs 6649 & 8429 because they are known to be insecure.

RPC with TLS support

The NFS client and NFS server now support RPC with TLS, as per RFC 9289. This provides a (host-)authenticated, encrypted connection for NFS traffic, that is simple to configure, without the need to set up a complex Kerberos installation.

Note that the RPC with TLS feature was previously backported to the UEK7 kernel; it will be covered in more detail in an upcoming article.

Performance

General

work has continued to further improve the performance of NFS & RPC over RDMA transports.
support for the NFSv4.2 READ_PLUS operation, which allows for efficient handling of sparse files. These are files with holes, i.e. extended blocks of unallocated data. It is possible to represent this on device storage by simply storing blocks of zero bytes, but it is more efficient to note that a range of the file is unallocated, and those blocks do not need to be stored at all. This reduces storage space requirements, with large sparse files, and improves performance when reading & writing the file.

NFS clients

enhancement of the GETATTR/READDIRPLUS heuristic for the case where a user is doing an ls -l, when there are multiple readers of that directory.
further READDIR enhancements relating to caching of large directories with multiple readers and writers, and also the case where find is used to walk the directory tree.
rsize/wsize: previously, the NFS client mount options to set the read and write transfer sizes, rsize/wsize, were limited to a power-of-2 value, for historical reasons related to the UDP network transport. Now, on other transports, this limitation is relaxed, and rsize & wsize may be set to a multiple of the page size (4KiB). Note that power-of-2 sizes are still available when setting rsize & wsize to a value smaller than the page size, which may be useful in some particular scenarios
the nconnect mount option is now available for NFS over RDMA mounts. This enables an NFS client to set up multiple connections to the same NFS server, potentially improving total throughput.
Improved throughput for random buffered writes.
Memory management: folios

Traditionally, the Linux kernel manages memory in chunks known as pages. Memory pages may be various sizes, but on the common x86_64 & arm64 architectures they are generally 4KiB. This can lead to a large amount of page-handling overhead, when dealing with memory operations involving dozens, or hundreds of pages, leading to a decrease in performance.

NFS has previously used Linux kernel “compound pages”, which are collections of multiple contiguous memory pages all specifying a single header page. However, compound-page usage can be awkward, and introduces further overhead in other areas, especially when kernel code needs to know whether a particular page it is passed is in fact a compound header page, or a regular (tail) page.

The Linux kernel now supports a new, much improved way to manage large groups of pages, called folios. Using folios, filesystems and the page cache can efficiently access large number of pages together — similarly to compound pages but without the large overhead of managing the head/tail page ambiguity — leading to noticeable increases in performance. The more consistent folio API also means code is better structured, and more easily checked to be correct.

For NFS in particular, the NFS client now uses folios in the code paths that deal with larger chunks of memory, i.e. for handling operations such as READ, WRITE, & READDIR.

NFS servers

The NFS server caches open file state on the NFS server itself, i.e. file opens on the exported underlying filesystem. This filecache is used since file open is an expensive process. The filecache is used for all NFSv4 opens, and files recently used by NFSv3 clients (as NFSv3 does not have an OPEN operation). Previously, this cache did not scale well to large numbers of open files, introducing performance issues, and also caused issues in some cases, e.g. NFSv3 clients attempting to execute a recently-opened file. Now, the filecache is much more scalable, and the related issues have been corrected.
Locking changes have improved scalability across CPUs, and thus performance, for NFSv4 operations such as file lookup, creation, rename and removal.
When the NFSv4 server receives a competing access request for a file that had been delegated to another client, it must first recall the delegation from that client, before allowing the competing request to complete. Previously, the NFS server would immediately return a DELAY notification to the competing client, whilst waiting for the delegation to be returned. Now, the NFS server will wait a short while, for the delegation to be returned, potentially removing the need to send the DELAY if the delegation is returned quickly. This avoids unnecessary network requests.
The NFSv4 server imposes a limit on the number of NFS operations it will accept in a single COMPOUND procedure. This was previously 16, but has now been raised to 50. Some NFSv4 clients perform a mount operation in such a way that more than 16 operations may be needed to mount an exported path that contains many components.
The NFSv4 server will now send a CB_RECALL_ANY request to ask clients to return some delegations, when the NFS server is low on resources, e.g. memory. The client may choose exactly what to return, based on its knowledge of what it is currently using.
Latency of NFSv4 operations in general has been improved by a new file handle hashing mechanism.
The NFS RDMA server code now ensures optimal NUMA memory allocations.
Improvements to the RPC thread scheduler, to decrease scheduling latencies when selecting threads to wake up to handle & progress NFS client operations.

Diagnostics & Control

The NFS sysfs namespace has been improved, i.e. the files appearing under /sys/fs/nfs/, including the ability to stop a particular kernel RPC client, which gives a way of avoiding hangs on umount when an NFS server is known to be permanently unavailable.
NFSv4 servers are now able to revoke an NFS client’s file open and locking state. This has been available for NFSv3 for some time.

A future article will explore these, and other, diagnostics and control options in more detail, showing how they may be used.
Continuing to improve debugging and diagnosis, a large number of ftrace events have been added. Work continues on having a subset of these events optionally enabled during normal production, to aid fix on first occurrence without adversely impacting performance.

Miscellaneous

NFS clients

The NFS client can now detect when the NFS server is exporting a case-insensitive, or case-preserving filesystem, and ensure that it handles this correctly, e.g. when dealing with directory-entry (dentry) caches, etc.
Normally, if an open file is removed on an NFS client, the NFS client code silently renames the file to a name of the form “.nfsXXXX”, rather than simply removing it, so that the file is correctly preserved on the NFS server. Once all open references to the file are closed, the NFS client then removes the .nfsXXXX file. This often led to confusion when users noticed such a file. Now, if an NFSv4 server indicates to the NFS client that it will preserve a file upon removal (including across an NFS server restart), if the client still has open references to it, the client does not need to rename the file, and may safely remove it. This requires the NFS server to support this client notification, which is not yet widely available.
The file access cache is now cleared upon user login, so that the NFS client sees any user supplementary group changes that have occurred on the NFS server since the last user login.
Previously NFS & RPC statistics within Linux Containers saw global statistics in some cases. Now, these statistics are solely for the container’s own network namespace.
The pNFS Block Layout client now includes NVMe SCSI device name support, i.e. nvme-eui. path names, as per RFC 9561.
noalignwrite mount option. Some applications wish to use writes to non-overlapping regions of a file, concurrently from multiple clients, avoiding the need for file locking (since the writes don’t conflict). This has not previously worked reliably over NFS, since the NFS client extends non-aligned write operations to page boundaries, resulting in overlaps. This new noalignwrite mount option disables that extension, allowing for non-overlapping writes to proceed correctly without file locking. NOTE that this option should be used with great care: conflicting concurrent writes from multiple NFS clients, i.e. without file locking, may result in file data corruption, if not done correctly.
NFSv4 client support for attribute delegations. An attribute delegation permits an NFS client to manage a file’s modification time attribute (mtime), rather than flushing dirty data to the NFS server so that the file’s mtime reflects the last write, which is considerably slower. Not all NFSv4 servers currently support attribute delegations.

NFS servers

The NFS server now supports returning the btime “birth time” attribute of the statx(2) system call, i.e. a file’s creation time.

Summary

In this note we’ve looked at the changes and new features relating to NFS & RPC, for both clients and servers, in the latest Unbreakable Enterprise Kernel Release 8.

What’s new for NFS in Unbreakable Enterprise Kernel Release 8?

NFSv4 Courteous Server

NFSv4 server support for Write Delegations

Kerberos AES-SHA2 RPCSEC GSS encryption types

RPC with TLS support

Performance

General

NFS clients

NFS servers

Diagnostics & Control

Miscellaneous

NFS clients

NFS servers

Summary

Calum Mackay

PCI Express Advanced Error Reporting - An Introduction

Introduction to ASMLib v3

What’s new for NFS in Unbreakable Enterprise Kernel Release 8?

NFSv4 Courteous Server

NFSv4 server support for Write Delegations

Kerberos AES-SHA2 RPCSEC GSS encryption types

RPC with TLS support

Performance

General

NFS clients

NFS servers

Diagnostics & Control

Miscellaneous

NFS clients

NFS servers

Summary

Authors

Calum Mackay

PCI Express Advanced Error Reporting - An Introduction

Introduction to ASMLib v3