X

News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

What’s new for NFS in Unbreakable Enterprise Kernel Release 6?

Oracle Linux kernel engineer Calum Mackay provides some insight into the new features for NFS in release 6 of the Unbreakable Enterprise Kernel (UEK).

 

UEK R6 is based on the upstream long-term stable Linux kernel v5.4, and introduces many new features compared to the previous version UEK R5, which is based on the upstream stable Linux kernel v4.14.

In this blog, we look at what has been improved in the UEK R6 NFS client & server implementations.

Server-side Copy (NFSv4.2 clients & servers)

UEK R6 adds initial experimental support for parts of the NFSv4.2 server-side copy (SSC) mechanism.

This is a feature that considerably increases efficiency when copying a file between two locations on a server, via NFS. Without SSC, this operation requires that the NFS client use READ requests to read all the file's data, then WRITE requests to write it back to the server as a new file, with every byte travelling over the network twice.

With SSC, the NFS client may use one of two new NFSv4.2 operations to ask the server to perform the copy locally, on the server itself, without the file data needing to traverse the network at all. Obviously this will be enormously faster.

1. NFS COPY

NFS COPY is a new operation which can be used by the client to request that the server locally copy a range of bytes from one file to another, or indeed the entire file.

However, NFS COPY requires use of the copy_file_range client system call. Currently, no bundled utilities in Linux distributions appear to make use of this system call, but an application may easily be written or converted to use it. As soon as support for the copy_file_range client system call is added to client utilities, they will be able to make use of the NFSv4.2 COPY operation.

Note that NFS COPY does not require any special support within the NFS server filesystem itself.

2. NFS_CLONE

The new NFS CLONE operation allows clients to ask the server to use the exported filesystem's reflink mechanism to create a copy-on-write clone of the file, elsewhere within the same server filesystem.

NFS CLONE requires the use of client utilities that support reflink; currently cp includes this support, with its --reflink option.

In addition, NFS CLONE requires that the NFS server filesystem supports the reflink operation. The filesystems available in Oracle Linux NFS servers that support the reflink operation are btrfs, OCFS2 & XFS.

NFS CLONE is much faster even than NFS COPY, since it uses copy-on-write, on the NFS server, to clone the file, provided the source and destination files are within the same filesystem. Note that in some cases the server filesystem may need to have been originally created with reflink support, especially if they were created on Oracle Linux 7 or earlier.

The NFSv4.2 SSC design specifies both intra-server and inter-server operations. UEK R6 supports intra-server operations, i.e. the source and destination files exist on the same NFS server. Support for inter-server SSC (copies between two NFS servers) will be added in the future.

Use of these features requires that both NFS client and server support NFSv4.2 SSC; currently server support for SSC is only available with Linux NFS servers.

As an example of the performance gains possible with NFSv4.2 SSC, here's an example copying a 2GB file between two locations on an NFS server, over a relatively slow network:

 

Method
Time
Traditional NFS READ/WRITE
5 mins 22 Seconds
NFSv4.2 COPY (via custom app using copy_file_range syscall)
12 Seconds

 

SSC is specific to NFS version 4.2 or greater. In Oracle Linux 7, NFSv4.2 is supported (provided the latest UEK kernel and userland packages are installed), but it is not the default NFS version used by an NFS client when mounting filesystems.

By default, an OL7 NFS client will mount using NFSv4.1 (provided the NFS server supports it). An NFSv4.2 mount may be performed on an OL7 client, as follows:

# command line
mount -o vers=4.2 server:/export /mnt

# /etc/fstab
server:/export  /mnt    nfs noauto,vers=4.2         0   0

An Oracle Linux 8 NFS client will mount using NFSv4.2 by default. Just like OL7, if the NFS server does not support that, the OL8 client will try to successively lower NFS versions until it finds one that the server supports.

Multiple TCP connections per NFS server (NFSv4.1 and later clients)

For NFSv4.1 and later mounts over TCP, a new nconnect mount option enables an NFS client to set up multiple TCP connections, using the same client network interface, to the same NFS server.

This may improve total throughput in some cases, particularly with bonded networks. Multiple transports allow hardware parallelism on the network path to be fully exploited. However, there are also improvements even using just one NIC; thanks to various efficiency savings.

Using multiple connections will help most when a single TCP connection is saturated while the network itself and the server still has capacity. It will not help if the network itself is saturated, and it will still be bounded by the performance of the storage at the NFS server.

Enhanced statistics reporting has been added to report on all transports when using multiple connections.

Improved handling of soft mounts (NFSv4 clients)

NOTE: we do not recommend the use of the soft and rw mount options together (and remember that rw is the default) unless you fully understand the implications, including possible data loss or corruption.

By default, i.e. without the soft mount option, NFS mounts are described as hard, which means that NFS operations will not timeout in the case of an unresponsive NFS server, or network partition. NFS operations, including READ and WRITE, will wait indefinitely, until the NFS server is again reachable. In particular, this means that any such affected NFS filesystem cannot be unmounted, and the NFS client system itself cannot be cleanly shutdown, until the NFS server responds.

When an NFS filesystem is mounted with the soft mount option, NFS operations will timeout after a certain period (based on the timeo and retrans mount options) and the associated system calls (e.g. read, write, fsync, etc) will return an EIO error to the application. The NFS filesystem may be unmounted, and the NFS client system may be cleanly shutdown.

This might sound like a useful feature, but it can cause problems, which can be especially severe in the case of rw (read-write) filesystems, because of the following:

  • Client applications often don't expect to get EIO from file I/O request system calls, and may not handle them appropriately.

  • NFS uses asynchronous I/O which means that the first client write system call isn't necessarily going to return the error, which may instead get reported by a subsequent write, or close, or perhaps only by a subsequent fsync, which the client might not even perform; close may not be guaranteed to report the error, either. Obviously, reporting the error via a subsequent write/fsync makes it harder for the application to deal with correctly.

  • Write interleaving may mean that the NFS client kernel can't always precisely track which file descriptors are involved, so the error may perhaps not even be guaranteed to be delivered, or not delivered via the right descriptor on close/fsync.

It's important to realize that the above issues may result in NFS WRITE operations being lost, when using the soft mount option, resulting in file corruption and data loss, depending on how well the client application handles these situations.

For that reason, it is dangerous to use the soft mount option with rw mounted filesystems, even with UEK R6, unless you are fully aware of how your application(s) handle EIO errors from file I/O request system calls.

In UEK R6, the handling of soft mounts with NFSv4 has been improved, in particular:

  • Reducing the risk of false-positive timeouts, e.g. in the case where the NFS server is merely congested.

  • Faster failover of NFS READ and WRITE operations after a timeout.

  • Better association of errors with process/fd.

  • A new optional additional softerr mount option to return ETIMEDOUT (instead of EIO) to the application after a timeout, so that applications written to be aware of this may better differentiate between the the timeout case, e.g. to drive a failover response, and other I/O errors, so that the client application may choose a different recovery action for those cases. Mounts using only the soft mount option will see the other improvements, but timeout errors will still be returned to the application with the EIO error.

General improvements will still benefit to an extent existing applications not written specifically to deal with ETIMEDOUT/EIO using the soft mount option with NFSv4, as follows:

  • The client kernel will give the server longer to reply, without returning EIO to the application, as long as the network connection remains connected, for example if the server is badly congested.

  • Swifter handling of real timeouts, and better correlation of error to process file descriptor.

Be aware that the same caveats still apply: it's still dangerous to use soft with rw mounts unless you fully understand the implications, and all client applications are correctly written to handle the issues. If you are in any doubt about whether your applications behave correctly in the face of EIO & ETIMEDOUT errors, do not use soft rw mounts.

New knfsd file descriptor cache (NFSv3 servers)

UEK R6 NFSv3 servers benefit from a new knfsd file descriptor cache, so that the NFS server's kernel doesn't have to perform internal open and close calls for each NFSv3 READ or WRITE. This can speed up I/O in some cases. It also replaces the readahead cache.

When an NFSv3 READ or WRITE request comes in to an NFS server, knfsd initially opens a new file descriptor, then it perform the read/write, and finally it closes the fd. While this is often a relatively inexpensive thing to do for most local server filesystems, it is usually less so for FUSE, clustered, networked and other filesystems with a slow open routine, that are being exported by knfsd.

This improvement attempts to reduce some of that cost by caching open file descriptors so that they may be reused by other incoming NFSv3 READ/WRITE requests for the same file.

Performance

General

  • Much work has been done to further improve the performance of NFS & RPC over RDMA transports.

  • The performance of RPC requests has been improved, by removing BH (bottom-half soft IRQ) spinlocks.

NFS clients

  • Optimization of the default readahead size, to suit modern NFS block sizes and server disk latencies.

  • RPC client parallelization optimizations.

  • Performance optimizations for NFSv4 LOOKUP operations, and delegations, including not unnecessarily returning NFSv4 delegations, and locking improvements.

  • Support the statx mask & query flags to enable optimizations when the user is requesting only attributes that are already up to date in the inode cache, or is specifying AT_STATX_DONT_SYNC.

NFS servers

  • Remove the artificial limit on NFSv4.1 performance by limiting the number of oustanding RPC requests from a single client.

  • Increase the limit of concurrent NFSv4.1 clients, i.e. stop a few greedy clients using up all the NFS server's session slots.

Diagnostics

  • To improve debugging and diagnosibility, a large number of ftrace events have been added. Work to follow will include having a subset of these events optionally enabled during normal production, to aid fix on first occurrence without adversely impacting performance.

  • Expose information about NFSv4 state held by servers on behalf of clients. This is especially important for NFSv4 OPEN calls, which are currently invisible to user space on the server, unlike locks (/proc/locks) and local processes' opens (/proc/pid/). A new directory (/proc/fs/nfsd/clients/) is added, with subdirectories for each active NFSv4 client. Each subdirectory has an info file with some basic information to help identify the client and a states/ directory that lists the OPEN state held by that client. This also allows forced revocation of client state.

  • (NFSv3/NLM) Cleanup and modify lock code to show the pid of lockd as the owner of NLM locks.

Miscellaneous

NFS clients

  • Finer-grained NFSv4 attribute checking.

  • For NFS mounts over RDMA, the port=20049 (sic) mount option is now the default.

NFS servers

  • Locking and data structure improvements for duplicate request cache (DRC) and other caches.

  • Improvements for running NFS servers in containers, including replacing the global duplicate reply cache with separate caches per network namespace; it is now possible to run separate NFS server processes in each network namespace, each with their own set of exports.

NFSv3 clients and servers

  • Improved handling of correctness and reporting of NFS WRITE errors, on both NFSv3 clients and servers. This is especially important given that NFS WRITE operations are generally done asynchronously to application write system calls.

Summary

In this blog we've looked at the changes and new features relating to NFS & RPC, for both clients and servers available in the latest Unbreakable Enterprise Kernel Release 6.

Join the discussion

Comments ( 1 )
  • sang-suan.gam Saturday, April 25, 2020
    wow! great work calum!

    and the improvements spans many areas,
    in the kernel, nw, io areas.

    would be really great, to see even more
    detailed blogs on the various problems
    and implementations of the solutions :-p
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.