Tracing NFS: Beyond tcpdump

This blog entry is by Linux kernel developer Chuck Lever. tcpdump and rpcdebug have historically been the go-to tools for NFS client troubleshooting. Lately they are showing their age. At Oracle, we want to identify existing tools and provide new ones that can observe NFS client operation under heavy workloads and on high performance network fabrics without affecting timing or risking loss of capture information.

Historical NFS and RPC Debugging Facilities

Unless network traffic is encrypted, NFS operations are exposed to anyone snooping the network between NFS clients and an NFS server. NFS operations can be captured by snooping tools available on either a client or server, such as tcpdump, wireshark, or snoop. These tools have the capability to dissect NFS operations contained in a network capture, or pcap, file, save them into a file for later analysis, or display them in a human-readable parsed format.

This is one of the most valuable tools that an NFS developer can use, but it suffers from some drawbacks:

The tool has to be running when a problem occurs to capture the most useful data. Data captured after a problem occurs is generally not useful.
A capture tool can generate an enormous amount of data very quickly. Clever use of packet filtering and capture into a ring of files can help mitigate the amount of data that is left to be analyzed.
Because of the high data rate, the tool can introduce significant overhead on the system where the capture is being done. Timing can be altered, preventing reproducing of the problem. Or the system may not have enough horsepower to keep up, and critical packets can be missing from the capture.
User data is also captured by this tool, swelling the size of the resulting capture file, and creating privacy concerns when NFS is being used to store sensitive data.

Getting the exact capture configuration right can result in having to run a capture repeatedly until the correct set of circumstances allow a clean capture of the underlying issue.

Linux has an internal tracing mechanism built into the NFS and RPC client and server stacks. This is known as the “rpcdebug” facility, named for the tool that is used to control it. When enabled, a message is added to the kernel log (usually in /var/log/messages) for specific events, such as an NFS READ being sent, or for each step of RPC execution.

This mechanism suffers from similar shortcomings as network capture. More critically, it writes into the kernel log and onto the system console rather than into user files. This has significant drawbacks:

The system console is often a real serial port. This restricts the rate at which debugging messages can be generated, and slows NFS operation to a crawl. It is not appropriate to use rpcdebug on a production system running a heavy NFS workload.
Modern Linux distributions have rate-limiting built into their kernel log, so that after a few hundred debugging messages appear in a short period of time, the rate-limiter throws new messages away. During problem reproduction, rate-limiting often occurs well before the issue is reproduced.

In addition, the granularity of the debugging is quite course. To see each NFS GETATTR operation, for example, requires that debugging messages be enabled for all NFS operations, which generates a large volume of data.

Introducing static trace points

Given the frequent use of NFS for large workloads, on systems with many concurrent users of NFS, or on increasingly fast networks, the traditional debugging mechanisms are no longer viable. Instead, a mechanism that has low overhead and flexible event filtering is needed. The Linux NFS community has adopted static trace points for this purpose.

These trace points are a fixed part of the source code, added by developers at points where important information (RPC XID, user ID, size of I/O, and error codes, for example) is available. Trace points are always available, and are low overhead when enabled or disabled. In addition, each individual trace point can be toggled on or off as needed. Trace data is directed to a user file, and stored in a compact format. User data is not exposed by trace points, further reducing the size of captured data.

There are already multiple NFS trace points in the Linux kernel NFS client. These capture various operational aspects of NFS client and RPC layer operation. However, Oracle recently contributed a set of trace points that specifically capture NFS I/O operations. The new trace points first appeared in the v4.14 Linux kernel NFS client.

This includes the start of every NFS READ, WRITE, and COMMIT operation, and the operation’s arguments (except for the data payload); and the end of every NFS READ, WRITE, and COMMIT operation, including any error that resulted. Each of these six trace points can be enabled separately or all at once. Each generated event includes timestamp information and information about the CPU core and process that requested the operation.

These trace points can be used sophisticated ways, such as:

Recording a series of individual NFS READ operations, then using the recorded timestamps to generate deep statistical outlier analysis
During a production workload, watching NFS WRITE results for a specific error condition
Verifying that an NFS COMMIT works correctly during a server reboot, even with a heavy workload
Examining how NFSv4 state recovery affects NFS I/O operations

NFS trace points are controlled, just like other static trace points in the kernel, by using the trace-cmd tool. Use “trace-cmd list | grep nfs:” to discover trace points related to the kernel NFS client. The trace points introduced to the NFS client in v4.14 include nfs:nfs_initiate_read, nfs:nfs_readpage_done, nfs:nfs_initiate_write, nfs:nfs_writeback_done, nfs:nfs_initiate_commit, and nfs:nfs_commit_done.

Oracle recognizes how critical good troubleshooting and observability tools are for system administrators responsible for NFS deployments. Our intention is to continue to contribute facilities to Linux that provide deep visibility without interfering with normal operation.

Tracing NFS: Beyond tcpdump

Chuck Lever

Accelerating Linux Boot Time

hugetlbfs: Not just for databases anymore!

Tracing NFS: Beyond tcpdump

Authors

Chuck Lever

Accelerating Linux Boot Time

hugetlbfs: Not just for databases anymore!