Chuck Lever is a Linux Kernel Architect working with the Oracle Linux and Unbreakable Enterprise Kernel team at Oracle. He contributed this article about replacing printk debugging with static trace points in the kernel.
These days, kernel developers have several choices when it comes to reporting exceptional events. Among them: a console message; a static trace point; Dtrace; or, a Berkeley Packet Filter script. Arguably the best choice for building an observability framework into the kernel is the judicious use of static trace points. Amongst the several kernel debugging techniques that are currently in vogue, we like static trace points. Here's why.
Years ago IBM coined the term First Failure Data Capture (FFDC). Capture enough data about a failure, just as it occurs the first time, so that reproducing the failure is all but unnecessary. An observability framework is a set of tools that enable system administrators to monitor and troubleshoot systems running in production, without interfering with efficient operation. In other words, it captures enough data about any failure that occurs so that a failure can be root-caused and possibly even fixed without the need to reproduce the failure in vitro.
Of course, FFDC is an aspirational goal. There will always be a practical limit to how much data can be collected, managed, and analyzed without impacting normal operation. The key is to identify important exceptional events and place hooks in those areas to record those events as they happen. These exceptional events are hopefully rare enough that the captured data is manageable. And the hooks themselves must introduce little or no overhead to a running system.
The trace point facility, also known as ftrace, has existed in the Linux kernel for over a decade. Each static trace point is an individually-enabled call out that records a set of data as a structured record into a circular buffer. An area expert determines where each trace point is placed, what data is stored in the structured record, and how the stored record should be displayed (i.e., a print format specifier string). The format of the structured record acts as a kernel API. It is much simpler to parse than string output by printk. User space tools can filter trace data based on values contained in the fields (e.g., show me just trace events where "status != 0").
Each trace point is always available to use, as it is built into the code. When triggered, a trace point can do more than capture the values of a few variables. It also records a timestamp and whether interrupts are enabled, and which CPU, which PID, and which executable is running. It is also able to enable or disable other trace points, or provide a stack trace. Dtrace and eBPF scripts can attach to a trace point, and hist triggers are also possible.
Trace point buffers are allocated per CPU to eliminate memory contention and lock waiting when a trace event is triggered. There is a default set of buffers ready from system boot onward. However, trace point events can be directed into separate buffers. This permits several different tracing operations to occur concurrently without interfering with each other. These buffers can be recorded into files, transmitted over the network, or read from a pipe. If a system crash should occur, captured trace records still reside in these buffers and can be examined using crash dump analysis tools.
Trace points can be safely placed in code that runs at interrupt context as well as code that runs in process context, unlike printk(). Also unlike printk(), individual trace points can be enabled, rather than every printk() at a particular log level. Groups of trace points can be conveniently enabled or disabled with a single operation, and can be combined with other more advanced ftrace facilities such as function_graph tracing.
Trace points are designed to be low overhead, especially when they are disabled. The code that structures trace point data and inserts it into a trace buffer is out-of-line, so that a uncalled trace point adds very little instruction cache footprint. The actual code at the call site is nothing more than a load and a conditional branch. This is unlike some debugging mechanisms that place no-ops in the code, and then modify the code when they are enabled. This technique would not be effective if the executable resides in read-only memory, but a trace point in the same code can continue to work.
In contrast, printk() logs messages onto the system console and directly into the kernel's log file (typically /var/log/messages). In recent Linux distributions, kernel log output is rate-limited, which means an important but noisy stream of printk() messages can be temporarily disabled by software just before that one critical log message comes out. In addition, in lights-out environments, the console can be a serial device set to a relative low baud rate. A copious stream of printk() messages can trigger workqueue stalls or worse as the console device struggles to keep up.
We've described a good way to identify and record exceptional events, using static trace points. How are the captured events recorded to a file for analysis? The trace-cmd(1) tool permits a privileged user to specify a set of events to enable and direct the output to a file or across a network, and then filter and display the captured data. This tool is packaged and available for download in Oracle Linux RPM channels. A graphical front-end for trace-cmd called kernelshark is also available.
In addition, Oracle has introduced a facility for continuous monitoring of trace point events called Flight Data Recorder (FDR for short). FDR is started by systemd and enables trace points to monitor. It captures event data to a round-robin set of files, limiting the amount of data so it does not overrun the local root filesystem. A configuration file allows administrators to adjust the set of trace points that are monitored. Because this facility is always on, it can capture events at the time of a crash. The captured trace point data is available in files or it can be examined by crash analysis.
To keep this article short, we've left out plenty of other benefits and details about static trace points. You can read more about them by following the links below. These links point to articles about trace point-related user space tools, clever tips and tricks, how to insert trace points into your code, and much much more.
There are several links to lwn.net http://lwn.net/ above. lwn.net http://lwn.net/ is such a valuable resource to the Linux community. I encourage everyone to consider a subscription!
Previous Post