Here we are going to describe what tracepoints are, how they are defined and finally demonstrate the various ways they can be used. By fleshing out all of the steps, I’m hoping others may find this process a bit easier.
As always it’s a good idea to start with the Linux kernel documentation : https://www.kernel.org/doc/Documentation/trace/tracepoints.txt
What are tracepoints?
In kernel tracing, we have various classes of events we can trace.
We can trace kernel function entry and return with kprobes, but the problem is functions appear and disappear as the kernel evolves, so this isn’t a very stable way of tracing functionality. Further, kernel functions may be inlined sometimes. If you’re not famililar, inlining occurs when the compiler (perhaps with an inline suggestion from the developer or indeed an insistence using __always_inline) replaces a call to a function with the body of that function every time it is called. This can increase the kernel image size, but can be a saving too, avoiding the dance required when the function is called.
If a function is inlined, it cannot be kprobe’d as easily. Note that it’s not impossible, but in such cases we’d need to figure out the offset in the function body where inline code begins (kprobes can be placed on most instructions, not just function entry).
So all of the above is a long-winded way of establishing that we need something else other than kprobes in kernel tracing. Tracepoints are that something else.
Tracepoints simply represent hooks to events of interest. We’ll talk about what we hook to them later on, but the idea is to represent events which may occur in possibly multiple places in the code rather than have something so tied to implementation details such as function entry.
What tracepoints are available?
Take a look at /sys/kernel/debug/tracing/events. The directories here represent the tracing subsystems that are available.
On a 5.3 kernel this looks like
$ sudo ls /sys/kernel/debug/tracing/events alarmtimer ftrace kmem page_pool sunrpc bcache gpio kvm percpu swiotlb block hda kvmmmu power syscalls bpf_test_run hda_controller kyber preemptirq task bridge hda_intel libata printk tcp cfg80211 header_event lock qdisc thermal cgroup header_page mac80211 random timer clk huge_memory mac80211_msg ras tlb cma hwmon mce raw_syscalls udp compaction hyperv mdio rcu v4l2 context_tracking i2c mei regmap vb2 cpuhp i915 migrate regulator vmscan devfreq initcall module rpcgss vsyscall devlink intel_ish mpx rpm wbt dma_fence iommu msr rseq workqueue drm irq napi rtc writeback enable irq_matrix neigh sched x86_fpu exceptions irq_vectors net scsi xdp fib iwlwifi nfsd signal xen fib6 iwlwifi_data nmi skb xfs filelock iwlwifi_io oom smbus xhci-hcd filemap iwlwifi_msg page_isolation snd_pcm fs_dax iwlwifi_ucode pagemap sock
An individual subsystem may define multiple events; for example in the net subsystem we see:
$ sudo ls /sys/kernel/debug/tracing/events/net enable netif_receive_skb filter netif_receive_skb_entry napi_gro_frags_entry netif_receive_skb_exit napi_gro_frags_exit netif_receive_skb_list_entry napi_gro_receive_entry netif_receive_skb_list_exit napi_gro_receive_exit netif_rx net_dev_queue netif_rx_entry net_dev_start_xmit netif_rx_exit net_dev_xmit netif_rx_ni_entry net_dev_xmit_timeout netif_rx_ni_exit
The directories correspond to recognizable events in generic network stack processing.
What about the enable and filter file entries above? If we look we notice they are marked rw. As you might guess, the enable file is used to switch on the probe or set of probes. We can echo 1 to the appropriate enable file to enable events. For more details please refer to: https://www.kernel.org/doc/Documentation/trace/events.txt
We can also filter events using filter expressions which operate on trace data fields. We’ll describe both these features below.
If we look in the directory of a specific trace event – say netif_rx – we see:
$ sudo ls -al /sys/kernel/debug/tracing/events/net/netif_rx total 0 drwxr-xr-x. 2 root root 0 Oct 18 10:20 . drwxr-xr-x. 20 root root 0 Oct 18 10:20 .. -rw-r--r--. 1 root root 0 Oct 18 10:20 enable -rw-r--r--. 1 root root 0 Oct 18 10:20 filter -r--r--r--. 1 root root 0 Oct 18 10:20 format -r--r--r--. 1 root root 0 Oct 18 10:20 id -rw-r--r--. 1 root root 0 Oct 18 10:20 trigger
So we see some additional files here.
id is a unique identifier that can be used to attach to a specific tracepoint:
$ sudo cat /sys/kernel/debug/tracing/events/net/netif_rx/id 1320
The format describes the fields used by the tracepoint. Each tracepoint prints a message which can be displayed by tracing tools, and to help it to do so it defines a set of fields which are populated by the raw tracing arguments.
$ sudo cat /sys/kernel/debug/tracing/events/net/netif_rx/format
name: netif_rx
ID: 1320
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1;signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:void * skbaddr; offset:8; size:8; signed:0;
field:unsigned int len; offset:16; size:4; signed:0;
field:__data_loc char[] name; offset:20; size:4; signed:1;
print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr, REC->len
Here we see how the formatted message displays the device name, the address of the skb, etc.
Finding tracepoints in the Linux source
Taking the netif_rx tracepoint, let’s see where it’s placed. The convention is that tracepoints are called via trace_<tracepoint_name>, so we need to look for trace_netif_rx().
If we look in net/core/dev.c we see:
static int netif_rx_internal(struct sk_buff *skb)
{
int ret;
net_timestamp_check(netdev_tstamp_prequeue, skb);
trace_netif_rx(skb);
...
We see the tracepoint is called with a struct sk_buff * as an argument. From that sk_buff *, the fields described in the format are filled out, and those field values can be used in the formatted trace event output.
How do we define a tracepoint? We’ll describe that briefly here.
Tracepoint definition
Tracepoints are defined in header files under include/trace/events.
Our netif_rx tracepoint is defined in include/trace/events/net.h.
Each tracepoint definition consists of a description of
- TP_PROTO, the function prototype used in calling the tracepoint. In the case of netif_rx(), that’s simply struct sk_buff *skb
-
TP_ARGS, the argument names; (skb)
-
TP_STRUCT__entry field definitions; these correspond to the fields which are assigned when the tracepoint is triggered
-
TP_fast_assign statements which take the raw argument to the tracepoint (the skb) and set the associated field values (skb len, skb pointer etc)
-
TP_printk which is responsible for using those field values to display a relevant tracing message
Note: tracepoints defined via TRACE_EVENT specify all of the above, whereas we can also define an event class which shares fields, assignments and messages. In fact netif_rx is of event class net_dev_template, so our field assignments and message come from that event class.
Using tracepoints 1: debugfs
We can use the debugfs enable file associated with a tracepoint to switch it on, and harvest the trace data from /sys/kernel/debug/tracing/trace_pipe.
$ echo 1 > /sys/kernel/debug/tracing/events/net/netif_receive_skb/enable
$ cat /sys/kernel/debug/tracing/trace_pipe
<idle>-0 [030] ..s. 17429.831685: netif_receive_skb: dev=eth0 skbaddr=ffff99a3b231d600 len=52
<idle>-0 [030] .Ns. 17429.831694: netif_receive_skb: dev=eth0 skbaddr=ffff99a3b231d600 len=52
The messages we see above come from the netif_receive_skb tracepoint, which is defined as being of the net_dev_template class of events; the latter defines the message format:
TP_printk("dev=%s skbaddr=%p len=%u",
__get_str(name), __entry->skbaddr, __entry->len)
Next we can use the filter to limit event reporting to larger skbs;
$ echo "len > 128" > filter
$ cat filter
len > 128
$ cat /sys/kernel/debug/tracing/trace_pipe
<idle>-0 [038] ..s. 578.249650: netif_receive_skb: dev=eth0 skbaddr=ffff9fa2043c6e00 len=156
<idle>-0 [038] ..s. 579.249051: netif_receive_skb: dev=eth0 skbaddr=ffff9fa2043c6e00 len=156
<idle>-0 [038] ..s. 580.250971: netif_receive_skb: dev=eth0 skbaddr=ffff9fa2043c6e00 len=156
Using tracepoints 2: perf
perf record -e can be used to trace tracepoint entry, and the associated tracing messages will be recorded. Here we record the net subsystem’s netif_receive_skb tracepoint, and while doing so we ping the system doing the recording to trigger events:
$ perf record -e net:netif_receive_skb $ perf report Samples: 8 of event 'net:netif_receive_skb', Event count (approx.): 8 Overhead Trace output 50.00% dev=eth0 skbaddr=0xffff99a3b1c10400 len=84 12.50% dev=eth0 skbaddr=0xffff999a9658ef00 len=76 12.50% dev=eth0 skbaddr=0xffff99a3b26de400 len=52 12.50% dev=eth0 skbaddr=0xffff99a3b2765500 len=76 12.50% dev=eth0 skbaddr=0xffff99a7026f0300 len=46
The messages we see above again come from the TP_printk associated with the netif_receive_skb event.
Using tracepoints 3: bpf
BPF gives us a few ways to connect to tracepoints via different program types:
- BPF_PROG_TYPE_TRACEPOINT: this program type gives access to the TP_STRUCT_entry data available at tracepoint entry; i.e. the data assigned from the raw tracepoint arguments via the TP_fast_assign() section in the tracepoint definition. The tricky part with these tracepoints is the context definition must match that described in the TP_STRUCT_entry definition exactly. For example, in samples/bpf/xdp_redirect_cpu_kern.c the tracepoint format for xdp_redirect tracepoints is defined as follows
/* Tracepoint format: /sys/kernel/debug/tracing/events/xdp/xdp_redirect/format
* Code in: kernel/include/trace/events/xdp.h
*/
struct xdp_redirect_ctx {
u64 __pad; // First 8 bytes are not accessible by bpf code
int prog_id; // offset:8; size:4; signed:1;
u32 act; // offset:12 size:4; signed:0;
int ifindex; // offset:16 size:4; signed:1;
int err; // offset:20 size:4; signed:1;
int to_ifindex; // offset:24 size:4; signed:1;
u32 map_id; // offset:28 size:4; signed:0;
int map_index; // offset:32 size:4; signed:1;
}; // offset:36
Then the tracepoint program uses this context:
static __always_inline
int xdp_redirect_collect_stat(struct xdp_redirect_ctx *ctx)
{
u32 key = XDP_REDIRECT_ERROR;
struct datarec *rec;
int err = ctx->err;
if (!err)
key = XDP_REDIRECT_SUCCESS;
...
- BPF_PROG_TYPE_RAW_TRACEPOINT: BPF programs are in this case provided with the raw arguments to the tracepoint, i.e. before the fast assign is done. The context to such programs is defined as
struct bpf_raw_tracepoint_args {
__u64 args[0];
};
…so the __u64 representation of the raw arguments can be accessed from args[].
Using tracepoints 4: DTrace
DTrace in Oracle Linux supports a perf provider which, similar to BPF, allows us to access the raw tracepoint arguments.
We can see the available probes via dtrace -l:
$ modprobe sdt $ dtrace -l -P perf ID PROVIDER MODULE FUNCTION NAME 656 perf vmlinux syscall_trace_enter sys_enter 657 perf vmlinux syscall_slow_exit_work sys_exit 658 perf vmlinux emulate_vsyscall emulate_vsy scall 665 perf vmlinux trace_xen_mc_flush_reason xen_mc_flus h_reason ...
We can gather stack traces, count events etc using DTrace; here we count how many times the tracepoint fires with a specific stack trace, and see we get 6 events with the same stack trace:
$ dtrace -n 'perf:::netif_receive_skb { @c[stack()] = count(); }'
dtrace: description 'perf:::netif_receive_skb ' matched 1 probe
^C
vmlinux`do_invalid_op+0x20
vmlinux`invalid_op+0x11a
vmlinux`__netif_receive_skb_core+0x3f
vmlinux`__netif_receive_skb+0x18
vmlinux`netif_receive_skb_internal+0x45
vmlinux`napi_gro_receive+0xd8
virtio_net`__dta_receive_buf_110+0xf2
virtio_net`__dta_virtnet_poll_91+0x134
vmlinux`net_rx_action+0x289
vmlinux`__do_softirq+0xd9
vmlinux`irq_exit+0xe6
vmlinux`do_IRQ+0x59
vmlinux`common_interrupt+0x1c2
vmlinux`native_safe_halt+0x12
vmlinux`default_idle+0x1e
cpuidle_haltpoll`haltpoll_enter_idle+0xa9
vmlinux`cpuidle_enter_state+0xa4
vmlinux`cpuidle_enter+0x17
vmlinux`call_cpuidle+0x23
vmlinux`do_idle+0x172
4
$
Next steps
Now that we’ve seen some of the basics around tracepoints and how to gather information from them, we’re ready to create a new one!