Here we are going to describe what tracepoints are, how they are defined and finally demonstrate the various ways they can be used. By fleshing out all of the steps, I'm hoping others may find this process a bit easier.
As always it's a good idea to start with the Linux kernel documentation : https://www.kernel.org/doc/Documentation/trace/tracepoints.txt
In kernel tracing, we have various classes of events we can trace.
We can trace kernel function entry and return with kprobes, but the problem is functions appear and disappear as the kernel evolves, so this isn't a very stable way of tracing functionality. Further, kernel functions may be inlined sometimes. If you're not famililar, inlining occurs when the compiler (perhaps with an inline suggestion from the developer or indeed an insistence using __always_inline) replaces a call to a function with the body of that function every time it is called. This can increase the kernel image size, but can be a saving too, avoiding the dance required when the function is called.
If a function is inlined, it cannot be kprobe'd as easily. Note that it's not impossible, but in such cases we'd need to figure out the offset in the function body where inline code begins (kprobes can be placed on most instructions, not just function entry).
So all of the above is a long-winded way of establishing that we need something else other than kprobes in kernel tracing. Tracepoints are that something else.
Tracepoints simply represent hooks to events of interest. We'll talk about what we hook to them later on, but the idea is to represent events which may occur in possibly multiple places in the code rather than have something so tied to implementation details such as function entry.
Take a look at /sys/kernel/debug/tracing/events. The directories here represent the tracing subsystems that are available.
On a 5.3 kernel this looks like
$ sudo ls /sys/kernel/debug/tracing/events alarmtimer ftrace kmem page_pool sunrpc bcache gpio kvm percpu swiotlb block hda kvmmmu power syscalls bpf_test_run hda_controller kyber preemptirq task bridge hda_intel libata printk tcp cfg80211 header_event lock qdisc thermal cgroup header_page mac80211 random timer clk huge_memory mac80211_msg ras tlb cma hwmon mce raw_syscalls udp compaction hyperv mdio rcu v4l2 context_tracking i2c mei regmap vb2 cpuhp i915 migrate regulator vmscan devfreq initcall module rpcgss vsyscall devlink intel_ish mpx rpm wbt dma_fence iommu msr rseq workqueue drm irq napi rtc writeback enable irq_matrix neigh sched x86_fpu exceptions irq_vectors net scsi xdp fib iwlwifi nfsd signal xen fib6 iwlwifi_data nmi skb xfs filelock iwlwifi_io oom smbus xhci-hcd filemap iwlwifi_msg page_isolation snd_pcm fs_dax iwlwifi_ucode pagemap sock
An individual subsystem may define multiple events; for example in the net subsystem we see:
$ sudo ls /sys/kernel/debug/tracing/events/net enable netif_receive_skb filter netif_receive_skb_entry napi_gro_frags_entry netif_receive_skb_exit napi_gro_frags_exit netif_receive_skb_list_entry napi_gro_receive_entry netif_receive_skb_list_exit napi_gro_receive_exit netif_rx net_dev_queue netif_rx_entry net_dev_start_xmit netif_rx_exit net_dev_xmit netif_rx_ni_entry net_dev_xmit_timeout netif_rx_ni_exit
The directories correspond to recognizable events in generic network stack processing.
What about the enable and filter file entries above? If we look we notice they are marked rw. As you might guess, the enable file is used to switch on the probe or set of probes. We can echo 1 to the appropriate enable file to enable events. For more details please refer to: https://www.kernel.org/doc/Documentation/trace/events.txt
We can also filter events using filter expressions which operate on trace data fields. We'll describe both these features below.
If we look in the directory of a specific trace event - say netif_rx - we see:
$ sudo ls -al /sys/kernel/debug/tracing/events/net/netif_rx total 0 drwxr-xr-x. 2 root root 0 Oct 18 10:20 . drwxr-xr-x. 20 root root 0 Oct 18 10:20 .. -rw-r--r--. 1 root root 0 Oct 18 10:20 enable -rw-r--r--. 1 root root 0 Oct 18 10:20 filter -r--r--r--. 1 root root 0 Oct 18 10:20 format -r--r--r--. 1 root root 0 Oct 18 10:20 id -rw-r--r--. 1 root root 0 Oct 18 10:20 trigger
So we see some additional files here.
id is a unique identifier that can be used to attach to a specific tracepoint:
$ sudo cat /sys/kernel/debug/tracing/events/net/netif_rx/id 1320
The format describes the fields used by the tracepoint. Each tracepoint prints a message which can be displayed by tracing tools, and to help it to do so it defines a set of fields which are populated by the raw tracing arguments.
$ sudo cat /sys/kernel/debug/tracing/events/net/netif_rx/format name: netif_rx ID: 1320 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1;signed:0; field:int common_pid; offset:4; size:4; signed:1; field:void * skbaddr; offset:8; size:8; signed:0; field:unsigned int len; offset:16; size:4; signed:0; field:__data_loc char[] name; offset:20; size:4; signed:1; print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr, REC->len
Here we see how the formatted message displays the device name, the address of the skb, etc.
Taking the netif_rx tracepoint, let's see where it's placed. The convention is that tracepoints are called via trace_<tracepoint_name>, so we need to look for trace_netif_rx().
If we look in net/core/dev.c we see:
static int netif_rx_internal(struct sk_buff *skb) { int ret; net_timestamp_check(netdev_tstamp_prequeue, skb); trace_netif_rx(skb); ...
We see the tracepoint is called with a struct sk_buff * as an argument. From that sk_buff *, the fields described in the format are filled out, and those field values can be used in the formatted trace event output.
How do we define a tracepoint? We'll describe that briefly here.
Tracepoints are defined in header files under include/trace/events.
Our netif_rx tracepoint is defined in include/trace/events/net.h.
Each tracepoint definition consists of a description of
TP_ARGS, the argument names; (skb)
TP_STRUCT__entry field definitions; these correspond to the fields which are assigned when the tracepoint is triggered
TP_fast_assign statements which take the raw argument to the tracepoint (the skb) and set the associated field values (skb len, skb pointer etc)
TP_printk which is responsible for using those field values to display a relevant tracing message
Note: tracepoints defined via TRACE_EVENT specify all of the above, whereas we can also define an event class which shares fields, assignments and messages. In fact netif_rx is of event class net_dev_template, so our field assignments and message come from that event class.
We can use the debugfs enable file associated with a tracepoint to switch it on, and harvest the trace data from /sys/kernel/debug/tracing/trace_pipe.
$ echo 1 > /sys/kernel/debug/tracing/events/net/netif_receive_skb/enable $ cat /sys/kernel/debug/tracing/trace_pipe <idle>-0 [030] ..s. 17429.831685: netif_receive_skb: dev=eth0 skbaddr=ffff99a3b231d600 len=52 <idle>-0 [030] .Ns. 17429.831694: netif_receive_skb: dev=eth0 skbaddr=ffff99a3b231d600 len=52
The messages we see above come from the netif_receive_skb tracepoint, which is defined as being of the net_dev_template class of events; the latter defines the message format:
TP_printk("dev=%s skbaddr=%p len=%u", __get_str(name), __entry->skbaddr, __entry->len)
Next we can use the filter to limit event reporting to larger skbs;
$ echo "len > 128" > filter $ cat filter len > 128 $ cat /sys/kernel/debug/tracing/trace_pipe <idle>-0 [038] ..s. 578.249650: netif_receive_skb: dev=eth0 skbaddr=ffff9fa2043c6e00 len=156 <idle>-0 [038] ..s. 579.249051: netif_receive_skb: dev=eth0 skbaddr=ffff9fa2043c6e00 len=156 <idle>-0 [038] ..s. 580.250971: netif_receive_skb: dev=eth0 skbaddr=ffff9fa2043c6e00 len=156
perf record -e can be used to trace tracepoint entry, and the associated tracing messages will be recorded. Here we record the net subsystem's netif_receive_skb tracepoint, and while doing so we ping the system doing the recording to trigger events:
$ perf record -e net:netif_receive_skb $ perf report Samples: 8 of event 'net:netif_receive_skb', Event count (approx.): 8 Overhead Trace output 50.00% dev=eth0 skbaddr=0xffff99a3b1c10400 len=84 12.50% dev=eth0 skbaddr=0xffff999a9658ef00 len=76 12.50% dev=eth0 skbaddr=0xffff99a3b26de400 len=52 12.50% dev=eth0 skbaddr=0xffff99a3b2765500 len=76 12.50% dev=eth0 skbaddr=0xffff99a7026f0300 len=46
The messages we see above again come from the TP_printk associated with the netif_receive_skb event.
BPF gives us a few ways to connect to tracepoints via different program types:
/* Tracepoint format: /sys/kernel/debug/tracing/events/xdp/xdp_redirect/format * Code in: kernel/include/trace/events/xdp.h */ struct xdp_redirect_ctx { u64 __pad; // First 8 bytes are not accessible by bpf code int prog_id; // offset:8; size:4; signed:1; u32 act; // offset:12 size:4; signed:0; int ifindex; // offset:16 size:4; signed:1; int err; // offset:20 size:4; signed:1; int to_ifindex; // offset:24 size:4; signed:1; u32 map_id; // offset:28 size:4; signed:0; int map_index; // offset:32 size:4; signed:1; }; // offset:36
Then the tracepoint program uses this context:
static __always_inline int xdp_redirect_collect_stat(struct xdp_redirect_ctx *ctx) { u32 key = XDP_REDIRECT_ERROR; struct datarec *rec; int err = ctx->err; if (!err) key = XDP_REDIRECT_SUCCESS; ...
struct bpf_raw_tracepoint_args { __u64 args[0]; };
...so the __u64 representation of the raw arguments can be accessed from args[].
DTrace in Oracle Linux supports a perf provider which, similar to BPF, allows us to access the raw tracepoint arguments.
We can see the available probes via dtrace -l:
$ modprobe sdt $ dtrace -l -P perf ID PROVIDER MODULE FUNCTION NAME 656 perf vmlinux syscall_trace_enter sys_enter 657 perf vmlinux syscall_slow_exit_work sys_exit 658 perf vmlinux emulate_vsyscall emulate_vsy scall 665 perf vmlinux trace_xen_mc_flush_reason xen_mc_flus h_reason ...
We can gather stack traces, count events etc using DTrace; here we count how many times the tracepoint fires with a specific stack trace, and see we get 6 events with the same stack trace:
$ dtrace -n 'perf:::netif_receive_skb { @c[stack()] = count(); }' dtrace: description 'perf:::netif_receive_skb ' matched 1 probe ^C vmlinux`do_invalid_op+0x20 vmlinux`invalid_op+0x11a vmlinux`__netif_receive_skb_core+0x3f vmlinux`__netif_receive_skb+0x18 vmlinux`netif_receive_skb_internal+0x45 vmlinux`napi_gro_receive+0xd8 virtio_net`__dta_receive_buf_110+0xf2 virtio_net`__dta_virtnet_poll_91+0x134 vmlinux`net_rx_action+0x289 vmlinux`__do_softirq+0xd9 vmlinux`irq_exit+0xe6 vmlinux`do_IRQ+0x59 vmlinux`common_interrupt+0x1c2 vmlinux`native_safe_halt+0x12 vmlinux`default_idle+0x1e cpuidle_haltpoll`haltpoll_enter_idle+0xa9 vmlinux`cpuidle_enter_state+0xa4 vmlinux`cpuidle_enter+0x17 vmlinux`call_cpuidle+0x23 vmlinux`do_idle+0x172 4 $
Now that we've seen some of the basics around tracepoints and how to gather information from them, we're ready to create a new one!
Next Post