X

News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

Taming Tracepoints in the Linux Kernel

Have you always wanted to learn how to implement tracepoints in the Linux Kernel? Then this blog is for you. Oracle Linux kernel engineer Alan Maguire explains how to implement a tracepoint in the Linux kernel.

 

Here we are going to describe what tracepoints are, how they are defined and finally demonstrate the various ways they can be used. By fleshing out all of the steps, I'm hoping others may find this process a bit easier.

As always it's a good idea to start with the Linux kernel documentation : https://www.kernel.org/doc/Documentation/trace/tracepoints.txt

What are tracepoints?

In kernel tracing, we have various classes of events we can trace.

We can trace kernel function entry and return with kprobes, but the problem is functions appear and disappear as the kernel evolves, so this isn't a very stable way of tracing functionality. Further, kernel functions may be inlined sometimes. If you're not famililar, inlining occurs when the compiler (perhaps with an inline suggestion from the developer or indeed an insistence using __always_inline) replaces a call to a function with the body of that function every time it is called. This can increase the kernel image size, but can be a saving too, avoiding the dance required when the function is called.

If a function is inlined, it cannot be kprobe'd as easily. Note that it's not impossible, but in such cases we'd need to figure out the offset in the function body where inline code begins (kprobes can be placed on most instructions, not just function entry).

So all of the above is a long-winded way of establishing that we need something else other than kprobes in kernel tracing. Tracepoints are that something else.

Tracepoints simply represent hooks to events of interest. We'll talk about what we hook to them later on, but the idea is to represent events which may occur in possibly multiple places in the code rather than have something so tied to implementation details such as function entry.

What tracepoints are available?

Take a look at /sys/kernel/debug/tracing/events. The directories here represent the tracing subsystems that are available.

On a 5.3 kernel this looks like

$ sudo ls /sys/kernel/debug/tracing/events
alarmtimer        ftrace          kmem            page_pool     sunrpc
bcache            gpio            kvm             percpu        swiotlb
block             hda             kvmmmu          power         syscalls
bpf_test_run      hda_controller  kyber           preemptirq    task
bridge            hda_intel       libata          printk        tcp
cfg80211          header_event    lock            qdisc         thermal
cgroup            header_page     mac80211        random        timer
clk               huge_memory     mac80211_msg    ras           tlb
cma               hwmon           mce             raw_syscalls  udp
compaction        hyperv          mdio            rcu           v4l2
context_tracking  i2c             mei             regmap        vb2 
cpuhp             i915            migrate         regulator     vmscan
devfreq           initcall        module          rpcgss        vsyscall
devlink           intel_ish       mpx             rpm           wbt
dma_fence         iommu           msr             rseq          workqueue
drm               irq             napi            rtc           writeback
enable                   irq_matrix      neigh           sched         x86_fpu
exceptions        irq_vectors     net             scsi          xdp
fib               iwlwifi         nfsd            signal        xen
fib6              iwlwifi_data    nmi             skb           xfs
filelock          iwlwifi_io      oom             smbus         xhci-hcd
filemap           iwlwifi_msg     page_isolation  snd_pcm
fs_dax            iwlwifi_ucode   pagemap         sock

An individual subsystem may define multiple events; for example in the net subsystem we see:

$ sudo ls /sys/kernel/debug/tracing/events/net
enable                             netif_receive_skb
filter                  netif_receive_skb_entry
napi_gro_frags_entry    netif_receive_skb_exit
napi_gro_frags_exit     netif_receive_skb_list_entry
napi_gro_receive_entry  netif_receive_skb_list_exit
napi_gro_receive_exit   netif_rx
net_dev_queue           netif_rx_entry
net_dev_start_xmit      netif_rx_exit
net_dev_xmit            netif_rx_ni_entry
net_dev_xmit_timeout    netif_rx_ni_exit

The directories correspond to recognizable events in generic network stack processing.

What about the enable and filter file entries above? If we look we notice they are marked rw. As you might guess, the enable file is used to switch on the probe or set of probes. We can echo 1 to the appropriate enable file to enable events. For more details please refer to: https://www.kernel.org/doc/Documentation/trace/events.txt

We can also filter events using filter expressions which operate on trace data fields. We'll describe both these features below.

If we look in the directory of a specific trace event - say netif_rx - we see:

$ sudo ls -al /sys/kernel/debug/tracing/events/net/netif_rx
total 0
drwxr-xr-x.  2 root root 0 Oct 18 10:20 .
drwxr-xr-x. 20 root root 0 Oct 18 10:20 ..
-rw-r--r--.  1 root root 0 Oct 18 10:20 enable
-rw-r--r--.  1 root root 0 Oct 18 10:20 filter
-r--r--r--.  1 root root 0 Oct 18 10:20 format
-r--r--r--.  1 root root 0 Oct 18 10:20 id
-rw-r--r--.  1 root root 0 Oct 18 10:20 trigger

So we see some additional files here.

id is a unique identifier that can be used to attach to a specific tracepoint:

$ sudo cat /sys/kernel/debug/tracing/events/net/netif_rx/id
1320

The format describes the fields used by the tracepoint. Each tracepoint prints a message which can be displayed by tracing tools, and to help it to do so it defines a set of fields which are populated by the raw tracing arguments.

$ sudo cat /sys/kernel/debug/tracing/events/net/netif_rx/format
name: netif_rx
ID: 1320
format:
    field:unsigned short common_type;   offset:0;   size:2; signed:0;
    field:unsigned char common_flags;   offset:2;   size:1; signed:0;
    field:unsigned char common_preempt_count;   offset:3;   size:1;signed:0;
    field:int common_pid;   offset:4;   size:4; signed:1;

    field:void * skbaddr;   offset:8;   size:8; signed:0;
    field:unsigned int len; offset:16;  size:4; signed:0;
    field:__data_loc char[] name;   offset:20;  size:4; signed:1;

print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr, REC->len

Here we see how the formatted message displays the device name, the address of the skb, etc.

Finding tracepoints in the Linux source

Taking the netif_rx tracepoint, let's see where it's placed. The convention is that tracepoints are called via trace_<tracepoint_name>, so we need to look for trace_netif_rx().

If we look in net/core/dev.c we see:

static int netif_rx_internal(struct sk_buff *skb)
{
        int ret;

        net_timestamp_check(netdev_tstamp_prequeue, skb);

        trace_netif_rx(skb);

...

We see the tracepoint is called with a struct sk_buff * as an argument. From that sk_buff *, the fields described in the format are filled out, and those field values can be used in the formatted trace event output.

How do we define a tracepoint? We'll describe that briefly here.

Tracepoint definition

Tracepoints are defined in header files under include/trace/events.

Our netif_rx tracepoint is defined in include/trace/events/net.h.

Each tracepoint definition consists of a description of

  • TP_PROTO, the function prototype used in calling the tracepoint. In the case of netif_rx(), that's simply struct sk_buff *skb
  • TP_ARGS, the argument names; (skb)

  • TP_STRUCT__entry field definitions; these correspond to the fields which are assigned when the tracepoint is triggered

  • TP_fast_assign statements which take the raw argument to the tracepoint (the skb) and set the associated field values (skb len, skb pointer etc)

  • TP_printk which is responsible for using those field values to display a relevant tracing message

Note: tracepoints defined via TRACE_EVENT specify all of the above, whereas we can also define an event class which shares fields, assignments and messages. In fact netif_rx is of event class net_dev_template, so our field assignments and message come from that event class.

Using tracepoints 1: debugfs

We can use the debugfs enable file associated with a tracepoint to switch it on, and harvest the trace data from /sys/kernel/debug/tracing/trace_pipe.

$ echo 1 > /sys/kernel/debug/tracing/events/net/netif_receive_skb/enable

$ cat /sys/kernel/debug/tracing/trace_pipe
          <idle>-0     [030] ..s. 17429.831685: netif_receive_skb: dev=eth0 skbaddr=ffff99a3b231d600 len=52
          <idle>-0     [030] .Ns. 17429.831694: netif_receive_skb: dev=eth0 skbaddr=ffff99a3b231d600 len=52

The messages we see above come from the netif_receive_skb tracepoint, which is defined as being of the net_dev_template class of events; the latter defines the message format:

TP_printk("dev=%s skbaddr=%p len=%u",
                __get_str(name), __entry->skbaddr, __entry->len)

Next we can use the filter to limit event reporting to larger skbs;

$ echo "len > 128" > filter
$ cat filter
len > 128
$ cat /sys/kernel/debug/tracing/trace_pipe
          <idle>-0     [038] ..s.   578.249650: netif_receive_skb: dev=eth0 skbaddr=ffff9fa2043c6e00 len=156
          <idle>-0     [038] ..s.   579.249051: netif_receive_skb: dev=eth0 skbaddr=ffff9fa2043c6e00 len=156
          <idle>-0     [038] ..s.   580.250971: netif_receive_skb: dev=eth0 skbaddr=ffff9fa2043c6e00 len=156

Using tracepoints 2: perf

perf record -e can be used to trace tracepoint entry, and the associated tracing messages will be recorded. Here we record the net subsystem's netif_receive_skb tracepoint, and while doing so we ping the system doing the recording to trigger events:

$ perf record -e net:netif_receive_skb

$ perf report
Samples: 8  of event 'net:netif_receive_skb', Event count (approx.): 8
Overhead  Trace output
  50.00%  dev=eth0 skbaddr=0xffff99a3b1c10400 len=84
  12.50%  dev=eth0 skbaddr=0xffff999a9658ef00 len=76
  12.50%  dev=eth0 skbaddr=0xffff99a3b26de400 len=52
  12.50%  dev=eth0 skbaddr=0xffff99a3b2765500 len=76
  12.50%  dev=eth0 skbaddr=0xffff99a7026f0300 len=46                            

The messages we see above again come from the TP_printk associated with the netif_receive_skb event.

Using tracepoints 3: bpf

BPF gives us a few ways to connect to tracepoints via different program types:

  • BPF_PROG_TYPE_TRACEPOINT: this program type gives access to the TP_STRUCT_entry data available at tracepoint entry; i.e. the data assigned from the raw tracepoint arguments via the TP_fast_assign() section in the tracepoint definition. The tricky part with these tracepoints is the context definition must match that described in the TP_STRUCT_entry definition exactly. For example, in samples/bpf/xdp_redirect_cpu_kern.c the tracepoint format for xdp_redirect tracepoints is defined as follows
  /* Tracepoint format: /sys/kernel/debug/tracing/events/xdp/xdp_redirect/format
   * Code in:                kernel/include/trace/events/xdp.h
   */
  struct xdp_redirect_ctx {
        u64 __pad;      // First 8 bytes are not accessible by bpf code
        int prog_id;    //      offset:8;  size:4; signed:1;
        u32 act;        //      offset:12  size:4; signed:0;
        int ifindex;    //      offset:16  size:4; signed:1;
        int err;        //      offset:20  size:4; signed:1;
        int to_ifindex; //      offset:24  size:4; signed:1;
        u32 map_id;     //      offset:28  size:4; signed:0;
        int map_index;  //      offset:32  size:4; signed:1;
  };                    //      offset:36

Then the tracepoint program uses this context:

  static __always_inline
  int xdp_redirect_collect_stat(struct xdp_redirect_ctx *ctx)
  {
        u32 key = XDP_REDIRECT_ERROR;
        struct datarec *rec;
        int err = ctx->err;

        if (!err)
                key = XDP_REDIRECT_SUCCESS;
...
  • BPF_PROG_TYPE_RAW_TRACEPOINT: BPF programs are in this case provided with the raw arguments to the tracepoint, i.e. before the fast assign is done. The context to such programs is defined as
  struct bpf_raw_tracepoint_args {
    __u64 args[0];
  };

...so the __u64 representation of the raw arguments can be accessed from args[].

Using tracepoints 4: DTrace

DTrace in Oracle Linux supports a perf provider which, similar to BPF, allows us to access the raw tracepoint arguments.

We can see the available probes via dtrace -l:

$ modprobe sdt
$ dtrace -l -P perf
  ID   PROVIDER            MODULE                          FUNCTION NAME
  656       perf           vmlinux               syscall_trace_enter sys_enter
  657       perf           vmlinux            syscall_slow_exit_work sys_exit
  658       perf           vmlinux                  emulate_vsyscall emulate_vsy
scall
  665       perf           vmlinux         trace_xen_mc_flush_reason xen_mc_flus
h_reason
...

We can gather stack traces, count events etc using DTrace; here we count how many times the tracepoint fires with a specific stack trace, and see we get 6 events with the same stack trace:

$ dtrace -n 'perf:::netif_receive_skb { @c[stack()] = count(); }'
dtrace: description 'perf:::netif_receive_skb ' matched 1 probe
^C
              vmlinux`do_invalid_op+0x20
              vmlinux`invalid_op+0x11a
              vmlinux`__netif_receive_skb_core+0x3f
              vmlinux`__netif_receive_skb+0x18
              vmlinux`netif_receive_skb_internal+0x45
              vmlinux`napi_gro_receive+0xd8
              virtio_net`__dta_receive_buf_110+0xf2
              virtio_net`__dta_virtnet_poll_91+0x134
              vmlinux`net_rx_action+0x289
              vmlinux`__do_softirq+0xd9
              vmlinux`irq_exit+0xe6
              vmlinux`do_IRQ+0x59
              vmlinux`common_interrupt+0x1c2
              vmlinux`native_safe_halt+0x12
              vmlinux`default_idle+0x1e
              cpuidle_haltpoll`haltpoll_enter_idle+0xa9
              vmlinux`cpuidle_enter_state+0xa4
              vmlinux`cpuidle_enter+0x17
              vmlinux`call_cpuidle+0x23
              vmlinux`do_idle+0x172


                4
$

Next steps

Now that we've seen some of the basics around tracepoints and how to gather information from them, we're ready to create a new one!

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.