Data-centric tracing

September 27, 2022 | 20 minute read
Text Size 100%:

Here we continue our series of BPF blog entries by looking at observability improvements the Oracle Linux team have worked on with the upstream BPF community.

In particular, we will discuss how BPF - and libbpf in particular - can facilitate observability of kernel function execution, showing arguments and return values in a similar way to that supported in debuggers.

You can see the eBPF summit presentation on this topic here:

https://youtu.be/PRiJYuDMtEQ

Tracing control flow

We have long focused debugging and tracing efforts on control flow. We have instrumentation such as kprobes and tracepoints for the kernel, and breakpoints for debuggers. These facilities can intercept events of interest, jump to carry out instrumentation and return.

In a rather beautiful idea, ftrace even looks like code when it is instrumenting kernel function execution:

# cd /sys/kernel/debug/tracing/
# echo function_graph > current_tracer
# cat /sys/kernel/debug/tracing/trace_pipe
...
 0)               |                pty_write() {
 0)   0.140 us    |                  _raw_spin_lock_irqsave();
 0)               |                  tty_insert_flip_string_fixed_flag() {
 0)   0.133 us    |                    __tty_buffer_request_room();
 0)   1.791 us    |                  }
 0)   0.207 us    |                  _raw_spin_unlock_irqrestore();
 0)               |                  tty_flip_buffer_push() {
 0)   0.215 us    |                    queue_work_on();
 0)   0.619 us    |                  }
 0)   3.642 us    |                }
...

But of course code execution is not the whole story; that code has to operate on data. This used to be easier. It used to be the case that when we called a function, the convention was to push the function arguments onto the stack. This was largely a result of not having many registers to play with, so using the stack for function arguments impacted less on register usage. However it also had the nice side-effect that since it was easy to read data from stack frames, we could still easily see what arguments that function was called with by looking at the stack when carrying out instrumentation. We had a frozen-in-time view of the function when it was called, and for all of its ancestors on the stack!

However, as often happens, speed beats simplicity. Consider the fact that when I call a function I need to push all of its arguments onto the stack, and when I execute it I have to read them again. This is why modern 64-bit calling conventions (x86_64, aarch64) that operate on systems with more registers moved to using registers for function arguments; no costly interactions with the stack are required. This is good for minimizing function call overheads; it is bad for observability. Registers which were used to store function arguments are often reused or overwritten during function execution, so if we are observing a function anywhere but at a function entry boundary, it becomes an exercise in detective work to figure out what values a given function was called with.

For a great discussion on how to do this see here:

https://blogs.oracle.com/linux/post/extracting-kernel-stack-function-arguments-from-linux-x86-64-kernel-crash-dumps

The view from the debugger

Once we have a reference to our data, debuggers tend to have methods to display data of interest. We can use gdb to print data structures via the print command:

(gdb) print test_cases

$1 = {root = {rb_node = 0xffff8800d1d16800}, size = 1,

  keycompare = 0xffffffffa0bdbe10 <ktf_map_name_compare>}

Tracing kernel data structures via BTF and libbpf

The focus of my recent work has been to bring this capability to BPF tracing also.

The availability of BPF Type Format (BTF) information for live kernels means we can now carry out function boundary tracing by consulting BTF to determine types of function arguments and return values. We can then use the information about those types to guide typed data display.

To that end, libbpf now has support for type-based dump of data structures. The relevant API is:

int
btf_dump__dump_type_data(struct btf_dump *d, __u32 id,
    const void *data, size_t data_sz,
    const struct btf_dump_type_data_opts *opts);

..and the associated options look like this:

struct btf_dump_type_data_opts {
    /* size of this struct, for forward/backward compatibility */
    size_t sz;
    const char *indent_str;
    int indent_level;
    /* below match "show" flags for bpf_show_snprintf() */
    bool compact;       /* no newlines/indentation */
    bool skip_names;    /* skip member/type names */
    bool emit_zeroes;   /* show 0-valued fields */
    size_t :0;
};

So we can control the display by specifying a custom indent string (default is a space) and initial indent level. We can also make the output compact, which makes it harder to read, but this can be useful sometimes. Finally, by default we emit type and field names and skip zero-valued fields, but we can override these choices with the last two options.

A data-centric tracing example - ksnoop

Ideally, tracers will adopt these features over time, but to aid with that I wrote a simple tracer called ksnoop, which is now available in the set of libbpf-tools provided with bcc (the BPF Compiler Collection). libbpf-tools are neat because they will run on multiple kernels due to their use of BPF Compile Once - Run Everywhere (CO-RE), and because they are self-contained, you only need the binary ksnoop on a system. bpftool skeletons (see the previous BPF blog post) embed the BPF bytecode.

ksnoop is available from libbpf-tools; for Oracle Linux the v0.23 is available in the UEKR7 repository. ksnoop is a standalone binary which has BPF program built-in via BPF skeleton support.

ksnoop is a simple tracer that traces functions with arguments and return values, using btf_dump__dump_type_data() to display the associated kernel data. We start by looking at the function signature:

# bpf-ksnoop info ip_send_skb

  int  ip_send_skb(struct net  * net, struct sk_buff  * skb);

Now we can trace that function, and by default we see function entry and return:

# bpf-ksnoop trace ip_send_skb

              TIME  CPU      PID FUNCTION/ARGS
    78101668506811    1     2813 ip_send_skb(
                                     net = *(0xffffffffb5959840)
                                      (struct net){
                                       .passive = (refcount_t){
                                        .refs = (atomic_t){
                                         .counter = (int)0x2,
                                        },
                                       },
                                       .dev_base_seq = (unsigned int)0x18,
                                       .ifindex = (int)0xf,
                                       .list = (struct list_head){
                                        .next = (struct list_head *)0xffff9895440dc120,
                                        .prev = (struct list_head *)0xffffffffb595a8d0,
                                       },
                                     ...

    79561322965250    1     2813 ip_send_skb(
                                     return =
                                      (int)0x0
                                 );

Next we might want to trace a specific argument only, reducing the amount of output. Use the function argument name to do this:

# bpf-ksnoop trace "ip_send_skb(skb)"

             TIME  CPU      PID FUNCTION/ARGS
    78142420834537    1     2813 ip_send_skb(
                                     skb = *(0xffff989750797c00)
                                      (struct sk_buff){
                                       (union){
                                        .sk = (struct sock *)0xffff98966ce19200,
                                        .ip_defrag_offset = (int)0x6ce19200,
                                       },
                                       (union){
                                        (struct){
                                         ._skb_refdst = (long unsigned int)0xffff98981dde2d80,
                                         .destructor = (void (*)(struct sk_buff *))0xffffffffb3e1beb0,
                                        },
    ...

We do not see return traces since we asked for an entry argument only; we can specify return value using the “return” name:

# bpf-ksnoop trace "ip_send_skb(return)"

             TIME  CPU      PID FUNCTION/ARGS
    78178228354796    1     2813 ip_send_skb(
                                     return =
                                      (int)0x0
                                 );

We can also dig a bit deeper (one level only is supported) by dereferencing a pointer argument member; for example:

# bpf-ksnoop trace "ip_send_skb(skb->sk)"

              TIME  CPU      PID FUNCTION/ARGS
    78207649138829    2     2813 ip_send_skb(
                                     skb->sk = *(0xffff98966ce19200)
                                      (struct sock){
                                       .__sk_common = (struct sock_common){
                                        (union){
                                         .skc_addrpair = (__addrpair)0x1701a8c017d38f8d,
                                         (struct){
                                          .skc_daddr = (__be32)0x17d38f8d,
                                          .skc_rcv_saddr = (__be32)0x1701a8c0,
                                         },
                                        },
    ...

…and we can filter by pid:

# bpf-ksnoop -p 2813 "ip_rcv(dev)"

              TIME  CPU      PID FUNCTION/ARGS
    78254803164920    1     2813 ip_rcv(
                                     dev = *(0xffff9895414cb000)
                                      (struct net_device){
                                       .name = (char[16])[
                                        'l',
                                        'o',
                                       ],
                                       .name_node = (struct netdev_name_node *)0xffff989541515ec0,
    ...

We can also trace multiple functions at once:

# bpf-ksnoop tcp_sendmsg __tcp_transmit_skb  ip_output

             TIME  CPU      PID FUNCTION/ARGS
    71827770952903    1     4777 __tcp_transmit_skb(
                                     sk = *(0xffff9852460a2300)
                                      (struct sock)
    ...

We sometimes want to filter using specific criteria. For example to trace skbs of length > 100:

# bpf-ksnoop "ip_send_skb(skb->len > 100, skb)"

              TIME  CPU      PID FUNCTION/ARGS
    39267395709745    1     2955 ip_send_skb(
                                     skb->len =
                                      (unsigned int)0x89,
                                     skb = *(0xffff89c8be81e500)
                                      (struct sk_buff){
                                       (union){
                                        .sk = (struct sock *)0xffff89c6c59e5580,
                                        .ip_defrag_offset = (int)0xc59e5580,
                                       },

Lastly, we sometimes are interested in tracing a specific stack signature, where function A calls function B which calls function C. ksnoop supports such stack-specific tracing via the -s option. Note that function A does not have to call function B directly for tracing to occur, the functions just need to have been called in order, not directly:

# bpf-ksnoop -s tcp_sendmsg __tcp_transmit_skb  ip_output
               TIME  CPU      PID FUNCTION/ARGS
    246287022690137    2   884013 tcp_sendmsg(
                                      sk = *(0xffff9b2a328cd580)
                                       (struct sock){
    ...
    246287022693163    2   884013 __tcp_transmit_skb(
                                      sk = *(0xffff9b2a328cd580)
                                       (struct sock){
                                        .__sk_common = (struct sock_common){
    ...
    246287022697230    2   884013 ip_output(
                                      net = *(0xffffffffbc7c14c0)
                                       (struct net){
                                        .passive = (refcount_t){
                                         .refs = (atomic_t){
                                          .counter = (int)2,
    ...

All of this is done by creating BPF map entries specifying what the BPF program associated with the function has to trace (which we determined via a combination of BTF and the user-specified trace); when it fires the data is copied to user-space and we use libbpf to dump it out.

For more details, see:

Note that ksnoop is really only a toy tracer; it is intended to serve as a proof-of-concept for other tracers to help them adopt typed data display support. However I find myself using it daily when debugging kernel issues.

Summary

Sometimes it is enough to understand control flow, but often when we need to dig deeper, having access to function arguments and return values can really help. ksnoop demonstrates that with the new libbpf interfaces, this is possible in a tracing context as well as in a debugger context. Critically, having this support in libbpf itself allows consumers to make use of these features in their own tools; ksnoop is simply an example which will hopefully inspire other tracing tools to adopt this functionality.

References

Inferring function arguments from a core dump can be hard work; a great description of the process can be found here:

As noted above, ksnoop can be found in

My eBPF summit on this topic can be found here:

Previous Blogs

Alan Maguire


Previous Post

Backporting patches using git

Vegard Nossum | 15 min read

Next Post


Join Oracle Linux, Virtualization, and Verrazzano teams at Oracle CloudWorld 2022

Parnian Taidi | 6 min read