From kernel to user-space tracing

November 15, 2022 | 13 minute read
Text Size 100%:

We continue our series of BPF blog entries by looking at observing userspace, and how this can be facilitated by adding Userspace Static Defined Tracing (USDT) probes to a library or program.

In this blog entry, we will describe how user-space tracing works under the hood, and once this has been done, we will provide a simple USDT tracing example that uses those mechanisms.

To get there, it will be necessary to

  1. Describe how kprobes and kretprobes work:
    • What they are
    • How they take control; and
    • How to use them
  2. Examine how these methods were adapted for userspace with uprobes/uretprobes:
    • Again we discuss what these probes are
    • How they take control; and again
    • How to use them
  3. Finally we can look at how USDT tracing works and give some simple examples.

Each step builds upon the previous one, so hopefully following these steps will help clarify the mechanisms used to support user-space tracing.

Most of the following discussion focuses on x86_64, but the same concepts broadly apply across different architectures, with different instructions and registers doing the same sorts of work.

Sample code with simple BPF programs illustrating kprobe, uprobe and USDT cases are available at:

kprobes

kprobes are used to instrument kernel code.

kprobes can be used to instrument nearly every instruction - not just function entry/return - in the kernel and are thus very powerful methods of instrumentation. The downside is that they also incur some overhead, for reasons we will see shortly.

How do they assume control?

When we add a kprobe, we copy the current instruction at the address into a “struct kprobe”, and replace the first byte with an INT3 trap instruction. Recall that traps are a type of exception (an interrupt you insert in your code really) that is triggered explicitly by execution of the trap instruction. Execution of the INT3 instruction triggers a breakpoint exception, and the debug exception handler is run. When this trap is triggered on function entry, registers are saved and the kprobe handler runs via the notifier call chain mechanism. Kprobe handling finds the associated “struct kprobe” to call its associated kprobe handler(s). The first handler - run prior to executing the traced instruction - is called the “pre-handler”. Then the instruction is single-stepped. After this, kprobes also have an optional “post-handler”, which, as you have probably guessed, runs after the single step. At this point, execution proceeds.

This is not then a trampoline, since we use a trap to take control for instrumentation. A trampoline is a (series of) jump(s) - hence the name - to a routine used to do the setup before jumping to the code required to accomplish the task at hand. We then see why a trampoline performs better than a trap; the trap requires execution of the trap handler and overheads associated with that, along with overheads associated with finding the kprobe associated with the trap. It is essentially having to handle an interrupt every time we trace information, whereas carrying out a jump to a trampoline address does not incur these costs.

Note however there is an ftrace optimization; when we place the kprobe at an ftrace site - function entry - the ftrace trampoline is used instead for kprobe handling.

kprobes can also use a jump optimization, not just at the function entry site. A set of conditions need to be satisfied at the to-be-probed address; these all boil down to checking

  • If there is space; and
  • If it is safe

…to replace the instruction with a jump.

A detour buffer consisting of code to save registers, call the trampoline which in turn calls the probe handlers, restore registers and the jump back is prepared as the target of the jump optimization. In other words we create code to do all those things and jump to it instead of issuing an expensive trap.

How do I use them?

There are many ways to make use of kprobes

  • tracefs has support for enabling kprobe activation.
  • A kernel module can be written using kprobe interfaces
  • BPF programs can be attached to kprobes also

Demo

We will illustrate how the kernel changes when a kprobe is added. In this example - because of the ftrace optimization mentioned above - the entry into the function will be avoided.

For this example - since we are using gdb - we will need the vmlinux binary associated with the kernel. For Oracle Linux, we install the “kernel-uek-debuginfo” package; other distributions have similar packages containing kernel debugging-related components.

First we will take a look at our to-be-traced function using gdb:

# gdb /usr/lib/debug/usr/lib/modules/`uname -r`/vmlinux /proc/kcore
(gdb) disassemble uptime_proc_show
Dump of assembler code for function uptime_proc_show:
   0xffffffff8140d360 <+0>: e8 5b fb c6 ff  callq  0xffffffff8107cec0 <__fentry__>
   0xffffffff8140d365 <+5>: 55  push   %rbp
   0xffffffff8140d366 <+6>: 48 89 e5    mov    %rsp,%rbp
   0xffffffff8140d369 <+9>: 41 55   push   %r13
...

Note that the above is the code in the vmlinux binary, not the code the kernel is currently running, so any modifications to program text will not show up in that disassembly listing.

/proc/kallsyms shows us where the function actually resides (kernel address space layout randomization means it will not be the address from the vmlinux symbol table):

$ grep uptime_proc_show /proc/kallsyms
ffffffff9840d360 t uptime_proc_show

For now we are going to instrument uptime_proc_show+0x9, i.e.

   0xffffffff8140d369 <+9>:     41 55   push   %r13

Via /proc/kallsyms, we saw the address to be instrumented will be 0xffffffff9840d369 - 0xffffffff9840d360 + 0x9; 9 bytes into the uptime_proc_show() function.

So we examine the memory at that byte before enabling the kprobe:

(gdb) x/xb 0xffffffff9840d369
0xffffffff9840d369: 0x41

Currently it is a push instruction, as we would expect - we can see the opcode in the disassembly above.

Next we enable the kprobe, trigger it, and check it fired:

$ echo 'p:uptime_trace uptime_proc_show+0x9' > /sys/kernel/debug/tracing/kprobe_events
$ echo 1 > /sys/kernel/debug/tracing/events/kprobes/uptime_trace/enable
$ uptime 
 07:14:46 up 2 days, 19:50,  1 user,  load average: 0.09, 0.04, 0.01
$ cat /sys/kernel/debug/tracing/trace_pipe
          uptime-927083  [002] d.Z.. 244211.532838: uptime_trace: (uptime_proc_show+0x9/0x131)

Okay, so our probe fired. We now examine the instruction we traced to see how it has changed:

(gdb) x/xb 0xffffffff9840d369
0xffffffff9840d369: 0xcc
(gdb) 

Now, there is our INT3 instruction!

Finally we disable the probe and see if the instruction changed again…

$ echo 0  > /sys/kernel/debug/tracing/events/kprobes/uptime_trace/enable
...
(gdb) x/xb 0xffffffff9840d369
0xffffffff9840d369: 0x41

Yep, back to normal - it is a push instruction again.

As mentioned above, BPF programs can also be attached to kprobes. Here is an example that prints “called do_sysinfo” when the sysinfo system call is used:

#include "vmlinux.h"

#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>

char _license[] SEC("license") = "GPL";

SEC("kprobe/do_sysinfo")
int BPF_PROG(do_sysinfo, struct sysinfo *sysinfo)
{
    __bpf_printk("called do_sysinfo\n");
    return 0;
}

libbpf can use the section name - “kprobe/do_sysinfo” to

  • Look up the address of do_sysinfo in /proc/kallsyms;
  • Enable the kprobe; and
  • Attach the program to it

For more info

Also see:

…which describes kprobe-based event tracing.

kretprobe instrumentation

kretprobes fire on function return, just prior to returning control to the calling function. They use kprobes under the hood to work, so the core mechanisms are the same as above.

How do they assume control?

Recall that when we call a function, we push the return address to the stack. kretprobes use a neat trick to instrument function return.

kretprobes establish a kprobe at function entry using the mechanisms described above. When that probe fires, the kprobe handler stores the return address that was pushed to the stack on function entry. Then that return address on the stack is replaced with the address of the kretprobe handler.

What this means is when the function “returns”, it actually jumps to the kretprobe handler instead! From there it can carry out kretprobe actions and then do the “actual” return.

Because a common pattern is measuring something like execution time across functions, kretprobes also support an entry handler which is triggered by that entry probe we mentioned above. Using the entry+return handlers is a neat way to observe changes across function lifetime.

How do I use them?

Similarly to kprobes

  • tracefs has support for enabling kretprobe activation.
  • A kernel module can be written using kprobe interfaces to enable kretprobes.
  • BPF programs can be attached to kretprobes also

For more info

As above,

uprobes

As mentioned above, it might seem odd that the kernel is used to instrument userspace. However the methods used closely mirror those used for kprobes.

How do they assume control?

Remember our friend, the INT3 trap instruction? It is again used, but this time to instrument user-space code.

The user provides an offset in a binary, and the uprobe registration mechanisms record the current instruction, and then place a software breakpoint (on x86_64 an INT3) at the specified offset in each VMA (Virtual Memory Area) associated with the specified inode. As a result, when code execution arrives at the INT3, the trap is generated and the associated uprobe located.

How do I use them?

We can again use tracefs interfaces; there is a clear example in:

Using uprobe interfaces requires specification of:

  • The binary to be traced
  • The raw offset within that object (which must be calcuated)
  • Any info required at trace time (register values)

We can also use BPF, as we will describe next.

Demo

As we saw, tracefs uprobe interfaces require explicitly specifying an offset in a binary for uprobe instrumentation. The Oracle Linux team worked with the upstream community recently to simplify aspects of uprobe support in libbpf. Specifically, we wanted to have a kprobe-like experience, where using the function name is all that is required in the section definition to specify the point of attachment. Similarly we wanted to simplify binary specification to allow specification of a binary name rather than a full path; libbpf then looks in standard and environment-determined locations for the program/library. Note that these changes are available in libbpf 0.8 and later.

So for example to sum malloc()ations by process:

#include "vmlinux.h"

#include <bpf/bpf_core_read.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 8192);
    __type(key, int);
    __type(value, int);
} alloc_map SEC(".maps");

SEC("uprobe/libc.so.6:malloc")
int malloc_counter(struct pt_regs *ctx)
{
    int pid = bpf_get_current_pid_tgid() >> 32;
    int *szp, sz = 0;

    szp = bpf_map_lookup_elem(&alloc_map, &pid);
    if (szp)
        sz += *szp;
    sz += PT_REGS_PARM1_CORE(ctx);
    bpf_map_update_elem(&alloc_map, &pid, &sz, 0);
    return 0;
}

char _license[] SEC("license") = "GPL";

This is a lot more portable than:

SEC("uprobe//usr/lib64/libc.so.6:0x4085740")

…and since uprobe searches LD_LIBRARY_PATH/PATH when passed a library or binary name, adding search paths is easy too!

For more info

uretprobes

uretprobes have the same relationship to uprobes that kretprobes have to kprobes; they both use entry probes under the hood to change the return address on function entry to that of the uretprobe dispatcher, to support tracing return.

USDT probes

USDT tracing relies on the uprobe mechanism, and much as uprobes are the userspace analogue to kprobes, USDT probes are the userspace analogue to tracepoints.

They are used to explicitly trace particular events of interest, so - again similarly to tracepoints:

  • We are not tied to function entry and return
  • We can place probes for the same event in multiple places where that is appropriate
  • We can assemble arguments, and thus provide more stable interfaces than userspace functions that may come and go
  • We can also use is-enabled probing, where we can minimize overheads introduced in probe argument preparation by not carrying out such preparation if probes are not enabled

USDT tracing works by adding information about these points of interest to the ELF notes section named ‘.note.stapsdt’ section; for example

$ objdump -h /usr/sbin/libvirtd  |grep .note.stapsdt
 27 .note.stapsdt 00000364  0000000000000000  0000000000000000  0007c188  2**2

The ELF notes for USDT probes contain:

  • A string specifying a provider name
  • A string specifying a probe name
  • A probe address
  • The address of the associated semaphore (optional)
  • Argument info

The semaphore is used for is-enabled probe points, where the semaphore value determines conditional execution of the probe code. This can be useful if it is necessary to prepare arguments for probes and the overhead of doing so is enough that we want to avoid it when not tracing.

How do they assume control?

The uprobe mechanism is used to replace the NOP at the point of instrumentation with an INT3 trap. If the is-enabled form is used, the USDT instrumentation method must also increment the semaphore value to ensure the trap instruction is reached.

How do I use them?

We can see available USDT probes using perf list

$ perf list|grep sdt
  sdt_libc:lll_futex_wake                            [SDT event]
  sdt_libc:lll_lock_wait_private                     [SDT event]
  sdt_libc:longjmp                                   [SDT event]
  sdt_libc:longjmp_target                            [SDT event]
  sdt_libc:memory_arena_new                          [SDT event]
  sdt_libc:memory_arena_retry                        [SDT event]
  sdt_libc:memory_arena_reuse                        [SDT event]
  ...

Additional probes can be added to the list using “perf buildid-cache”. For example:

$ perf buildid-cache --add /usr/sbin/libvirtd
$ perf list |grep sdt_libvirt
  sdt_libvirt:rpc_server_client_auth_allow           [SDT event]
  sdt_libvirt:rpc_server_client_auth_deny            [SDT event]
  sdt_libvirt:rpc_server_client_auth_fail            [SDT event]

Support was added in libbpf 0.8 by Andrii Nakryiko for USDT tracing; this is a great feature and it closes one of the last gaps between libbpf and bcc support for tracing. See usdt-related tests in tools/testing/selftests/bpf/prog[_tests] for examples of how to use this support.

Demo

We can add simple USDT probes to a program as follows. /usr/include/sys/sdt.h is required - it is available in the systemtap-sdt-devel package, since it provides the definitions which add ELF note info for probes.

#include <stdio.h>
#include <sys sdt.h>

int main(int argc, char *argv[])
{
  int nargs = argc;

  getc(stdin);

  nargs = argc;

  DTRACE_PROBE1(example, args, nargs);

        return 0;
}

Compile the above, then add the probe info via perf and trace it:

$ /usr/bin/perf buildid-cache --add ./usdt_example
$ /usr/bin/perf list |grep sdt_example

  sdt_example:args                                [SDT event]
$ /usr/bin/perf probe sdt_example:args
$ /usr/bin/perf record -e sdt_example:no_args -aR ./usdt_example

…and press a key. You should see:

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.499 MB perf.data (1 samples) ]

Running perf report we see:

Samples: 1  of event 'sdt_example:args', Event count (approx.): 1
Overhead  Command       Shared Object  Symbol
 100.00%  usdt_example  usdt_example   [.] main

With libbpf >= 0.8, we can also write a BPF program to capture data from the USDT probe. In this example we record the number of arguments in the global "got_nargs".

#include "vmlinux.h"

#include <bpf usdt.bpf.h>

int got_nargs = 0;

char _license[] SEC("license") = "GPL";

SEC("usdt//proc/self/exe:example:args")
int BPF_USDT(args, int nargs)
{
        got_nargs = nargs;

        return 0;
}

Summary

So we have seen how trap instructions can be used to instrument both kernel and userspace code. Trapping can be done from kernel and user-space and kprobes, uprobes and USDT rely on this mechanism to trigger instrumentation.

However, over time, with other instrumentation technologies, the focus has shifted towards minimizing overhead, and ftrace in particular has been a driving force in innovating methods for low-overhead kernel instrumentation via clever use of trampolines. More recently, BPF has begun to make use of these methods to support low-overhead, highly flexible instrumentation that - as well as supporting classic instrumentation targets like functions and tracepoints - also supports attaching to other BPF programs! In a future blog post, we will cover some of these developments.

References

Sample code detailed above is available from:

Description of kprobe internals, and kprobe tracefs interfaces:

Description of u(ret)probes:

Previous Blogs

Alan Maguire


Previous Post

Oracle Linux debuts on 4th Gen AMD EPYC processor

Michele Resta | 2 min read

Next Post


Oracle Linux 8 Update 7 simplifies operations at scale

Simon Coter | 3 min read