Intro to Kernel and Userspace Tracing Using BCC, Part 3 of 3

August 24, 2021 | 51 minute read
Text Size 100%:

Welcome to the third and last part of this 3-part series on Kernel and Userspace Tracing Using BCC. In this final part, we're going to look at a much more involved BCC script example. This example shows how to use both dynamic and static tracing in QEMU and the Linux kernel, as well as flexing some of the true capabilities of BCC. This page also assumes you're at least familiar with navigating through large source repositories such as QEMU and the Linux kernel, as for each we'll be looking into their source code, editing, and rebuilding them.

Example 3: Virtio in QEMU and Kernel

Let's briefly take a look at instrumenting some observability in virtio, both in QEMU (userspace) and kernel. In this example, we'll introduce uprobes, uretprobes, USDT tracepoints, and kernel tracepoints. The goal of this example is not only to show you how to implement these tracing technologies, but to also see how we can get full visibility on a system, choose the data we want, and choose how we see it.

Uprobes, Uretprobes, and Histograms

First, we'll get some uprobes and uretprobes instrumented for virtio-net. More specifically, for virtio-net's RX virtqueue, let's create a log2 histogram of receiving packet sizes. For virtio-net's TX virtqueue, we'll create a log2 histogram of the number of packets per transmission.

Recall QEMU's main TX and RX handler functions:

// RX
static ssize_t virtio_net_receive_rcu(NetClientState *nc,
                                      const uint8_t *buf,
                                      size_t size) {
    ...
}
// TX
static int32_t virtio_net_flush_tx(VirtIONetQueue *q) {
    ...
}


For instrumenting virtio_net_receive_rcu, we'll enable a uprobe and uretprobe and create separate histograms of the size_t size (the size of the incoming packet is both an argument and the return value). They should show the same histogram.

For instrumenting virtio_net_flush_tx, we'll just enable a uretprobe (we can create a histogram of the number of packets transmitted, that is, the return value).

To get virtio-net on your guest, you can include the following parameters in the QEMU invocation:

-device virtio-net-pci,netdev=tapnet
-netdev tap,id=tapnet,vhost=off,ifname=tap0,script=/etc/qemu-ifup,downscript=no


Examine the following BPF program; notice that there's no BPF_PERF_OUTPUT defined anywhere, as we're reading directly from the BPF histograms (Note: you'll have to change the path in the variable dev_qemu_path to correctly point to your QEMU binary):

#!/usr/bin/python

from bcc import BPF
from time import sleep

bpf_txt = """
#include <uapi/linux/ptrace.h>

BPF_HISTOGRAM(rx_entry_hist);
BPF_HISTOGRAM(rx_retval_hist);
BPF_HISTOGRAM(tx_retval_hist);

int start_vnet_rec_rcu(struct pt_regs *ctx) {
    u64 size;
    size = PT_REGS_PARM3(ctx);
    rx_entry_hist.increment(bpf_log2l(size));
    return 0;
}

int end_vnet_rec_rcu(struct pt_regs *ctx) {
    int retval;
    retval = PT_REGS_RC(ctx);
    rx_retval_hist.increment(bpf_log2l(retval));
    return 0;
}

int end_vnet_tx(struct pt_regs *ctx) {
    int retval;
    retval = PT_REGS_RC(ctx);
    tx_retval_hist.increment(bpf_log2l(retval));
    return 0;
}
"""

# Load BPF program
dev_qemu_path = \
    "/mnt/jonpalme-bv1/repos/oracle/qemu/build/x86_64-softmmu/qemu-system-x86_64"

bpf_ctx = BPF(text=bpf_txt)
bpf_ctx.attach_uprobe(name=dev_qemu_path,
                      sym="virtio_net_receive_rcu",
                      fn_name="start_vnet_rec_rcu")
bpf_ctx.attach_uretprobe(name=dev_qemu_path,
                         sym="virtio_net_receive_rcu",
                         fn_name="end_vnet_rec_rcu")
bpf_ctx.attach_uretprobe(name=dev_qemu_path,
                         sym="virtio_net_flush_tx",
                         fn_name="end_vnet_tx")

# Print header
print("Aggregating data from virtio-net RX & TX virtqueues... "
      "Hit Ctrl-C to end.")

# Process arguments
while 1:
    try:
        sleep(1)
    except KeyboardInterrupt:
        print("\n")
        break

# Output
print("Virtio-net RX (entry) log2 packet size histogram")
print("---------------------------------------------")
bpf_ctx["rx_entry_hist"].print_log2_hist("size")
print("\n")
print("Virtio-net RX (return) log2 packet size histogram")
print("----------------------------------------------")
bpf_ctx["rx_retval_hist"].print_log2_hist("retSize")
print("\n")
print("virtio-net TX (return) log2 number of packets transmitted")
bpf_ctx["tx_retval_hist"].print_log2_hist("num packets")
print("\n")


Open another window to run this script, then get some network traffic going on your guest. In this example and the next, I simply wget a file to generate network traffic. After a few moments, Ctrl-C the script and you should have something like the following:

[root@jonpalme-ol7-bm: my_scripts]# ./qemu_uprobes.py
Aggregating data from virtio-net RX & TX virtqueues... Hit Ctrl-C to end.
^C

Virtio-net RX (entry) log2 packet size histogram
---------------------------------------------
     size                : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 7        |                                        |
        64 -> 127        : 17       |                                        |
       128 -> 255        : 1        |                                        |
       256 -> 511        : 1        |                                        |
       512 -> 1023       : 5        |                                        |
      1024 -> 2047       : 47906    |****                                    |
      2048 -> 4095       : 30912    |***                                     |
      4096 -> 8191       : 100188   |*********                               |
      8192 -> 16383      : 411462   |****************************************|
     16384 -> 32767      : 278      |                                        |

Virtio-net RX (return) log2 packet size histogram
----------------------------------------------
     retSize             : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 7        |                                        |
        64 -> 127        : 17       |                                        |
       128 -> 255        : 1        |                                        |
       256 -> 511        : 1        |                                        |
       512 -> 1023       : 5        |                                        |
      1024 -> 2047       : 47906    |****                                    |
      2048 -> 4095       : 30912    |***                                     |
      4096 -> 8191       : 100188   |*********                               |
      8192 -> 16383      : 411462   |****************************************|
     16384 -> 32767      : 278      |                                        |

virtio-net TX (return) log2 number of packets transmitted
     num packets         : count     distribution
         0 -> 1          : 782827   |****************************************|
         2 -> 3          : 78190    |***                                     |
         4 -> 7          : 740      |                                        |
         8 -> 15         : 24       |                                        |
        16 -> 31         : 1        |                                        |


Note: If you did not receive any output from the script, make sure you have virtio-net active and that you have a working network connection on that guest.

Looking at the output from this script, we see that the histogram for RX's entry and return are exactly the same, as we expected.

For TX, we notice that the bulk of the operations to send packets out typically only sends 1 packet at a time. You also probably noticed that there's a slight nuance between enabling kprobes/kretprobes and uprobes/uretprobes.

Let's take a closer look at some of the new things we see here:

BPF_HISTOGRAM(name): Creates a histogram map named name, with optional parameters.

  • Default: BPF_HISTOGRAM(name, key_type=int, size=64)
  • BPF_HISTOGRAM

PT_REGS_PARMn(ctx): Retrieves the n'th argument of the function it's instrumenting.

  • To retrieve a kprobe/uprobe's arguments, you can use this method or the method in the previous example where we added the function's parameters to our BPF C function's parameters

hist.increment(bpf_log2l(val)): Increments the key's value by the log-2 of val.

attach_uprobe(name='...', sym='...', fn_name='...'): Instruments the user-level function sym from either the library or binary named by name using user-level dynamic tracing of the function entry, and attach our C defined function (fn_name) to be called whenever the user-level function is called.

  • In our example, I use the full path to the QEMU binary (dev_qemu_path) I'm running the guests on
  • You'll have to put in your own path for this to work
  • attach_uprobe

attach_uretprobe(name='...', sym='...', fn_name='...'): Instruments the return of the user-level function sym from either the library or binary named by name using user-level dynamic tracing of the function return, and attach our C defined function (fn_name) to be called whenever the user-level function returns.

bpf_ctx["hist_name"].print_log2_hist("label"): Prints the BPF histogram table named hist_name as a log-2 histogram in ASCII.

  • The label of the first column is labled as label
  • The table must be stored as log-2 (see hist.increment(bpf_log2l(val)))
  • print_log2_hist

Kernel Tracepoints for Vhost-Net

Dynamic instrumentation (uprobes, kprobes, uretprobes, kretprobes) are great in the sense that they're already there and ready to be enabled. On the other hand, they do have their own drawbacks associated with them. For example, what if we're using an argument from one of these probes and it changes? Or what if we want to get information from inside the function?

Let's take a look at the vhost-net driver code inside of the kernel (linux-uek/drivers/vhost/net.c), more specifically, the functions handle_tx and handle_rx:

static void handle_rx(struct vhost_net *net) {
    ...
    do {
        ...
    } while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len)));
    ...
}

static void handle_tx(struct vhost_net *net) {
    ...
    do {
        ...
    } while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
    ...
}


While of course we can still use kprobes and kretprobes, however, a kretprobe won't give us much information (no return value), and with a kprobe we only have access to the data structure struct vhost_net *net. While having access to this vhost_net data structure can provide us with a plethora of information, it's not too helpful to us if we're trying to find, for example, the number of packets received and transmitted (e.g. recv_pkts & sent_pkts).

How would we do this? One answer is kernel tracepoints. Recall that kernel tracepoints are a form of static instrumentation, thus we will need to define and implement a tracepoint function directly into the code.

I wont go into full detail on how these kernel tracepoints are implemented, but I highly encourage you to check out LWN.net's 3-part article on setting up kernel tracepoings with the TRACE_EVENT() macro (link will be in the References section at the end of this page).

Instead, I'll highlight some of the key points:

Aside: Implementing Kernel Tracepoints

Essentially what we want to do is this (let's also trace total_len too):

static void handle_rx(struct vhost_net *net) {
    ...
    do {
        ...
    } while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len)));

    trace_vhost_net_rx_end(recv_pkts, total_len);
    ...
}

static void handle_tx(struct vhost_net *net) {
    ...
    do {
        ...
    } while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));

    trace_vhost_net_tx_end(sent_pkts, total_len);
    ...
}


Here we want to define and implement these kernel tracepoints trace_vhost_net_rx_end and trace_vhost_net_rx_end and pass in int recv_pkts, int sent_pkts, and size_t total_len, so we may trace them.

Now the question is, where and how do we define these kernel tracepoints? Well, they can technically be defined anywhere, but typically they're defined in these kernel tracepoint header files in linux-uek/include/trace/events. Each header file can contain many kernel tracepoints, and the convention is to keep all of a system or subsystem's tracepoints defined in their corresponding header files.

Since we're observing vhost-net's handle_rx and handle_tx functions (and there's no tracepoints defined for vhost-net), we'll make our own kernel tracepoint header file called vhost_net.h.

That is, we create the following file linux-uek/include/trace/events/vhost_net.h:

/* Kernel tracepoints for vhost_net virtio functions */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM vhost_net

#if !defined(_TRACE_VHOST_NET_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_VHOST_NET_H

#include <linux/tracepoint.h>
#include <linux/vhost.h>

TRACE_EVENT(vhost_net_rx_end,

        TP_PROTO(int recv_pkts, int total_len),

        TP_ARGS(recv_pkts, total_len),

        TP_STRUCT__entry(
            __field(    int,    recv_pkts   )
            __field(    int,    total_len   )
        ),

        TP_fast_assign(
            __entry->recv_pkts = recv_pkts;
            __entry->total_len = total_len;
        ),

        TP_printk("recv_pkts=%d, total_len=%d",
            __entry->recv_pkts,
            __entry->total_len)
);

TRACE_EVENT(vhost_net_tx_end,

        TP_PROTO(int sent_pkts, int total_len),

        TP_ARGS(sent_pkts, total_len),

        TP_STRUCT__entry(
            __field(    int,    sent_pkts   )
            __field(    int,    total_len   )
        ),

        TP_fast_assign(
            __entry->sent_pkts = sent_pkts;
            __entry->total_len = total_len;
        ),

        TP_printk("sent_pkts=%d, total_len=%d",
            __entry->sent_pkts,
            __entry->total_len)
);

#endif /*_TRACE_VHOST_NET_H */

/* This part must be outside protection */
#include <trace/define_trace.h>


Again, you'll want to check the LWN.net articles linked in the References section below to understand all the details of what's going on here. In any case, you'll get something very similar to this (you might define your printk a different way).

Also notice how we've defined the name of these tracepoints (no trace_ prefix). When we attempt to add the tracepoint in, we add the trace_ to the beginning of the name so the compiler knows that this is a kernel tracepoint function.

Next, we'll need to add the kernel tracepoint functions (trace_vhost_net_tx_end and trace_vhost_net_rx_end) for both RX and TX right after the do while loop (like we did in the first example of this aside).

Lastly, at the top of net.c, add the following lines to the top of the file:

// Other header files...
...
#define CREATE_TRACE_POINTS
#include <trace/events/vhost_net.h>
...


At this point, the kernel tracepoints should be set up and implemented. The only problem is that static instrumentation is built into their corresponding binary files. This means that we'll have to rebuild and reinstall this modified kernel.

In the top directory of the linux repo (e.g. linux-uek repo), make and install this new kernel:

[root@jonpalme-ol7-bm: linux-uek]# make -j 144
  CHK     include/config/kernel.release
  CHK     include/generated/uapi/linux/version.h
  DESCEND  objtool
  ...

[root@jonpalme-ol7-bm: linux-uek]# make modules_install
  INSTALL arch/x86/crypto/aesni-intel.ko
  INSTALL arch/x86/crypto/blowfish-x86_64.ko
  INSTALL arch/x86/crypto/camellia-aesni-avx-x86_64.ko
  ...
  DEPMOD  4.14.35+

[root@jonpalme-ol7-bm: linux-uek]# make install
sh ./arch/x86/boot/install.sh 4.14.35+ arch/x86/boot/bzImage \
    System.map "/boot"
bash
nss-softokn
...


Now that the new kernel is ready to be booted into, you'll want to open a serial connection to your machine and reboot over that connection (so you can select the new kernel to boot into).

Hint: if you want to give yourself more time to select a kernel version at startup, you can go into /etc/default/grub and set GRUB_TIMEOUT=n where n is some number of seconds. This will give you n seconds to select your kernel.

Once you've rebooted into your new kernel, spawn a guest with vhost enabled i.e (note vhost=on):

-device virtio-net-pci,netdev=tapnet
-netdev tap,id=tapnet,vhost=on,ifname=tap0,script=/etc/qemu-ifup,downscript=no


To check that your kernel tracepoints are implemented and able to be traced, you can go to BCC's premade tool tplist.py, which displays implemented kernel tracepoints and USDT tracepoints. You can find the tool here (default path): /usr/share/bcc/tools/tplist.

If you run the command like ./tplist | grep vhost, you should be able to see our two kernel tracepoints:

[root@jonpalme-ol7-bm: tools]# ./tplist.py | grep vhost
vhost_net:vhost_net_rx_end
vhost_net:vhost_net_tx_end

Kernel Tracepoints for Vhost-Net (Cont.)

Now that we've implemented our kernel tracepoints, and we've checked that they're there, all that's left to do is build the BPF program.

Similar to what we did in QEMU with virtio-net's RX and TX virtqueues, we'll do the same except now we have the total length for the TX virtqueue. So we'll have 4 histograms; for the RX and TX virtqueues, we'll have a histogram for both received/sent packets and for their total length.

Examine the code below, taking note of the struct our_ktp_args and the format of attach_tracepoint:

#!/usr/bin/python

from bcc import BPF
from time import sleep

bpf_txt = """

BPF_HISTOGRAM(rx_pkts_hist);
BPF_HISTOGRAM(rx_len_hist);
BPF_HISTOGRAM(tx_pkts_hist);
BPF_HISTOGRAM(tx_len_hist);

struct our_ktp_args {
    u64 __unused__;
    int packets;
    int total_length;
};

int rx_end(struct our_ktp_args *args) {
    int recv_pkts = args->packets;
    int len = args->total_length;

    rx_pkts_hist.increment(bpf_log2l(recv_pkts));
    rx_len_hist.increment(bpf_log2l(len));
    return 0;
}

int tx_end(struct our_ktp_args *args) {
    int sent_pkts = args->packets;
    int len = args->total_length;

    tx_pkts_hist.increment(bpf_log2l(sent_pkts));
    tx_len_hist.increment(bpf_log2l(len));
    return 0;
}
"""
# Load BPF program
bpf_ctx = BPF(text=bpf_txt)
bpf_ctx.attach_tracepoint(tp="vhost_net:vhost_net_rx_end",
                          fn_name="rx_end")
bpf_ctx.attach_tracepoint(tp="vhost_net:vhost_net_tx_end",
                          fn_name="tx_end")

# Print header
print("Aggregating data from vhost-net RX & TX virtqueues... "
      "Hit Ctrl-C to end.")

# Format output
while 1:
    try:
        sleep(1)
    except KeyboardInterrupt:
        print("\n")
        break

# Output
print("Vhost-net RX log2 received packets histogram")
print("--------------------------------------------")
bpf_ctx["rx_pkts_hist"].print_log2_hist("recv pkts")
print("\n")

print("Vhost-net RX log2 total length histogram")
print("----------------------------------------")
bpf_ctx["rx_len_hist"].print_log2_hist("total len")
print("\n")

print("-------------------------------------------------")
print("\n")

print("Vhost-net TX log2 sent packets histogram")
print("----------------------------------------")
bpf_ctx["tx_pkts_hist"].print_log2_hist("sent pkts")
print("\n")

print("Vhost-net TX log2 total length histogram")
print("----------------------------------------")
bpf_ctx["tx_len_hist"].print_log2_hist("total len")
print("\n")


To get some output, start up a guest with vhost enabled, run the script, and in another window connect to the guest and generate some network traffic. Then Ctrl-C the script:

[root@jonpalme-ol7-bm: my_scripts]# ./kernel_tracepts.py
Aggregating data from vhost-net RX & TX virtqueues... Hit Ctrl-C to end.
^C

Vhost-net RX log2 received packets histogram
--------------------------------------------
     recv pkts           : count     distribution
         0 -> 1          : 482291   |****************************************|
         2 -> 3          : 40457    |***                                     |
         4 -> 7          : 534      |                                        |
         8 -> 15         : 4        |                                        |
        16 -> 31         : 2        |                                        |

Vhost-net RX log2 total length histogram
----------------------------------------
     total len           : count     distribution
         0 -> 1          : 330      |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 8        |                                        |
        64 -> 127        : 32       |                                        |
       128 -> 255        : 9        |                                        |
       256 -> 511        : 8        |                                        |
       512 -> 1023       : 30       |                                        |
      1024 -> 2047       : 15275    |*                                       |
      2048 -> 4095       : 15516    |*                                       |
      4096 -> 8191       : 57986    |*****                                   |
      8192 -> 16383      : 433015   |****************************************|
     16384 -> 32767      : 1060     |                                        |
     32768 -> 65535      : 14       |                                        |
     65536 -> 131071     : 4        |                                        |
    131072 -> 262143     : 1        |                                        |

-------------------------------------------------

Vhost-net TX log2 sent packets histogram
----------------------------------------
     sent pkts           : count     distribution
         0 -> 1          : 264948   |****************************************|
         2 -> 3          : 8358     |*                                       |
         4 -> 7          : 199      |                                        |
         8 -> 15         : 1        |                                        |

Vhost-net TX log2 total length histogram
----------------------------------------
     total len           : count     distribution
         0 -> 1          : 18       |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 8        |                                        |
        64 -> 127        : 264919   |****************************************|
       128 -> 255        : 7998     |*                                       |
       256 -> 511        : 558      |                                        |
       512 -> 1023       : 5        |                                        |


For the above output, the same wget command for the same file was executed and the BPF program was terminated once it had finished downloading.

We can see from the program's output, and comparing it to our previous program that measured virtio-net's RX and TX virtqueues, the size/length of the RX virtqueue's packets and sent packets from the TX virtqueue is roughly similar to our output here with vhost-net enabled. Not only that, but we got more observability (specifically, the received packets by the RX virtqueue and the total length of sent packets by the TX virtqueue) than what we were able to get with dynamic tracing.

Let's also take note at some of the new things we see here:

struct our_ktp_args: A self-defined data structure that holds the arguments of the kernel tracepoints.

  • In order to retrieve our arguments from our kernel tracepoints (e.g. sent/recv_pkts and total_len), we must define the data structure that they will be stored in (and exactly as its format is defined)
  • The format for vhost_net_tx_end kernel tracepoint, for example, can be found in /sys/kernel/debug/tracing/events/vhost_net/vhost_net_tx_end/format:

``` [root@jonpalme-ol7-bm: my_scripts]# cat /sys/kernel/debug/tracing/events/vhost_net/vhost_net_tx_end/format name: vhost_net_tx_end ID: 1958 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1;

  field:int sent_pkts;    offset:8;   size:4; signed:1;
  field:int total_len;    offset:12;  size:4; signed:1;


print fmt: "sent_pkts=%d, total_len=%d", REC->sent_pkts, REC->total_len ```

  • There are 4 predefined fields: common_type, common_flags, common_preempt_count, and common_pid
    • These can be used or ignored, but they must have the correct amount of space reserved for them either way
  • In our example, we've ignored them and using a u64 type placeholder seems to fit them in nicely (you don't have to call it __unused__, just make sure its of type u64 or equivalent)

  • Since both tracepoints vhost_net_rx_end and vhost_net_tx_end pass in the same types, we can just use one tracepoint argument data structure (struct our_ktp_regs)

Note: If we weren't using a histogram to output to userspace (and instead used perf_submit(), for example), you can use the struct out_ktp_args *args in place of struct pt_regs *ctx to submit data to perf's ring buffer, e.g. events.perf_submit(args, &data sizeof(data)).

attach_tracepoint(tp='...', fn_name='...'): Instruments the kernel tracepoint tp, and when hit, runs the BPF function fn_name.

  • The format of defining kernel tracepoints is header-name:tracepoint-name
  • We created include/trace/events/vhost_net.h for our tracepoints, so that is the header-name
    • The tracepoint-name is the tracepoint name as it's defined in vhost_net.h
  • Thus, tp="vhost_net:vhost_net_rx_end" and tp="vhost_net:vhost_net_tx_end"
  • attach_tracepoint

USDT Tracepoints in QEMU

Now that we were able to get better observability on vhost-net's RX and TX virtqueues using kernel tracepoints, let's see if we can do the same in userspace for virtio-net's RX and TX virtqueues. That is, using USDT tracepoints.

Aside: Implementing USDT Tracepoints

Implementing USDT tracepoints for userspace applications can be much simpler and faster compared to implementing kernel tracepoints. Although it's worth mentioning that for QEMU, there is an accepted way of defining USDT tracepoints (similar in fashion on how a kernel tracepoint is defined). We won't cover how to define those types of USDT tracepoints, but we will cover a more general USDT tracepoint that you can use in any userspace C application: DTRACE_PROBEn() macro.

The DTRACE_PROBEn() macro is very easy to implement in userspace applications. All you need to do to get these is install the systemtap-sdt-devel package, which provides the sys/sdt.h header file that defines the DTRACE_PROBEn macro.

There's a great LWN.net article on using userspace tracepoints with BCC (link in References section), namely in Adding USDT probes to your application section of the article. As with kernel tracepoints, I won't go into detail on them, I'll just highlight the key parts of implementing USDT tracepoints with SystemTap's API.

I'll create a simple C program and implement USDT tracepoints to show how to use this macro. Examine the following simple C program below:

#include <sys/sdt.h>

int main(int argc, char **argv) {
    int x = 0;

    while (x < 1000000) {
        x++;
        if (x % 100000 == 0) {
            DTRACE_PROBE1(usdt-examples, usdt1, x);
        }
    }
    DTRACE_PROBE3(usdt-examples, usdt2, x, "Hello!", 5);
    return 0;
}


We'll compile this program and confirm that these USDT tracepoints are set and usable via one of BCC's multipurpose tools /usr/share/bcc/tools/trace. This trace script acts as a frontend that can execute one-line commands like we're about to do. I like to use this tool to quickly trace things if I need to.

First, we'll enable and start tracing these probes like so:

./trace.py 'u:path/to/usdt_example_binary:probe_name "x=%d", arg1'


For my setup, and using both USDT tracepoints, my command line looks like this:

./trace.py \
  'u:/mnt/jonpalme-bv1/repos/bcc/my_scripts/usdt_example_binary:usdt1 "x=%d", arg1' \
  'u:/mnt/jonpalme-bv1/repos/bcc/my_scripts/usdt_example_binary:usdt2 "x=%d, arg2=%s, arg3=%d", arg1, arg2, arg3'


Run this command on your machine with the correct path to the binary USDT tracepoint example program above. Then, in a separate window, run your C program (e.g. ./usdt_example) and observe the output from our trace tool:

[root@jonpalme-ol7-bm: tools]# ./trace.py \
  'u:/mnt/jonpalme-bv1/repos/bcc/my_scripts/usdt_example:usdt1 "x=%d", arg1' \
  'u:/mnt/jonpalme-bv1/repos/bcc/my_scripts/usdt_example:usdt2 "x=%d, arg2=%s, arg3=%d", arg1, arg2, arg3'
PID     TID     COMM            FUNC             -
15543   15543   usdt_example    usdt1            x=100000
15543   15543   usdt_example    usdt1            x=200000
15543   15543   usdt_example    usdt1            x=300000
15543   15543   usdt_example    usdt1            x=400000
15543   15543   usdt_example    usdt1            x=500000
15543   15543   usdt_example    usdt1            x=600000
15543   15543   usdt_example    usdt1            x=700000
15543   15543   usdt_example    usdt1            x=800000
15543   15543   usdt_example    usdt1            x=900000
15543   15543   usdt_example    usdt1            x=1000000
15543   15543   usdt_example    usdt2            x=1000000, arg2=Hello!, arg3=5


When implementing USDT tracepoints with SystemTap's API, you only need to be aware of two things: #include <sys/sdt.h> and the DTRACE_PROBEn() macro:

#include <sys/sdt.h>: This is the header file from the systemtap-sdt-devel package and is required to use the DTRACE_PROBEn() macro.

DTRACE_PROBEn(provider_name, probe_name, arg1, arg2, ..., argn): SystemTap's USDT tracepoint implementation macro.

  • DTRACE_PROBEn: SystemTap's USDT tracepoint macro where n must be the number of arguments declared in the macro
  • provider_name: Allows you to create a namespace for your probes
    • Typically named after the name of the application or library, but can be named anything
  • probe_name: Name of this probe
  • Typically named as an "event" but can also be named anything as well
  • arg1, arg2, ..., argn: Arguments to the USDT tracepoint (elements we want to trace)
  • Can be any number of arguments

trace / trace.py: One of BCC's in-house BPF tools with all types of tracing functionality.

  • See BCC's documentation on this, very useful tool

USDT Tracepoints in QEMU (Cont.)

Recall that for our previous implementation with kernel tracepoints, we were able to get observability on the number of received/sent packets and the total length for each request to the RX or TX virtqueues. Also recall that with uprobes and uretprobes, we were only able to trace the total length of RX packets and only the number of sent packets for the TX virtqueue.

Now let's use USDT tracepoints to see both the received/sent packets and the total length of each request for virtio-net's RX and TX virtqueues.

If we take a look at vhost-net's handler functions for its RX and TX virtqueues again (handle_rx, handle_tx) and compare them to virtio-net's handler functions (virtio_net_receive_rcu, virtio_net_flush_tx), you'll notice that they're similar in structure, but, unlike vhost-net's handler functions, virtio-net's handler function for the TX virtqueue doesn't keep a running total of the size of the request. We'll have to improvise to collect this data.

Looking at virtio-net's RX handler function (virtio_net_receive_rcu), we know that the size of the request is passed in as a parameter and returned, so that'll be easy to get. However, if we look close enough, we also notice that the variable size_t i can be used to represent the number of received packets. We'll add our USDT tracepoint at the end of the function like so:

static ssize_t virtio_net_receive_rcu(NetClientState *nc,
                                      const uint8_t *buf,
                                      size_t size) {
    ...
    while (offset < size) {
        ...
    }
    ...
    DTRACE_PROBE2(qemu, virtio-net-rx-end, i, size);
    return size;
}


Looking at virtio-net's TX handler function (virtio_net_flush_tx), we know that the number of sent packets is num_packets, which is returned from the function. However, there's no running total of the length/size as there was in vhost-net's TX handler function. We'll have to improvise for this as well.

That is, we'll have to total up the return values from qemu_sendv_packet_async and report the total once we've finished handling the entire TX request (using a USDT tracepoint and uretprobe). We'll add our USDT tracepoint at the end of the for loop like so:

static int32_t virtio_net_flush_tx(VirtIONetQueue *q) {
    ...
    for (;;) {
        ...
        DTRACE_PROBE1(qemu, virtio-net-tx-get-len, ret);
    }
    return num_packets;
}


Once the USDT tracepoints are in, we'll add the header #include <sys/sdt.h> to the top of virtio-net.c and then rebuild QEMU.

Go to your QEMU's build directory and type make:

[root@jonpalme-ol7-bm: qemu]# cd build
[root@jonpalme-ol7-bm: build]# make
make[1]: Entering directory `/mnt/jonpalme-bv1/repos/oracle/qemu/slirp'
make[1]: Nothing to be done for `all'.
make[1]: Leaving directory `/mnt/jonpalme-bv1/repos/oracle/qemu/slirp'
  CC      x86_64-softmmu/hw/net/virtio-net.o
  LINK    x86_64-softmmu/qemu-system-x86_64


Now all we need to do is create the BPF program!

We'll instrument our USDT tracepoint in virtio_net_flush_tx to simply total up the length for each descriptor popped. Then, we'll instrument our uretprobe (also for virtio_net_flush_tx) to retrieve it's return value num_packets and retrieve the running total produced by our USDT tracepoint (via BPF_HASH()). Lastly, we'll reset the total by deleting the hash entry for that total and increment our TX's histograms.

Examine the code below, taking note of the algorithms in tx_handler and tx_handler_ret to see how we retrieve, store, update, and reset the total length for each TX request. Also take note of how we set up USDT tracepoints in the BPF program:

#!/usr/bin/python

from bcc import BPF, USDT
from time import sleep
from subprocess import check_output

bpf_txt = """
#include <uapi/linux/ptrace.h>

BPF_HISTOGRAM(rx_pkts_hist);
BPF_HISTOGRAM(rx_len_hist);
BPF_HISTOGRAM(tx_pkts_hist);
BPF_HISTOGRAM(tx_len_hist);

BPF_HASH(length_hash, u64, size_t);

int rx_handler(struct pt_regs *ctx) {
    size_t recv_pkts, total_len;
    bpf_usdt_readarg(1, ctx, &recv_pkts);
    bpf_usdt_readarg(2, ctx, &total_len);

    rx_pkts_hist.increment(bpf_log2l(recv_pkts));
    rx_len_hist.increment(bpf_log2l(total_len));
    return 0;
}

int tx_handler(struct pt_regs *ctx) {
    u64 hash_key = 123;
    size_t *len_ptr, len, total_len;
    bpf_usdt_readarg(1, ctx, &len);

    len_ptr = length_hash.lookup(&hash_key);
    if (len_ptr != NULL) {
        total_len = len + *len_ptr;
        length_hash.delete(&hash_key);
        length_hash.update(&hash_key, &total_len);
    }
    else {
        length_hash.update(&hash_key, &len);
    }
    return 0;
}

int tx_handler_ret(struct pt_regs *ctx) {
    u64 hash_key = 123;
    int32_t sent_pkts;
    size_t *len_ptr, total_len;
    sent_pkts = PT_REGS_RC(ctx);

    len_ptr = length_hash.lookup(&hash_key);
    if (len_ptr != NULL) {
        total_len = *len_ptr;
        length_hash.delete(&hash_key);
    }
    else return 0;

    tx_pkts_hist.increment(bpf_log2l(sent_pkts));
    tx_len_hist.increment(bpf_log2l(total_len));

    return 0;
}
"""

# Load BPF program
dev_qemu_path = \
    "/mnt/jonpalme-bv1/repos/oracle/qemu/build/x86_64-softmmu/qemu-system-x86_64"
qemu_pid = 0
try:
    qemu_pid = check_output(["pidof", "qemu-system-x86_64"])
except:
    print("Error: process qemu-system-x86_64 not found")
    exit()

usdt_ctx = USDT(pid=int(qemu_pid))
usdt_ctx.enable_probe(probe="virtio-net-rx-end",
                      fn_name="rx_handler")
usdt_ctx.enable_probe(probe="virtio-net-tx-get-len",
                      fn_name="tx_handler")
bpf_ctx = BPF(text=bpf_txt, usdt_contexts=[usdt_ctx])
bpf_ctx.attach_uretprobe(name=dev_qemu_path,
                         sym="virtio_net_flush_tx",
                         fn_name="tx_handler_ret")

# Print header
print("Aggregating data from virtio-net RX & TX virtqueues... "
      "Hit Ctrl-C to end.")

# Format output
while 1:
    try:
        sleep(1)
    except KeyboardInterrupt:
        print("\n")
        break

# Output
print("Virtio-net RX log2 received packets histogram")
print("---------------------------------------------")
bpf_ctx["rx_pkts_hist"].print_log2_hist("recv pkts")
print("\n")

print("Virtio-net RX log2 total length histogram")
print("-----------------------------------------")
bpf_ctx["rx_len_hist"].print_log2_hist("len")
print("\n")

print("--------------------------------------------------")
print("\n")

print("Virtio-net TX log2 sent packets histogram")
print("-----------------------------------------")
bpf_ctx["tx_pkts_hist"].print_log2_hist("sent pkts")
print("\n")

print("Virtio-net TX log2 total length histogram")
print("-----------------------------------------")
bpf_ctx["tx_len_hist"].print_log2_hist("len")
print("\n")


To generate output, simply start up a guest with vhost disabled, start the script, and in another window, generate network traffic on your guest. Ctrl-C the application after a few moments of network traffic. You should get output similar to the output below:

[root@jonpalme-ol7-bm: my_scripts]# ./usdt_probes.py
Aggregating data from virtio-net RX & TX virtqueues... Hit Ctrl-C to end.
^C

Virtio-net RX log2 received packets histogram
---------------------------------------------
     recv pkts           : count     distribution
         0 -> 1          : 38820    |***                                     |
         2 -> 3          : 499564   |****************************************|
         4 -> 7          : 2200     |                                        |
         8 -> 15         : 21       |                                        |

Virtio-net RX log2 total length histogram
-----------------------------------------
     len                 : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 4        |                                        |
        64 -> 127        : 6        |                                        |
       128 -> 255        : 1        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 4        |                                        |
      1024 -> 2047       : 17564    |*                                       |
      2048 -> 4095       : 21277    |*                                       |
      4096 -> 8191       : 69310    |******                                  |
      8192 -> 16383      : 432271   |****************************************|
     16384 -> 32767      : 157      |                                        |
     32768 -> 65535      : 11       |                                        |

--------------------------------------------------

Virtio-net TX log2 sent packets histogram
-----------------------------------------
     sent pkts           : count     distribution
         0 -> 1          : 376922   |****************************************|
         2 -> 3          : 58685    |******                                  |
         4 -> 7          : 455      |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 2        |                                        |

Virtio-net TX log2 total length histogram
-----------------------------------------
     len                 : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 376921   |****************************************|
       128 -> 255        : 58541    |******                                  |
       256 -> 511        : 584      |                                        |
       512 -> 1023       : 16       |                                        |
      1024 -> 2047       : 2        |                                        |


For the output above, and as in previous examples, I used the wget command on a large file and stopped the BPF program once the file was completely downloaded.

The results look fairly similar to the data we took for vhost-net, except for RX's received packets; virtio-net seems to typically receive 2-3 packets per request whereas vhost-net seems to typically receive 1.

Also note that my dev_qemu_path will be different from yours, so make sure that your path is correct for your machine.

Let's go over the new things we see here:

bpf_usdt_readarg(index, ctx, &addr): BPF's method of reading in arguments from USDT tracepoints.

  • index is the argument number and addr is the variable that will store that argument's data
  • bpf_usdt_readarg

usdt_ctx = USDT(pid=int(qemu_pid)): Creates a USDT context object to instrument USDT tracepoints.

  • Requires from bcc import BPF, USDT
  • Can either specify the process PID (as in our example) or specify the binary path for the application
    • When using PID, make sure to convert it to an int() ctype, as it will not recognize it otherwise
  • USDT

USDT.enable_probe(probe='...', fn_name='...'): Attaches a BPF C function fn_name to the USDT tracepoint probe

  • Similar to enabling other tracing technologies, USDT tracepoints are no different except that you must enable them using a USDT context (usdt_ctx)
  • USDT.enable_probe

BPF(text=bpf_txt, usdt_contexts=[usdt_ctx]): Creates a BPF object with a USDT context

  • You must include a USDT context when creating your BPF context object if you are using USDT probes

  • BPF

References

BPF Performance Tools by Brendan Gregg (O'Reilly E-Book):

BCC GitHub Repository:

LWN.net Kernel Tracepoints Guide (TRACE_EVENT() macro):

LWN.net Using Userspace Tracepoints w/ BPF (DTRACE_PROBEn() macro):

Other LWN.net Tracing / BPF / BCC Articles:

Brendan Gregg's Blog / Website:

Other Resources:

Jonah Palmer


Previous Post

Announcing Oracle Linux Automation Manager

Simon Hayler | 3 min read

Next Post


A Quick Start With The Oracle Linux Templates For KVM