Introduction

eBPF is a powerful technolgoy that was introduced in the Linux kernel starting from version 3.18. It allows developers to write custom code that can be loaded safely into the kernel dynamically and to change the behavior of the kernel or for performance tracing of the system events, such as function calls and other system events.

In this blog we will discuss an example architecture used to implement an I/O filter BPF program for block devices using libbpf and the eBPF tracing framework that is built into the linux kernel. The main purpose is to show that using libbpf makes it simple and easy to implement a BPF program. libbpf is a C-based library containing a BPF loader that takes compiled BPF object files and prepares and loads them into the Linux kernel. We will discuss the usage of libbpf interfaces and kernel interfaces for this use case. It is assumed that you are familiar with eBPF techonology and linux kernel support.

Why Libbpf?

libbpf enables “CO-RE” (Compile Once – Run Everywhere), which allows the bpf program to be compiled once and to work with multiple linux kernels that support the required eBPF framework. There are macros defined in libbpf such as BPF_CORE_READ() that will mark the accesses such that libbpf can request them to be relocated to the actual field offsets; it’s a combination of BPF program compilation, libbpf and the kernel that achieves this. So, there is no need to recompile the BPF program for different kernels or include vmlinux.h (which has all kernel data types generated by bpftool on the running kernel). Clang tool generates the BPF CO-RE relocations, allowing libbpf to adapt your BPF code to the host kernel.

Block I/O filter use case example

The simple use case for an I/O filter (iofilter) BPF program is to alter the behavior of I/O on block devices to either allow or fail specific I/O operations (READ/WRITE) depending on some conditions defined by the filter program. The model used for this use case is to create a BPF map for block devices of interest with information for applying the desired filter condition when the BPF program is invoked.

Usage of BPF map

The userspace program which loads BPF program communicates through BPF maps, which provide various data types like hash maps, arrays and event-based structures. Both the userspace program and kernel space BPF program can access/update data in the map for processing or analysis or whatever is required for the specific use case. The userspace program which loads the BPF program also creates the required maps.

libbpf provides helper functions to manipulate the BPF map elements, for example:

int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags);
int bpf_map_lookup_elem(int fd, const void *key, void *value);
int bpf_map_lookup_and_delete_elem(int fd, const void *key, void *value);
int bpf_map_lookup_and_delete_elem_flags(int fd, const void *key, void *value, __u64 flags);
int bpf_map_delete_elem(int fd, const void *key);

bpf_map_type{} data type in bpf.h defines all the supported map types (BPF_MAP_TYPE_HASH, BPF_MAP_TYPE_ARRAY, etc.).

For the I/O filter use case example, we use BPF_MAP_TYPE_HASH type which provides general purpose hash map storage. Both the key and value can be structs, allowing composite keys and values. For the example use case, here are the types used:

/* I/O filter map key type */
struct iofilter_dev {
    unsigned int major;
    unsigned int minor;
};

/*
 * Key value type used for this use case is struct iofilter{},
 * which is the map element to store filter info for the disk.
 */
struct sector_range {
        unsigned long sector_start;
        unsigned long sector_end;
        unsigned int opf;               /* which operations filtered? */
        union {
            unsigned int pminor;        /* parent minor number */
            unsigned int cminor;        /* child minor number */
        };
};

struct iofilter {
        struct sector_range sector_ranges[MAX_SECTOR_RANGES];
        unsigned long disk_start_sector; /* offset w.r.t whole disk */
        unsigned int num_sector_ranges; /* # of ranges in sector_ranges[] used */
};

In the above example data type, a block device may be a partition on a disk so the block range filtered needs to account for I/O operations on the whole disk as well. So, we need an element in the map for the whole disk as well. The pminor field is used to manage the whole disk (parent) map element and cminor for the partition type map element.

Kernel Interfaces Used

eBPF programs are event driven, executing when the kernel traverses a specific hook point. For this use case, the iofilter bpf program is attached to the should_fail_bio() kernel function. The should_fail_bio() kernel function was introduced in the linux kernel 4.18 (commit 30abb3a67f4b). It is used as part of the bpf error injection mechanism and it is enabled when the Linux kernel configuration parameter, CONFIG_BPF_KPROBE_OVERRIDE, is enabled.

For this use case, the goal is to have the attached BPF program result in should_fail_bio() returning true for cases where the I/O operation should fail. This can be achieved in either of the following ways:

  1. Kprobes

    Short for Kernel Probes, is a feature in the linux kernel that allow dynamic tracing and debugging of kernel functions. BPF_PROG_TYPE_KPROBE is the program type used to attach the eBPF program with ELF section name ‘kprobe’.

    kprobe type program calls bpf_override_return(), which effectively returns without executing the rest of the function and returning true/false for the kernel function.

    SEC(\"kprobe/should_fail_bio\")
    int BPF_KPROBE(iofilter_kprobe, struct bio *bio)

    SEC(“kprobe/should_fail_bio”) indicates that a eBPF program is associated with the kprobe event for the should_fail_bio() kernel function.

  2. fmod_ret tracing interface

    BPF_PROG_TYPE_TRACING is the program type and BPF_MODIFY_RETURN is the attachment type and fmod_ret is the ELF section name. BPF_PROG_TYPE_TRACING was introduced in the Linux kernel 5.5 which offers better performance than raw tracepoints.

    fmod_ret() type programs are similar to kprobe and modifies the return value of the kernel function. This method is preferred over the kprobe due to lower overhead, but older systems may need to fall back to the kprobe method.

    SEC(\"fmod_ret/should_fail_bio\")
    int BPF_KPROBE(iofilter_fmod_ret, struct bio *bio)

For the iofilter program use case, we will try to use the fmod_ret attachment interface first and fall back to kprobe if it fails. This would help for the scenario where the kernel does not support the fmod_ret tracing interface.

Program Implementation Architecture

The I/O filter program has two components, a userspace program and a kernel bpf program.

  • Userspace program will:
    • Load the bpf program and attach it to the kernel function should_fail_bio()
    • Setup the map object for keeping information about block devices that are of interest. Device major:minor is used as the key to setup the map elements.
    • Pin the bpf program and the map.
    • Run any time to open the map and update (add, modify or delete) the information.
  • Kernel space bpf program when triggered:
    • Searches map data if the block device element is present and applies the filter conditions if the element exists.
    • Returns 0 or EIO depending on the filter condition check.

Program files organization (file names as examples)

  • iofilter.h

    Contains common data type definitions used by both the userspace program and the bpf program in managing iofilter map data. Here are example data types used for this use case.

    Map key type for identifying the block device.

    struct iofilter_dev {
            unsigned int major;
            unsigned int minor;
    };

    Data types for the iofilter data stored in the BPF map element associated with the block device. In this example, multiple disk ranges on the block device may be specified for filtering I/O. It could be the whole block range of the device.

    #define MAX_SECTOR_RANGES       10
    
    /* operations filtered */
    #define FILTER_READ     0x1
    #define FILTER_WRITE    0x2
    
    struct sector_range {
            unsigned long sector_start;
            unsigned long sector_end;
            unsigned int opf;               /* which operations filtered? */
            union {
                    unsigned int pminor;        /* parent minor number */
                    unsigned int cminor;        /* child minor number */
            };
    };

    Data type for BPF map element.

    /* iofilter map element type used to store device details */
    struct iofilter {
            struct sector_range sector_ranges[MAX_SECTOR_RANGES];
            unsigned long disk_start_sector; /* offset w.r.t whole disk */
            unsigned int num_sector_ranges;
    };
  • iofilter.skel.h

    This is generated by the bpftool program and contains bpf code for iofilter.bpf.c and helper functions to load and manage the iofilter bpf program. This header file is included in iofilter.c. For example, the following commands generate this file:

    $ clang -g -D__TARGET_ARCH_x86 -O2 -target bpf -I. -I/usr/include/bpf \
        -c iofilter.bpf.c -o iofilter.bpf.o && llvm-strip -g iofilter.bpf.o
    $ bpftool gen skeleton iofilter.bpf.o > iofilter.skel.h  

    For example, the helper functions and data types generated into this file include:

    struct iofilter_bpf {...}
    static void iofilter_bpf__destroy()
    static void iofilter_bpf__create_skelton()
    static void iofilter_bpf__open_opts()
    static void iofilter_bpf__open()
    static void iofilter_bpf__load()
    static void iofilter_bpf__open_and_load()
    static void iofilter_bpf__attach()
    static void iofilter_bpf__detach()  
  • iofilter.c

    The userspace program for loading/pinning the bpf program and map data object. It is also used to update or maintain (add/update/delete) the iofilter map elements. libbpf functions and helper functions defined in iofilter.skel.h are used to manage the iofilter bpf program and map data.

  • iofilter.bpf.c

    This provides the BPF program code that gets attached to the kernel function, should_fail_bio(). It applies the filter condition on the I/O operation if the device is in the bpf map data and returns zero if the operation can proceed or EIO if the operation is not allowed. It effectively alters the return value of the should_fail_bio() function.

    Any kernel data type definitions needed by the bpf program are added/copied into this file, no need to generate and include vmlinux.h. libbpf resolves the correct offsets for the data fields from the data types of the running kernel. Here are the example data types that are used:

    struct pt_regs {...}
        This is needed by BPF_KPROBE().
    enum req_opf {
            /* read sectors from the device */
            REQ_OP_READ             = 0,
            /* write sectors to the device */
            REQ_OP_WRITE            = 1,
    };
    typedef __u32 dev_t;
    typedef __u64 sector_t;
    struct block_device {
            dev_t bd_dev;
    };
    
    struct bvec_iter {
            sector_t bi_sector;
            unsigned int bi_size;
    };
    struct bio {
            unsigned int bi_opf;
            struct block_device *bi_bdev;
            unsigned short bi_ioprio;
            struct bvec_iter bi_iter;
    };

    For the block I/O filter case, the bpf program sections are defined:

    SEC(\"fmod_ret/should_fail_bio\")
    int BPF_PROG(iofilter_fmod_ret, struct bio *bio)
    {
            return iofilter_apply(bio);
    }
    
    SEC(\"kprobe/should_fail_bio\")
    int BPF_KPROBE(iofilter_kprobe, struct bio *bio)
    {
            int ret = iofilter_apply(bio);
    
            if (ret)
                    bpf_override_return(ctx, ret);
    
            return 0;
    }
    /* apply iofilter condition on this I/O operation */
    static __always_inline int iofilter_apply(struct bio *bio)
    {
        ...
        /*
         * This looks up the map element for this block device
         * and if present it applies the filter conditions.
         */
        ...
    }

    iofilter map object section:

    struct {
            __uint(type, BPF_MAP_TYPE_HASH);
            __uint(max_entries, MAX_IOFILTER_DEV);
            __type(key, struct iofilter_dev);
            __type(value, struct iofilter);
    } iofilter_map SEC(\".maps\");

Building the iofilter program

Tools required: clang, llvm-strip, bpftool, gcc compiler.

Steps to build the program:

  • Compile the bpf C program using clang tool and generate .o file

    $ clang -g -D__TARGET_ARCH_x86 -O2 -target bpf \
            -I. -I/usr/include/bpf -c iofilter.bpf.c -o iofilter.bpf.o && \
        llvm-strip -g iofilter.bpf.o
  • Generate the skeleton header file that embeds BPF code and helper functions to manage the bpf program.

    $ bpftool gen skeleton iofilter.bpf.o > iofilter.skel.h
  • Compile the userspace program that includes the skeleton header file.

    $ gcc -g -Wall -I. -I/usr/include/bpf -c iofilter.c -o iofilter.o
    $ gcc -g -Wall iofilter.o -lbpf -o iofilter

Implementation code snippets in using libbpf

Userspace program steps and example code snippets

Checking if the pinned iofilter map already exists.

    int map_fd = -1;
    char map_path[PATH_MAX] = "/sys/fs/bpf/iofilter";

    map_fd = bpf_obf_get(map_path);
    if (map_fd > 0) {
        /* map exists */
    }

Create the skelton for the bpf program:

    struct iofilter_bpf *skel = NULL;
        skel = iofilter_bpf__open();

Adjust the map size if needed from the default value used at compile time. Note that the size of the map can not be dynamically changed once the map is created.

    ret = bpf_map__set_max_entries(skel->maps.iofilter_map, maxdev);

Set filter type for the operations that are fitered.

    skel->rodata->opf = FILTER_WRITE;

Load and attach the bpf program.

    /*
     * To use fmod_ret interface set autoload option to the
     * iofilter_fmod_ret BPF program to true.
     */
    fmod_ret_prog = skel->progs.iofilter_fmod_ret;
    kprobe_prog = skel->progs.iofilter_kprobe;
    bpf_program__set_autoload(fmod_ret_prog, true);
    bpf_program__set_autoload(kprobe_prog, false);
    /* Load the BPF program */
    ret = iofilter_bpf__load(skel);
    /* Attach the BPF program */
    err = iofilter_bpf__attach(skel);

If attaching fmod_ret_prog fails we can use the kprobe_prog interface.

    iofilter_bpf__destroy(skel);
    bpf_program__set_autoload(fmod_ret_prog, false);
    bpf_program__set_autoload(kprobe_prog, true);
    /* Load the BPF program */
    ret = iofilter_bpf__load(skel);
    /* Attach the BPF program */
    err = iofilter_bpf__attach(skel);

Pin the bpf program to the required path and map object in the kernel.

    char pin_path[PATH_MAX] = "/sys/fs/bpf/iofilter";
    char path[PATH_MAX];

    snprintf(path, sizeof(path), "%s/prog", pin_path);
    /* pin the BPF program */
    err = bpf_link__pin(link, path);
    /* pin the BPF map */
    snprintf(path, sizeof(path), "%s/map", pin_path);
    err = bpf_map__pin(skel->maps.iofilter_map, path);

Kernel BPF program steps and example code snippets

In the example use case, iofilter_apply() is the function that gets called when triggered by any I/O operation on block devices.

Check if the operation is filtered for this device. For exmaple, if applying the filter only for WRITE operations.

    enum req_opf opf;
    opf = BPF_CORE_READ(bio, bi_opf) & REQ_OP_MASK;
    if (opf != REQ_OP_WRITE) {
        /* not an op we're interested in */
        return 0;
    }

Now, lookup the map elements for this block device if it is monitored.

    struct iofilter_dev dev_key = {};
    struct iofilter *iofilter = NULL;

    devt = BPF_CORE_READ(bio, bi_bdev, bd_dev);
    dev_key.major = MAJOR(devt);
    dev_key.minor = MINOR(devt);

    iofilter = bpf_map_lookup_elem(&iofilter_map, &dev_key);

Get the informaiton on the I/O size and starting block sector number.

    __u64 start_sector, end_sector;
    io_size = BPF_CORE_READ(bio, bi_iter.bi_size);
    start_sector = BPF_CORE_READ(bio, bi_iter.bi_sector);
    end_sector = start_sector + (io_size >> 9)-1;

Now, check if the I/O is within a sector range that is being filtered.

    for (i = 0;
             i < iofilter->num_sector_ranges && i < MAX_SECTOR_RANGES; i++) {
                range_start = iofilter->sector_ranges[i].sector_start;
                range_end = iofilter->sector_ranges[i].sector_end;
                if (((start_sector >= range_start) &&
                    (start_sector <= range_end)) ||
                    ((end_sector >= range_start) &&
                    (end_sector <= range_end))) {

            /*
             * operation falls into a section range we want to apply
             * desired filter condition. Based on the check return -EIO
             * if the operation should not be allowed.
             */
        }
    }
    /* allow the operation to proceed */
    return 0;

Conclusion

The example use case shows the simplicity and flexibility of writing eBPF programs using libbpf interfaces that work with different kernels. In the above example use case, when the iofilter program is loaded/active there will be a small overhead of checking if the device is in the interested list for every I/O operation on a block device.

References

  1. https://ebpf.io/what-is-ebpf/#maps
  2. https://nakryiko.com/posts/bpf-core-reference-guide/
  3. https://blog.aquasec.com/vmlinux.h-ebpf-programs
  4. https://liuhangbin.netlify.app/post/bpf-skeleton/
  5. https://nakryiko.com/posts/bcc-to-libbpf-howto-guide/#bpf-skeleton-and-bpf-app-lifecycle
  6. https://ebpf.io/what-is-ebpf/
  7. https://blogs.oracle.com/linux/post/bpf-in-depth-communicating-with-userspace