In this blog entry, we will discuss the concept of split BPF Type Format information (BTF) and why it is valuable. Next we will show the problems that a split BTF model causes for kernel modules that are not built every time the kernel is. Finally we introduce the solution the BPF community arrived at to solve this problem. Happily this solution is now available in upstream kernels and UEK8, the Oracle Linux kernel based on the upstream stable 6.12 kernel.
We have discussed many aspects of BPF previously here; see the references below for more info.
First however, we will describe BTF itself.
A BTF primer
BTF is the BPF Type Format, a compact representation of type and function information associated with an object, where the object can be a kernel, kernel module or BPF program). The compact aspect is critical – unlike DWARF, where debug information covers types, functions, tracking local variable values on the stack and in registers, inline functions and so forth – the content of BTF is intentionally quite minimal.
BTF is generated for BPF programs and the Linux kernel and associated modules during the kernel build process using pahole to generate BTF from DWARF debug information. We are also working on having the kernel build use the BTF that gcc can emit directly. But currently to enable BTF generation for your kernel build, ensure you have a copy of pahole available, and specify
CONFIG_DEBUG_INFO_BTF=y CONFIG_DEBUG_INFO_BTF_MODULES=y
Once the BTF is generated it is stored in a “.BTF” ELF section in the vmlinux kernel image and in each module. When you boot into the kernel, the data in those .BTF sections is made available by the kernel under /sys/kernel/btf ; so /sys/kernel/btf/vmlinux contains the kernel BTF and /sys/kernel/btf/my_module contains the module BTF. Running
$ bpftool btf dump file vmlinux
will allow us to see that BTF, and if we want to generate a header for our BPF programs to use that includes all kernel types we can run
$ bpftool btf dump file vmlinux format c > vmlinux.h
Many BPF-related makefiles do this as part of the BPF program build process.
BTF is critical for many aspects of BPF – kfuncs, fentry/fexit programs, even aspects of XDP now use BTF-based descriptions!
So we have seen that BTF is generated during the kernel build process; BTF is also generated for BPF programs themselves by the compiler during the BPF program compilation process. Next we will examine the actual format of this BTF information.
BPF programs and the kernel and modules contain BTF information in a .BTF ELF section which consists of a header:
struct btf_header {
__u16 magic;
__u8 version;
__u8 flags;
__u32 hdr_len;
/* All offsets are in bytes relative to the end of this header */
__u32 type_off; /* offset of type section */
__u32 type_len; /* length of type section */
__u32 str_off; /* offset of string section */
__u32 str_len; /* length of string section */
};
The header is followed by the actual BTF data.
The header shows the offset of the associated type information (type_off) after the header, so to parse that, we use that and type_len.
Names of types, members, functions etc are stored in the string section at str_off; this allows a type to just use a string section offset to reference its name. The strings section is simply a set of ASCII, null-terminated strings of the form
\0foo\0bar\0baz\0...
So the null string is offset, 0, “foo” offset 1, “bar” offset 5 and so on.
Each entity in the type section is represented by a struct btf_type:
struct btf_type {
__u32 name_off;
/* "info" bits arrangement
* bits 0-15: vlen (e.g. # of struct's members)
* bits 16-23: unused
* bits 24-27: kind (e.g. int, ptr, array...etc)
* bits 28-30: unused
* bit 31: kind_flag, currently used by
* struct, union and fwd
*/
__u32 info;
/* "size" is used by INT, ENUM, STRUCT, UNION and DATASEC.
* "size" tells the size of the type it is describing.
*
* "type" is used by PTR, TYPEDEF, VOLATILE, CONST, RESTRICT,
* FUNC, FUNC_PROTO and VAR.
* "type" is a type_id referring to another type.
*/
union {
__u32 size;
__u32 type;
};
};
BTF information uses a number of different kinds, for example in /usr/include/linux/btf.h you will see
#define BTF_KIND_UNKN 0 /* Unknown */ #define BTF_KIND_INT 1 /* Integer */ #define BTF_KIND_PTR 2 /* Pointer */ #define BTF_KIND_ARRAY 3 /* Array */ #define BTF_KIND_STRUCT 4 /* Struct */ #define BTF_KIND_UNION 5 /* Union */ #define BTF_KIND_ENUM 6 /* Enumeration */ #define BTF_KIND_FWD 7 /* Forward */ #define BTF_KIND_TYPEDEF 8 /* Typedef */ #define BTF_KIND_VOLATILE 9 /* Volatile */ #define BTF_KIND_CONST 10 /* Const */ #define BTF_KIND_RESTRICT 11 /* Restrict */ #define BTF_KIND_FUNC 12 /* Function */ #define BTF_KIND_FUNC_PROTO 13 /* Function Proto */ ...
So as we parse each struct btf_type in the type section in BTF, we use the kind in bits 24-27 to discover how to interpret the current type and find where the next struct btf_type is.
Some kinds are base types – int, enum, float for example.
Some kinds are “reference kinds” and refer in turn to other types, for example BTF_KIND_PTR specifies a target type. Such reference kinds refer to other kinds using type identifiers. Each type is assigned a type id based on its order in the type section; so the first type has type id 1, second type id 2, and so on. We start at 1 because type id 0 – void – is implicit; it is not described in the type section and has no associated struct btf_type there. So a reference kind for void * would be a BTF_KIND_PTR with a type reference of 0.
Split BTF does not start at type id 1 however; we will see this below.
To represent such additional information, some kinds are followed by additional information such as
- function parameter info in the case of BTF_KIND_FUNC_PROTO;
- enumerated value/name mappings in the case of BTF_KIND_ENUM;
- struct member information in the case of BTF_KIND_STRUCT;
We can dump BTF for the kernel – it is embedded in the kernel image in a .BTF section, and made available to users in /sys/kernel/btf :
$ bpftool btf dump file /sys/kernel/btf/vmlinux
[1] INT 'long unsigned int' size=8 bits_offset=0 nr_bits=64 encoding=(none)
[2] INT 'char' size=1 bits_offset=0 nr_bits=8 encoding=(none)
[3] INT 'unsigned int' size=4 bits_offset=0 nr_bits=32 encoding=(none)
[4] INT 'signed char' size=1 bits_offset=0 nr_bits=8 encoding=SIGNED
[5] INT 'unsigned char' size=1 bits_offset=0 nr_bits=8 encoding=(none)
[6] INT 'short int' size=2 bits_offset=0 nr_bits=16 encoding=SIGNED
[7] INT 'short unsigned int' size=2 bits_offset=0 nr_bits=16 encoding=(none)
[8] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
[9] INT 'long long int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED
[10] TYPEDEF '__u64' type_id=11
[11] INT 'long long unsigned int' size=8 bits_offset=0 nr_bits=64 encoding=(none
)
[12] TYPEDEF 'u64' type_id=10
[13] ENUM '(anon)' encoding=UNSIGNED size=4 vlen=2
'false' val=0
'true' val=1
[14] INT 'long int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED
[15] INT '__int128' size=16 bits_offset=0 nr_bits=128 encoding=SIGNED
[16] INT '__int128 unsigned' size=16 bits_offset=0 nr_bits=128 encoding=(none)
[17] TYPEDEF 'bool' type_id=18
[18] INT '_Bool' size=1 bits_offset=0 nr_bits=8 encoding=BOOL
[19] PTR '(anon)' type_id=0
...
So here we can see how we construct more complex types from simpler ones.
For example type id 19 is a PTR to type id 0 – i.e. void *. Similarly type id 10 is a ‘TYPEDEF’ called __u64 which is of the type specified by type id 11; i.e. equivalent to the declaration
typedef long long unsigned int __u64;
So we see how using references to other types we can construct descriptions of types in a compact form.
However, we face a problem; if you dump the contents of /sys/kernel/btf/vmlinux, you will notice the core kernel needs over 100,000 type ids to represent the various types, functions etc. In fact this number would be much larger were it not for the fact that we remove duplicate types in a process known as de-duplication; for a great guide to this see here. When dealing with DWARF as the input to BTF generation, each compilation unit may duplicate types, so in order to honour its commitment to be compact, BTF has to remove such duplicates.
One challenge we have when it comes to kernel modules is that such modules use a lot of kernel types but also have their own module-specific types. We would not wish to redefine all the kernel types for each module BTF representation, so what do we do?
Split BTF to the rescue
Split BTF represents a way of minimizing the BTF representations needed for kernel modules, or indeed for any case where the child object shares a significant amount of type information with the parent. In the case of kernel modules, it would be very expensive if every module needed to define its own type representations for void *, struct sk_buff and so on.
So the solution is to make use of the deduplication process we described earlier; modules share the same – kernel – base BTF – and as a result just have to add any types that are unique to the module. To avoid id clashes, type ids start at
last_base_BTF_type_id + 1
If they need to refer to base types, they can do so easily, and we can easily spot such references; they have type ids <= last_base_BTF_type_id.
Split BTF is created by deduplicating split (module) types with the base (kernel) types; this leaves us with no redundancy between split and base BTF.
When loading split BTF then, we need to specify the associated base BTF, since the split BTF will have references to it, and its type id references will only make sense in the context of that exact base BTF.
For example, taking a small BTF representation for the module xt_time:
$ bpftool btf dump -B /sys/kernel/btf/vmlinux file /sys/kernel/btf/xt_time
[155461] CONST '(anon)' type_id=115631
[155462] ARRAY '(anon)' type_id=155461 index_type_id=8 nr_elems=9
[155463] CONST '(anon)' type_id=155462
[155464] CONST '(anon)' type_id=18378
[155465] STRUCT 'xt_time_info' size=24 vlen=7
'date_start' type_id=55 bits_offset=0
'date_stop' type_id=55 bits_offset=32
'daytime_start' type_id=55 bits_offset=64
'daytime_stop' type_id=55 bits_offset=96
'monthdays_match' type_id=55 bits_offset=128
'weekdays_match' type_id=49 bits_offset=160
'flags' type_id=49 bits_offset=168
[155466] CONST '(anon)' type_id=155465
[155467] ENUM '(anon)' encoding=UNSIGNED size=4 vlen=6
'XT_TIME_LOCAL_TZ' val=1
'XT_TIME_CONTIGUOUS' val=2
'XT_TIME_ALL_MONTHDAYS' val=4294967294
'XT_TIME_ALL_WEEKDAYS' val=254
'XT_TIME_MIN_DAYTIME' val=0
'XT_TIME_MAX_DAYTIME' val=86399
[155468] STRUCT 'xtm' size=12 vlen=7
'month' type_id=18377 bits_offset=0
'monthday' type_id=18377 bits_offset=8
'weekday' type_id=18377 bits_offset=16
'hour' type_id=18377 bits_offset=24
'minute' type_id=18377 bits_offset=32
'second' type_id=18377 bits_offset=40
'dse' type_id=3 bits_offset=64
[155469] ARRAY '(anon)' type_id=155464 index_type_id=8 nr_elems=12
[155470] CONST '(anon)' type_id=155469
[155471] ENUM '(anon)' encoding=UNSIGNED size=4 vlen=2
'DSE_FIRST' val=2039
'SECONDS_PER_DAY' val=86400
[155472] ARRAY '(anon)' type_id=155464 index_type_id=8 nr_elems=70
[155473] CONST '(anon)' type_id=155472
[155474] PTR '(anon)' type_id=155466
[155475] PTR '(anon)' type_id=155468
[155476] FUNC 'time_mt' type_id=108285 linkage=static
[155477] FUNC 'time_mt_check' type_id=108283 linkage=static
[155478] FUNC 'time_mt_exit' type_id=124 linkage=static
[155479] FUNC 'time_mt_init' type_id=122 linkage=static
Here we see the type ids start at last_vmlinux_BTF_id + 1 (155460 + 1), and some refer to base vmlinux types; for example:
[155461] CONST '(anon)' type_id=115631
This represents a const struct modversion_info, making use of struct modversion_info from base BTF:
[115631] STRUCT 'modversion_info' size=64 vlen=2
In some cases of course, the BTF refers to other split BTF ids:
[155463] CONST '(anon)' type_id=155462
In both cases, we need the exact same base BTF for these references to make sense; in the case of base BTF references, we clearly need base BTF where type id 115631 is a struct modversion; and we also need a base BTF with 155460 types for these internal module type id references to be valid.
Split BTF is a huge win because we have no need to redefine types like struct modversion and all their associated references; however we also see that it is brittle in the sense that the split BTF will only work with the exact same base BTF.
Where brittleness matters – split BTF and Linux kernel distributions
With Oracle Linux, we release Linux stable kernel release-based kernels every few years; the latest is the 6.12-based UEK8. As stable fixes and other changes land in that kernel, the BTF associated with the kernel changes. This is not a problem for the modules that we ship and are built when that kernel is built – these generate their BTF relative to the base vmlinux, but since they are all built together any changes in the base will be reflected in the module split BTF generated at the same time.
However, when customers or partners want to build modules against a distro kernel, problems with BTF arise. Normally we would like a customer to be able to build their module with BTF and have that module work for the lifetime of a distro release. However, as we have seen, the module BTF is built in reference to the kernel at the time the module is built; an updated kernel (with stable fixes say) with different BTF makes the module BTF useless.
Consider the xt_time module representation above. If struct modversion is assigned a different type id – as can happen when a kernel is rebuilt to incorporate fixes during a stable release – the BTF no longer works. Worse if a different number of type ids are used in the base BTF, all the split BTF references will become invalid.
So can we fix this?
Towards a more resilient split BTF
I think it is important to stress here that the solution finally arrived at here evolved over time with discussions and feedback, advice and contributions from the BPF community.
The initial approach was to simply build BTF that is not split for out-of-tree modules. Such BTF would simply start at type id 1 and not be deduplicated with the core kernel.
The issue with this – apart from larger module BTF size – is that for BPF and the verifier, certain kernel type ids are special. So if we have standalone BTF for a module, it makes handling such type ids very difficult. How would the kernel know that a module sk_buff is the exact same as the kernel sk_buff? BTF deduplication is the answer, but it is a highly complex activity; we really do not want to have to do it in-kernel.
So the idea we ultimately pursued was to ship a limited base BTF – in a .BTF.base section – that would help clarify the base BTF references in the split BTF.
The process for such “distilled” BTF generation is
-
Generate split BTF as usual
-
For each type reference from split to base BTF, create a “distilled” base BTF representation for our .BTF.base section; the rules for this are:
- base types int, float and fwd are represented as-is in distilled base
- named structs, unions and enumerated types are added as empty named struct, unions and enumerated types to the distilled base
- all other base types are added to the split BTF
- add all the split types to the split BTF and update their references to point at BTF.base ids or updated split BTF ids where applicable
Now we carry a stripped down representation of the base BTF in the .BTF.base section and the split BTF that refers to it is in the .BTF section as usual.
When we load the module, a BTF relocation process takes place where
- the type ids that refer to the .BTF.base ids are updated to point at the current vmlinux type; so a reference to a struct sk_buff is updated to point at the struct sk_buff in the current vmlinux
- we renumber the split types to start at the current vmlinux type_id + 1, so all split BTF references are renumbered appropriately
So in other words, providing the additional context in .BTF.base allows us to relocate the BTF references to work with a changed kernel.
This process is limited, however the intent is to support the lifetime of a stable-based kernel release; in such releases we do not see a massive amount of type churn.
Example
To illustrate this, let us build a simple module:
#include <linux/init.h>
#include <linux/module.h>
#include <linux/skbuff.h>
#include <linux/proc_fs.h>
struct foo {
int f1;
void *f2;
struct sk_buff f3;
};
MODULE_DESCRIPTION("resilient btf mod");
MODULE_LICENSE("GPL");
static int __init simplemod_init(void) {
struct foo f = { .f1 = 1 };
printk(KERN_INFO "mod inited %d.", f.f1);
return 0;
}
static void __exit simplemod_exit(void)
{
printk(KERN_INFO "mod exited.");
}
module_init(simplemod_init);
module_exit(simplemod_exit);
We see it refers to some base kernel types in its own struct foo.
If we build it as follows:
$ make -C /lib/modules/$(uname -r)/build M=$(pwd) modules CC [M] simplemod.o MODPOST Module.symvers LD [M] simplemod.ko BTF [M] simplemod.ko
…we will see it contains a .BTF and a .BTF.base section:
$ objdump -h simplemod.ko |grep BTF 37 .BTF 000000df 0000000000000000 0000000000000000 0002c538 2**0 38 .BTF.base 0000005d 0000000000000000 0000000000000000 0002c617 2**0
We can dump BTF either in reference to the .BTF.base (in its distilled form):
$ bpftool btf dump file simplemod.ko
[4] STRUCT 'foo' size=256 vlen=3
'f1' type_id=1 bits_offset=0
'f2' type_id=10 bits_offset=64
'f3' type_id=2 bits_offset=128
[5] CONST '(anon)' type_id=3
[6] ARRAY '(anon)' type_id=5 index_type_id=1 nr_elems=4
[7] CONST '(anon)' type_id=6
[8] FUNC 'simplemod_exit' type_id=12 linkage=static
[9] FUNC 'simplemod_init' type_id=11 linkage=static
[10] PTR '(anon)' type_id=0
[11] FUNC_PROTO '(anon)' ret_type_id=1 vlen=0
[12] FUNC_PROTO '(anon)' ret_type_id=0 vlen=0
…or relative to vmlinux, where split BTF has been relocated:
$ bpftool btf dump -B /sys/kernel/btf/vmlinux file simplemod.ko
[132497] STRUCT 'foo' size=256 vlen=3
'f1' type_id=21 bits_offset=0
'f2' type_id=132503 bits_offset=64
'f3' type_id=903 bits_offset=128
[132498] CONST '(anon)' type_id=11747
[132499] ARRAY '(anon)' type_id=132498 index_type_id=21 nr_elems=4
[132500] CONST '(anon)' type_id=132499
[132501] FUNC 'simplemod_exit' type_id=132505 linkage=static
[132502] FUNC 'simplemod_init' type_id=132504 linkage=static
[132503] PTR '(anon)' type_id=0
[132504] FUNC_PROTO '(anon)' ret_type_id=21 vlen=0
[132505] FUNC_PROTO '(anon)' ret_type_id=0 vlen=0
Note for example that f1 refers to type id 21 now:
[21] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
…and f3 refers to type id 903:
[903] STRUCT 'sk_buff' size=240 vlen=29
The latter is how the BTF looks when it has been relocated by the kernel after the module is loaded:
$ sudo insmod ./simplemod.ko
$ bpftool btf dump file /sys/kernel/btf/simplemod
[132497] STRUCT 'foo' size=256 vlen=3
'f1' type_id=21 bits_offset=0
'f2' type_id=132503 bits_offset=64
'f3' type_id=903 bits_offset=128
[132498] CONST '(anon)' type_id=11747
[132499] ARRAY '(anon)' type_id=132498 index_type_id=21 nr_elems=4
[132500] CONST '(anon)' type_id=132499
[132501] FUNC 'simplemod_exit' type_id=132505 linkage=static
[132502] FUNC 'simplemod_init' type_id=132504 linkage=static
[132503] PTR '(anon)' type_id=0
[132504] FUNC_PROTO '(anon)' ret_type_id=21 vlen=0
[132505] FUNC_PROTO '(anon)' ret_type_id=0 vlen=0
Note that the module builder does not need to do anything different; we silently detect out-of-tree module build and add distilled base BTF accordingly.
This approach provides two key benefits:
- it allows split BTF to be more resilient
- crucially it also preserves type IDs of important kernel types via relocation. This is important for kfuncs and verification of programs that expect specific types.
Summary
Split BTF presented a problem for out-of-tree modules; any time the kernel is updated, the BTF associated with the module becomes invalid. Distros like to support a model where a module is compiled once and works across the lifetime of the release.
Generating a minimal base BTF representation that travels with the module split BTF allows us to relocate references in the split BTF even if the underlying kernel type info has changed.
To make all of this work, changes were needed in
- libbpf to support distilled base BTF generation
- pahole to carry out distilled BTF generation
- bpftool to display module BTF with a .BTF.base section
- kbuild infrastructure to trigger pahole distilled base generation selectively for out-of-tree modules
- the kernel to handle BTF relocation on module load
Huge thanks to the BPF community for all their help, advice and contributions to this work; as I mentioned, we tried multiple iterations but wound up with a solution that satisfied the requirements while being invisible to users.
Since most users interaction with BPF is via distros, having a solution to this problem was important. To support distillled base BTF in your environment, ensure you have
- kernel >= 6.12 (UEK8 is based on 6.12)
- pahole >= 1.28
- libbpf >= 1.5