X

News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

BPF In Depth: The BPF Bytecode and the BPF Verifier

Notes on BPF (5) - BPF bytecodes and the BPF verifier

Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel's "Berkeley Packet Filter" -- a useful and extensible kernel function for much more than packet filtering.

Previously, we've described

  • what sorts of BPF programs can be used;
  • the BPF helper functions those programs can call
  • the ways BPF programs can communicate with userpace
  • how to set up a BPF build environment.

Now we've got one more topic to cover before we're ready to start writing BPF programs. How does BPF ensure that programs are safe?

When working with BPF, the first wall you are likely to hit - after compiling your program and trying to load it - is a BPF verifier complaint such as this one I came across recently when loading a BPF-based tc classifier:

from 1545 to 1615: R0=inv0 R6=inv2 R7=ctx(id=0,off=0,imm=0) R8=inv(id=0,umin_value=28,umax_value=1048,var_off=(0x0; 0x7fc)) R9=inv(id=0,umax_value=1020,var_off=(0x0; 0x3fc)) R10=fp0 fp-248=inv fp-232=map_value
1615: (61) r1 = *(u32 *)(r7 +76)
1616: (79) r2 = *(u64 *)(r10 -152)
1617: (0f) r1 += r2

math between pkt pointer and register with unbounded min value is not allowed

To understand what all this means, we need to describe the BPF instruction set and what the verifier does.

BPF instruction set

As mentioned previously, the eBPF instruction set extended the set of bytecodes available, moved to 64-bit registers and in general created an instruction set that looks quite like x86_64.This isn't a coincidence; the aim was to support just-in-time (JIT) compilation as a speedier alternative to interpreting bytecodes. JIT compilation can be enabled via

# sysctl net/core/bpf_jit_enable=1

The instruction set is documented at

https://www.kernel.org/doc/Documentation/networking/filter.txt

...but be sure to look at the "BPF kernel internals" section and later, as above that is a description of "classic" BPF, the initial instruction set used for packet filtering. Classic BPF is translated in-kernel to support existing filtering mechanisms in wireshark, tcpdump etc.

BPF Registers

Register Function x86_64 equiv
R0 return value from in-kernel function/exit value for prog rax
R1 first arg to in-kernel function/scratch variable rdi
R2 second arg to in-kernel function/scratch variable rsi
R3 third arg to in-kernel function/scratch variable rdx
R4 fourth arg to in-kernel function/scratch variable rcx
R5 fifth arg to in-kernel function/scratch variable r8
R6 callee saved registers that in-kernel function preserves rbx
R7 callee saved registers that in-kernel function preserves r13
R8 callee saved registers that in-kernel function preserves r14
R9 callee saved registers that in-kernel function preserves r15
R10 read-only frame pointer to access stack rbp

As we can see, the maximum number of function register arguments BPF can use is 5 (x86_64 supports more).

So just-in-time compilation can simply use the x86_64 equivalents when creating a mapping from BPF instructions to the x86_64 ISA. The x86_64 implementation of JIT compilation for 4.14 is found at

https://github.com/oracle/linux-uek/blob/uek5/master/arch/x86/net/bpf_jit_comp.c

bpf_int_jit_compile() makes multiple passes over the bytecodes, shrinking the image each time until no more shrinking occurs. do_jit() carries out the mapping, cycling through the instructions in a big switch() statement.

There is no great need to describe the various supported BPF opcodes, as the output from the verifier is rendered in a quite human-readable form.

Returning to our verifier complaint, it showed us a few snippets of our program:

1615: (61) r1 = *(u32 *)(r7 +76)
1616: (79) r2 = *(u64 *)(r10 -152)
1617: (0f) r1 += r2

From the above, we know that r1 and r2 are registers used to pass arguments to BPF functions. So on 1615, we are setting r1 to the u32 value pointed at by (r7 + 76). And on 1616, we are setting r2 to an offset from the frame pointer, i.e. a local variable on the stack. Finally we add both together. The last piece of the puzzle is to describe what the verifier context information

"from 1545 to 1615: R0=inv0 R6=inv2 R7=ctx(id=0,off=0,imm=0) R8=inv(id=0,umin_value=28,umax_value=1048,var_off=(0x0; 0x7fc)) R9=inv(id=0,umax_value=1020,var_off=(0x0; 0x3fc)) R10=fp0 fp-248=inv fp-232=map_value"

and error

"math between pkt pointer and register with unbounded min value is not allowed"

...mean. To describe that, we need a bit more information on what the verifier does.

The BPF verifier

What does the verifier do?

At a high-level, the BPF verifier ensures that BPF programs are safe - i.e. they will not crash the system, access invalid memory addresses, etc.

The verifier code is pretty well commented; I'd recommend starting with https://github.com/oracle/linux-uek/blob/uek5/master/kernel/bpf/verifier.c :

Here's what it says

bpf_check() is a static code analyzer that walks eBPF program instruction by instruction and updates register/stack state. All paths of conditional branches are analyzed until 'bpf_exit' insn.


The first pass is depth-first-search to check that the program is a DAG. It rejects the following programs:

- larger than BPF_MAXINSNS insns

- if loop is present (detected via back-edge)

- unreachable insns exist (shouldn't be a forest. program = one function)

- out of bounds or malformed jumps


The second pass is all possible path descent from the 1st insn. Since it's analyzing all pathes through the program, the length of the analysis is limited to 64k insn, which may be hit even if total number of insn is less then 4K, but there are too many branches that change stack/regs.

Number of 'branches to be analyzed' is limited to 1k

DAG is a "directed acyclic graph" - we want to ensure our program has no backward branches. It is a directed graph because we always branch forwards. However, multiple places in the code can branch forward to the same destination, so it's not a tree.

First pass verifier complaints

So from the above, when we considering verifier errors, we have a few different classes of problem. In the first pass we can get verifier errors if we pass in

  • programs that are too big (larger than BPF_MAXINSNS instructions). In Linux 4.14 BPF_MAXINSNS is set to 4096.
  • programs with loops. Note that we can unroll simple loops in clang via "#pragma clang loop unroll(full)", but such loops should be simple as unrolling can fail. In particular loops which are bounded by a variable value (such as a value retrieved from a packet header) are risky. In addiition, adding complex predicates within a loop body can be problematic also.
  • programs which call other functions (that are not BPF helpers or defined as __always_inline).
  • unreachable instructions. Hard to see how this could happen in a restricted C environment, using raw BPF it could be a risk of course.

Most of these problems are reasonably easy to eliminate.

Second pass verifier complaints

The second pass is trickier. The verifier will try all paths, tracking types of registers used as input to instructions, and updating resulting type via register state values. For example, PTR_TO_PACKET + SCALAR_VALUE → PTR_TO_PACKET. Certain operations are forbidden, e.g. adding two pointer values together gives an invalid value.

The bpf verifier explores all paths of the program. For conditional jumps, a stack is used, so one path is explored while the instruction for the other path is pushed onto the stack. So we do a depth-first search of the instruction set. When we arrive at an instruction with a state equivalent to an earlier instruction state analysis (see the states_equal() function for how this is determined), we can prune the search. If we reach bpf_exit() without any complaints and a valid R0 value (the return value of the BPF program), a state is marked safe. We then backtrack to the first pushed instruction and repeat the cycle until the stack is empty and we're done.

In my experience, the verifier can find quite subtle issues in code, and while you will spend a considerable amount of time tracking them down, it is usually a genuine bug! There are cases where compiler optimizations can confuse the verifier, so it's always best to run "llvm-objdump" (see below) to examine your program.

So the "math between pkt pointer and register with unbounded min value is not allowed" error we saw above is essentially the verifier figuring out the set of actions in your program can lead to a situation where it is possible to run off the start of the packet by subtracting a value from the packet pointer. It is important to ensure values we add to packet pointers are unsigned, e.g. __u16. Also watch out for overflows when bit-shifting, adding, subtracting or multiplying.

Ensure same for accesses on stack - for that we would get a similar error but it would mention the fp (frame pointer) instead.

So what about

"from 1545 to 1615: R0=inv0 R6=inv2 R7=ctx(id=0,off=0,imm=0) R8=inv(id=0,umin_value=28,umax_value=1048,var_off=(0x0; 0x7fc)) R9=inv(id=0,umax_value=1020,var_off=(0x0; 0x3fc)) R10=fp0 fp-248=inv fp-232=map_value"

?

With the above info, we can see that this is the output of the verifier analysis, and it is telling us the state of the registers as per the verifier when processing a given chunk of instructions.

Direct packet access, non-linear SKBs and verifier complaints

For BPF programs which support sk_buff access, there are two modes with which we can retrieve/store information from/to a packet. The first is to use bpf_skb_load|store_bytes(). The advantage of this interface is that it works for linear and non-linear sk_buffs. Packet data is stored in the sk_buff structure, and it can hold that data in an initial "linear" section from skb→data to skb→data + skb→end (If this all sounds confusing, I'd recommend reading "How SKBs work" by David Miller). However, additional packet data can also be stored in fragments associated with the packet. These are referenced via the skb_shared_info structure. In BPF, we get a modified version of the skb, "struct __sk_buff" which contains pointers "data" and "data_end" which point at the start and end of the linear portion of the packet. We can directly read and write packet data using these pointers, but the BPF verifier requires that we first ensure that the packet data we wish to read/write is between "data" and "data_end". So a lot of BPF code which uses direct packet access looks like this:

struct eth_hdr *eth;

if (data + sizeof(*eth) > data_end)
   return TC_ACT_OK;

eth = data;
if (bpf_ntohs(eth→h_proto) == ETH_P_IP) {
...

This leads naturally to a question - what if some of our packet data falls into the non-linear portion? I've encountered situations - particulary in VMs which use multiple layers of packet encapsulation - where some packet header data falls into the non-linear part of the skb. The best approach is to test for the condition where additional packet data is present and not in the linear portion. If that is the case, we can call bpf_skb_pull_data(skb, data_len) to ensure that data_len bytes will be in the linear portion.David Miller writes about direct packet access here.

However, you may get a bunch more verifier warnings if you do this. Why? Well, many BPF functions such as bpf_skb_store_bytes(), bpf_skb_pull_data(), bpf_skb_adjust_room() etc will invalidate the data/data_end pointers and any checks done on them. So when using direct packet access, we need to retrieve data/data_end from the skb again and ensure that we verify the data we read/write falls between them.

Examining your program with llvm-objdump

If you get verifier errors, you will want to figure out where they occur in your restricted C code. We can dump our BPF program with annotated source if we run

# llvm-objdump -S -no-show-raw-insn program.o

Ensure the original program was compiled with -g to get source annotations. The result will look something like this:

program.o: file format ELF64-BPF

Disassembly of section program_handle_egress:
program_handle_egress:
; {
       0:       r7 = r1
; {
       1:       r6 = 0
; void *data_end = (void *)(long)skb->data_end;
       2:       r2 = *(u32 *)(r7 + 80)
; void *data = (void *)(long)skb->data;
       3:       r1 = *(u32 *)(r7 + 76)
; if (data + sizeof(*eth) > data_end)
       4:       r3 = r1
       5:       r3 += 14
       6:       if r3 > r2 goto 570

You can see the source code interspersed with the BPF program, so it's a neat way to figure out what's happening around wherever the BPF is complaining about.

Learning more about BPF

Thanks for reading this installment of our series on BPF. We hope you found it educational and useful. Questions or comments? Use the comments field below!

Stay tuned for the final installment in this series, BPF Packet Transformation.

Be sure to visit the previous installments of this series on BPF, here, and stay tuned for our next blog posts! 1. BPF program types 2. BPF helper functions for those programs 3. BPF userspace communication 4. BPF program build environment 5. BPF bytecodes and verifier

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha