Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel's "Berkeley Packet Filter" -- a useful and extensible kernel function for much more than packet filtering.
Previously, we've described
Now we've got one more topic to cover before we're ready to start writing BPF programs. How does BPF ensure that programs are safe?
When working with BPF, the first wall you are likely to hit - after compiling your program and trying to load it - is a BPF verifier complaint such as this one I came across recently when loading a BPF-based tc classifier:
from 1545 to 1615: R0=inv0 R6=inv2 R7=ctx(id=0,off=0,imm=0) R8=inv(id=0,umin_value=28,umax_value=1048,var_off=(0x0; 0x7fc)) R9=inv(id=0,umax_value=1020,var_off=(0x0; 0x3fc)) R10=fp0 fp-248=inv fp-232=map_value
1615: (61) r1 = *(u32 *)(r7 +76)
1616: (79) r2 = *(u64 *)(r10 -152)
1617: (0f) r1 += r2
math between pkt pointer and register with unbounded min value is not allowed
To understand what all this means, we need to describe the BPF instruction set and what the verifier does.
As mentioned previously, the eBPF instruction set extended the set of bytecodes available, moved to 64-bit registers and in general created an instruction set that looks quite like x86_64.This isn't a coincidence; the aim was to support just-in-time (JIT) compilation as a speedier alternative to interpreting bytecodes. JIT compilation can be enabled via
# sysctl net/core/bpf_jit_enable=1
The instruction set is documented at
https://www.kernel.org/doc/Documentation/networking/filter.txt
...but be sure to look at the "BPF kernel internals" section and later, as above that is a description of "classic" BPF, the initial instruction set used for packet filtering. Classic BPF is translated in-kernel to support existing filtering mechanisms in wireshark, tcpdump etc.
Register | Function | x86_64 equiv |
---|---|---|
R0 | return value from in-kernel function/exit value for prog | rax |
R1 | first arg to in-kernel function/scratch variable | rdi |
R2 | second arg to in-kernel function/scratch variable | rsi |
R3 | third arg to in-kernel function/scratch variable | rdx |
R4 | fourth arg to in-kernel function/scratch variable | rcx |
R5 | fifth arg to in-kernel function/scratch variable | r8 |
R6 | callee saved registers that in-kernel function preserves | rbx |
R7 | callee saved registers that in-kernel function preserves | r13 |
R8 | callee saved registers that in-kernel function preserves | r14 |
R9 | callee saved registers that in-kernel function preserves | r15 |
R10 | read-only frame pointer to access stack | rbp |
As we can see, the maximum number of function register arguments BPF can use is 5 (x86_64 supports more).
So just-in-time compilation can simply use the x86_64 equivalents when creating a mapping from BPF instructions to the x86_64 ISA. The x86_64 implementation of JIT compilation for 4.14 is found at
https://github.com/oracle/linux-uek/blob/uek5/master/arch/x86/net/bpf_jit_comp.c
bpf_int_jit_compile() makes multiple passes over the bytecodes, shrinking the image each time until no more shrinking occurs. do_jit() carries out the mapping, cycling through the instructions in a big switch() statement.
There is no great need to describe the various supported BPF opcodes, as the output from the verifier is rendered in a quite human-readable form.
Returning to our verifier complaint, it showed us a few snippets of our program:
1615: (61) r1 = *(u32 *)(r7 +76)
1616: (79) r2 = *(u64 *)(r10 -152)
1617: (0f) r1 += r2
From the above, we know that r1 and r2 are registers used to pass arguments to BPF functions. So on 1615, we are setting r1 to the u32 value pointed at by (r7 + 76). And on 1616, we are setting r2 to an offset from the frame pointer, i.e. a local variable on the stack. Finally we add both together. The last piece of the puzzle is to describe what the verifier context information
"from 1545 to 1615: R0=inv0 R6=inv2 R7=ctx(id=0,off=0,imm=0) R8=inv(id=0,umin_value=28,umax_value=1048,var_off=(0x0; 0x7fc)) R9=inv(id=0,umax_value=1020,var_off=(0x0; 0x3fc)) R10=fp0 fp-248=inv fp-232=map_value"
and error
"math between pkt pointer and register with unbounded min value is not allowed"
...mean. To describe that, we need a bit more information on what the verifier does.
At a high-level, the BPF verifier ensures that BPF programs are safe - i.e. they will not crash the system, access invalid memory addresses, etc.
The verifier code is pretty well commented; I'd recommend starting with https://github.com/oracle/linux-uek/blob/uek5/master/kernel/bpf/verifier.c :
Here's what it says
bpf_check() is a static code analyzer that walks eBPF program instruction by instruction and updates register/stack state. All paths of conditional branches are analyzed until 'bpf_exit' insn.
The first pass is depth-first-search to check that the program is a DAG. It rejects the following programs:
- larger than BPF_MAXINSNS insns
- if loop is present (detected via back-edge)
- unreachable insns exist (shouldn't be a forest. program = one function)
- out of bounds or malformed jumps
The second pass is all possible path descent from the 1st insn. Since it's analyzing all pathes through the program, the length of the analysis is limited to 64k insn, which may be hit even if total number of insn is less then 4K, but there are too many branches that change stack/regs.
Number of 'branches to be analyzed' is limited to 1k
DAG is a "directed acyclic graph" - we want to ensure our program has no backward branches. It is a directed graph because we always branch forwards. However, multiple places in the code can branch forward to the same destination, so it's not a tree.
So from the above, when we considering verifier errors, we have a few different classes of problem. In the first pass we can get verifier errors if we pass in
Most of these problems are reasonably easy to eliminate.
The second pass is trickier. The verifier will try all paths, tracking types of registers used as input to instructions, and updating resulting type via register state values. For example, PTR_TO_PACKET + SCALAR_VALUE → PTR_TO_PACKET. Certain operations are forbidden, e.g. adding two pointer values together gives an invalid value.
The bpf verifier explores all paths of the program. For conditional jumps, a stack is used, so one path is explored while the instruction for the other path is pushed onto the stack. So we do a depth-first search of the instruction set. When we arrive at an instruction with a state equivalent to an earlier instruction state analysis (see the states_equal() function for how this is determined), we can prune the search. If we reach bpf_exit() without any complaints and a valid R0 value (the return value of the BPF program), a state is marked safe. We then backtrack to the first pushed instruction and repeat the cycle until the stack is empty and we're done.
In my experience, the verifier can find quite subtle issues in code, and while you will spend a considerable amount of time tracking them down, it is usually a genuine bug! There are cases where compiler optimizations can confuse the verifier, so it's always best to run "llvm-objdump" (see below) to examine your program.
So the "math between pkt pointer and register with unbounded min value is not allowed" error we saw above is essentially the verifier figuring out the set of actions in your program can lead to a situation where it is possible to run off the start of the packet by subtracting a value from the packet pointer. It is important to ensure values we add to packet pointers are unsigned, e.g. __u16. Also watch out for overflows when bit-shifting, adding, subtracting or multiplying.
Ensure same for accesses on stack - for that we would get a similar error but it would mention the fp (frame pointer) instead.
So what about
"from 1545 to 1615: R0=inv0 R6=inv2 R7=ctx(id=0,off=0,imm=0) R8=inv(id=0,umin_value=28,umax_value=1048,var_off=(0x0; 0x7fc)) R9=inv(id=0,umax_value=1020,var_off=(0x0; 0x3fc)) R10=fp0 fp-248=inv fp-232=map_value"
?
With the above info, we can see that this is the output of the verifier analysis, and it is telling us the state of the registers as per the verifier when processing a given chunk of instructions.
For BPF programs which support sk_buff access, there are two modes with which we can retrieve/store information from/to a packet. The first is to use bpf_skb_load|store_bytes(). The advantage of this interface is that it works for linear and non-linear sk_buffs. Packet data is stored in the sk_buff structure, and it can hold that data in an initial "linear" section from skb→data to skb→data + skb→end (If this all sounds confusing, I'd recommend reading "How SKBs work" by David Miller). However, additional packet data can also be stored in fragments associated with the packet. These are referenced via the skb_shared_info structure. In BPF, we get a modified version of the skb, "struct __sk_buff" which contains pointers "data" and "data_end" which point at the start and end of the linear portion of the packet. We can directly read and write packet data using these pointers, but the BPF verifier requires that we first ensure that the packet data we wish to read/write is between "data" and "data_end". So a lot of BPF code which uses direct packet access looks like this:
struct eth_hdr *eth;
if (data + sizeof(*eth) > data_end)
return TC_ACT_OK;
eth = data;
if (bpf_ntohs(eth→h_proto) == ETH_P_IP) {
...
This leads naturally to a question - what if some of our packet data falls into the non-linear portion? I've encountered situations - particulary in VMs which use multiple layers of packet encapsulation - where some packet header data falls into the non-linear part of the skb. The best approach is to test for the condition where additional packet data is present and not in the linear portion. If that is the case, we can call bpf_skb_pull_data(skb, data_len) to ensure that data_len bytes will be in the linear portion.David Miller writes about direct packet access here.
However, you may get a bunch more verifier warnings if you do this. Why? Well, many BPF functions such as bpf_skb_store_bytes(), bpf_skb_pull_data(), bpf_skb_adjust_room() etc will invalidate the data/data_end pointers and any checks done on them. So when using direct packet access, we need to retrieve data/data_end from the skb again and ensure that we verify the data we read/write falls between them.
If you get verifier errors, you will want to figure out where they occur in your restricted C code. We can dump our BPF program with annotated source if we run
# llvm-objdump -S -no-show-raw-insn program.o
Ensure the original program was compiled with -g to get source annotations. The result will look something like this:
program.o: file format ELF64-BPF
Disassembly of section program_handle_egress:
program_handle_egress:
; {
0: r7 = r1
; {
1: r6 = 0
; void *data_end = (void *)(long)skb->data_end;
2: r2 = *(u32 *)(r7 + 80)
; void *data = (void *)(long)skb->data;
3: r1 = *(u32 *)(r7 + 76)
; if (data + sizeof(*eth) > data_end)
4: r3 = r1
5: r3 += 14
6: if r3 > r2 goto 570
You can see the source code interspersed with the BPF program, so it's a neat way to figure out what's happening around wherever the BPF is complaining about.
Thanks for reading this installment of our series on BPF. We hope you found it educational and useful. Questions or comments? Use the comments field below!
Stay tuned for the final installment in this series, BPF Packet Transformation.
Be sure to visit the previous installments of this series on BPF, here, and stay tuned for our next blog posts! 1. BPF program types 2. BPF helper functions for those programs 3. BPF userspace communication 4. BPF program build environment 5. BPF bytecodes and verifier
Next Post