Notes on BPF (6) – BPF packet transformation using tc

Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel’s “Berkeley Packet Filter” — a useful and extensible kernel function for much more than packet filtering.

In earlier blog entries, we’ve tried to run through some of the concepts in BPF and hopefully now we’re ready to try writing some BPF programs.

One of the great use cases for BPF is in network packet handling. Here we will try and do some magic using BPF; we’re going to turn IPv4 packets we receive on the wire into IPv6 packets for the receiving Linux networking stack, so that the receiving TCP/IP stack only sees IPv6 traffic, and then we will reverse the trick on outbound. So our system running BPF will only see IPv6 in the networking stack, while IPv4 traffic will be what’s seen on the wire. Specifically we’ll do this for an ICMP echo request (ping), converting an inbound ping into an IPv6 echo request. Then we will take the IPv6 echo reply and convert it into IPv4. So the remote ping application thinks it’s talking to an IPv4 endpoint, while the local Linux TCP/IP stack thinks it’s talking to an remote IPv6 ping client!

So on inbound, what happens is this:

    +---->  3. IPv6 packet is processed by TCP/IP stack
    |
+-----> 2. BPF ingress (inbound) filter transforms it into IPv6
|
1. IPv4 inbound packet arrives

Similarly for outbound packets:

    +-----  1. IPv6 packet is sent by TCP/IP stack
    V   
+-------2. BPF egress (outbound) filter transforms it into IPv4
|
3. IPv4 outbound packet is sent on wire.

Why do this? Mostly because it’s a non-trivial example of using BPF to do packet transformation, and I couldn’t find any existing examples that do IPv4 -> IPv6 transformation. As a reminder though, the samples/bpf directory in the kernel tree has a bunch of different examples that are useful if you’re trying to learn how to write BPF programs.

If you want to see the fully worked example, check out

https://github.com/alan-maguire/bpf-test/blob/master/bpf/test_bpf_helper_bpf_skb_change_proto_kern.c

It’s part of a repo which does unit tests of various bpf helpers. This one covers the bpf_skb_change_proto() helper function which allows us to turn an IPv4 packet into IPv6 and vice versa. The test converts IPv4 ICMP echo requests (pings) into IPv6 echo requests on ingress, and takes IPv6 echo replies on egress and converts them into IPv4 echo replies. So the remote system pings an IPv4 address and BPF translates things so that the echo request is processed an IPv6 ping. Doing all this allows us to test that the protocol change helper works.

Converting IPv4 to IPv6 – a quick primer

To convert between the protocols, we need to remind ourselves what the differences are between IPv4 and IPv6. As always, consult the RFCs for full details, but to summarize the key details we need to care about:

IPv6 does not utilize a checksum while IPv4 checksums the IPv4 header
IPv6 headers are 40 bytes in size while IPv4 are 20 bytes, largely because…
IPv6 addresses are 128 bits in size rather than 32 for IPv4.
IPv6 uses extension headers, while IPv4 uses options which are tacked on the end of the header.

Note that for higher-level protocols, we also need to consider the concept of a pseudo-header. When checksumming TCP, UDP and ICMPv6, we checksum the TCP, UDP and ICMPv6 packet content, but also add a pseudo-header consisting of the source/destination addresses, payload length and protocol type. Again consult the RFCs for full details, but the consequences for BPF are this: if moving from IPv4 to IPv6, we need to modify layer 4 checksums also because in changing the IP addresses (from v4 to v6 or vice versa), we also change the pseudo-header and thus the checksum calculation.

Another pain point is that IPCMPv6 != ICMP; types and codes are different, even for simple packet data like ping echo requests/replies. So if we’re converting ICMPv4 to ICMPv6 we will need to modify these fields too. And ICMPv4 does not use a pseudo-header, so we need to take that into account in checksum calculations.

All seems kind of daunting, but the great news is BPF provides helpers to do checksum calculations, convert IPv4 to IPv6 and vice versa and so on.

Choosing our BPF program type

When we initially described the various program types in BPF, we talked about when the BPF program associated with the program type is run. For this case, we have two requirements:

We need to be able to run it on ingress for inbound traffic and for egress for outbound traffic.
It needs to process the packet on ingress prior to handing it off to the TCP/IP networking stack, and on egress prior to handing it to the driver for transmission.

There are a few options for us to choose from, but a “tc” bpf program makes most sense. tc supports symmetric (ingress and egress) program attach, and the advantage of using XDP – not having to allocate packet metadata – doesn’t really buy us much here, since we want to pass our packet upstream to the Linux TCP/IP stack. If we were doing some form of firewalling or DDoS mitigation where we were dropping a lot of the received packets, doing that without the overhead of skbuff packet metadata allocation in XDP is ideal.

Userspace interactions?

In the real world, you’d likely want to restrict such conversions to a specific IP address or port, so you could store those in a BPF hash map. In the case of our tests, we use a BPF array map to store test status for each test; this allows us to mark a test case failed from within our BPF program and to be able to pick that up in the userspace program that launches the test.

Beware of offload functionality!

If you are doing anything involving tunnel encapsulation/de-encapsulation, it can be difficult to get that functionality working with generic send offload/generice receive offload functionality. As a reminder, GSO allows us to send a large packet down to the device which segments it into individual under-MTU-sized packets for transmission. If we are pre-pending tunnel headers etc we may need to switch off such functionality as we want each packet to have the tunnel header pre-pended. I haven’t had much luck with getting these offload features to work with BPF so I generally turn them off with ethtool, but your experience may be different.

Direct packet access versus bpf_skb_load/store_bytes

Initially the way to read write packet data in BPF was to use bpf_skb_load_bytes() and bpf_skb_store_bytes(). These interfaces were useful because they handled cases where the packet is what is known as non-linear. This means that the buffers storing packet data are not contiguous. In general packet headers are in the linear portion of an sk_buff, but I’ve come across cases (in heavily encapsulated traffic for VMs) where header data falls into non-linear parts of packet data. For a review of how sk_buff data structures work, see David Miller’s “How SKBs work”:

http://vger.kernel.org/~davem/skb_data.html

Later direct packet access was added to BPF, which meant we could use the __sk_buff “data” pointer to access packet data like a normal pointer. However for safety BPF requires we first test we have not reached the end of the linear portion of the packet (data_end). So most packet accesses have to be prefixed with checks for this condition. If we fall off the end of the packet we can explicitly call bpf_skb_pull_data() to request that the desired amount of data be in the linear portion.

Writing our ingress filter

Our goal is to process an IPv4 inbound ICMPv4 echo request packet and convert it into ICMPv6. I’ve chosen ICMP because it’s harder to do than TCP or UDP – for those protocols, L4 checksum modification is done for the changed IP addresses only. For ICMPv4->ICMPv6 we also need to change ICMP type and take into account the fact that ICMPv6 has a pseudo header whereas ICMPv4 does not. So to adapt this example to TCP/UDP, you will just need to modify the checksum computations and the checksum offset.

Verify our packet is IPv4/ICMP

We define our ingress ELF section, and we use direct packet access (hence the initial checks) to ensure we’ve got an IPv4 (ETH_P_IP) packet, and moreover that it’s an ICMP echo requests (ICMP_ECHO).

Note we could do an explicit bpf_skb_pull_data() for these cases, but since it’s unlikely that the first few bytes of the packet are non-linear we just pass such packets up to Linux intact (by returning TC_ACT_OK).

SEC("ipv4toipv6_ingress")
int ipv4toipv6_ingress(struct __sk_buff *skb)
{
    /* We use an icmp hdr for icmp6 because we only want type/code/check */
        struct icmphdr *icmph, icmp6h = { 0 };
    void *data_end = (void *)(long)skb->data_end;
        void *data = (void *)(long)skb->data;
        struct eth_hdr *eth = data, eth_copy;
    struct icmphdr *icmph;
    sruct iphdr *iph;

    if (data + sizeof(*eth) > data_end)
                return TC_ACT_OK;

        if (bpf_ntohs(eth->h_proto) != ETH_P_IP)
                return TC_ACT_OK;

    if (data + sizeof(*eth) + sizeof(*iph) > data_end)
        return TC_ACT_OK;

    iph = data + sizeof(*eth);
    if (iph->protocol != IPPROTO_ICMP)
        return TC_ACT_OK;

    if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*icmph) > data_end)
        return TC_ACT_OK;
    icmph = data + sizeof(*eth) + sizeof(*iph);
    if (icmph->type != ICMP_ECHO)
        return TC_ACT_OK;

Also note that if IP options were present, we’d need to adjust offsets accordingly, but we will keep things simple here.

Copy our ethernet header, extract needed info from IP header

When we convert from IPv4 to IPv6, we need 20 bytes extra space for the IPv6 header. The bpf helper bpf_skb_change_proto() will reserve extra headroom in the sk_buff for us to do this, but at the cost of overwriting the existing ethernet header. So let’s copy that out and modify the protocol to ETH_P_IPV6.

        /* Copy original ethernet header, as it must be moved. */
        ret = bpf_skb_load_bytes(skb, 0, &eth_copy, sizeof(eth_copy));
        if (ret) {
                bpf_debug("bpf_skb_load_bytes returned %d\n", ret);
                return TC_ACT_OK;
        }
        eth_copy.h_proto = bpf_htons(ETH_P_IPV6);

        /* IPv6 payload len does not include header len. */
        payload_len = bpf_ntohs(iph->tot_len) - (iph->ihl << 2);

Construct our ICMPv6, IPv6 headers.

Here we use hardcoded IPv6 addresses along with a simple __always_inline function to set 4 32-bit values comprising an IPv6 address:

static __always_inline void ipv6_addr_set(struct in6_addr *addr,
                                          __be32 w1, __be32 w2,
                                          __be32 w3, __be32 w4)
{
        addr->in6_u.u6_addr32[0] = w1;
        addr->in6_u.u6_addr32[1] = w2;
        addr->in6_u.u6_addr32[2] = w3;
        addr->in6_u.u6_addr32[3] = w4;
}

The “__always_inline” is needed to ensure the function gets into our ingress ELF section.

Back to our ingress handler:

        /* Time to construct ICMPv6 header. */
        icmp6h.type = ICMPV6_ECHO_REQUEST;
        icmp6h.code = icmph->code;

        /* Time to construct IPv6 header and copy it. */
        __builtin_memset(&ip6h, 0, sizeof(ip6h));
        ip6h.version = 6;
        ip6h.payload_len = bpf_htons(payload_len);
        ip6h.nexthdr = IPPROTO_ICMPV6;
        ip6h.hop_limit = 8;
        ipv6_addr_set(&ip6h.saddr, BPF_HELPER_IPV6_PREFIX, 0, 0,
                      BPF_HELPER_IPV6_REMOTE_SUFFIX);
        ipv6_addr_set(&ip6h.daddr, BPF_HELPER_IPV6_PREFIX, 0, 0,
                      BPF_HELPER_IPV6_LOCAL_SUFFIX);

Calculate value for ICMPv6 checksum

Internet checksums have some really nice mathematical properties; one key property is if the field of a header changes, we can recalcuate the checksum without traversing the whole header if we know the old and new values. We take advantage of that behaviour here, because in moving from IPv4 ICMP to IPv6 ICMPv6 – we need to add a pseudo-header to our ICMPv6 checksum – to do so we need to sum over the IPv6 addresses, the payload length and the protocol (IPPROTO_ICMPV6) – we also need to take into account the difference between the old ICMP type (ICMP_ECHO) and the ICMPv6 equivalent (ICMPV6_ECHO_REQUEST).

We need a function to generate the sum of 16-bit values; so we use Clang’s loop-unrolling feature to define sum16():

static __always_inline __u32 sum16(__u16 *addr, __u8 len)
{
        __u32 sum = 0;
        int i;

#pragma clang loop unroll(full)
        for (i = 0; i < len; i++)
                sum += *addr++;

        return sum;
}

…and then use it to sum up the checksum value changes in adding the pseudo-header and modifying the ICMP type values:

        /* Fix up our checksum. Source/destination addresses have changed, and
         * so has ICMP type.  Note that ICMPv6 also has a pseudo-header, so
         * we also need to add payload length and ICMPv6 protocol to newsum,
         * but do not add IPv4 equivalents to oldsum because ICMPv4 does not
         * use a pseudo-header in checksum calculation.  Only thing that changes
         * for oldsum is ICMP type.
         */
        oldsum = icmph->type;
        newsum = sum16((__u16 *)&ip6h.saddr, sizeof(ip6h.saddr) >> 1);
        newsum += sum16((__u16 *)&ip6h.daddr, sizeof(ip6h.daddr) >> 1);
        newsum += icmp6h.type + bpf_htons(payload_len) +
                  bpf_htons(IPPROTO_ICMPV6);

Later we will use these values to modify the checksum.

Change from IPv4 -> IPv6 and store our new ethernet, IPv6 and ICMPv6 data

We also update the checksum via bpf_l4_csum_replace(), specifying our oldsum and newsum values from above:

        /* Convert skb to IPv6 and adjust headroom to allow for space for
         * IPv6 header.
         */
        ret = bpf_skb_change_proto(skb, bpf_htons(ETH_P_IPV6), 0);
        if (ret) {
                bpf_debug("bpf_skb_change_proto returned %d\n", ret);
                return TC_ACT_OK;
        }
        /* Store our copied ethernet header at new start of packet. */
        ret = bpf_skb_store_bytes(skb, 0, &eth_copy, sizeof(eth_copy), 0);
        if (ret) {
                bpf_debug("bpf_skb_store_bytes returned %d\n", ret);
                return TC_ACT_SHOT;
        }
        /* Store our IPv6 header after the copied ether header */
        ret = bpf_skb_store_bytes(skb, sizeof(eth), &ip6h, sizeof(ip6h), 0);
        if (ret) {
                bpf_debug("bpf_skb_store_bytes returned %d\n", ret);
                return TC_ACT_SHOT;
        }
        /* Only two bytes type/code change */
        ret = bpf_skb_store_bytes(skb, sizeof(eth) + sizeof(ip6h),
                                  &icmp6h, 2, 0);
        if (ret) {
                bpf_debug("bpf_skb_store_bytes returned %d\n", ret);
                return TC_ACT_SHOT;
        }
        /* Lastly, recompute L4 checksum. */
        ret = bpf_l4_csum_replace(skb, sizeof(eth) + sizeof(ip6h) +
                                  offsetof(struct icmphdr, checksum),
                                  oldsum, newsum,
                                  BPF_F_PSEUDO_HDR | sizeof(newsum));
        if (ret) {
                bpf_debug("bpf_l4_csum_replace returned %d\n", ret);
                return TC_ACT_SHOT;
        }

Note that in failure cases, we return TC_ACT_SHOT since we’ve modified the packet in bpf_skb_change_proto() such that it’s not in a proper state if something goes wrong.

Writing our egress filter

This is mostly reversing the above, with the caveat that we need to calcuate the IPv4 checksum. Again see the referenced example for a fully-worked out version:

https://github.com/alan-maguire/bpf-test/blob/master/bpf/test_bpf_helper_bpf_skb_change_proto_kern.c

Conclusion

BPF is an extremely flexible environment in which to do packet processing. We didn’t touch on encapsulation/de-enapsulation here, but we can handle cases like that with the helper bpf_skb_adjust_room() to add/remove headroom in a packet. Hopefully the above demonstrates that we can do some interesting things in BPF!

Be sure to visit the previous installments of this series on BPF, here, and stay tuned for our next blog posts! 1. BPF program types 2. BPF helper functions for those programs 3. BPF userspace communication 4. BPF program build environment 5. BPF bytecodes and verifier 6. BPF Packet Transformation

BPF: Using BPF to do Packet Transformation

Notes on BPF (6) – BPF packet transformation using tc

Converting IPv4 to IPv6 – a quick primer

Choosing our BPF program type

Userspace interactions?

Beware of offload functionality!

Direct packet access versus bpf_skb_load/store_bytes

Writing our ingress filter

Writing our egress filter

Conclusion

Alan Maguire

BPF In Depth: The BPF Bytecode and the BPF Verifier

Talk of Huge Pages at Linux Plumbers Conference 2018

BPF: Using BPF to do Packet Transformation

Notes on BPF (6) – BPF packet transformation using tc

Converting IPv4 to IPv6 – a quick primer

Choosing our BPF program type

Userspace interactions?

Beware of offload functionality!

Direct packet access versus bpf_skb_load/store_bytes

Writing our ingress filter

Writing our egress filter

Conclusion

Authors

Alan Maguire

BPF In Depth: The BPF Bytecode and the BPF Verifier

Talk of Huge Pages at Linux Plumbers Conference 2018