X

News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

The Power of XDP

The Power of XDP

Oracle Linux kernel developer Alan Maguire talks about XDP, the eXpress DataPath which uses BPF to accelerate packet processing. For more background on BPF, see the series on BPF, wherein he presented an in depth look at the kernel's "Berkeley Packet Filter" -- a useful and extensible kernel function for much more than packet filtering.

[Important note: the BPF blog series referred to BPF functionality available in the 4.14 kernel. The functionality described here is for the most part present in that kernel also, but a few of the libbpf functions used in the example program and the layout of the xdp_md metadata structure have changed, and here we refer to the up-to-date (as of the 5.2 kernel) versions.]

In previous blog entries I gave a general description of BPF and applied BPF concepts to building tc-bpf programs. In that case, such programs are attached to tc ingress and egress hooks and can carry out packet transformation and other activities there.

However, such processing happens after the packet metadata - in Linux this is a "struct sk_buff" - has been allocated. As such there are earlier intervention points where BPF could operate.

The goal of XDP is to offer comparable performance to kernel bypass solutions while working with the existing kernel networking stack. For example, we may drop or forward packets directly using XDP, or perhaps simply pass them through the network stack for normal processing.

XDP metadata

As mentioned in the first article of the BPF series, XDP allows us to attach BPF programs early in packet receive codepaths. A key focus of the design is to minimize overheads, so each packet uses a minimal metadata descriptor:

/* user accessible metadata for XDP packet hook
 * new fields must be added to the end of this structure
 */
struct xdp_md {
        __u32 data;
        __u32 data_end;
    __u32 data_meta;
    /* Below access go through struct xdp_rxq_info */
    __u32 ingress_ifindex; /* rxq->dev->ifindex */
    __u32 rx_queue_index;  /* rxq->queue_index  */
};

Contrast this to the struct sk_buff definition as described here:

https://www.netdevconf.org/2.2/slides/miller-datastructurebloat-keynote.pdf

Each sk_buff requires an allocation of at least 216 bytes of metadata. This translates into observable performance costs.

XDP program execution

XDP comes in two flavours;

  • native XDP requires driver support, and packets are processed before sk_buffs are allocated. This allows us to realize the benefits of a minimal metadata descriptor. The hook comprises a call to bpf_prog_run_xdp, and after calling this function the driver must handle the possible return values - see below for a description of these. As an example, the bnxt_rx_pkt function calls bnxt_rx_xdp, which in turn verifies if an XDP program has been loaded for the RX ring, and if so sets up metadata buffer and calls bpf_prog_run_xdp. bnxt_rx_pkt is called directly from device polling functions and so is called via the net_rx_action for both interrupt processing and polling; in short we are getting our hands on the packet as soon as possible in the receive codepath.

  • generic XDP, where the XDP hooks are called from within the networking stack after the sk_buff has been allocated. Generic XDP allows us to use the benefits of XDP - though at a slightly higer performance cost - without underlying driver support. In this case bpf_prog_run_xdp is called as via netdev's netif_receive_generic_xdp function; i.e. after the skb has been allocated and set up. To ensure that XDP processing works, the skb has to be linearized (made contiguous rather than chunked in data fragments) - again this can cost performance.

XDP actions

XDP programs can signal a desired behaviour by returning

  • XDP_DROP: drops with XDP are fast, the buffers are just recycled to the rx ring queue
  • XDP_PASS: pass to the normal networking stack, possibly after modification
  • XDP_TX: send out same NIC packet arrived from, after modifying packet
  • XDP_REDIRECT: Using the XDP_REDIRECT action from an XDP program, the program can redirect ingress frames to other XDP enabled netdev

Adding support for XDP to a driver requires adding the receive hook calling bpf_prog_run_xdp and handling the various outcomes, and adding setup/teardown functions which dedicate buffer rings to XDP.

An example - xdping

From the above set of actions, and the desire to minimize per-packet overhead, we can see that use cases such as Distributed Denial of Service mitigation and load balancing make sense. To help illustrate the key concepts in XDP, here we present a fully-worked example of our own. This example is available in recent bpf-next kernels; see

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/xdping.c

...for the userspace program;

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/xdping.h

...for the shared header; and

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/xdping_kern.c

...for the BPF program.

xdping is a C program that uses XDP, BPF maps and the ping program to measure round-trip times (RTT) in a similar manner to ping, but with xdping we measure round-trip time from XDP itself, instead of invoking all the additional layers of IP, ICMP and user-space-to-kernel interactions. The idea is that by presenting round-trip times as measured in XDP versus those measured via a traditional ping we can

  • see how much processing traffic in XDP directly can save us in terms of response latency
  • eliminate variations in RTT due to the additional processing layers

xdping can operate in either client or server modes.

  • As a client, it is responsible for generating ICMP requests and receiving ICMP replies, measuring RTT and saving the result in a BPF map. It does this by receiving a ping-generated ICMP reply, turning that back into an ICMP request, noting the time and sending it. When the reply is received, the RTT can be calculated
  • As a server, it is responsible for receiving ICMP requests, turning them back into replies

Note that the above approach is necessary because XDP is receive-driven; i.e. the XDP hooks are in the receive codepaths. With AF_XDP - the topic of our next XDP blog entry - transmission is also possible, but here we stick to core XDP.

Let's see what the program looks like!

# ./xdping -I eth4 192.168.55.7
Setting up xdp for eth4, please wait...
Normal ping RTT data:
PING 192.168.55.7 (192.168.55.7) from 192.168.55.8 eth4: 56(84) bytes of data.
64 bytes from 192.168.55.7: icmp_seq=1 ttl=64 time=0.206 ms
64 bytes from 192.168.55.7: icmp_seq=2 ttl=64 time=0.165 ms
64 bytes from 192.168.55.7: icmp_seq=3 ttl=64 time=0.162 ms
64 bytes from 192.168.55.7: icmp_seq=8 ttl=64 time=0.470 ms
--- 192.168.55.7 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3065ms
rtt min/avg/max/mdev = 0.162/0.250/0.470/0.129 ms
XDP RTT data:
64 bytes from 192.168.55.7: icmp_seq=5 ttl=64 time=0.03003 ms
64 bytes from 192.168.55.7: icmp_seq=6 ttl=64 time=0.02665 ms
64 bytes from 192.168.55.7: icmp_seq=7 ttl=64 time=0.02453 ms
64 bytes from 192.168.55.7: icmp_seq=8 ttl=64 time=0.02633 ms

Note that - unlike ping where it is optional - we must specify an interface for use in ping'ing; we need to know where to load the XDP program. Note also that the RTT measurements from XDP are significantly quicker than those reported by ping. Now ping has support for timestaming, where the network stack processing can use IP timestamps to get more accurate numbers, but not all systems have timestamping enabled.

Finally notice one other thing; each ICMP echo packet has an associated sequence number, and we see these reported in the ping output. However note that the final icmp_seq=8 and not 4 as we might expect. This is because our XDP program took that 4th reply, rewrote as a request with sequence number 5 and sent it out. Then when it got that reply and measured the RTT, it did the same again for seq number 6 and so on until it got the 8th reply, realized it had all the numbers it needed (by defalt we do 4 requests, that can be changed with the "-c count" option to xdping) and instead of returning XDP_TX ("send out this modified packet") the program returns XDP_PASS ("pass this packet to the networking stack"). So the ping program finally sees ICMP reply number 8, hence the output.

To store RTTs we need a common data structure to store in a BPF map which we shall key using the target (remote) IP address. xdping.h can store this info and be included by the userspace and kernel programs:

/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved. */

#define XDPING_MAX_COUNT        10
#define XDPING_DEFAULT_COUNT    4

struct pinginfo {
        __u64   start;
        __be16  seq;
        __u16   count;
        __u32   pad;
        __u64   times[XDPING_MAX_COUNT];
};

We store the number of ICMP requests to make ("count"), the start time for the current request ("start"), the current sequence number ("seq") and the RTTs ("times").

Next, here is the implementation of the ping client code for the BPF program, xdping_kern.c:

SEC("xdpclient")
int xdping_client(struct xdp_md *ctx)
{
        void *data_end = (void *)(long)ctx->data_end;
        void *data = (void *)(long)ctx->data;
        struct pinginfo *pinginfo = NULL;
        struct ethhdr *eth = data;
        struct icmphdr *icmph;
        struct iphdr *iph;
        __u64 recvtime;
        __be32 raddr;
        __be16 seq;
        int ret;
        __u8 i;

        ret = icmp_check(ctx, ICMP_ECHOREPLY);

        if (ret != XDP_TX)
                return ret;

        iph = data + sizeof(*eth);
        icmph = data + sizeof(*eth) + sizeof(*iph);
        raddr = iph->saddr;

        /* Record time reply received. */
        recvtime = bpf_ktime_get_ns();
        pinginfo = bpf_map_lookup_elem(&ping_map, &raddr);
        if (!pinginfo || pinginfo->seq != icmph->un.echo.sequence)
                return XDP_PASS;

        if (pinginfo->start) {
#pragma clang loop unroll(full)
                for (i = 0; i < XDPING_MAX_COUNT; i++) {
                        if (pinginfo->times[i] == 0)
                                break;
                }
                /* verifier is fussy here... */
                if (i < XDPING_MAX_COUNT) {
                        pinginfo->times[i] = recvtime -
                                             pinginfo->start;
                        pinginfo->start = 0;
                       i++;
                }
                /* No more space for values? */
                if (i == pinginfo->count || i == XDPING_MAX_COUNT)
                        return XDP_PASS;
        }

        /* Now convert reply back into echo request. */
        swap_src_dst_mac(data);
        iph->saddr = iph->daddr;
        iph->daddr = raddr;
        icmph->type = ICMP_ECHO;
        seq = bpf_htons(bpf_ntohs(icmph->un.echo.sequence) + 1);
        icmph->un.echo.sequence = seq;
        icmph->checksum = 0;
        icmph->checksum = ipv4_csum(icmph, ICMP_ECHO_LEN);

        pinginfo->seq = seq;
        pinginfo->start = bpf_ktime_get_ns();

        return XDP_TX;
}

In the full program, there are two ELF sections; one for the client mode (turn replies into requests and send them, measure RTT), and one for the server (turn requests into replies and send them out).

Finally, the user-space program loads the XDP program, intializes the map used by it and kicks off the ping. Here is the main() function that sets up XDP and runs the ping:

int main(int argc, char **argv)
{
        __u32 mode_flags = XDP_FLAGS_DRV_MODE | XDP_FLAGS_SKB_MODE;
        struct addrinfo *a, hints = { .ai_family = AF_INET };
        struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
        __u16 count = XDPING_DEFAULT_COUNT;
        struct pinginfo pinginfo = { 0 };
        const char *optstr = "c:I:NsS";
        struct bpf_program *main_prog;
        int prog_fd = -1, map_fd = -1;
        struct sockaddr_in rin;
        struct bpf_object *obj;
        struct bpf_map *map;
        char *ifname = NULL;
        char filename[256];
        int opt, ret = 1;
        __u32 raddr = 0;
        int server = 0;
        char cmd[256];

        while ((opt = getopt(argc, argv, optstr)) != -1) {
                switch (opt) {
                case 'c':
                        count = atoi(optarg);
                        if (count < 1 || count > XDPING_MAX_COUNT) {
                                fprintf(stderr,
                                        "min count is 1, max count is %d\n",
                                        XDPING_MAX_COUNT);
                                return 1;
                        }
                        break;
                case 'I':
                        ifname = optarg;
                        ifindex = if_nametoindex(ifname);
                        if (!ifindex) {
                                fprintf(stderr, "Could not get interface %s\n",
                                        ifname);
                                return 1;
                        }
                        break;
                case 'N':
                        xdp_flags |= XDP_FLAGS_DRV_MODE;
                        break;
                case 's':
                        /* use server program */
                        server = 1;
                        break;
                case 'S':
                        xdp_flags |= XDP_FLAGS_SKB_MODE;
                        break;
                default:
                        show_usage(basename(argv[0]));
                        return 1;
                }
        }

        if (!ifname) {
                show_usage(basename(argv[0]));
                return 1;
        }
        if (!server && optind == argc) {
                show_usage(basename(argv[0]));
                return 1;
        }

        if ((xdp_flags & mode_flags) == mode_flags) {
                fprintf(stderr, "-N or -S can be specified, not both.\n");
                show_usage(basename(argv[0]));
                return 1;
        }

        if (!server) {
                /* Only supports IPv4; see hints initiailization above. */
                if (getaddrinfo(argv[optind], NULL, &hints, &a) || !a) {
                        fprintf(stderr, "Could not resolve %s\n", argv[optind]);
                        return 1;
                }
                memcpy(&rin, a->ai_addr, sizeof(rin));
                raddr = rin.sin_addr.s_addr;
                freeaddrinfo(a);
        }

        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
                perror("setrlimit(RLIMIT_MEMLOCK)");
                return 1;
        }

        snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);

        if (bpf_prog_load(filename, BPF_PROG_TYPE_XDP, &obj, &prog_fd)) {
                fprintf(stderr, "load of %s failed\n", filename);
                return 1;
        }

        main_prog = bpf_object__find_program_by_title(obj,
                                                      server ? "xdpserver" :
                                                               "xdpclient");
        if (main_prog)
                prog_fd = bpf_program__fd(main_prog);
        if (!main_prog || prog_fd < 0) {
                fprintf(stderr, "could not find xdping program");
                return 1;
        }

        map = bpf_map__next(NULL, obj);
        if (map)
                map_fd = bpf_map__fd(map);
        if (!map || map_fd < 0) {
                fprintf(stderr, "Could not find ping map");
                goto done;
        }

        signal(SIGINT, cleanup);
        signal(SIGTERM, cleanup);

        printf("Setting up XDP for %s, please wait...\n", ifname);

        printf("XDP setup disrupts network connectivity, hit Ctrl+C to quit\n");

        if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) {
                fprintf(stderr, "Link set xdp fd failed for %s\n", ifname);
                goto done;
        }

        if (server) {
                close(prog_fd);
                close(map_fd);
                printf("Running server on %s; press Ctrl+C to exit...\n",
                       ifname);
                do { } while (1);
        }
        /* Start xdping-ing from last regular ping reply, e.g. for a count
         * of 10 ICMP requests, we start xdping-ing using reply with seq number
         * 10.  The reason the last "real" ping RTT is much higher is that
         * the ping program sees the ICMP reply associated with the last
         * XDP-generated packet, so ping doesn't get a reply until XDP is done.
         */
        pinginfo.seq = htons(count);
        pinginfo.count = count;

        if (bpf_map_update_elem(map_fd, &raddr, &pinginfo, BPF_ANY)) {
                fprintf(stderr, "could not communicate with BPF map: %s\n",
                        strerror(errno));
                cleanup(0);
                goto done;
        }

        /* We need to wait for XDP setup to complete. */
        sleep(10);

        snprintf(cmd, sizeof(cmd), "ping -c %d -I %s %s",
                 count, ifname, argv[optind]);

        printf("\nNormal ping RTT data\n");
        printf("[Ignore final RTT; it is distorted by XDP using the reply]\n");

        ret = system(cmd);

        if (!ret)
                ret = get_stats(map_fd, count, raddr);

        cleanup(0);

done:
        if (prog_fd > 0)
                close(prog_fd);
        if (map_fd > 0)
                close(map_fd);

        return ret;

Conclusion

We've talked about XDP programs; where they run, what they can do and provided a code example. I hope this inspires you to play around with XDP! Next time we'll cover AF_XDP, a new socket type which uses XDP to support a more complete range of kernel bypass functionality.

Be sure to visit our series on BPF,  and stay tuned for our next blog posts! 1. BPF program types 2. BPF helper functions for those programs 3. BPF userspace communication 4. BPF program build environment 5. BPF bytecodes and verifier 6. BPF Packet Transformation

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.