BPF has evolved to support multiple use cases in the kernel including tracing, network packet handling and even most recently custom TCP congestion control and scheduler policies via sched_ext. Here is a use case of interest requiring a way for an application to monitor abnormal exit or death of its processes & threads.
An application can have many processes/threads which need to be monitored for abnormal exit so that action can be taken to perform necessary cleanup. Normally, process exit is detected based on parent-child relationship, either via SIGCHILD signal or having a thread wait for the child’s exit. Individual thread exits are handled within a process.
In a large application, there can be many multithreaded processes, which may not follow proper process hierarchy and so handling abnormal exits centrally, can be a challenge. One way would be to have a dedicated monitor process periodically scan /proc or issue kill 0 signal to check if a process/thread is still alive, which is inefficient. Also, it has to deal with PID recycling issues. A mechanism that allows the monitor process to receive notifications when the application’s processes / threads exit abnormally, would be useful.
Netlink proc connector
Netlink’s proc connector can be used to receive process exit notifications.
The Netlink’s proc connector provides notification of process events, which include process exit events. Recent changes to the Netlink proc connector allows unprivileged users to collect process events. Also, support for filtering process exit events based on non-zero exit code was added. This can potentially be useful in receiving only abnormal exit events.
However, there are some limitations with the Netlink proc connector-based exit notification, which makes it not quite suitable for the use case described above.
For instance, Netlink proc connector provides process events of all processes in the system which may not be required by an application that is only interested in its own process/thread’s exit. Also, a thread cannot provide a non-zero exit code i.e the exit code of a thread will always be zero, so it would not help if non zero exit filtering is applied. The application would have to deal with determining if a thread exited abnormally.
BPF based exit notification
Exit notification can be implemented entirely using BPF as per the application’s requirement, without needing any additional kernel support or changes to Netlink. Here is an example.
BPF program description
A BPF program is loaded and attached to the tracepoint ‘sched:sched_process_exit’; this tracepoint fires on process exit. Tracepoints are to be preferred generally for BPF attachment since they are more stable attach sites than kernel functions.
Two BPF maps are set up in user space: a ring buffer map(BPF_MAP_TYPE_RINGBUF) and a shared hashmap(BPF_MAP_TYPE_HASH), which allow BPF programs and user space to communicate via shared memory. In the case of the ring buffer, the map allows us to send messages to user space from the BPF program and in the case of the hashmap we can populate, update or read it either from user space or kernel side. For ring buffer details please refer to here.
The ring buffer map is used to send exit notifications to the monitor process from the BPF program. The shared hashmap is accessed by the monitor process and also all other processes/threads of the application, that need to be monitored. The threads register themselves in the hashmap as soon as they start running. When the thread exits, the kernel BPF program will check in the hashmap if that exiting thread is registered and sends the notification to the monitor process through the ring buffer.
In order to have only abnormal exits be notified, the processes/threads would register themselves in the hashmap when they start, and unregister before exiting in case of a normal exit. Therefore no notifications will be sent for normal exits, as the kernel probe will not find a registration.
Libbpf APIs are used to access the hashmap to add(register) and remove(unregister) entries. The ring buffer map fd is poll’ble. The monitor process can epoll the ring buffer fd and consume notifications using libbpf’s ring_buffer__poll() API.
The hashmap can be shared amongst the application processes using ‘mapid’. Iteration over BPF map ids is only available to privileged users and pinning a map to bpffs in /sys/kernel/bpf can make maps accessible to unprivileged users via standard file permissions. Pinning pathnames help in sharing the BPF objects between processes, and also ensures resources persist when the application restarts.
An event structure describes the event sent to the monitor program, including pid & tid of the exited thread. It also includes a 64-bit data field, that will contain the value from the hashmap entry. This is used by the application thread to associate some value to be delivered to the monitor program along with the exit notification. The application thread provides the value at the time of registering itself in the hashmap.
The event structure can be defined in “common.h”, referenced by both kernel and user space code.
Event structure used for notification
/* LICENSE: GPLv2 */ struct event { int pid; int tid; __u64 data; /* Application data */ };
The following BPF object gets compiled as skeleton – pexit.skel.h.
BPF skeleton is a bpftool-generated header file containing the byte code of the BPF objects associated programs along with APIs to create the object and attach, detach and free it. The header can be included in the user space component of the program to avoid having to manage a separate BPF object and simplify interaction with it.
/* LICENSE: GPLv2 */ #include "common.h" /* BPF ringbuf map */ struct { __uint(type, BPF_MAP_TYPE_RINGBUF); __uint(max_entries, 256 * 1024 /* 256 KB */); } rb SEC(".maps"); /* BPF hashmap */ struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 1024); __type(key, __u64); /* pid | tid */ __type(value, __u64); } ptable SEC(".maps"); /* BPF probe routine */ SEC("tp/sched/sched_process_exit") int handle_exit(struct trace_event_raw_sched_process_template *args) { struct event *pe; __u64 *v; __u64 pidtid = bpf_get_current_pid_tgid(); v = bpf_map_lookup_elem(&ptable, &pidtid); if (!v) return 0; pe = bpf_ringbuf_reserve(&rb, sizeof(*pe), 0); if (!pe) return 0; pe->data = *v; pe->tid = pidtid & 0xFFFFFFFF; pe->pid = pidtid >> 32; bpf_ringbuf_submit(pe, 0); return 0; }
Load the BPF program
A loader program loads the BPF program and pins pathnames to the ring buffer and the hashmap under /sys/fs/bpf. Having a separate loader program would be useful in assigning it necessary capabilities/privileges to load the BPF program, which the other processes of the application need not have. This loader program would load the BPF program and set up pin pathnames with the required owner and group id to give access to the application processes.
/* LICENSE: GPLv2 */ #include "common.h" #include "pexit.skel.h" char pathname[PATH_MAX] = "/sys/fs/bpf/pexit"; struct bpf_program *prog; int prog_fd; /* pathname, size, uid & gid can be specified */ /* load BPF probe */ skel = pexit_bpf__open(); .. /* resize */ bpf_map__set_max_entries(skel->maps.rb, ringbuf_size); bpf_map__set_max_entries(skel->maps.ptable, hashmap_size); .. err = pexit_bpf__load(skel); /* pin path names to bpf program, hashmap and ringbuf and change ownership */ .. prog = bpf_objec__find_program_by_name(skel->obj, "handle_exit"); .. prog_fd = bpf_program__fd(prog); sprintf(pinpath, "%s/prog", pathname); bpf_obj_pin(prog_fd, pinpath); .. sprintf(pinpath, "%s/map", pathname); err = bpf_map__pin(skel->maps.ptable, pinpath); chown(pinpath, uid, gid); sprintf(pinpath, "%s/rb", pathname); err = bpf_map__pin(skel->maps.rb, pinpath); chown(pinpath, uid, gid);
Monitor program
/* LICENSE: GPLv2 */ #include "common.h" #include "pexit.skel.h" /* event handler associated with ring buffer poll */ int handle_event(void *ctx, void *data, size_t data_sz) { struct event *pe = data; /* process exit event */ printf("Pid %d, Tid %d, data %lld\n", pe->pid, pe->tid, pe->data); .. } /* open ring buffer given pinpath */ sprintf(path, "%s/rb", pinpath); ringfd = bpf_obj_get(path); rb = ring_buffer__new(ringfd, handle_event, NULL, NULL); /* Create a ring buffer poll fd and epoll if needed */ rbpollfd = ring_buffer__epoll_fd(rb); .. while (!exiting) { .. /* consume events - will call handle_event*/ err = ring_buffer__poll(rb, 0); }
Other threads/processes that need to be monitored would do the following.
/* LICENSE: GPLv2 */ pid_t pid, tid; __u64 key, value, int hmapfd; .. hmapfd = bpf_obj_get(hashmap_path); /* register */ pid = getpid(); tid = gettid(); key = pid; key = key << 32|tid; value = 0xFFF8; bpf_map_update_elem(hmapfd, key, &value, BPF_ANY); .. /* in case of normal exit */ bpf_map_delete_elem(hmapfd, &key);
Summary
The BPF based exit notification shown here can help simplify the application’s handling of abnormal process/thread exits. In general, BPF is becoming useful in extending kernel functionality where possible, to suit application requirements.
Note: All code