Understanding Unix Garbage Collection and its Interaction with io_uring

May 7, 2024 | 22 minute read
Text Size 100%:


As Ksplice engineers, we often have to look at completely different sub-systems of the Linux kernel to patch them, either to fix a vulnerability or to add trip wires. As a result, we gain a lot of knowledge in various areas and this time we’ll share our experience regarding a use-after-free issue stemming from the interaction between Unix garbage collection (GC) and the io_uring component, as we’ve gained insights through a Known Exploit Detection Update.

To explain the interaction between various components, we will begin by exploring the kernel implementation for sending file descriptors. Additionally, we will examine the function responsible for registering file descriptors with io_uring using the IORING_REGISTER_FILES opcode. Next, we’ll take a closer look at the detection and various methods for cleaning up cycles, including the Unix garbage collection code. Following this exploration, we’ll discuss a use-after-free scenario that results from the interaction between Unix GC and io_uring.

This article is based on Linux kernel version 5.15. Throughout the article, we will use the term ‘socket’ to refer to Unix domain sockets.

Sending File Descriptors over Unix Domain Sockets

File descriptors can be transmitted from one socket to another, using the sendmsg system call with the SCM_RIGHTS message type. Upon receiving the message, the receiving socket opens the file descriptors in the receiving process.

Here’s an example of how to send a file descriptor from one socket to another from userspace.

int scmrights_send_fd(int s, struct sockaddr_un *receiver, int fd) {
    struct msghdr msg;
    struct cmsghdr *cmsg;
    char buffer[1024];
    int fds[1] = { fd };

    memset(&msg, 0, sizeof(msg));
    memset(buffer, 0, sizeof(buffer));

    msg.msg_name = receiver;
    msg.msg_namelen = sizeof(struct sockaddr_un);

    msg.msg_control = buffer; 
    msg.msg_controllen = sizeof(buffer);
    cmsg = CMSG_FIRSTHDR(&msg); 
    cmsg->cmsg_level = SOL_SOCKET;
    cmsg->cmsg_type = SCM_RIGHTS;
    cmsg->cmsg_len = CMSG_LEN(sizeof(fds));
    memcpy(CMSG_DATA(cmsg), fds, sizeof(fds));

    msg.msg_controllen = CMSG_SPACE(sizeof(fds));

    return sendmsg(s, &msg, 0);

Let’s take a deep dive into how this is done in kernel mode. During the sending process, file objects associated to the file descriptors are referenced and placed into an array. An skb (socket buffer) holding a pointer to the file object array is allocated, and added to the receiver queue of the receiving socket. If the file descriptors being sent from userspace belong to the Unix domain socket type, then those sockets are added to a global linked list called gc_inflight_list to facilitate garbage collection.

unix_dgram_sendmsg is a function pointed to by the sendmsg function pointer in unix_dgram_ops. This function is invoked whenever a message is sent using the sendmsg system call on a Unix datagram socket.

static const struct proto_ops unix_dgram_ops = {
    .sendmsg =  unix_dgram_sendmsg,

Following is the definition of an important data structure to store file object pointers, that we’re going to examine shortly.

struct scm_fp_list {
    struct file     *fp[SCM_MAX_FD];

unix_dgram_sendmsg calls scm_send->_scm_send function to process socket level control message (SCM). To handle SCM_RIGHTS message type, where the file descriptors are encoded in the cmsghdr from userspace, it calls scm_fp_copy function. This function allocates memory for scm_fp_list, references the file objects from the file descriptors and store the file object pointers in scm_fp_list->fp.

The pointer to the allocated memory is stored in a local variable scm.fp in function unix_dgram_sendmsg.

static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
    int *fdp = (int*)CMSG_DATA(cmsg);
    struct scm_fp_list *fpl = *fplp;
    struct file **fpp;
    int i, num;

    num = (cmsg->cmsg_len - sizeof(struct cmsghdr))/sizeof(int);

    if (num <= 0)
        return 0;

    if (num > SCM_MAX_FD)
        return -EINVAL;

    if (!fpl)
        fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
        if (!fpl)
            return -ENOMEM;
        *fplp = fpl;
        fpl->count = 0;
        fpl->max = SCM_MAX_FD;
        fpl->user = NULL;
    fpp = &fpl->fp[fpl->count];

    if (fpl->count + num > fpl->max)
        return -EINVAL;

     *  Verify the descriptors and increment the usage count.

    for (i=0; i< num; i++)
        int fd = fdp[i];
        struct file *file;

        if (fd < 0 || !(file = fget_raw(fd)))
            return -EBADF;
        *fpp++ = file;

    if (!fpl->user)
        fpl->user = get_uid(current_user());

    return num;

unix_dgram_sendmsg allocates an skb (of type struct sk_buff).

    skb = sock_alloc_send_pskb(sk, len - data_len, data_len,
                               msg->msg_flags & MSG_DONTWAIT, &err,

An skb includes a field called cb, which is an array of characters. In the case of Unix domain sockets, this cb field is typecasted to unix_skb_params, which contains a member named fp of type scm_fp_list. This member is used to store an array of file object pointers. Therefore, an skb can indirectly access file object pointers through its cb field.

struct sk_buff {
     * This is the control buffer. It is free to use for every
     * layer. Please put your private variables there. If you
     * want to keep them across layers you have to do a skb_clone()
     * first. This is owned by whoever has the skb queued ATM.
    char            cb[48] __aligned(8);
struct unix_skb_parms {
    struct scm_fp_list  *fp;        /* Passed files     */

To facilitate pointer access to the cb, the following macro is defined:

#define UNIXCB(skb) (*(struct unix_skb_parms *)&((skb)->cb))

The array of file object pointers from scm->fp is copied to an allocated scm_fp_list variable in the function scm_fp_dup, which is then stored in UNIXCB(skb).fp.

struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl);
    UNIXCB(skb).fp = scm_fp_dup(scm->fp;)

It then iterates over the file object array, and if a referenced file object is an instance of a unix domain socket, then it increments the “inflight count” and also adds it to a global linked list variable called gc_inflight_list from unix_inflight function. Here, “inflight” represents the file objects of sockets which have not been received yet.

The reason it only appends the unix sockets to the gc_inflight_list and not a regular file or any other socket type is, a cycle (which we will explore later in this blog) is formed only by unix domain sockets. That’s why the Unix GC code aims to only track the unix sockets being sent.

void unix_inflight(struct user_struct *user, struct file *fp)
    struct sock *s = unix_get_socket(fp);


    if (s) {
        struct unix_sock *u = unix_sk(s);

        if (atomic_long_inc_return(&u->inflight) == 1) {
            list_add_tail(&u->link, &gc_inflight_list);
        } else {

At the beginning of the function unix_dgram_sendmsg, it obtains the receiver socket represented by other variable.

If the socket is paired with another peer (e.g., socketpair), it retrieves the ‘other’ socket using the following code:

    other = unix_peer_get(sk);

Otherwise, if it is targeting a specific address where another socket is bound, the code looks like this:

    other = unix_find_other(net, sunaddr, namelen, sk->sk_type,
                            hash, &err);

We then add the skb to the receive queue of the receiver socket:

    skb_queue_tail(&other->sk_receive_queue, skb);

Now, it’s the responsibility of the receiver socket to consume the message using recvmsg system call. When it does that, it iterates over the file objects in the scm_fp_list attached to the skb in the sk_receive_queue, and file descriptors are created from each of the file objects in the scm_fp_list array. It also dereferences the file object, and if it’s an instance of a socket, then it removes it from the global gc_inflight_list to indicate that Unix GC no longer needs to track this socket.

Registering File Descriptors with io_uring

io_uring is an asynchronous framework that facilitates asynchronous and parallel IO operations, minimizing kernel mode blocking and enhancing user-mode processing. You can learn more about io_uring from this blog post.

The io_uring_register system call, when used with the IORING_REGISTER_FILES opcode, registers a file descriptor with io_uring. This action allows the underlying file object to be referenced and placed in an internal array. Later, an IO operation can be specified on a file descriptor using the index at which it was registered, utilizing the io_uring_get_sqe system call.

In kernel mode, io_uring initializes an internal Unix domain socket when setting up a new io_uring context. The function responsible for handling the registration of file descriptors is io_sqe_files_register, which in turn invokes io_sqe_files_scm->__io_sqe_files_scm. __io_sqe_files_scm references the file objects associated to the respective file descriptors passed to io_uring for registration. Additionally, it increments an “inflight” count and appends it to the global linked list gc_inflight_list if the file descriptor is an instance of a Unix domain socket, similar to the mechanism used for sending file descriptors using SCM_RIGHTS, as explained earlier. The array of file object pointers is stored in an allocated skb, which is queued to the receive queue of the internal socket.

static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
    struct sock *sk = ctx->ring_sock->sk;
    struct scm_fp_list *fpl;
    struct sk_buff *skb;
    int i, nr_files;

    fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
    if (!fpl)
        return -ENOMEM;

    skb = alloc_skb(0, GFP_KERNEL);
    if (!skb) {
        return -ENOMEM;

    skb->sk = sk;

    nr_files = 0;
    fpl->user = get_uid(current_user());
    for (i = 0; i < nr; i++) {
        struct file *file = io_file_from_index(ctx, i + offset);

        if (!file)
        fpl->fp[nr_files] = get_file(file);
        unix_inflight(fpl->user, fpl->fp[nr_files]);

    if (nr_files) {
        fpl->max = SCM_MAX_FD;
        fpl->count = nr_files;
        UNIXCB(skb).fp = fpl;
        skb->destructor = unix_destruct_scm;
        refcount_add(skb->truesize, &sk->sk_wmem_alloc);
        skb_queue_head(&sk->sk_receive_queue, skb);

        for (i = 0; i < nr_files; i++)
    } else {

    return 0;

Unix Garbage Collection Internal

Unix GC tracks file objects sent via sendmsg system call with the SCM_RIGHTS message type, where the receiver socket has not yet consumed the message. If file descriptors are closed without consuming the messages, and the file object references persist within SCM messages, Unix GC steps in to free these objects.

A cycle of Unix domain sockets is formed by sending one Unix domain socket to another using SCM_RIGHTS message type and then sending the latter socket back to the former using the same technique. A cycle can be broken by consuming the message from the receiving sockets, which opens the file descriptor in the receiving process. However, if a user closes the file descriptors of the receiver sockets without consuming the messages, then the only remaining references to those file objects will be from the file object array in an skb (socket buffers) belonging to the list of sockets in gc_inflight_list.

Later, when Unix GC is invoked, it iterates over the Unix domain sockets in gc_inflight_list and checks for cycles without any external references (i.e., the message cannot be consumed, and file objects cannot be freed by the close system call from userspace). The Unix GC code detects this condition and forms a hitlist of skbs containing arrays of file objects without references from open file descriptors, so that those file objects can be freed.

Cycles are detected and cleaned up if needed by the Unix GC routine when triggered. However, it’s important to note that cycles form well before the garbage collection process initiates.

Cycle Formation

Consider the following example involving three Unix domain sockets where there’s a sender socket and two others are receiver sockets (sender socket is optional).

    int sender;
    int receiver1, receiver2;

    sender = socket(AF_UNIX, SOCK_DGRAM, 0);
    receiver1 = socket(AF_UNIX, SOCK_DGRAM, 0);
    receiver2 = socket(AF_UNIX, SOCK_DGRAM, 0);

Either the sender socket sends the first receiver socket to the second receiver socket or the first receiver socket sends itself to the second receiver socket. Then, the first receiver socket’s “inflight count” will be 1, and the second receiver socket’s receive queue will have an skb that stores a pointer to an array of file objects, and one of the file objects in that array will be the file object of the first receiver socket. The first receiver socket will also be added to the global inflight list.

Sender socket is sending first receiver socket to the second receiver socket.

    struct sockaddr_un receiver1_addr;
    struct sockaddr_un receiver2_addr;

#define RECEIVER2_ADDR "/tmp/Receiver2.sock"
    memset(&receiver2_addr, 0, sizeof(receiver2_addr));
    receiver2_addr.sun_family = AF_UNIX;
    strncpy(receiver2_addr.sun_path, RECEIVER2_ADDR, sizeof(receiver2_addr.sun_path) - 1);

    bind(receiver2, (struct sockaddr *)&receiver2_addr,

    scmrights_send_fd(sender, &receiver2_addr, receiver1)

scmrights_send_fd sends file descriptors using SCM_RIGHTS message type and is defined in “Sending File Descriptors over Unix Domain Sockets” section.

This will not form a cycle yet. The cycle will be formed if, in addition to the steps in the above paragraph and example, the sender socket sends the second receiver socket to the first receiver socket or the second receiver socket sends itself to the first receiver socket. Then the second receiver socket’s file object will be placed in the file object array and the pointer to the array will be stored in an skb of the first receiver socket’s receive queue.

Sender socket is sending second receiver socket to the first receiver socket, thus completing the cycle.

#define RECEIVER1_ADDR "/tmp/Receiver1.sock"
    memset(&receiver1_addr, 0, sizeof(receiver1_addr));
    receiver1_addr.sun_family = AF_UNIX;
    strncpy(receiver1_addr.sun_path, RECEIVER1_ADDR, sizeof(receiver1_addr.sun_path) - 1);

    bind(receiver1, (struct sockaddr *)&receiver1_addr,

    scmrights_send_fd(sender, &receiver1_addr, receiver2)

So, the first receiver socket has a reference to the second receiver socket through its receive queue, and the second receiver socket has a reference to the first receiver socket through its receive queue, effectively making a cycle.

If the file descriptors for first and second receivers are closed, then the only references to the associated file objects would exist through the sockets’ receive queues.


Cycle Detection

In order to detect a cycle, Linux adopts an algorithm in unix_gc function by moving all the Unix domain sockets in the gc_inflight_list, which have their file objects reference count equal to the inflight count (indicating that the only references to the file objects are from Unix GC tracking code and corresponding file descriptors are closed), to a candidate list. In our case, the list contains the first and second receiver sockets.

Then process the first receiver socket from the candidate list. Here it will inspect all the file objects from the skbs in its receive queue. It will find the file object of the second receiver socket stored there and will decrement its inflight count.

It will then process the second receiver socket from the candidate list. It will inspect all the file objects from the skbs in its receive queue. It will find the file object of the first receiver socket stored there and will decrement its inflight count.

It will then iterate over all the sockets from the candidate list whose inflight count is greater than zero; it will add them to the not-cycle list.

The remaining sockets in the candidate list are part of a cycle only to be freed by Unix GC.

If the first receiver socket was sent to a second receiver socket, but the second receiver socket was not sent to the first, then we did not form a cycle.

Breaking the Cycle

Breaking of the cycles can happen in two ways:

  • Receiving the SCM_RIGHTS message.

  • Triggering the Unix GC when no external reference to the file objects exist (i.e. file descriptors are closed).

Message Reception

In this section, we will explore how the cycle break occurs by consuming the SCM_RIGHTS message using the recvmsg system call. The recvmsg system call for Unix datagram sockets invokes unix_dgram_recvmsg->__unix_dgram_recvmsg. This function, in turn calls unix_detach_fds, which iterates over the file object pointer array attached to the skb of the receive queue of the receiver socket and calls unix_notinflight to unlink the socket from the global linked list of inflight sockets gc_inflight_list.

void unix_notinflight(struct user_struct *user, struct file *fp)
    struct sock *s = unix_get_socket(fp);


    if (s) {
        struct unix_sock *u = unix_sk(s);


        if (atomic_long_dec_and_test(&u->inflight))

After that, it installs the file object(s) as the new file descriptor(s) for the receiving process.

int __receive_fd(struct file *file, int __user *ufd, unsigned int o_flags)
    int new_fd;
    int error;

    error = security_file_receive(file);
    if (error)
        return error;

    new_fd = get_unused_fd_flags(o_flags);
    if (new_fd < 0)
        return new_fd;

    if (ufd) {
        error = put_user(new_fd, ufd);
        if (error) {
            return error;

    fd_install(new_fd, get_file(file));
    return new_fd;

Stack Trace of the __receive_fd.

#0  __receive_fd (file=0xffffc90000297cc8, ufd=0xffffc90000297ec0, o_flags=36300032) at fs/file.c:1205
#1  0xffffffff816f9fd4 in receive_fd_user (o_flags=<optimized out>, ufd=<optimized out>, file=<optimized out>) at ./include/linux/file.h:105
#2  scm_detach_fds (msg=0xffffc90000297ec0, scm=0xffffc90000297cc8) at net/core/scm.c:318
#3  0xffffffff817e82b7 in scm_recv (sock=0xffff88810229e500, msg=0xffffc90000297ec0, scm=0xffffc90000297cc8, flags=<optimized out>) at ./include/net/scm.h:140
#4  0xffffffff817eab59 in __unix_dgram_recvmsg (sk=0xffff88810219b180, msg=0xffffc90000297ec0, size=18446683600572742856, flags=0) at net/unix/af_unix.c:2394
Triggering the Unix Garbage Collection (GC)

The Unix Garbage Collection (GC) code serves two main purposes: detecting cycles and removing file objects that lack external references, such as open file descriptors.

Unix GC is triggered upon closing a Unix domain socket.

    close(socket(AF_UNIX, SOCK_DGRAM, 0));

The subsequent section will dive deeply into the functionality.

Implementation of unix_gc Routine

The unix_gc routine first makes a list of possible cycles of Unix domain sockets whose file objects’ refcount is equal to the inflight count. This new list is called gc_candidates.

    list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
        long total_refs;
        long inflight_refs;

        total_refs = file_count(u->sk.sk_socket->file);
        inflight_refs = atomic_long_read(&u->inflight);

        if (total_refs == inflight_refs) {
            list_move_tail(&u->link, &gc_candidates);
            __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
            __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);

Next, it calls scan_children->scan_inflight function to decrement the inflight count of the all sockets from gc_candidates list. This should make the inflight count zero for the sockets that are “inflight” (i.e. SCM_RIGHTS message has not been consumed yet).

    list_for_each_entry(u, &gc_candidates, link)
        scan_children(&u->sk, dec_inflight, NULL;)

The scan_inflight function iterates over all the skbs in the receive queue of the socket passed from scan_children. Each skb contains an array of file object pointers stored in the fp member field, which is typecasted to UNIXCB as explained before. During this iteration, the function decrements the inflight count of any Unix domain sockets found in the file objects array belonging to the currently processed skb.

static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
                          struct sk_buff_head *hitlist)
    struct sk_buff *skb;
    struct sk_buff *next;


    skb_queue_walk_safe(&x->sk_receive_queue, skb, next) {
        /* Do we have file descriptors ? */
        if (UNIXCB(skb).fp) {
            bool hit = false;
            /* Process the descriptors of this socket */
            int nfd = UNIXCB(skb).fp->count;
            struct file **fp = UNIXCB(skb).fp->fp;

            while (nfd--) {
                /* Get the socket the fd matches if it indeed does so */
                struct sock *sk = unix_get_socket(*fp++);

                if (sk) {
                    struct unix_sock *u = unix_sk(sk);

                    /* Ignore non-candidates, they could
                     * have been added to the queues after
                     * starting the garbage collection
                    if (test_bit(UNIX_GC_CANDIDATE, &u->gc_flags)) {
                        hit = true;

            if (hit && hitlist != NULL) {
                __skb_unlink(skb, &x->sk_receive_queue);
                __skb_queue_tail(hitlist, skb);

It again iterates over the gc_candidates list to check if there’s any socket for which the inflight count is greater than zero. For those sockets, it will move them to the not_cycle_list and adjust the inflight count by incrementing it (as it was decremented in the previous step).

    list_add(&cursor, &gc_candidates);
    while (cursor.next != &gc_candidates) {
        u = list_entry(cursor.next, struct unix_sock, link);

        /* Move cursor to after the current position. */
        list_move(&cursor, &u->link);

        if (atomic_long_read(&u->inflight) > 0) {
            list_move_tail(&u->link, &not_cycle_list);
            __clear_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
            scan_children(&u->sk, inc_inflight_move_tail, NULL);

Now, the gc_candidates contains only the sockets with no external handles opened, and these are part of a cycle.

It will then form a list of skbs called hitlist. To construct the hitlist, it iterates through all the skbs in a socket’s receive queue (the socket being a member of gc_candidates). For each skb, it checks whether any file object from the file object pointer array attached to that skb is an instance of a Unix domain socket. If so, it adds that skb to the ‘hitlist’. We also need to adjust (increase) the inflight count that was previously decreased to maintain consistency.

    list_for_each_entry(u, &gc_candidates, link)
        scan_children(&u->sk, inc_inflight, &hitlist);

Move the list of sockets in the not_cycle_list to gc_inflight_list.

    /* not_cycle_list contains those sockets which do not make up a
     * cycle.  Restore these to the inflight list.
    while (!list_empty(&not_cycle_list)) {
        u = list_entry(not_cycle_list.next, struct unix_sock, link);
        __clear_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
        list_move_tail(&u->link, &gc_inflight_list);

The hitlist of skbs that we prepared will be used to delete the skbs and dereference the file objects in the file object pointer array. The file objects will be deleted after their f_count becomes zero upon dereferencing.


Use-after-free Resulting From the Interaction Between Unix GC and io_uring

A malicious user can exploit the “cycle” found in inflight sockets when a file object associated to a file descriptor registered with an io_uring context is also present in one of the socket’s receive queues in the cycle. This premature freeing by Unix GC results in a use-after-free issue.

The use-after-free scenario involves a regular file object that, after undergoing a write permission check in the kernel, awaits an inode lock in the write path invoked from io_uring. The file descriptor is registered with io_uring, and the corresponding file object is referenced and stored in a file pointer array attached to a skb belonging to the internal Unix domain socket of the io_uring context. While the io_uring write operation is initiated from a work queue task and awaits the inode lock, the Unix GC frees the file object due to the presence of a “cycle” of sockets while the file descriptors of those sockets are already closed. Subsequently, another thread opens a file with supervisor rights in read-only mode, resulting in the allocation of the file object in the same memory location as the recently freed file object. When the io_uring writer thread wakes up, it utilizes the newer privileged file object to begin writing user data.

If any skb present in the hitlist belongs to an io_uring’s internal socket, then at least one file object pointer is present in the scm_fp_list of the skb, which is registered with io_uring using IORING_REGISTER_FILES opcode. This file object is freed prematurely while there’s an I/O operation waiting when the skb is purged as part of cleaning up the hitlist.


As Ksplice engineers, we often have to learn new areas of the kernel very rapidly to understand how a patch can be applied without rebooting the system. This work can be challenging, but it’s also rewarding to explore new areas of the kernel and understand how they interact with each other.

If you find this type of work interesting, consider applying for a job with the Ksplice team! Feel free to drop us a line at ksplice-support_ww@oracle.com.

Shoily Rahman

Previous Post

OS Management Hub simplifies management of enterprise systems across distributed environments

Julie Wong | 8 min read

Next Post

Oracle Linux 9.3 meets USGv6-r1 standards, receives IPv6 Ready Gold Logo

David Gilpin | 2 min read