Wednesday Jul 22, 2015

Fixing Security Vulnerabilities in Linux

Security vulnerabilities are some of the hardest bugs to discover yet they can have the largest impact. At Ksplice, we spend a lot of time looking at security vulnerabilities and seeing how they are fixed. We use automated tools such as the Trinity syscall fuzzer and the Kernel Address Sanitizer (KASan) to aid our process. In this blog post we'll go over some case studies of recent vulnerabilities and show you how you can avoid them in your code.

CVE-2013-7339 and CVE-2014-2678

These two are very similar NULL pointer dereferences when trying to bind an RDS socket without having an RDS device. This is an oversight that happens quite often in hardware-specific code in the kernel. It is easy for developers to assume that a piece of hardware always exists since all their dev machines have it, but that sometimes leads to other possible hardware configurations left untested. In this example the code makes a seemingly reasonable assumption that using RDS sockets without RDS hardware doesn't really make sense.

The issue is pretty simple as we can see from this fix:

diff --git a/net/rds/ib.c b/net/rds/ib.c
index b4c8b00..ba2dffe 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -338,7 +338,8 @@ static int rds_ib_laddr_check(__be32 addr)
   ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin);
   /* due to this, we will claim to support iWARP devices unless we
      check node_type. */
-     if (ret || cm_id->device->node_type != RDMA_NODE_IB_CA)
+     if (ret || !cm_id->device ||
+         cm_id->device->node_type != RDMA_NODE_IB_CA)
                                   ret = -EADDRNOTAVAIL;

                                   rdsdebug("addr %pI4 ret %d node type %d\n",

Generally we are allowed to bind an address without a physical device so we can reach this code without any RDS hardware. Sadly, this code wrongly assumes that a devices exists at this point and that cm_id->device is not NULL leading to a NULL pointer dereference.

These type of issues are usually caught early in -next as that exposes the code to various users and hardware configurations, but this one managed to slip through somehow.

There are many variations of the scenario where the hardware specific and other kernel code doesn't handle cases which "don't make sense". Another recent example is dlmfs. The kernel would panic when trying to create a directory on it - something that doesn't happen in regular usage of dlmfs.


This one is interesting and very difficult to stumble upon by accident. It's a race condition that is only possible during the migration of huge pages between NUMA nodes, so the window of opportunity is *very* small. It can be triggered by trying to dump the NUMA maps of a process while its memory is being moved around. What happens is that the code trying to dump memory makes invalid memory accesses because it does not check the presence of the memory beforehand.

When we dump NUMA maps we need to walk memory entries using walk_page_range():

         * Handle hugetlb vma individually because pagetable
         * walk for the hugetlb page is dependent on the
         * architecture and we can't handled it in the same
         * manner as non-huge pages.
        if (walk->hugetlb_entry && (vma->vm_start <= addr) &&
            is_vm_hugetlb_page(vma)) {
                if (vma->vm_end < next)
                        next = vma->vm_end;
                 * Hugepage is very tightly coupled with vma,
                 * so walk through hugetlb entries within a
                 * given vma.
                err = walk_hugetlb_range(vma, addr, next, walk);
                if (err)
                pgd = pgd_offset(walk->mm, next);

When walk_page_range() detects a hugepage it calls walk_hugetlb_range(), which calls the proc's callback (provided by walk->hugetlb_entry()) for each page individually:

        pte_t *pte;
        int err = 0;

        do {
                next = hugetlb_entry_end(h, addr, end);
                pte = huge_pte_offset(walk->mm, addr & hmask);
                if (pte && walk->hugetlb_entry)
                        err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
                if (err)
                        return err;
        } while (addr = next, addr != end);

Note that the callback is executed for each pte; even for those that are not present in memory (pte_present(*pte) would return false in that case). This is done by the walker code because it was assumed that callback functions might want to handle that scenario for some reason. In the past there was no way for a huge pte to be absent, but that changed when the hugepage migration was introduced. During page migration unmap_and_move_huge_page() removes huge ptes:

if (page_mapped(hpage)) {


page_was_mapped = 1;


Unfortunately, some callbacks were not changed to deal with this new possibility. A notable example is gather_pte_stats(), which tries to lock a non-existent pte:

        orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);

This can cause a panic if it happens during a tiny window inside unmap_and_move_huge_page().

Dumping NUMA maps doesn't happen too often and is mainly used for testing/debugging, so this bug has lived there for quite a while and was made visible only recently when hugepage migration was added.

It's also common that adding userspace interfaces to trigger kernel code which doesn't get called often exposes many issues. This happened recently when the firmware loading code was exposed to userspace.


This one also falls into the category of "doesn't make sense" because it involves repeated page faulting of memory that we marked as unwanted. When this happens shmem tries to remove a block of memory, but since it's getting faulted over and over again shmem will hang waiting until it's available for removal. Meanwhile other filesystem operations will be blocked, which is bad because that memory may never become available for removal.

When we're faulting a shmem page in, shmem_fault() would grab a reference to the page:

static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { [...] error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);

But because shmem_fallocate() holds i_mutex this means that shmem_fallocate() can wait forever until it can free up that page. This, in turn means that that filesystem is stuck waiting for shmem_fallocate() to complete.

Beyond that, punching holes in files and marking memory as unwanted are not common operations; especially not on a shmem filesystem. This means that those code paths are very untested.


This is a privilege escalation which was found using KASan. We've noticed that as a result of a call to a PPPOL2TP ioctl an uninitialized address inside a struct was being read. Further investigation showed that this is the result of a confusion about the type of the struct that was being accessed.

When we call setsockopt from userspace on a PPPOL2TP socket in userspace we'll end up in pppol2tp_setsockopt() which will look at the level parameter to see if the sockopt operation is intended for PPPOL2TP or the underlying UDP socket:

   if (level != SOL_PPPOL2TP)
      return udp_prot.setsockopt(sk, level, optname, optval, optlen);

PPPOL2TP tries to be helpful here and allows userspace to set UDP sockopts rather than just PPPOL2TP ones. The problem here is that UDP's setsockopt expects a udp_sock:

 int udp_lib_setsockopt(struct sock *sk, int level, int optname,
                        char __user *optval, unsigned int optlen,
                        int (*push_pending_frames)(struct sock *))
         struct udp_sock *up = udp_sk(sk);

But instead it's getting just a sock struct.

It's possible to leverage this struct confusion to achieve privilege escalation. We can overwrite the function pointer in the struct to point to code of our choice. Then we can trigger the execution of this code by making another socket operation. The piece of code that allowed for this vulnerability was added for convenience, but no one ever needed it, and it was never tested.


We hope that this exposition of straightforward and more subtle kernel bugs will remind of the importance of looking at code from a new perspective and encourage the developer community to contribute to and create new tools and methodologies for detecting and preventing bugs in the kernel.


Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.


« February 2017