Wednesday Jul 22, 2015

Fixing Security Vulnerabilities in Linux

Security vulnerabilities are some of the hardest bugs to discover yet they can have the largest impact. At Ksplice, we spend a lot of time looking at security vulnerabilities and seeing how they are fixed. We use automated tools such as the Trinity syscall fuzzer and the Kernel Address Sanitizer (KASan) to aid our process. In this blog post we'll go over some case studies of recent vulnerabilities and show you how you can avoid them in your code.

CVE-2013-7339 and CVE-2014-2678

These two are very similar NULL pointer dereferences when trying to bind an RDS socket without having an RDS device. This is an oversight that happens quite often in hardware-specific code in the kernel. It is easy for developers to assume that a piece of hardware always exists since all their dev machines have it, but that sometimes leads to other possible hardware configurations left untested. In this example the code makes a seemingly reasonable assumption that using RDS sockets without RDS hardware doesn't really make sense.

The issue is pretty simple as we can see from this fix:

diff --git a/net/rds/ib.c b/net/rds/ib.c
index b4c8b00..ba2dffe 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -338,7 +338,8 @@ static int rds_ib_laddr_check(__be32 addr)
   ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin);
   /* due to this, we will claim to support iWARP devices unless we
      check node_type. */
-     if (ret || cm_id->device->node_type != RDMA_NODE_IB_CA)
+     if (ret || !cm_id->device ||
+         cm_id->device->node_type != RDMA_NODE_IB_CA)
                                   ret = -EADDRNOTAVAIL;

                                   rdsdebug("addr %pI4 ret %d node type %d\n",

Generally we are allowed to bind an address without a physical device so we can reach this code without any RDS hardware. Sadly, this code wrongly assumes that a devices exists at this point and that cm_id->device is not NULL leading to a NULL pointer dereference.

These type of issues are usually caught early in -next as that exposes the code to various users and hardware configurations, but this one managed to slip through somehow.

There are many variations of the scenario where the hardware specific and other kernel code doesn't handle cases which "don't make sense". Another recent example is dlmfs. The kernel would panic when trying to create a directory on it - something that doesn't happen in regular usage of dlmfs.


This one is interesting and very difficult to stumble upon by accident. It's a race condition that is only possible during the migration of huge pages between NUMA nodes, so the window of opportunity is *very* small. It can be triggered by trying to dump the NUMA maps of a process while its memory is being moved around. What happens is that the code trying to dump memory makes invalid memory accesses because it does not check the presence of the memory beforehand.

When we dump NUMA maps we need to walk memory entries using walk_page_range():

         * Handle hugetlb vma individually because pagetable
         * walk for the hugetlb page is dependent on the
         * architecture and we can't handled it in the same
         * manner as non-huge pages.
        if (walk->hugetlb_entry && (vma->vm_start <= addr) &&
            is_vm_hugetlb_page(vma)) {
                if (vma->vm_end < next)
                        next = vma->vm_end;
                 * Hugepage is very tightly coupled with vma,
                 * so walk through hugetlb entries within a
                 * given vma.
                err = walk_hugetlb_range(vma, addr, next, walk);
                if (err)
                pgd = pgd_offset(walk->mm, next);

When walk_page_range() detects a hugepage it calls walk_hugetlb_range(), which calls the proc's callback (provided by walk->hugetlb_entry()) for each page individually:

        pte_t *pte;
        int err = 0;

        do {
                next = hugetlb_entry_end(h, addr, end);
                pte = huge_pte_offset(walk->mm, addr & hmask);
                if (pte && walk->hugetlb_entry)
                        err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
                if (err)
                        return err;
        } while (addr = next, addr != end);

Note that the callback is executed for each pte; even for those that are not present in memory (pte_present(*pte) would return false in that case). This is done by the walker code because it was assumed that callback functions might want to handle that scenario for some reason. In the past there was no way for a huge pte to be absent, but that changed when the hugepage migration was introduced. During page migration unmap_and_move_huge_page() removes huge ptes:

if (page_mapped(hpage)) {


page_was_mapped = 1;


Unfortunately, some callbacks were not changed to deal with this new possibility. A notable example is gather_pte_stats(), which tries to lock a non-existent pte:

        orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);

This can cause a panic if it happens during a tiny window inside unmap_and_move_huge_page().

Dumping NUMA maps doesn't happen too often and is mainly used for testing/debugging, so this bug has lived there for quite a while and was made visible only recently when hugepage migration was added.

It's also common that adding userspace interfaces to trigger kernel code which doesn't get called often exposes many issues. This happened recently when the firmware loading code was exposed to userspace.


This one also falls into the category of "doesn't make sense" because it involves repeated page faulting of memory that we marked as unwanted. When this happens shmem tries to remove a block of memory, but since it's getting faulted over and over again shmem will hang waiting until it's available for removal. Meanwhile other filesystem operations will be blocked, which is bad because that memory may never become available for removal.

When we're faulting a shmem page in, shmem_fault() would grab a reference to the page:

static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { [...] error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);

But because shmem_fallocate() holds i_mutex this means that shmem_fallocate() can wait forever until it can free up that page. This, in turn means that that filesystem is stuck waiting for shmem_fallocate() to complete.

Beyond that, punching holes in files and marking memory as unwanted are not common operations; especially not on a shmem filesystem. This means that those code paths are very untested.


This is a privilege escalation which was found using KASan. We've noticed that as a result of a call to a PPPOL2TP ioctl an uninitialized address inside a struct was being read. Further investigation showed that this is the result of a confusion about the type of the struct that was being accessed.

When we call setsockopt from userspace on a PPPOL2TP socket in userspace we'll end up in pppol2tp_setsockopt() which will look at the level parameter to see if the sockopt operation is intended for PPPOL2TP or the underlying UDP socket:

   if (level != SOL_PPPOL2TP)
      return udp_prot.setsockopt(sk, level, optname, optval, optlen);

PPPOL2TP tries to be helpful here and allows userspace to set UDP sockopts rather than just PPPOL2TP ones. The problem here is that UDP's setsockopt expects a udp_sock:

 int udp_lib_setsockopt(struct sock *sk, int level, int optname,
                        char __user *optval, unsigned int optlen,
                        int (*push_pending_frames)(struct sock *))
         struct udp_sock *up = udp_sk(sk);

But instead it's getting just a sock struct.

It's possible to leverage this struct confusion to achieve privilege escalation. We can overwrite the function pointer in the struct to point to code of our choice. Then we can trigger the execution of this code by making another socket operation. The piece of code that allowed for this vulnerability was added for convenience, but no one ever needed it, and it was never tested.


We hope that this exposition of straightforward and more subtle kernel bugs will remind of the importance of looking at code from a new perspective and encourage the developer community to contribute to and create new tools and methodologies for detecting and preventing bugs in the kernel.

Wednesday Jan 29, 2014

Ksplice SNMP Plugin

The Ksplice team is happy to announce the release of an SNMP plugin for Ksplice, available today on the Unbreakable Linux Network. The plugin will let you use Oracle Enterprise Manager to monitor the status of Ksplice on all of your systems, but it will also work with any monitoring solution that is SNMP compatible.


You'll find the plugin on Ksplice channel for your distribution and architecture. For Oracle Linux 6 on x86_64 that's ol6_x86_64_ksplice. Install the plugin by running (as root):

yum install ksplice-snmp-plugin


If you haven't set up SNMP before, you'll need to do a little bit of configuration. Included below is a sample /etc/snmp/snmpd.conf file to try out the plugin:

# Setting up permissions
# ======================
com2sec local localhost public
com2sec mynet public

group local v1 local
group local v2c local
group local usm local
group mynet v1  mynet
group mynet v2c mynet
group mynet usm mynet

view all included .1 80

access mynet "" any noauth exact all none none
access local "" any noauth exact all all none

syslocation Oracle Linux 6
syscontact sysadmin <root@localhost>

# Load the plugin
# ===============
dlmod kspliceUptrack /usr/lib64/ksplice-snmp/


You'll want to replace the IP address next to com2sec with the address of your local network, or the address of your SNMP monitoring software. If you are running on a 32 bit architecture x86), replace lib64 in the dlmod path with lib. The above configuration is an example, intended only for testing purposes. For more information about configuring SNMP, check out the SNMP documentation, including the man pages for snmpd and snmpd.conf.


You can test out your configuration by using the snmpwalk command and verifying the responses. Some examples:

Displaying the installed version of Ksplice:

$ snmpwalk -v 1 -c public -O e localhost KSPLICE-UPTRACK-MIB::kspliceVersion
KSPLICE-UPTRACK-MIB::kspliceVersion.0 = STRING: 1.2.12

Checking if a kernel has all available updates installed:

$ snmpwalk -v 1 -c public -O e localhost KSPLICE-UPTRACK-MIB::kspliceStatus
KSPLICE-UPTRACK-MIB::kspliceStatus.0 = STRING: outofdate

Displaying and comparing the kernel installed on disk with the Ksplice effective version:

$ snmpwalk -v 1 -c public -O e localhost KSPLICE-UPTRACK-MIB::kspliceBaseKernel
KSPLICE-UPTRACK-MIB::kspliceBaseKernel.0 = STRING: 2.6.18-274.3.1.el5

$ snmpwalk -v 1 -c public -O e localhost KSPLICE-UPTRACK-MIB::kspliceEffectiveKernel
KSPLICE-UPTRACK-MIB::kspliceEffectiveKernel.0 = STRING: 2.6.18-274.3.1.el5

Displaying a list of all installed updates:

$ snmpwalk -v 1 -c public -O e localhost KSPLICE-UPTRACK-MIB::ksplicePatchTable

In this case, there are none. This is why the base kernel version and effective kernel version are the same, and why this kernel is out of date.

Displaying a list of updates that can be installed right now, including their description:

$ snmpwalk -v 1 -c public -O e localhost KSPLICE-UPTRACK-MIB::kspliceAvailTable
KSPLICE-UPTRACK-MIB::kspliceavailIndex.0 = INTEGER: 0
KSPLICE-UPTRACK-MIB::kspliceavailIndex.1 = INTEGER: 1
KSPLICE-UPTRACK-MIB::kspliceavailIndex.2 = INTEGER: 2
KSPLICE-UPTRACK-MIB::kspliceavailIndex.3 = INTEGER: 3
KSPLICE-UPTRACK-MIB::kspliceavailIndex.4 = INTEGER: 4
KSPLICE-UPTRACK-MIB::kspliceavailIndex.5 = INTEGER: 5
KSPLICE-UPTRACK-MIB::kspliceavailIndex.6 = INTEGER: 6
KSPLICE-UPTRACK-MIB::kspliceavailIndex.7 = INTEGER: 7
KSPLICE-UPTRACK-MIB::kspliceavailIndex.8 = INTEGER: 8
KSPLICE-UPTRACK-MIB::kspliceavailIndex.9 = INTEGER: 9
KSPLICE-UPTRACK-MIB::kspliceavailIndex.10 = INTEGER: 10
KSPLICE-UPTRACK-MIB::kspliceavailIndex.11 = INTEGER: 11
KSPLICE-UPTRACK-MIB::kspliceavailIndex.12 = INTEGER: 12
KSPLICE-UPTRACK-MIB::kspliceavailIndex.13 = INTEGER: 13
KSPLICE-UPTRACK-MIB::kspliceavailIndex.14 = INTEGER: 14
KSPLICE-UPTRACK-MIB::kspliceavailIndex.15 = INTEGER: 15
KSPLICE-UPTRACK-MIB::kspliceavailIndex.16 = INTEGER: 16
KSPLICE-UPTRACK-MIB::kspliceavailIndex.17 = INTEGER: 17
KSPLICE-UPTRACK-MIB::kspliceavailIndex.18 = INTEGER: 18
KSPLICE-UPTRACK-MIB::kspliceavailIndex.19 = INTEGER: 19
KSPLICE-UPTRACK-MIB::kspliceavailIndex.20 = INTEGER: 20
KSPLICE-UPTRACK-MIB::kspliceavailIndex.21 = INTEGER: 21
KSPLICE-UPTRACK-MIB::kspliceavailIndex.22 = INTEGER: 22
KSPLICE-UPTRACK-MIB::kspliceavailIndex.23 = INTEGER: 23
KSPLICE-UPTRACK-MIB::kspliceavailIndex.24 = INTEGER: 24
KSPLICE-UPTRACK-MIB::kspliceavailIndex.25 = INTEGER: 25
KSPLICE-UPTRACK-MIB::kspliceavailName.0 = STRING: [urvt04qt]
KSPLICE-UPTRACK-MIB::kspliceavailName.1 = STRING: [7jb2jb4r]
KSPLICE-UPTRACK-MIB::kspliceavailName.2 = STRING: [ot8lfoya]
KSPLICE-UPTRACK-MIB::kspliceavailName.3 = STRING: [f7pwmkto]
KSPLICE-UPTRACK-MIB::kspliceavailName.4 = STRING: [nxs9cwnt]
KSPLICE-UPTRACK-MIB::kspliceavailName.5 = STRING: [i8j4bdkr]
KSPLICE-UPTRACK-MIB::kspliceavailName.6 = STRING: [5jr9aom4]
KSPLICE-UPTRACK-MIB::kspliceavailName.7 = STRING: [iifdtqom]
KSPLICE-UPTRACK-MIB::kspliceavailName.8 = STRING: [6yagfyh1]
KSPLICE-UPTRACK-MIB::kspliceavailName.9 = STRING: [bqc6pn0b]
KSPLICE-UPTRACK-MIB::kspliceavailName.10 = STRING: [sy14t1rw]
KSPLICE-UPTRACK-MIB::kspliceavailName.11 = STRING: [ayo20d8s]
KSPLICE-UPTRACK-MIB::kspliceavailName.12 = STRING: [ur5of4nd]
KSPLICE-UPTRACK-MIB::kspliceavailName.13 = STRING: [ue4dtk2k]
KSPLICE-UPTRACK-MIB::kspliceavailName.14 = STRING: [wy52x339]
KSPLICE-UPTRACK-MIB::kspliceavailName.15 = STRING: [qsajn0ce]
KSPLICE-UPTRACK-MIB::kspliceavailName.16 = STRING: [5tx9tboo]
KSPLICE-UPTRACK-MIB::kspliceavailName.17 = STRING: [2nve5xek]
KSPLICE-UPTRACK-MIB::kspliceavailName.18 = STRING: [w7ik1ka8]
KSPLICE-UPTRACK-MIB::kspliceavailName.19 = STRING: [9ky2kan5]
KSPLICE-UPTRACK-MIB::kspliceavailName.20 = STRING: [zjr4ahvv]
KSPLICE-UPTRACK-MIB::kspliceavailName.21 = STRING: [j0mkxnwg]
KSPLICE-UPTRACK-MIB::kspliceavailName.22 = STRING: [mvu2clnk]
KSPLICE-UPTRACK-MIB::kspliceavailName.23 = STRING: [rc8yh417]
KSPLICE-UPTRACK-MIB::kspliceavailName.24 = STRING: [0zfhziax]
KSPLICE-UPTRACK-MIB::kspliceavailName.25 = STRING: [ns82h58y]
KSPLICE-UPTRACK-MIB::kspliceavailDesc.0 = STRING: Clear garbage data on the kernel stack when handling signals.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.1 = STRING: CVE-2011-1160: Information leak in tpm driver.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.2 = STRING: CVE-2011-1585: Authentication bypass in CIFS.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.3 = STRING: CVE-2011-2484: Denial of service in taskstats subsystem.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.4 = STRING: CVE-2011-2496: Local denial of service in mremap().
KSPLICE-UPTRACK-MIB::kspliceavailDesc.5 = STRING: CVE-2009-4067: Buffer overflow in Auerswald usb driver.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.6 = STRING: CVE-2011-2695: Off-by-one errors in the ext4 filesystem.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.7 = STRING: CVE-2011-2699: Predictable IPv6 fragment identification numbers.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.8 = STRING: CVE-2011-2723: Remote denial of service vulnerability in gro.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.9 = STRING: CVE-2011-2942: Regression in bridged ethernet devices.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.10 = STRING: CVE-2011-1833: Information disclosure in eCryptfs.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.11 = STRING: CVE-2011-3191: Memory corruption in CIFSFindNext.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.12 = STRING: CVE-2011-3209: Denial of Service in clock implementation.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.13 = STRING: CVE-2011-3188: Weak TCP sequence number generation.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.14 = STRING: CVE-2011-3363: Remote denial of service in cifs_mount.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.15 = STRING: CVE-2011-4110: Null pointer dereference in key subsystem.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.16 = STRING: CVE-2011-1162: Information leak in TPM driver.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.17 = STRING: CVE-2011-2494: Information leak in task/process statistics.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.18 = STRING: CVE-2011-2203: Null pointer dereference mounting HFS filesystems.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.19 = STRING: CVE-2011-4077: Buffer overflow in xfs_readlink.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.20 = STRING: CVE-2011-4132: Denial of service in Journaling Block Device layer.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.21 = STRING: CVE-2011-4330: Buffer overflow in HFS file name translation logic.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.22 = STRING: CVE-2011-4324: Denial of service vulnerability in NFSv4.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.23 = STRING: CVE-2011-4325: Denial of service in NFS direct-io.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.24 = STRING: CVE-2011-4348: Socking locking race in SCTP.
KSPLICE-UPTRACK-MIB::kspliceavailDesc.25 = STRING: CVE-2011-1020, CVE-2011-3637: Information leak, DoS in /proc.

And here's what happens after you run uptrack-upgrade -y, using Ksplice to fully upgrade your kernel:

$ snmpwalk -v 1 -c public -O e localhost KSPLICE-UPTRACK-MIB::kspliceStatus
KSPLICE-UPTRACK-MIB::kspliceStatus.0 = STRING: uptodate

$ snmpwalk -v 1 -c public -O e localhost KSPLICE-UPTRACK-MIB::kspliceAvailTable

$ snmpwalk -v 1 -c public -O e localhost KSPLICE-UPTRACK-MIB::ksplicePatchTable
KSPLICE-UPTRACK-MIB::ksplicepatchIndex.0 = INTEGER: 0
KSPLICE-UPTRACK-MIB::ksplicepatchIndex.1 = INTEGER: 1
KSPLICE-UPTRACK-MIB::ksplicepatchIndex.2 = INTEGER: 2
KSPLICE-UPTRACK-MIB::ksplicepatchIndex.3 = INTEGER: 3
[ . . . ]

The plugin displays that the kernel is now up-to-date.

SNMP and Enterprise Manager

Once the plugin is up and running, you can monitor your system using Oracle Enterprise Manager. Specifically, you can create an SNMP Adapter to allow Enterprise Manager Management Agents to query the status of Ksplice on each system with the plugin installed. Check out our documentation on SNMP support in Enterprise Manager to get started, including section 22.6, "About Metric Extensions".

This plugin represents the first step in greater functionality between Ksplice and Enterprise Manager and we're excited about what is coming up. If you have any questions about the plugin or suggestions for future development, leave a comment below or drop us a line at


Notes about Enterprise Manager 12c

When connecting Enterprise Manager 12c (EM) as an SNMP client to the target Ksplice interface, there is an extra step required to fix a known issue where EM is not able to query an SNMP property, e.g:

If your target is "", run the following command on your OMS:

 $ emcli modify_target -type="host" -properties="" -on_agent 

That command will populate the missing value of the SNMPHostname property. Then stop and re-start your OMS to make sure the changes take effect.

Thursday Aug 08, 2013

CVE-2013-2224: Denial of service in sendmsg().

In September 2012, CVE-2012-3552 was reported which could allow an attacker to corrupt slab memory which could lead to a denial-of-service or possible privilege escalation depending on the target machine workload.  This bug had originally been fixed in the mainline kernel in April 2011 and was a fairly large patch for a security fix.  The RedHat backport for this fix introduced a new bug which has been assigned CVE-2013-2224 which again could allow for a denial-of-service or possible privilege escalation.  Rack911 & Tortoiselabs created a reproducer in June 2013 which would allow an unprivileged user to cause a denial-of-service.

RedHat have not yet released a kernel with this CVE fixed, but CentOS have released a custom kernel with the vendor fix for CentOS 6.

We have just released a Ksplice update to address this issue for releases 5 and 6 of Oracle Linux, RedHat Enterprise Linux, CentOS and Scientific Linux.  We recommend that all users of Ksplice on these distributions install this zero-downtime update.

[Read More]

Friday May 31, 2013

CVE-2013-2850: Remote heap buffer overflow in iSCSI target subsystem.

We have just released a rebootless update to deal with a critical security vulnerability:

CVE-2013-2850: Remote heap buffer overflow in iSCSI target subsystem.

If an iSCSI target is configured and listening on the network, a remote
attacker can corrupt heap memory, and gain kernel execution control over
the system and gain kernel code execution.

As this vulnerability is exploitable by remote users, Ksplice is issuing an update for all affected kernels immediately.

This update was embargoed for release until today (May 30th), when the information regarding this vulnerability has been made public. We are pushing updates for Ubuntu Precise, Quantal, and Raring, as well as for Debian Wheezy, Fedora 17 and Fedora 18. This bug was introduced in version 3.1 of the Linux kernel and so does not affect Oracle UEK kernels, or any RedHat 6 derivatives or earlier.

We recommend Oracle Linux Premier Support for receiving rebootless kernel updates via Ksplice.

Wednesday May 15, 2013

Ksplice update for CVE-2013-2094

This is a 0-day local privilege escalation found by Tommi Rantala while fuzzing the kernel using Trinity. The cause of that oops was patched in 3.8.10 in commit 8176cced706b5e5d15887584150764894e94e02f.

'spender' on Reddit has an interesting writeup on the details of this exploit.

We've already shipped this for Fedora 17 and 18 for the 3.8 kernel, and an update for Ubuntu 13.04 will ship as soon as Canonical releases their kernel.

We have a policy of only shipping updates that the vendor has shipped, but in this case we are shipping an update for this CVE for Oracle's UEK2 kernel early. Oracle is in the process of preparing an updated UEK2 kernel with the same fix and will be released through the normal channels.

All customers with Oracle Linux Premier Support should use Ksplice to update their kernel as soon as possible.

[EDITED 2013-05-15]: We have now released an early update for Oracle RHCK 6, RedHat Enterprise Linux 6, Scientific Linux 6 and CentOS 6.

[EDITED 2013-05-15]: We have released an early update for Wheezy. Additionally, Ubuntu Raring, Quantal and Precise have released their kernel, so we have released updates for them.

Tuesday Apr 02, 2013

Ksplice Inspector

With so many kernel updates released, it can be difficult to keep track. At Oracle, we monitor kernels on a daily basis and provide bug and security updates administrators can apply without a system reboot. To help out, the Ksplice team at Oracle has produced the Ksplice Inspector, a web tool to show you the updates Ksplice can apply to your kernel with zero downtime.

The Ksplice Inspector is freely available to everyone. If you're running any Ksplice supported kernel, whether it is Oracle's Unbreakable Enterprise Kernel, a Red Hat compatible kernel, or the kernel of one of our supported desktop distributions, visit and follow the instructions and you'll see a list of all the available Ksplice updates for your kernel.

But what if you're running a system without a browser, or one without a GUI at all? We've got you covered: you can get the same information from our API through the command line. Just run the following command:

(uname -s; uname -m; uname -r; uname -v) | \
curl \
-L -H "Accept: text/text" --data-binary @-

Once you've seen all the updates available for your kernel, you can quickly patch them all with Ksplice. If you're an Oracle Linux Premier Support customer, access to Ksplice is included with your subscription and available through the Unbreakable Linux Network. If you're running Red Hat Enterprise Linux and you would like to check it out, you can try Ksplice free for 30 days.

Let us know what you think by commenting below or sending us an email at

Friday Nov 09, 2012

Introducing RedPatch

The Ksplice team is happy to announce the public availability of one of our git repositories, RedPatch. RedPatch contains the source for all of the changes Red Hat makes to their kernel, one commit per fix and we've published it on With RedPatch, you can access the broken-out patches using git, browse them online via gitweb, and freely redistribute the source under the terms of the GPL. This is the same policy we provide for Oracle Linux and the Unbreakable Enterprise Kernel (UEK). Users can freely access the source, view the commit logs and easily identify the changes that are relevant to their environments.

To understand why we've created this project we'll need a little history. In early 2011, Red Hat changed how they released their kernel source, going from a tarball that had individual patch files to shipping the kernel source as one giant tarball with a single patch for all Red Hat-introduced changes. For most people who work in the kernel this is merely an inconvenience; driver developers and other out-of-kernel module developers can see the end result to make sure their module still performs as expected.

For Ksplice, we build individual updates for each change and rely on source patches that are broken-out, not a giant tarball. Otherwise, we wouldn’t be able to take the right patches to create individual updates for each fix, and to skip over the noise — like a change that speeds up bootup — which is unnecessary for an already-running system. We’ve been taking the monolithic Red Hat patch tarball and breaking it into smaller commits internally ever since they introduced this change.

At Oracle, we feel everyone in the Linux community can benefit from the work we already do to get our jobs done, so now we’re sharing these broken-out patches publicly. In addition to RedPatch, the complete source code for Oracle Linux and the Oracle Unbreakable Enterprise Kernel (UEK) is available from both ULN and our public yum server, including all security errata.

Check out RedPatch and subscribe to for discussion about the project. Also, drop us a line and let us know how you're using RedPatch!

Tuesday Oct 18, 2011

The Ksplice Pointer Challenge

Back when Ksplice was just a research project at MIT, we all spent a lot of time around the student computing group, SIPB. While there, several precocious undergrads kept talking about how excited they were to take 6.828, MIT's operating systems class.

"You really need to understand pointers for this class," we cautioned them. "Reread K&R Chapter 5, again." Of course, they insisted that they understood pointers and didn't need to. So we devised a test.

Ladies and gentlemen, I hereby do officially present the Ksplice Pointer Challenge, to be answered without the use of a computer:

What does this program print?

#include <stdio.h>
int main() {
  int x[5];
  printf("%p\n", x);
  printf("%p\n", x+1);
  printf("%p\n", &x);
  printf("%p\n", &x+1);
  return 0;

This looks simple, but it captures a surprising amount of complexity. Let's break it down.

To make this concrete, let's assume that x is stored at address 0x7fffdfbf7f00 (this is a 64-bit system). We've hidden each entry so that you have to click to make it appear -- I'd encourage you to think about what the line should output, before revealing the answer.

printf("%p\n", x);
What will this print?

Well, x is an array, right? You're no stranger to array syntax: x[i] accesses the ith element of x.

If we search back in the depths of our memory, we remember that x[i] can be rewritten as *(x+i). For that to work, x must be the memory location of the start of the array.

Result: printf("%p\n", x) prints 0x7fffdfbf7f00. Alright.

printf("%p\n", x+1);
What will this print?

So, x is 0x7fffdfbf7f00, and therefore x+1 should be 0x7fffdfbf7f01, right?

You're not fooled. You remember that  in C, pointer arithmetic is special and magical. If you have a pointer p to an int, p+1 actually adds sizeof(int)to p. It turns out that we need this behavior if *(x+i) is properly going to end up pointing us at the right place -- we need to move over enough to pass one entry in the array. In this case, sizeof(int) is 4.

Result: printf("%p\n", x) prints 0x7fffdfbf7f04. So far so good.

printf("%p\n", &x);
What will this print?

Well, let's see. & basically means "the address of", so this is like asking "Where does x live in memory?" We answered that earlier, didn't we? x lives at 0x7fffdfbf7f00, so that's what this should print.

But hang on a second... if &x is 0x7fffdfbf7f00, that means that it lives at 0x7fffdfbf7f00. But when we print x, we also get 0x7fffdfbf7f00. So x == &x.

How can that possibly work? If x is a pointer that lives at 0x7fffdfbf7f00, and also points to 0x7fffdfbf7f00, where is the actual array stored?

Thinking about that, I draw a picture like this:

That can't be right.

So what's really going on here? Well, first off, anyone who ever told you that a pointer and an array were the same thing was lying to you. That's our fallacy here. If x were a pointer, and x == &x, then yes, we would have something like the picture above. But x isn't a pointer -- x is an array!

And it turns out that in certain situations, an array can automatically "decay" into a pointer. Into &x[0], to be precise. That's what's going on in examples 1 and 2. But not here. So &x does indeed print the address of x.

Result: printf("%p\n", &x) prints 0x7fffdfbf7f00.

Aside: what is the type of &x[0]? Well, x[0] is an int, so &x[0] is "pointer to int". That feels right.

printf("%p\n", &x+1);
What will this print?

Ok, now for the coup de grace. x may be an array, but &x is definitely a pointer. So what's &x+1?

First, another aside: what is the type of &x? Well... &x is a pointer to an array of 5 ints. How would you declare something like that?

Let's fire up cdecl and find out:

cdecl> declare y as array 5 of int;
int y[5]
cdecl> declare y as pointer to array 5 of int;
int (*y)[5]

Confusing syntax, but it works:
int (*y)[5] = &x; compiles without error and works the way you'd expect.

But back to the question at hand. Pointer arithmetic tells us that &x+1 is going to be the address of x + sizeof(x). What's sizeof(x)? Well, it's an array of 5 ints. On this system, each int is 4 bytes, so it should be 20 bytes, or 0x14.

Result &x+1 prints 0x7fffdfbf7f14.

And thus concludes the Ksplice pointer challenge.

What's the takeaway? Arrays are not pointers (though they sometimes pretend to be!). More generally, C is subtle. Oh, and 6.828 students, if you're having trouble with Lab 5, it's probably because of a bug in your Lab 2.

P.S. If you're interested in hacking on low-level systems at a place where your backwards-and-forwards knowledge of C semantics will be more than just an awesome party trick, we're looking to hire kernel hackers for the Ksplice team.

We're based in beautiful Cambridge, Mass., though working remotely is definitely an option. Send me an email at with a resume and/or a github link if you're interested!

Monday Jun 27, 2011

Building a physical CPU load meter

I built this analog CPU load meter for my dev workstation:

Physical CPU load meter

All I did was drill a few holes into the CPU and probe the power supply lines...

Okay, I lied. This is actually a fun project that would make a great intro to embedded electronics, or a quick afternoon hack for someone with a bit of experience.

The parts

The main components are:

  • Current meter: I got this at MIT Swapfest. The scale printed on the face is misleading: the meter itself measures only about 600 microamps in each direction. (It's designed for use with a circuit like this one). We can determine the actual current scale by connecting (in series) the analog meter, a variable resistor, and a digital multimeter, and driving them from a 5 volt power supply. This lets us adjust and reliably measure the current flowing through the analog meter.

  • Arduino: This little board has a 16 MHz processor, with direct control of a few dozen input/output pins, and a USB serial interface to a computer. In our project, it will take commands over USB and produce the appropriate amount of current to drive the meter. We're not even using most of the capabilities of the Arduino, but it's hard to beat as a platform for rapid development.

  • Resistor: The Arduino board is powered over USB; its output pins produce 5 volts for a logic 'high'. We want this 5 volt potential to push 600 microamps of current through the meter, according to the earlier measurement. Using Ohm's law we can calculate that we'll need a resistance of about 8.3 kilohms. Or you can just measure the variable resistor from earlier.

We'll also use some wire, solder, and tape.

Building it

The resistor goes in series with the meter. I just soldered it directly to the back:

Back of the meter

Some tape over these components prevents them from shorting against any of the various junk on my desk. Those wires run to the Arduino, hidden behind my monitor, which is connected to the computer by USB:

The Arduino

That's it for hardware!

Code for the Arduino

The Arduino IDE will compile code written in a C-like language and upload it to the Arduino board over USB. Here's our program:

#define DIRECTION 2
#define MAGNITUDE 3

void setup() {

void loop() {
    int x =;
    if (x == -1)

    if (x < 128) {
        digitalWrite(DIRECTION, LOW);
        analogWrite (MAGNITUDE, 2*(127 - x));
    } else {
        digitalWrite(DIRECTION, HIGH);
        analogWrite (MAGNITUDE, 255 - 2*(x - 128));

When it turns on, the Arduino will execute setup() once, and then call loop() over and over, forever. On each iteration, we try to read a byte from the serial port. A value of -1 indicates that no byte is available, so we return and try again a moment later. Otherwise, we translate a byte value between 0 to 255 into a meter deflection between −600 and 600 microamps.

Pins 0 and 1 are used for serial communication, so I connected the meter to pins 2 and 3, and named them DIRECTION and MAGNITUDE respectively. When we call analogWrite on the MAGNITUDE pin with a value between 0 and 255, we get a proportional voltage between 0 and 5 volts. Actually, the Arduino fakes this by alternating between 0 and 5 volts very rapidly, but our meter is a slow mechanical object and won't know the difference.

Suppose the MAGNITUDE pin is at some intermediate voltage between 0 and 5 volts. If the DIRECTION pin is low (0 V), conventional current will flow from MAGNITUDE to DIRECTION through the meter. If we set DIRECTION high (5 V), current will flow from DIRECTION to MAGNITUDE. So we can send current through the meter in either direction, and we can control the amount of current by controlling the effective voltage at MAGNITUDE. This is all we need to make the meter display whatever reading we want.

Code for the Linux host

On Linux we can get CPU load information from the proc special filesystem:

keegan@lyle$ head -n 1 /proc/stat
cpu  965348 22839 479136 88577774 104454 5259 24633 0 0

These numbers tell us how much time the system's CPUs have spent in each of several states:

  1. user: running normal user processes
  2. nice: running user processes of low priority
  3. system: running kernel code, often on behalf of user processes
  4. idle: doing nothing because all processes are sleeping
  5. iowait: doing nothing because all runnable processes are waiting on I/O devices
  6. irq: handling asynchronous events from hardware
  7. softirq: performing tasks deferred by irq handlers
  8. steal: not running, because we're in a virtual machine and some other VM is using the physical CPU
  9. guest: acting as the host for a running virtual machine

The numbers in /proc/stat are cumulative totals since boot, measured in arbitrary time units. We can read the file twice and subtract, in order to get a measure of where CPU time was spent recently. Then we'll use the fraction of time spent in states other than idle as a measure of CPU load, and send this to the Arduino.

We'll do all this with a small Python script. The pySerial library lets us talk to the Arduino over USB serial. We'll configure it for 57,600 bits per second, the same rate specified in the Arduino's setup() function. Here's the code:

#!/usr/bin/env python

import serial
import time

port = serial.Serial('/dev/ttyUSB0', 57600)

old = None
while True:
    with open('/proc/stat') as stat:
        new = map(float, stat.readline().strip().split()[1:])
    if old is not None:
        diff = [n - o for n, o in zip(new, old)]
        idle = diff[3] / sum(diff)
        port.write(chr(int(255 * (1 - idle))))
    old = new

That's it!

That's all it takes to make a physical, analog CPU meter. It's been done before and will be done again, but we're interested in what you'd do (or have already done!) with the concept. You could measure website hits, or load across a whole cluster, or your profits from trading Bitcoins. One standard Arduino can run at least six meters of this type (being the number of pins which support analogWrite), and a number of switches, knobs, buzzers, and blinky lights besides. If your server room has a sweet control panel, we'd love to see a picture!


Saturday May 07, 2011

Improving your social life with git

I've used RCS, CVS, Subversion, and Perforce. Then I discovered distributed version control systems. Specifically, I discovered git. Lightweight branching? Clean histories? And you can work offline? It seemed to be too good to be true.

But one important question remained unanswered: Can git help improve my social life? Today, we present to you the astonishing results: why yes, yes it can. Introducing gitionary. The brainchild of Liz Denys and Nelson Elhage, it's what you get when you mash up Pictionary, git, and some of your nerdiest friends.

Contestants get randomly assigned git commands and are asked to illustrate them. They put this bold idea to the test quite some time ago in an experiment/party known only as git drunk (yes, some alcohol may have been involved), and we've reproduced selected results below. Each drawing is signed with the artist's username, and the time it took our studio audience to correctly guess the git command.

Suffice it to say, git is complicated. Conceptually, git is best modeled as a set of transformations on a directed acyclic graph, and sometimes it's easiest to illustrate it that way, as with this illustration of git rebase:

On other occasions, a more literal interpretation works best:

But not always:

A little creativity never hurts, though, which is why this is my personal favorite from the evening:

You can find more details (and the full photoset from the evening) at Liz's writeup of the event. Download the gitionary cards, print them double-sided on card stock, and send us pictures of your own gitionary parties!


Thursday Mar 31, 2011

Security Advisory: Plumber Injection Attack in Bowser's Castle

Advisory Name:
Plumber Injection Attack in Bowser's Castle
Release Date:
Bowser's Castle
Affected Versions:
Super Mario Bros., Super Mario Bros.: The Lost Levels
Advisory URL:

Vulnerability Overview

Multiple versions of Bowser's Castle are vulnerable to a plumber injection attack. An Italian plumber could exploit this bug to bypass security measures (walk through walls) in order to rescue Peach, to defeat Bowser, or for unspecified other impact.


This vulnerability is demonstrated by "happylee-supermariobros,warped.fm2". Attacks using this exploit have been observed in the wild, and multiple other exploits are publicly available.

Affected Versions

Versions of Bowser's Castle as shipped in Super Mario Bros. and Super Mario Bros.: The Lost Levels are affected.


An independently developed patch is available:

--- a/smb.asm   1985-09-13 12:00:00.000000000 +0900
+++ b/smb.asm   2011-04-01 12:00:00.000000000 -0400
@@ -12009,12015 +12009,12015 @@
         ldy $04
         cpy #$05
         bcc *+$09
-        lda $45
+        lda #$01
         sta $00
         jmp $df4b
         jsr $dec4

A binary hot patch to apply the update to an existing version is also available.

All users are advised to upgrade.


For users unable to apply the recommended fix, a number of mitigations are possible to reduce the impact of the vulnerability.


Potential mitigations include:


The vulnerability was originally discovered by Mario and Luigi, of Mario Bros. Security Research.

The provided patch and this advisory were prepared by Lakitu Cloud Security, Inc. The hot patch was developed in collaboration with Ksplice, Inc.

Product Overview

Bowser's Castle is King Bowser's home and the base of operations for the Koopa Troop. Bowser's Castle is the final defense against assaults by Mario to kidnap Princess Peach, and is guarded by Bowser's most powerful minions.


Wednesday Mar 16, 2011

disown, zombie children, and the uninterruptible sleep

PID 1 Riding, by Albrecht Dürer

It's the end of the day on Friday. On your laptop, in an ssh session on a work machine, you check on, which has been running all day and has another 8 or 9 hours to go. You start to close your laptop.

You freeze for a second and groan.

This was supposed to be running under a screen session. You know that if you kill the ssh connection, that'll also kill What are you going to do? Leave your laptop for the weekend? Kill the job, losing the last 8 hours of work?

You think about what does for a minute, and breathe a sigh of relief. The output is written to a file, so you don't care about terminal output. This means you can use disown.

How does this little shell built-in let your jobs finish even when you kill the parent process (the shell inside the ssh connection)?

Dissecting disown

As we'll see, disown synthesizes 3 big UNIX concepts: signals, process states, and job control.

The point of disowning a process is that it will continue to run even when you exit the shell that spawned it. Getting this to work requires a prelude. The steps are:

  1. suspend the process with ctl-Z.
  2. background with bg.
  3. disown the job.

What does each of these steps accomplish?

First, here's a summary of the states that a process can be in, from the ps man page:

       Here are the different values that the s, stat and state output specifiers (header "STAT" or "S")
       will display to describe the state of a process.
       D    Uninterruptible sleep (usually IO)
       R    Running or runnable (on run queue)
       S    Interruptible sleep (waiting for an event to complete)
       T    Stopped, either by a job control signal or because it is being traced.
       W    paging (not valid since the 2.6.xx kernel)
       X    dead (should never be seen)
       Z    Defunct ("zombie") process, terminated but not reaped by its parent.

       For BSD formats and when the stat keyword is used, additional characters may be displayed:
       <    high-priority (not nice to other users)
       N    low-priority (nice to other users)
       L    has pages locked into memory (for real-time and custom IO)
       s    is a session leader
       l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
       +    is in the foreground process group

And here is a transcript of the steps to disown To the right of each step is some useful ps output, in particular the parent process id (PPID), what process state our long job is in (STAT), and the controlling terminal (TT). I've highlighted the interesting changes:

Shell 1: disown Shell 2: monitor with ps
1. Start program
$ sh
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298 26145 S+   pts/0    sh
2. Suspend program with Ctl-z
[1]+  Stopped     sh
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298 26145 T    pts/0    sh
3. Resume program in background
$ bg
[1]+ sh &
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298 26145 S    pts/0    sh
4. disown job 1, our program
$ disown %1
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298 26145 S    pts/0    sh
5. Exit the shell
$ exit
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298     1 S    ?        sh

Putting this information together:

  1. When we run from the command line, its parent is the shell (PID 26145 in this example). Even though it looks like it is running as we watch it in the terminal, it mostly isn't; is waiting on some resource or event, so it is in process state S for interruptible sleep. It is in fact in the foreground, so it also gets a +.
  2. First, we suspend the program with Ctl-z. By ``suspend'', we mean send it the SIGTSTP signal, which is like SIGSTOP except that you can install your own signal handler for or ignore it. We see proof in the state change: it's now in T for stopped.
  3. Next, bg sets our process running again, but in the background, so we get the S for interruptible sleep, but no +.
  4. Finally, we can use disown to remove the process from the jobs list that our shell maintains. Our process has to be active when it is removed from the list or it'll get reaped when we kill the parent shell, which is why we needed the bg step.
  5. When we exit the shell, we are sending it a SIGHUP, which it propagates to all children in the jobs table**. By default, a SIGHUP will terminate a process. Because we removed our job from the jobs table, it doesn't get the SIGHUP and keeps on running (STAT S). However, since its parent the shell died, and the shell was the session leader in charge of the controlling tty, it doesn't have a tty anymore (TT ?). Additionally, our long job needs a new parent, so init, with PID 1, becomes the new parent process.
**This is not always true, as it turns out. In the bash shell, for example, there is a huponexit shell option. If this option is disabled, a SIGHUP to the shell isn't propagated to the children. This means if you have a backgrounded, active process (you followed steps 1, 2, and 3 above, or you started the process backgrounded with ``&'') and you exit the shell, you don't have to use disown for the process to keep running. You can check or toggle the huponexit shell option with the shopt shell built-in.

And that is disown in a nutshell.

What else can we learn about process states?

Dissecting disown presents enough interesting tangents about signals, process states, and job control for a small novel. Focusing on process states for this post, here are a few such tangents:

1. There are a lot of process states and modifiers. We saw some interruptible sleeps and suspended processes with disown, but what states are most common?

Using my laptop as a data source and taking advantage of ps format specifiers, we can get counts for the different process states:

jesstess@aja:~$ ps -e h -o stat | sort | uniq -c | sort -rn
     90 S
     31 Sl
     17 Ss
      9 Ss+
      8 Ssl
      4 S<
      3 S+
      2 SNl
      1 S<sl
      1 S<s
      1 SN
      1 SLl
      1 R+

So the vast majority are in an interruptible sleep (S), and a few processes are extra nice (N) and extra mean (<).

We can drill down on process ``niceness'', or scheduling priority, with the ni format specifier to ps:

jesstess@aja:~$ ps -e h -o ni | sort -n | uniq -c
      1 -11
      2  -5
      1  -4
      2  -2
      4   -
    156   0
      1   1
      1   5
      1  10

The numbers range from 19 (super friendly, low scheduling priority) to -20 (a total bully, high scheduling priority). The 6 processes with negative numbers are the 6 with a < process state modifier in the ``ps -e h -o stat'' output, and the 3 with positive numbers have the Ns. Most processes don't run under a special scheduling priority.

Why is almost nothing actually running?

In the ``ps -e h -o stat'' output above, only 1 process was marked as R running or runnable. This is a multi-processor machine, and there are over 150 other processes, so why isn't something running on the other processor?

The answer is that on an unloaded system, most processes really are waiting on an event or resource, so they can't run. On the laptop where I ran these tests, uptime tells us that we have a load average under 1:

jesstess@aja:~$ uptime
 13:09:10 up 16 days, 14:09,  5 users,  load average: 0.92, 0.87, 0.82

So we'd only expect to see 1 process in the R state at any given time for that load.

If we hop over to a more loaded machine -- a shell machine at MIT -- things are a little more interesting:

dr-wily:~> ps -e -o stat,cmd | awk '{if ($1 ~/R/) print}'
R+   /mit/barnowl/arch/i386_deb50/bin/barnowl.real.zephyr3
R+   ps -e -o stat,cmd
R+   w
dr-wily:~> uptime
 23:23:16 up 22 days, 20:09, 132 users,  load average: 3.01, 3.66, 3.43
dr-wily:~> grep processor /proc/cpuinfo
processor	: 0
processor	: 1
processor	: 2
processor	: 3

The machine has 4 processors. On average, 3 or 4 processors have processes running (in the R state). To get a sense of how the running processes change over time, throw the ps line under watch:

watch -n 1 "ps -e -o stat,cmd | awk '{if (\$1 ~/R/) print}'"

We get something like:

watching the changing output of ps

2. What about the zombies?

Noticeably absent in the process state summaries above are zombie processes (STAT Z) and processes in uninterruptible sleep (STAT D).

A process becomes a zombie when it has completed execution but hasn't been reaped by its parent. If a program produces long-lived zombies, this is usually a bug; zombies are undesirable because they take up process IDs, which are a limited resource.

I had to dig around a bit to find real examples of zombies. The winners were old barnowl zephyr clients (zephyr is a popular instant messaging system at MIT):

jesstess@linerva:~$ ps -e h -o stat,cmd | awk '{if ($1 ~/Z/) print}'
Z+   [barnowl] <defunct>
Z+   [barnowl] <defunct>

However, since all it takes to produce a zombie is a child exiting without the parent reaping it, it's easy to construct our own zombies of limited duration:

jesstess@aja:~$ cat zombie.c 
#include <sys/types.h>

int main () {
    pid_t child_pid = fork();
    if (child_pid > 0) {
    return 0;
jesstess@aja:~$ gcc -o zombie zombie.c
jesstess@aja:~$ ./zombie
[1]+  Stopped                 ./zombie
jesstess@aja:~$ ps -o stat,cmd $(pgrep -f zombie)
T    ./zombie
Z    [zombie] <defunct>

When you run this script, the parent dies after 60 seconds, init becomes the zombie child's new parent, and init quickly reaps the child by making a wait system call on the child's PID, which removes it from the system process table.

3. What about the uninterruptible sleeps?

A process is put in an uninterruptible sleep (STAT D) when it needs to wait on something (typically I/O) and shouldn't be handling signals while waiting. This means you can't kill it, because all kill does is send it signals. This might happen in the real world if you unplug your NFS server while other machines have open network connections to it.

We can create our own uninterruptible processes of limited duration by taking advantage of the vfork system call. vfork is like fork, except the address space is not copied from the parent into the child, in anticipation of an exec which would just throw out the copied data. Conveniently for us, when you vfork the parent waits uninterruptibly (by way of wait_on_completion) on the child's exec or exit:

jesstess@aja:~$ cat uninterruptible.c 
int main() {
    return 0;
jesstess@aja:~$ gcc -o uninterruptible uninterruptible.c
jesstess@aja:~$ echo $$
jesstess@aja:~$ ./uninterruptible

and in another shell:

jesstess@aja:~$ ps -o ppid,pid,stat,cmd $(pgrep -f uninterruptible)

13291  1972 D+   ./uninterruptible
 1972  1973 S+   ./uninterruptible

We see the child (PID 1973, PPID 1972) in an interruptible sleep and the parent (PID 1972, PPID 13291 -- the shell) in an uninterruptible sleep while it waits for 60 seconds on the child.

One neat (mischievous?) thing about this script is that processes in an uninterruptible sleep contribute to the load average for a machine. So you could run this script 100 times to temporarily give a machine a load average elevated by 100, as reported by uptime.

It's a family affair

Signals, process states, and job control offer a wealth of opportunities for exploration on a Linux system: we've already disowned children, killed parents, witnessed adoption (by init), crafted zombie children, and more. If this post inspires fun tangents or fond memories, please share in the comments!

*Albrecht had some help from Adam and Photoshop Elements. Larger version here.
Props to Nelson for his boundless supply of sysadmin party tricks, which includes this vfork example.


Monday Feb 21, 2011

Mapping your network with nmap

If you run a computer network, be it home WiFi or a global enterprise system, you need a way to investigate the machines connected to your network. When ping and traceroute won't cut it, you need a port scanner.

nmap is the port scanner. It's a powerful, sophisticated tool, not to mention a movie star. The documentation on nmap is voluminous: there's an entire book, with a free online edition, as well as a detailed manpage. In this post I'll show you just a few of the cool things nmap can do.

The law and ethics of port scanning are complex. A network scan can be detected by humans or automated systems, and treated as a malicious act, resulting in real costs to the target. Depending on the options you choose, the traffic generated by nmap can range from "completely innocuous" to "watch out for admins with baseball bats". A safe rule is to avoid scanning any network without the explicit permission of its administrators — better yet if that's you.

You'll need root privileges on the scanning system to run most interesting nmap commands, because nmap likes to bypass the standard network stack when synthesizing esoteric packets.

A firm handshake

Let's start by scanning my home network for web and SSH servers:

root@lyle# nmap -sS -p22,80
Nmap scan report for
22/tcp filtered ssh
80/tcp open     http

Nmap scan report for
22/tcp filtered ssh
80/tcp filtered http

Nmap scan report for
22/tcp open   ssh
80/tcp closed http

Nmap done: 256 IP addresses (3 hosts up) scanned in 6.05 seconds

We use -p22,80 to ask for a scan of TCP ports 22 and 80, the most popular ports for SSH and web servers respectively. If you don't specify a -p option, nmap will scan the 1,000 most commonly-used ports. You can give a port range like -p1-5000, or even use -p- to scan all ports, but your scan will take longer.

We describe the subnet to scan using CIDR notation. We could equivalently write

The option -sS requests a TCP SYN scan. nmap will start a TCP handshake by sending a SYN packet. Then it waits for a response. If the target replies with SYN/ACK, then some program is accepting our connection. A well-behaved client should respond with ACK, but nmap will simply record an open port and move on. This makes an nmap SYN scan both faster and more stealthy than a normal call to connect().

If the target replies with RST, then there's no service on that port, and nmap will record it as closed. Or we might not get a response at all. Perhaps a firewall is blocking our traffic, or the target host simply doesn't exist. In that case the port state is recorded as filtered after nmap times out.

You can scan UDP ports by passing -sU. There's one important difference from TCP: Since UDP is connectionless, there's no particular response required from an open port. Therefore nmap may show UDP ports in the ambiguous state open|filtered, unless you can prod the target application into sending you data (see below).

To save time, nmap tries to confirm that a target exists before performing a full scan. By default it will send ICMP echo (the ubiquitous "ping") as well as TCP SYN and ACK packets. You can use the -P family of options to customize this host-discovery phase.

Weird packets

nmap has the ability to generate all sorts of invalid, useless, or just plain weird network traffic. You can send a TCP packet with no flags at all (null scan, -sN) or one that's lit up "like a Christmas tree" (Xmas scan, -sX). You can chop your packets into little fragments (--mtu) or send an invalid checksum (--badsum). As a network administrator, you should know if the bad guys can confuse your security systems by sending weird packets. As the manpage advises, "Let your creative juices flow".

There's a second benefit to sending weird traffic: We can identify the target's operating system by seeing how it responds to unusual situations. nmap will perform this OS detection if you specify the -O flag:

root@lyle# nmap -sS -O
Nmap scan report for
Not shown: 998 filtered ports
23/tcp   closed telnet
80/tcp   open   http
MAC Address: 00:1C:10:33:6B:99 (Cisco-Linksys)
Device type: WAP|broadband router
Running: Linksys embedded, Netgear embedded, Netgear VxWorks 5.X
Nmap scan report for
Not shown: 998 filtered ports
139/tcp   open  netbios-ssn
445/tcp   open  microsoft-ds
MAC Address: 00:1F:3A:7F:7C:26 (Hon Hai Precision Ind.Co.)
Warning: OSScan results may be unreliable because we could not find
  at least 1 open and 1 closed port
Device type: general purpose
Running (JUST GUESSING) : Microsoft Windows Vista|2008|7 (98%)
Nmap scan report for
All 1000 scanned ports on are closed
MAC Address: 7C:61:93:53:9F:E5 (Unknown)
Too many fingerprints match this host to give specific OS details
TCP/IP fingerprint:

Since the first target has both an open and a closed port, nmap has many protocol corner cases to explore, and it easily recognizes a Linksys home router. With the second target, there's no port in the closed state, so nmap isn't as confident. It guesses a Windows OS, which seems especially plausible given the open NetBIOS ports. In the last case nmap has no clue, and gives us some raw findings only. If you know the OS of the target, you can contribute this fingerprint and help make nmap even better.

Behind the port

It's all well and good to discover that port 1234 is open, but what's actually listening there? nmap has a version detection subsystem that will spam a host's open ports with data in hopes of eliciting a response. Let's pass -sV to try this out:

root@lyle# nmap -sS -sV
Nmap scan report for
Not shown: 998 closed ports
443/tcp  open  ssh     OpenSSH 5.5p1 Debian 6 (protocol 2.0)
8888/tcp open  http    thttpd 2.25b 29dec2003

nmap correctly spotted an HTTP server on non-standard port 8888. The SSH server on port 443 (usually HTTPS) is also interesting. I find this setup useful when connecting from behind a restrictive outbound firewall. But I've also had network admins send me worried emails, thinking my machine has been compromised.

nmap also gives us the exact server software versions, straight from the server's own responses. This is a great way to quickly audit your network for any out-of-date, insecure servers.

Since a version scan involves sending application-level probes, it's more intrusive and can cause more trouble. From the book:

In the nmap-service-probes included with Nmap the only ports excluded are TCP port 9100 through 9107. These are common ports for printers to listen on and they often print any data sent to them. So a version detection scan can cause them to print many pages full of probes that Nmap sends, such as SunRPC requests, help statements, and X11 probes.

This behavior is often undesirable, especially when a scan is meant to be stealthy.

Trusting the source

It's a common (if questionable) practice for servers or firewalls to trust certain traffic based on where it appears to come from. nmap gives you a variety of tools for mapping these trust relationships. For example, some firewalls have special rules for traffic originating on ports 53, 67, or 20. You can set the source port for nmap's TCP and UDP packets by passing --source-port.

You can also spoof your source IP address using -S, and the target's responses will go to that fake address. This normally means that nmap won't see any results. But these responses can affect the unwitting source machine's IP protocol state in a way that nmap can observe indirectly. You can read about nmap's TCP idle scan for more details on this extremely clever technique. Imagine making any machine on the Internet — or your private network — port-scan any other machine, while you collect the results in secret. Can you use this to map out trust relationships in your network? Could an attacker?

Bells and whistles

So that's an overview of a few cool nmap features. There's a lot we haven't covered, such as performance tuning, packet traces, or nmap's useful output modes like XML or ScRipT KIdd|3. There's even a full scripting engine with hundreds of useful plugins written in Lua.


Tuesday Feb 15, 2011

Happy Birthday Ksplice Uptrack!

One year ago, we announced the general availability of Ksplice Uptrack, a subscription service for rebootless kernel updates on Linux. Since then, a lot has happened!


Over 600 companies have deployed Ksplice Uptrack on more than 100,000 production systems, on all 7 continents (Antarctica was the last hold-out). More than 2 million rebootless updates have been installed!

Ksplice at the South Pole

You see the greatest system administration time and cost savings when you use Ksplice Uptrack on all of your machines, and we've designed and priced Uptrack with this large-scale usage in mind. We've been very happy to see that our customers share this view: most of our customers either rolled out Ksplice across the board upon sign-up or have substantially expanded their usage after first purchasing for a particular environment.

We've introduced several features this year to make managing large Uptrack installations easier:

  • autoinstall: the Uptrack client can be configured to install rebootless updates on its own as they become available, making kernel updates fully automatic. Autoinstall is now our most popular configuration.
  • the Uptrack API: programmatically query the state of your machines through our RESTful API.
  • Nagios plugins: easily integrate Uptrack monitoring into your existing Nagios infrastructure with plugins that monitor for out of date, inactive, and unsupported machines.

Supported kernels

We continue to expand the distributions and flavors we support. You can now use Ksplice Uptrack on (deep breath) RHEL 4 & 5, CentOS 4 & 5 (and CentOS Plus), Debian 5 & 6, Ubuntu 8.04, 9.10, 10.04, & 10.10, Virtuozzo 3 & 4, OpenVZ for EL5 and Debian, CloudLinux 5, and Fedora 13 & 14. Within these distributions, we continue to support more flavors and older kernels; our launch may have been last year, but we support kernels back to 2007 and earlier!

You've always been able to use Ksplice in virtualized environments like VMWare, Xen, and Virtuozzo, This year, we've given you even more virtualization options by adding support for virtualization kernels like Debian OpenVZ and Debian Xen. Starting with Ubuntu Maverick, we support all Ubuntu kernel flavors, including the Maverick private cloud/EC2 kernel.

Thanks to some changes at Amazon and Rackspace, you can now even use Ksplice Uptrack on stock Linux kernels in Amazon EC2 and Rackspace Cloud.

As always, you can roll out a free trial on any of these distributions, for an unlimited number of systems.

This year, we also added Fedora Desktop to our free version options, so now you can use Ksplice Uptrack on both Ubuntu Desktop and Fedora Desktop completely for free.

Reboots saved

Did you know that between RHEL 4 and 5, Red Hat released 22 new kernels this past year?

chart: reboots on RHEL this past year

Without Ksplice, that's coordination and downtime for 22 reboots, including the hat trick of font-size: 125%; ">3 kernels in 3 weeks for RHEL 5. Gartner estimates that 90% of exploited systems are exploited using known, patched vulnerabilities, so if you're not rebooting and you're not using Ksplice, you are putting your servers (and your customers) at risk.

Looking forward

What do you want to see from Ksplice this year? What features can we add to help you deploy and monitor your Uptrack installations? Tell us what you want, and we'll do our best to deliver!


Monday Jan 24, 2011

8 gdb tricks you should know

Despite its age, gdb remains an amazingly versatile and flexible tool, and mastering it can save you huge amounts of time when trying to debug problems in your code. In this post, I'll share 10 tips and tricks for using GDB to debug most efficiently.

I'll be using the Linux kernel for examples throughout this post, not because these examples are necessarily realistic, but because it's a large C codebase that I know and that anyone can download and take a look at. Don't worry if you aren't familiar with Linux's source in particular -- the details of the examples won't matter too much.

  1. break WHERE if COND

    If you've ever used gdb, you almost certainly know about the "breakpoint" command, which lets you break at some specified point in the debugged program.

    But did you know that you can set conditional breakpoints? If you add if CONDITION to a breakpoint command, you can include an expression to be evaluated whenever the program reaches that point, and the program will only be stopped if the condition is fulfilled. Suppose I was debugging the Linux kernel and wanted to stop whenever init got scheduled. I could do:

    (gdb) break context_switch if next == init_task

    Note that the condition is evaluated by gdb, not by the debugged program, so you still pay the cost of the target stopping and switching to gdb every time the breakpoint is hit. As such, they still slow the target down in relation to to how often the target location is hit, not how often the condition is met.

  2. command

    In addition to conditional breakpoints, the command command lets you specify commands to be run every time you hit a breakpoint. This can be used for a number of things, but one of the most basic is to augment points in a program to include debug output, without having to recompile and restart the program. I could get a minimal log of every mmap() operation performed on a system using:

    (gdb) b do_mmap_pgoff 
    Breakpoint 1 at 0xffffffff8111a441: file mm/mmap.c, line 940.
    (gdb) command 1
    Type commands for when breakpoint 1 is hit, one per line.
    End with a line saying just "end".
    >print addr
    >print len
    >print prot
  3. gdb --args

    This one is simple, but a huge timesaver if you didn't know it. If you just want to start a program under gdb, passing some arguments on the command line, you can just build your command-line like usual, and then put "gdb --args" in front to launch gdb with the target program and the argument list both set:

    [~]$ gdb --args pizzamaker --deep-dish --toppings=pepperoni
    (gdb) show args
    Argument list to give program being debugged when it is started is
      " --deep-dish --toppings=pepperoni".
    (gdb) b main
    Breakpoint 1 at 0x45467c: file oven.c, line 123.
    (gdb) run

    I find this especially useful if I want to debug a project that has some arcane wrapper script that assembles lots of environment variables and possibly arguments before launching the actual binary (I'm looking at you, libtool). Instead of trying to replicate all that state and then launch gdb, simply make a copy of the wrapper, find the final "exec" call or similar, and add "gdb --args" in front.

  4. Finding source files

    I run Ubuntu, so I can download debug symbols for most of the packages on my system from, and I can get source using apt-get source. But how do I tell gdb to put the two together? If the debug symbols include relative paths, I can use gdb's directory command to add the source directory to my source path:

    [~/src]$ apt-get source coreutils
    [~/src]$ sudo apt-get install coreutils-dbgsym
    [~/src]$ gdb /bin/ls
    GNU gdb (GDB) 7.1-ubuntu
    (gdb) list main
    1192    ls.c: No such file or directory.
        in ls.c
    (gdb) directory ~/src/coreutils-7.4/src/
    Source directories searched: /home/nelhage/src/coreutils-7.4:$cdir:$cwd
    (gdb) list main
    1192        }
    1193    }
    1195    int
    1196    main (int argc, char **argv)
    1197    {
    1198      int i;
    1199      struct pending *thispend;
    1200      int n_files;

    Sometimes, however, debug symbols end up with absolute paths, such as the kernel's. In that case, I can use set substitute-path to tell gdb how to translate paths:

    [~/src]$ apt-get source linux-image-2.6.32-25-generic
    [~/src]$ sudo apt-get install linux-image-2.6.32-25-generic-dbgsym
    [~/src]$ gdb /usr/lib/debug/boot/vmlinux-2.6.32-25-generic 
    (gdb) list schedule
    5519    /build/buildd/linux-2.6.32/kernel/sched.c: No such file or directory.
        in /build/buildd/linux-2.6.32/kernel/sched.c
    (gdb) set substitute-path /build/buildd/linux-2.6.32 /home/nelhage/src/linux-2.6.32/
    (gdb) list schedule
    5520    static void put_prev_task(struct rq *rq, struct task_struct *p)
    5521    {
    5522        u64 runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
    5524        update_avg(&p->se.avg_running, runtime);
    5526        if (p->state == TASK_RUNNING) {
    5527            /*
    5528             * In order to avoid avg_overlap growing stale when we are
  5. Debugging macros

    One of the standard reasons almost everyone will tell you to prefer inline functions over macros is that debuggers tend to be better at dealing with inline functions. And in fact, by default, gdb doesn't know anything at all about macros, even when your project was built with debug symbols:

    (gdb) p GFP_ATOMIC
    No symbol "GFP_ATOMIC" in current context.
    (gdb) p task_is_stopped(&init_task)
    No symbol "task_is_stopped" in current context.

    However, if you're willing to tell GCC to generate debug symbols specifically optimized for gdb, using -ggdb3, it can preserve this information:

    $ make KCFLAGS=-ggdb3
    (gdb) break schedule
    (gdb) continue
    (gdb) p/x GFP_ATOMIC
    $1 = 0x20
    (gdb) p task_is_stopped_or_traced(init_task)
    $2 = 0

    You can also use the macro and info macro commands to work with macros from inside your gdb session:

    (gdb) macro expand task_is_stopped_or_traced(init_task)
    expands to: ((init_task->state & (4 | 8)) != 0)
    (gdb) info macro task_is_stopped_or_traced
    Defined at include/linux/sched.h:218
      included at include/linux/nmi.h:7
      included at kernel/sched.c:31
    #define task_is_stopped_or_traced(task) ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)

    Note that gdb actually knows which contexts macros are and aren't visible, so when you have the program stopped inside some function, you can only access macros visible at that point. (You can see that the "included at" lines above show you through exactly what path the macro is visible).

  6. gdb variables

    Whenever you print a variable in gdb, it prints this weird $NN = before it in the output:

    (gdb) p 5+5
    $1 = 10

    This is actually a gdb variable, that you can use to reference that same variable any time later in your session:

    (gdb) p $1
    $2 = 10

    You can also assign your own variables for convenience, using set:

    (gdb) set $foo = 4
    (gdb) p $foo
    $3 = 4

    This can be useful to grab a reference to some complex expression or similar that you'll be referencing many times, or, for example, for simplicity in writing a conditional breakpoint (see tip 1).

  7. Register variables

    In addition to the numeric variables, and any variables you define, gdb exposes your machine's registers as pseudo-variables, including some cross-architecture aliases for common ones, like $sp for the the stack pointer, or $pc for the program counter or instruction pointer.

    These are most useful when debugging assembly code or code without debugging symbols. Combined with a knowledge of your machine's calling convention, for example, you can use these to inspect function parameters:

    (gdb) break write if $rsi == 2

    will break on all writes to stderr on amd64, where the $rsi register is used to pass the first parameter.

  8. The x command

    Most people who've used gdb know about the print or p command, because of its obvious name, but I've been surprised how many don't know about the power of the x command.

    x (for "examine") is used to output regions of memory in various formats. It takes two arguments in a slightly unusual syntax:


    ADDRESS, unsurprisingly, is the address to examine; It can be an arbitrary expression, like the argument to print.

    FMT controls how the memory should be dumped, and consists of (up to) three components:

    • A numeric COUNT of how many elements to dump
    • A single-character FORMAT, indicating how to interpret and display each element
    • A single-character SIZE, indicating the size of each element to display.

    x displays COUNT elements of length SIZE each, starting from ADDRESS, formatting them according to the FORMAT.

    There are many valid "format" arguments; help x in gdb will give you the full list, so here's my favorites:

    x/x displays elements in hex, x/d displays them as signed decimals, x/c displays characters, x/i disassembles memory as instructions, and x/s interprets memory as C strings.

    The SIZE argument can be one of: b, h, w, and g, for one-, two-, four-, and eight-byte blocks, respectively.

    If you have debug symbols so that GDB knows the types of everything you might want to inspect, p is usually a better choice, but if not, x is invaluable for taking a look at memory.

    [~]$ grep saved_command /proc/kallsyms
    ffffffff81946000 B saved_command_line
    (gdb) x/s 0xffffffff81946000
    ffffffff81946000 <>:     "root=/dev/sda1 quiet"

    x/i is invaluable as a quick way to disassemble memory:

    (gdb) x/5i schedule
       0xffffffff8154804a <schedule>:   push   %rbp
       0xffffffff8154804b <schedule+1>: mov    $0x11ac0,%rdx
       0xffffffff81548052 <schedule+8>: mov    %gs:0xb588,%rax
       0xffffffff8154805b <schedule+17>:    mov    %rsp,%rbp
       0xffffffff8154805e <schedule+20>:    push   %r15

    If I'm stopped at a segfault in unknown code, one of the first things I try is something like x/20i $ip-40, to get a look at what the code I'm stopped at looks like.

    A quick-and-dirty but surprisingly effective way to debug memory leaks is to let the leak grow until it consumes most of a program's memory, and then attach gdb and just x random pieces of memory. Since the leaked data is using up most of memory, you'll usually hit it pretty quickly, and can try to interpret what it must have come from.


Ksplice is hiring!

Do you love tinkering with, exploring, and debugging Linux systems? Does writing Python clones of your favorite childhood computer games sound like a fun weekend project? Have you ever told a joke whose punch line was a git command?

Join Ksplice and work on technology that most people will tell you is impossible: updating the Linux kernel while it is running.

Help us develop the software and infrastructure to bring rebootless kernel updates to Linux, as well as new operating system kernels and other parts of the software stack. We're hiring backend, frontend, and kernel engineers. Say hello at!


Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.


« August 2015