Wednesday Mar 16, 2011

disown, zombie children, and the uninterruptible sleep

PID 1 Riding, by Albrecht Dürer

It's the end of the day on Friday. On your laptop, in an ssh session on a work machine, you check on, which has been running all day and has another 8 or 9 hours to go. You start to close your laptop.

You freeze for a second and groan.

This was supposed to be running under a screen session. You know that if you kill the ssh connection, that'll also kill What are you going to do? Leave your laptop for the weekend? Kill the job, losing the last 8 hours of work?

You think about what does for a minute, and breathe a sigh of relief. The output is written to a file, so you don't care about terminal output. This means you can use disown.

How does this little shell built-in let your jobs finish even when you kill the parent process (the shell inside the ssh connection)?

Dissecting disown

As we'll see, disown synthesizes 3 big UNIX concepts: signals, process states, and job control.

The point of disowning a process is that it will continue to run even when you exit the shell that spawned it. Getting this to work requires a prelude. The steps are:

  1. suspend the process with ctl-Z.
  2. background with bg.
  3. disown the job.

What does each of these steps accomplish?

First, here's a summary of the states that a process can be in, from the ps man page:

       Here are the different values that the s, stat and state output specifiers (header "STAT" or "S")
       will display to describe the state of a process.
       D    Uninterruptible sleep (usually IO)
       R    Running or runnable (on run queue)
       S    Interruptible sleep (waiting for an event to complete)
       T    Stopped, either by a job control signal or because it is being traced.
       W    paging (not valid since the 2.6.xx kernel)
       X    dead (should never be seen)
       Z    Defunct ("zombie") process, terminated but not reaped by its parent.

       For BSD formats and when the stat keyword is used, additional characters may be displayed:
       <    high-priority (not nice to other users)
       N    low-priority (nice to other users)
       L    has pages locked into memory (for real-time and custom IO)
       s    is a session leader
       l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
       +    is in the foreground process group

And here is a transcript of the steps to disown To the right of each step is some useful ps output, in particular the parent process id (PPID), what process state our long job is in (STAT), and the controlling terminal (TT). I've highlighted the interesting changes:

Shell 1: disown Shell 2: monitor with ps
1. Start program
$ sh
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298 26145 S+   pts/0    sh
2. Suspend program with Ctl-z
[1]+  Stopped     sh
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298 26145 T    pts/0    sh
3. Resume program in background
$ bg
[1]+ sh &
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298 26145 S    pts/0    sh
4. disown job 1, our program
$ disown %1
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298 26145 S    pts/0    sh
5. Exit the shell
$ exit
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298     1 S    ?        sh

Putting this information together:

  1. When we run from the command line, its parent is the shell (PID 26145 in this example). Even though it looks like it is running as we watch it in the terminal, it mostly isn't; is waiting on some resource or event, so it is in process state S for interruptible sleep. It is in fact in the foreground, so it also gets a +.
  2. First, we suspend the program with Ctl-z. By ``suspend'', we mean send it the SIGTSTP signal, which is like SIGSTOP except that you can install your own signal handler for or ignore it. We see proof in the state change: it's now in T for stopped.
  3. Next, bg sets our process running again, but in the background, so we get the S for interruptible sleep, but no +.
  4. Finally, we can use disown to remove the process from the jobs list that our shell maintains. Our process has to be active when it is removed from the list or it'll get reaped when we kill the parent shell, which is why we needed the bg step.
  5. When we exit the shell, we are sending it a SIGHUP, which it propagates to all children in the jobs table**. By default, a SIGHUP will terminate a process. Because we removed our job from the jobs table, it doesn't get the SIGHUP and keeps on running (STAT S). However, since its parent the shell died, and the shell was the session leader in charge of the controlling tty, it doesn't have a tty anymore (TT ?). Additionally, our long job needs a new parent, so init, with PID 1, becomes the new parent process.
**This is not always true, as it turns out. In the bash shell, for example, there is a huponexit shell option. If this option is disabled, a SIGHUP to the shell isn't propagated to the children. This means if you have a backgrounded, active process (you followed steps 1, 2, and 3 above, or you started the process backgrounded with ``&'') and you exit the shell, you don't have to use disown for the process to keep running. You can check or toggle the huponexit shell option with the shopt shell built-in.

And that is disown in a nutshell.

What else can we learn about process states?

Dissecting disown presents enough interesting tangents about signals, process states, and job control for a small novel. Focusing on process states for this post, here are a few such tangents:

1. There are a lot of process states and modifiers. We saw some interruptible sleeps and suspended processes with disown, but what states are most common?

Using my laptop as a data source and taking advantage of ps format specifiers, we can get counts for the different process states:

jesstess@aja:~$ ps -e h -o stat | sort | uniq -c | sort -rn
     90 S
     31 Sl
     17 Ss
      9 Ss+
      8 Ssl
      4 S<
      3 S+
      2 SNl
      1 S<sl
      1 S<s
      1 SN
      1 SLl
      1 R+

So the vast majority are in an interruptible sleep (S), and a few processes are extra nice (N) and extra mean (<).

We can drill down on process ``niceness'', or scheduling priority, with the ni format specifier to ps:

jesstess@aja:~$ ps -e h -o ni | sort -n | uniq -c
      1 -11
      2  -5
      1  -4
      2  -2
      4   -
    156   0
      1   1
      1   5
      1  10

The numbers range from 19 (super friendly, low scheduling priority) to -20 (a total bully, high scheduling priority). The 6 processes with negative numbers are the 6 with a < process state modifier in the ``ps -e h -o stat'' output, and the 3 with positive numbers have the Ns. Most processes don't run under a special scheduling priority.

Why is almost nothing actually running?

In the ``ps -e h -o stat'' output above, only 1 process was marked as R running or runnable. This is a multi-processor machine, and there are over 150 other processes, so why isn't something running on the other processor?

The answer is that on an unloaded system, most processes really are waiting on an event or resource, so they can't run. On the laptop where I ran these tests, uptime tells us that we have a load average under 1:

jesstess@aja:~$ uptime
 13:09:10 up 16 days, 14:09,  5 users,  load average: 0.92, 0.87, 0.82

So we'd only expect to see 1 process in the R state at any given time for that load.

If we hop over to a more loaded machine -- a shell machine at MIT -- things are a little more interesting:

dr-wily:~> ps -e -o stat,cmd | awk '{if ($1 ~/R/) print}'
R+   /mit/barnowl/arch/i386_deb50/bin/barnowl.real.zephyr3
R+   ps -e -o stat,cmd
R+   w
dr-wily:~> uptime
 23:23:16 up 22 days, 20:09, 132 users,  load average: 3.01, 3.66, 3.43
dr-wily:~> grep processor /proc/cpuinfo
processor	: 0
processor	: 1
processor	: 2
processor	: 3

The machine has 4 processors. On average, 3 or 4 processors have processes running (in the R state). To get a sense of how the running processes change over time, throw the ps line under watch:

watch -n 1 "ps -e -o stat,cmd | awk '{if (\$1 ~/R/) print}'"

We get something like:

watching the changing output of ps

2. What about the zombies?

Noticeably absent in the process state summaries above are zombie processes (STAT Z) and processes in uninterruptible sleep (STAT D).

A process becomes a zombie when it has completed execution but hasn't been reaped by its parent. If a program produces long-lived zombies, this is usually a bug; zombies are undesirable because they take up process IDs, which are a limited resource.

I had to dig around a bit to find real examples of zombies. The winners were old barnowl zephyr clients (zephyr is a popular instant messaging system at MIT):

jesstess@linerva:~$ ps -e h -o stat,cmd | awk '{if ($1 ~/Z/) print}'
Z+   [barnowl] <defunct>
Z+   [barnowl] <defunct>

However, since all it takes to produce a zombie is a child exiting without the parent reaping it, it's easy to construct our own zombies of limited duration:

jesstess@aja:~$ cat zombie.c 
#include <sys/types.h>

int main () {
    pid_t child_pid = fork();
    if (child_pid > 0) {
    return 0;
jesstess@aja:~$ gcc -o zombie zombie.c
jesstess@aja:~$ ./zombie
[1]+  Stopped                 ./zombie
jesstess@aja:~$ ps -o stat,cmd $(pgrep -f zombie)
T    ./zombie
Z    [zombie] <defunct>

When you run this script, the parent dies after 60 seconds, init becomes the zombie child's new parent, and init quickly reaps the child by making a wait system call on the child's PID, which removes it from the system process table.

3. What about the uninterruptible sleeps?

A process is put in an uninterruptible sleep (STAT D) when it needs to wait on something (typically I/O) and shouldn't be handling signals while waiting. This means you can't kill it, because all kill does is send it signals. This might happen in the real world if you unplug your NFS server while other machines have open network connections to it.

We can create our own uninterruptible processes of limited duration by taking advantage of the vfork system call. vfork is like fork, except the address space is not copied from the parent into the child, in anticipation of an exec which would just throw out the copied data. Conveniently for us, when you vfork the parent waits uninterruptibly (by way of wait_on_completion) on the child's exec or exit:

jesstess@aja:~$ cat uninterruptible.c 
int main() {
    return 0;
jesstess@aja:~$ gcc -o uninterruptible uninterruptible.c
jesstess@aja:~$ echo $$
jesstess@aja:~$ ./uninterruptible

and in another shell:

jesstess@aja:~$ ps -o ppid,pid,stat,cmd $(pgrep -f uninterruptible)

13291  1972 D+   ./uninterruptible
 1972  1973 S+   ./uninterruptible

We see the child (PID 1973, PPID 1972) in an interruptible sleep and the parent (PID 1972, PPID 13291 -- the shell) in an uninterruptible sleep while it waits for 60 seconds on the child.

One neat (mischievous?) thing about this script is that processes in an uninterruptible sleep contribute to the load average for a machine. So you could run this script 100 times to temporarily give a machine a load average elevated by 100, as reported by uptime.

It's a family affair

Signals, process states, and job control offer a wealth of opportunities for exploration on a Linux system: we've already disowned children, killed parents, witnessed adoption (by init), crafted zombie children, and more. If this post inspires fun tangents or fond memories, please share in the comments!

*Albrecht had some help from Adam and Photoshop Elements. Larger version here.
Props to Nelson for his boundless supply of sysadmin party tricks, which includes this vfork example.


Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.


« October 2016