Wednesday Mar 16, 2011

disown, zombie children, and the uninterruptible sleep

PID 1 Riding, by Albrecht Dürer

It's the end of the day on Friday. On your laptop, in an ssh session on a work machine, you check on long.sh, which has been running all day and has another 8 or 9 hours to go. You start to close your laptop.

You freeze for a second and groan.

This was supposed to be running under a screen session. You know that if you kill the ssh connection, that'll also kill long.sh. What are you going to do? Leave your laptop for the weekend? Kill the job, losing the last 8 hours of work?

You think about what long.sh does for a minute, and breathe a sigh of relief. The output is written to a file, so you don't care about terminal output. This means you can use disown.

How does this little shell built-in let your jobs finish even when you kill the parent process (the shell inside the ssh connection)?

Dissecting disown

As we'll see, disown synthesizes 3 big UNIX concepts: signals, process states, and job control.

The point of disowning a process is that it will continue to run even when you exit the shell that spawned it. Getting this to work requires a prelude. The steps are:

  1. suspend the process with ctl-Z.
  2. background with bg.
  3. disown the job.

What does each of these steps accomplish?

First, here's a summary of the states that a process can be in, from the ps man page:

PROCESS STATE CODES
       Here are the different values that the s, stat and state output specifiers (header "STAT" or "S")
       will display to describe the state of a process.
       D    Uninterruptible sleep (usually IO)
       R    Running or runnable (on run queue)
       S    Interruptible sleep (waiting for an event to complete)
       T    Stopped, either by a job control signal or because it is being traced.
       W    paging (not valid since the 2.6.xx kernel)
       X    dead (should never be seen)
       Z    Defunct ("zombie") process, terminated but not reaped by its parent.

       For BSD formats and when the stat keyword is used, additional characters may be displayed:
       <    high-priority (not nice to other users)
       N    low-priority (nice to other users)
       L    has pages locked into memory (for real-time and custom IO)
       s    is a session leader
       l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
       +    is in the foreground process group

And here is a transcript of the steps to disown long.sh. To the right of each step is some useful ps output, in particular the parent process id (PPID), what process state our long job is in (STAT), and the controlling terminal (TT). I've highlighted the interesting changes:

Shell 1: disown Shell 2: monitor with ps
1. Start program
$ sh long.sh
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
  PID  PPID STAT TT       CMD
26298 26145 S+   pts/0    sh long.sh
2. Suspend program with Ctl-z
^Z
[1]+  Stopped     sh long.sh
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
  PID  PPID STAT TT       CMD
26298 26145 T    pts/0    sh long.sh
3. Resume program in background
$ bg
[1]+ sh long.sh &
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
  PID  PPID STAT TT       CMD
26298 26145 S    pts/0    sh long.sh
4. disown job 1, our program
$ disown %1
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
  PID  PPID STAT TT       CMD
26298 26145 S    pts/0    sh long.sh
5. Exit the shell
$ exit
logout
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
  PID  PPID STAT TT       CMD
26298     1 S    ?        sh long.sh

Putting this information together:

  1. When we run long.sh from the command line, its parent is the shell (PID 26145 in this example). Even though it looks like it is running as we watch it in the terminal, it mostly isn't; long.sh is waiting on some resource or event, so it is in process state S for interruptible sleep. It is in fact in the foreground, so it also gets a +.
  2. First, we suspend the program with Ctl-z. By ``suspend'', we mean send it the SIGTSTP signal, which is like SIGSTOP except that you can install your own signal handler for or ignore it. We see proof in the state change: it's now in T for stopped.
  3. Next, bg sets our process running again, but in the background, so we get the S for interruptible sleep, but no +.
  4. Finally, we can use disown to remove the process from the jobs list that our shell maintains. Our process has to be active when it is removed from the list or it'll get reaped when we kill the parent shell, which is why we needed the bg step.
  5. When we exit the shell, we are sending it a SIGHUP, which it propagates to all children in the jobs table**. By default, a SIGHUP will terminate a process. Because we removed our job from the jobs table, it doesn't get the SIGHUP and keeps on running (STAT S). However, since its parent the shell died, and the shell was the session leader in charge of the controlling tty, it doesn't have a tty anymore (TT ?). Additionally, our long job needs a new parent, so init, with PID 1, becomes the new parent process.
**This is not always true, as it turns out. In the bash shell, for example, there is a huponexit shell option. If this option is disabled, a SIGHUP to the shell isn't propagated to the children. This means if you have a backgrounded, active process (you followed steps 1, 2, and 3 above, or you started the process backgrounded with ``&'') and you exit the shell, you don't have to use disown for the process to keep running. You can check or toggle the huponexit shell option with the shopt shell built-in.

And that is disown in a nutshell.

What else can we learn about process states?

Dissecting disown presents enough interesting tangents about signals, process states, and job control for a small novel. Focusing on process states for this post, here are a few such tangents:

1. There are a lot of process states and modifiers. We saw some interruptible sleeps and suspended processes with disown, but what states are most common?

Using my laptop as a data source and taking advantage of ps format specifiers, we can get counts for the different process states:

jesstess@aja:~$ ps -e h -o stat | sort | uniq -c | sort -rn
     90 S
     31 Sl
     17 Ss
      9 Ss+
      8 Ssl
      4 S<
      3 S+
      2 SNl
      1 S<sl
      1 S<s
      1 SN
      1 SLl
      1 R+

So the vast majority are in an interruptible sleep (S), and a few processes are extra nice (N) and extra mean (<).

We can drill down on process ``niceness'', or scheduling priority, with the ni format specifier to ps:

jesstess@aja:~$ ps -e h -o ni | sort -n | uniq -c
      1 -11
      2  -5
      1  -4
      2  -2
      4   -
    156   0
      1   1
      1   5
      1  10

The numbers range from 19 (super friendly, low scheduling priority) to -20 (a total bully, high scheduling priority). The 6 processes with negative numbers are the 6 with a < process state modifier in the ``ps -e h -o stat'' output, and the 3 with positive numbers have the Ns. Most processes don't run under a special scheduling priority.

Why is almost nothing actually running?

In the ``ps -e h -o stat'' output above, only 1 process was marked as R running or runnable. This is a multi-processor machine, and there are over 150 other processes, so why isn't something running on the other processor?

The answer is that on an unloaded system, most processes really are waiting on an event or resource, so they can't run. On the laptop where I ran these tests, uptime tells us that we have a load average under 1:

jesstess@aja:~$ uptime
 13:09:10 up 16 days, 14:09,  5 users,  load average: 0.92, 0.87, 0.82

So we'd only expect to see 1 process in the R state at any given time for that load.

If we hop over to a more loaded machine -- a shell machine at MIT -- things are a little more interesting:

dr-wily:~> ps -e -o stat,cmd | awk '{if ($1 ~/R/) print}'
R+   /mit/barnowl/arch/i386_deb50/bin/barnowl.real.zephyr3
R+   ps -e -o stat,cmd
R+   w
dr-wily:~> uptime
 23:23:16 up 22 days, 20:09, 132 users,  load average: 3.01, 3.66, 3.43
dr-wily:~> grep processor /proc/cpuinfo
processor	: 0
processor	: 1
processor	: 2
processor	: 3

The machine has 4 processors. On average, 3 or 4 processors have processes running (in the R state). To get a sense of how the running processes change over time, throw the ps line under watch:

watch -n 1 "ps -e -o stat,cmd | awk '{if (\$1 ~/R/) print}'"

We get something like:

watching the changing output of ps

2. What about the zombies?

Noticeably absent in the process state summaries above are zombie processes (STAT Z) and processes in uninterruptible sleep (STAT D).

A process becomes a zombie when it has completed execution but hasn't been reaped by its parent. If a program produces long-lived zombies, this is usually a bug; zombies are undesirable because they take up process IDs, which are a limited resource.

I had to dig around a bit to find real examples of zombies. The winners were old barnowl zephyr clients (zephyr is a popular instant messaging system at MIT):

jesstess@linerva:~$ ps -e h -o stat,cmd | awk '{if ($1 ~/Z/) print}'
Z+   [barnowl] <defunct>
Z+   [barnowl] <defunct>

However, since all it takes to produce a zombie is a child exiting without the parent reaping it, it's easy to construct our own zombies of limited duration:

jesstess@aja:~$ cat zombie.c 
#include <sys/types.h>

int main () {
    pid_t child_pid = fork();
    if (child_pid > 0) {
        sleep(60);
    }
    return 0;
}
jesstess@aja:~$ gcc -o zombie zombie.c
jesstess@aja:~$ ./zombie
^Z
[1]+  Stopped                 ./zombie
jesstess@aja:~$ ps -o stat,cmd $(pgrep -f zombie)
T    ./zombie
Z    [zombie] <defunct>

When you run this script, the parent dies after 60 seconds, init becomes the zombie child's new parent, and init quickly reaps the child by making a wait system call on the child's PID, which removes it from the system process table.

3. What about the uninterruptible sleeps?

A process is put in an uninterruptible sleep (STAT D) when it needs to wait on something (typically I/O) and shouldn't be handling signals while waiting. This means you can't kill it, because all kill does is send it signals. This might happen in the real world if you unplug your NFS server while other machines have open network connections to it.

We can create our own uninterruptible processes of limited duration by taking advantage of the vfork system call. vfork is like fork, except the address space is not copied from the parent into the child, in anticipation of an exec which would just throw out the copied data. Conveniently for us, when you vfork the parent waits uninterruptibly (by way of wait_on_completion) on the child's exec or exit:

jesstess@aja:~$ cat uninterruptible.c 
int main() {
    vfork();
    sleep(60);
    return 0;
}
jesstess@aja:~$ gcc -o uninterruptible uninterruptible.c
jesstess@aja:~$ echo $$
13291
jesstess@aja:~$ ./uninterruptible

and in another shell:

jesstess@aja:~$ ps -o ppid,pid,stat,cmd $(pgrep -f uninterruptible)

13291  1972 D+   ./uninterruptible
 1972  1973 S+   ./uninterruptible

We see the child (PID 1973, PPID 1972) in an interruptible sleep and the parent (PID 1972, PPID 13291 -- the shell) in an uninterruptible sleep while it waits for 60 seconds on the child.

One neat (mischievous?) thing about this script is that processes in an uninterruptible sleep contribute to the load average for a machine. So you could run this script 100 times to temporarily give a machine a load average elevated by 100, as reported by uptime.

It's a family affair

Signals, process states, and job control offer a wealth of opportunities for exploration on a Linux system: we've already disowned children, killed parents, witnessed adoption (by init), crafted zombie children, and more. If this post inspires fun tangents or fond memories, please share in the comments!


*Albrecht had some help from Adam and Photoshop Elements. Larger version here.
Props to Nelson for his boundless supply of sysadmin party tricks, which includes this vfork example.

~jesstess

Tuesday Feb 15, 2011

Happy Birthday Ksplice Uptrack!

One year ago, we announced the general availability of Ksplice Uptrack, a subscription service for rebootless kernel updates on Linux. Since then, a lot has happened!

Adoption

Over 600 companies have deployed Ksplice Uptrack on more than 100,000 production systems, on all 7 continents (Antarctica was the last hold-out). More than 2 million rebootless updates have been installed!

Ksplice at the South Pole

You see the greatest system administration time and cost savings when you use Ksplice Uptrack on all of your machines, and we've designed and priced Uptrack with this large-scale usage in mind. We've been very happy to see that our customers share this view: most of our customers either rolled out Ksplice across the board upon sign-up or have substantially expanded their usage after first purchasing for a particular environment.

We've introduced several features this year to make managing large Uptrack installations easier:

  • autoinstall: the Uptrack client can be configured to install rebootless updates on its own as they become available, making kernel updates fully automatic. Autoinstall is now our most popular configuration.
  • the Uptrack API: programmatically query the state of your machines through our RESTful API.
  • Nagios plugins: easily integrate Uptrack monitoring into your existing Nagios infrastructure with plugins that monitor for out of date, inactive, and unsupported machines.

Supported kernels

We continue to expand the distributions and flavors we support. You can now use Ksplice Uptrack on (deep breath) RHEL 4 & 5, CentOS 4 & 5 (and CentOS Plus), Debian 5 & 6, Ubuntu 8.04, 9.10, 10.04, & 10.10, Virtuozzo 3 & 4, OpenVZ for EL5 and Debian, CloudLinux 5, and Fedora 13 & 14. Within these distributions, we continue to support more flavors and older kernels; our launch may have been last year, but we support kernels back to 2007 and earlier!

You've always been able to use Ksplice in virtualized environments like VMWare, Xen, and Virtuozzo, This year, we've given you even more virtualization options by adding support for virtualization kernels like Debian OpenVZ and Debian Xen. Starting with Ubuntu Maverick, we support all Ubuntu kernel flavors, including the Maverick private cloud/EC2 kernel.

Thanks to some changes at Amazon and Rackspace, you can now even use Ksplice Uptrack on stock Linux kernels in Amazon EC2 and Rackspace Cloud.

As always, you can roll out a free trial on any of these distributions, for an unlimited number of systems.

This year, we also added Fedora Desktop to our free version options, so now you can use Ksplice Uptrack on both Ubuntu Desktop and Fedora Desktop completely for free.

Reboots saved

Did you know that between RHEL 4 and 5, Red Hat released 22 new kernels this past year?

chart: reboots on RHEL this past year

Without Ksplice, that's coordination and downtime for 22 reboots, including the hat trick of font-size: 125%; ">3 kernels in 3 weeks for RHEL 5. Gartner estimates that 90% of exploited systems are exploited using known, patched vulnerabilities, so if you're not rebooting and you're not using Ksplice, you are putting your servers (and your customers) at risk.

Looking forward

What do you want to see from Ksplice this year? What features can we add to help you deploy and monitor your Uptrack installations? Tell us what you want, and we'll do our best to deliver!

~jesstess
About

Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today