Solving problems with proc

The Linux kernel exposes a wealth of information through the proc special filesystem. It's not hard to find an encyclopedic reference about proc. In this article I'll take a different approach: we'll see how proc tricks can solve a number of real-world problems. All of these tricks should work on a recent Linux kernel, though some will fail on older systems like RHEL version 4.

Almost all Linux systems will have the proc filesystem mounted at /proc. If you look inside this directory you'll see a ton of stuff:

keegan@lyle$ mount | grep ^proc
proc on /proc type proc (rw,noexec,nosuid,nodev)
keegan@lyle$ ls /proc
1      13     23     29672  462        cmdline      kcore         self
10411  13112  23842  29813  5          cpuinfo      keys          slabinfo
...
12934  15260  26317  4      bus        irq          partitions    zoneinfo
12938  15262  26349  413    cgroups    kallsyms     sched_debug

These directories and files don't exist anywhere on disk. Rather, the kernel generates the contents of /proc as you read it. proc is a great example of the UNIX "everything is a file" philosophy. Since the Linux kernel exposes its internal state as a set of ordinary files, you can build tools using basic shell scripting, or any other programming environment you like. You can also change kernel behavior by writing to certain files in /proc, though we won't discuss this further.

Each process has a directory in /proc, named by its numerical process identifier (PID). So for example, information about init (PID 1) is stored in /proc/1. There's also a symlink /proc/self, which each process sees as pointing to its own directory:

keegan@lyle$ ls -l /proc/self
lrwxrwxrwx 1 root root 64 Jan 6 13:22 /proc/self -> 13833

Here we see that 13833 was the PID of the ls process. Since ls has exited, the directory /proc/13883 will have already vanished, unless your system reused the PID for another process. The contents of /proc are constantly changing, even in response to your queries!

Back from the dead

It's happened to all of us. You hit the up-arrow one too many times and accidentally wiped out that really important disk image.

keegan@lyle$ rm hda.img

Time to think fast! Luckily you were still computing its checksum in another terminal. And UNIX systems won't actually delete a file on disk while the file is in use. Let's make sure our file stays "in use" by suspending md5sum with control-Z:

keegan@lyle$ md5sum hda.img
^Z
[1]+  Stopped                 md5sum hda.img

The proc filesystem contains links to a process's open files, under the fd subdirectory. We'll get the PID of md5sum and try to recover our file:

keegan@lyle$ jobs -l
[1]+ 14595 Stopped                 md5sum hda.img
keegan@lyle$ ls -l /proc/14595/fd/
total 0
lrwx------ 1 keegan keegan 64 Jan 6 15:05 0 -> /dev/pts/18
lrwx------ 1 keegan keegan 64 Jan 6 15:05 1 -> /dev/pts/18
lrwx------ 1 keegan keegan 64 Jan 6 15:05 2 -> /dev/pts/18
lr-x------ 1 keegan keegan 64 Jan 6 15:05 3 -> /home/keegan/hda.img (deleted)
keegan@lyle$ cp /proc/14595/fd/3 saved.img
keegan@lyle$ du -h saved.img
320G    saved.img

Disaster averted, thanks to proc. There's one big caveat: making a full byte-for-byte copy of the file could require a lot of time and free disk space. In theory this isn't necessary; the file still exists on disk, and we just need to make a new name for it (a hardlink). But the ln command and associated system calls have no way to name a deleted file. On FreeBSD we could use fsdb, but I'm not aware of a similar tool for Linux. Suggestions are welcome!

Redirect harder

Most UNIX tools can read from standard input, either by default or with a specified filename of "-". But sometimes we have to use a program which requires an explicitly named file. proc provides an elegant workaround for this flaw.

A UNIX process refers to its open files using integers called file descriptors. When we say "standard input", we really mean "file descriptor 0". So we can use /proc/self/fd/0 as an explicit name for standard input:

keegan@lyle$ cat crap-prog.py 
import sys
print file(sys.argv[1]).read()
keegan@lyle$ echo hello | python crap-prog.py 
IndexError: list index out of range
keegan@lyle$ echo hello | python crap-prog.py -
IOError: [Errno 2] No such file or directory: '-'
keegan@lyle$ echo hello | python crap-prog.py /proc/self/fd/0
hello

This also works for standard output and standard error, on file descriptors 1 and 2 respectively. This trick is useful enough that many distributions provide symlinks at /dev/stdin, etc.

There are a lot of possibilities for where /proc/self/fd/0 might point:

keegan@lyle$ ls -l /proc/self/fd/0
lrwx------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> /dev/pts/6
keegan@lyle$ ls -l /proc/self/fd/0 < /dev/null
lr-x------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> /dev/null
keegan@lyle$ echo | ls -l /proc/self/fd/0
lr-x------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> pipe:[9159930]

In the first case, stdin is the pseudo-terminal created by my screen session. In the second case it's redirected from a different file. In the third case, stdin is an anonymous pipe. The symlink target isn't a real filename, but proc provides the appropriate magic so that we can read the file anyway. The filesystem nodes for anonymous pipes live in the pipefs special filesystem — specialer than proc, because it can't even be mounted.

The phantom progress bar

Say we have some program which is slowly working its way through an input file. We'd like a progress bar, but we already launched the program, so it's too late for pv.

Alongside /proc/$PID/fd we have /proc/$PID/fdinfo, which will tell us (among other things) a process's current position within an open file. Let's use this to make a little script that will attach a progress bar to an existing process:

keegan@lyle$ cat phantom-progress.bash
#!/bin/bash
fd=/proc/$1/fd/$2
fdinfo=/proc/$1/fdinfo/$2
name=$(readlink $fd)
size=$(wc -c $fd | awk '{print $1}')
while [ -e $fd ]; do
  progress=$(cat $fdinfo | grep ^pos | awk '{print $2}')
  echo $((100*$progress / $size))
  sleep 1
done | dialog --gauge "Progress reading $name" 7 100

We pass the PID and a file descriptor as arguments. Let's test it:

keegan@lyle$ cat slow-reader.py 
import sys
import time
f = file(sys.argv[1], 'r')
while f.read(1024):
  time.sleep(0.01)
keegan@lyle$ python slow-reader.py bigfile &
[1] 18589
keegan@lyle$ ls -l /proc/18589/fd
total 0
lrwx------ 1 keegan keegan 64 Jan  6 16:40 0 -> /dev/pts/16
lrwx------ 1 keegan keegan 64 Jan  6 16:40 1 -> /dev/pts/16
lrwx------ 1 keegan keegan 64 Jan  6 16:40 2 -> /dev/pts/16
lr-x------ 1 keegan keegan 64 Jan  6 16:40 3 -> /home/keegan/bigfile
keegan@lyle$ ./phantom-progress.bash 18589 3

And you should see a nice curses progress bar, courtesy of dialog. Or replace dialog with gdialog and you'll get a GTK+ window.

Chasing plugins

A user comes to you with a problem: every so often, their instance of Enterprise FooServer will crash and burn. You read up on Enterprise FooServer and discover that it's a plugin-riddled behemoth, loading dozens of shared libraries at startup. Loading the wrong library could very well cause mysterious crashing.

The exact set of libraries loaded will depend on the user's config files, as well as environment variables like LD_PRELOAD and LD_LIBRARY_PATH. So you ask the user to start fooserver exactly as they normally do. You get the process's PID and dump its memory map:

keegan@lyle$ cat /proc/21637/maps
00400000-00401000 r-xp 00000000 fe:02 475918             /usr/bin/fooserver
00600000-00601000 rw-p 00000000 fe:02 475918             /usr/bin/fooserver
02519000-0253a000 rw-p 00000000 00:00 0                  [heap]
7ffa5d3c5000-7ffa5d3c6000 r-xp 00000000 fe:02 1286241    /usr/lib/foo-1.2/libplugin-bar.so
7ffa5d3c6000-7ffa5d5c5000 ---p 00001000 fe:02 1286241    /usr/lib/foo-1.2/libplugin-bar.so
7ffa5d5c5000-7ffa5d5c6000 rw-p 00000000 fe:02 1286241    /usr/lib/foo-1.2/libplugin-bar.so
7ffa5d5c6000-7ffa5d5c7000 r-xp 00000000 fe:02 1286243    /usr/lib/foo-1.3/libplugin-quux.so
7ffa5d5c7000-7ffa5d7c6000 ---p 00001000 fe:02 1286243    /usr/lib/foo-1.3/libplugin-quux.so
7ffa5d7c6000-7ffa5d7c7000 rw-p 00000000 fe:02 1286243    /usr/lib/foo-1.3/libplugin-quux.so
7ffa5d7c7000-7ffa5d91f000 r-xp 00000000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5d91f000-7ffa5db1e000 ---p 00158000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5db1e000-7ffa5db22000 r--p 00157000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5db22000-7ffa5db23000 rw-p 0015b000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5db23000-7ffa5db28000 rw-p 00000000 00:00 0 
7ffa5db28000-7ffa5db2a000 r-xp 00000000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5db2a000-7ffa5dd2a000 ---p 00002000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5dd2a000-7ffa5dd2b000 r--p 00002000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5dd2b000-7ffa5dd2c000 rw-p 00003000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5dd2c000-7ffa5dd4a000 r-xp 00000000 fe:02 4055128    /lib/ld-2.11.2.so
7ffa5df26000-7ffa5df29000 rw-p 00000000 00:00 0 
7ffa5df46000-7ffa5df49000 rw-p 00000000 00:00 0 
7ffa5df49000-7ffa5df4a000 r--p 0001d000 fe:02 4055128    /lib/ld-2.11.2.so
7ffa5df4a000-7ffa5df4b000 rw-p 0001e000 fe:02 4055128    /lib/ld-2.11.2.so
7ffa5df4b000-7ffa5df4c000 rw-p 00000000 00:00 0 
7fffedc07000-7fffedc1c000 rw-p 00000000 00:00 0          [stack]
7fffedcdd000-7fffedcde000 r-xp 00000000 00:00 0          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0  [vsyscall]

That's a serious red flag: fooserver is loading the bar plugin from FooServer version 1.2 and the quux plugin from FooServer version 1.3. If the versions aren't binary-compatible, that might explain the mysterious crashes. You can now hassle the user for their config files and try to fix the problem.

Just for fun, let's take a closer look at what the memory map means. Right away, we can recognize a memory address range (first column), a filename (last column), and file-like permission bits rwx. So each line indicates that the contents of a particular file are available to the process at a particular range of addresses with a particular set of permissions. For more details, see the proc manpage.

The executable itself is mapped twice: once for executing code, and once for reading and writing data. The same is true of the shared libraries. The flag p indicates a private mapping: changes to this memory area will not be shared with other processes, or saved to disk. We certainly don't want the global variables in a shared library to be shared by every process which loads that library. If you're wondering, as I was, why some library mappings have no access permissions, see this glibc source comment. There are also a number of "anonymous" mappings lacking filenames; these exist in memory only. An allocator like malloc can ask the kernel for such a mapping, then parcel out this storage as the application requests it.

The last two entries are special creatures which aim to reduce system call overhead. At boot time, the kernel will determine the fastest way to make a system call on your particular CPU model. It builds this instruction sequence into a little shared library in memory, and provides this virtual dynamic shared object (named vdso) for use by userspace code. Even so, the overhead of switching to the kernel context should be avoided when possible. Certain system calls such as gettimeofday are merely reading information maintained by the kernel. The kernel will store this information in the public virtual system call page (named vsyscall), so that these "system calls" can be implemented entirely in userspace.

Counting interruptions

We have a process which is taking a long time to run. How can we tell if it's CPU-bound or IO-bound?

When a process makes a system call, the kernel might let a different process run for a while before servicing the request. This voluntary context switch is especially likely if the system call requires waiting for some resource or event. If a process is only doing pure computation, it's not making any system calls. In that case, the kernel uses a hardware timer interrupt to eventually perform a nonvoluntary context switch.

The file /proc/$PID/status has fields labeled voluntary_ctxt_switches and nonvoluntary_ctxt_switches showing how many of each event have occurred. Let's try our slow reader process from before:

keegan@lyle$ python slow-reader.py bigfile &
[1] 15264
keegan@lyle$ watch -d -n 1 'cat /proc/15264/status | grep ctxt_switches'

You should see mostly voluntary context switches. Our program calls into the kernel in order to read or sleep, and the kernel can decide to let another process run for a while. We could use strace to see the individual calls. Now let's try a tight computational loop:

keegan@lyle$ cat tightloop.c
int main() {
  while (1) {
  }
}
keegan@lyle$ gcc -Wall -o tightloop tightloop.c
keegan@lyle$ ./tightloop &
[1] 30086
keegan@lyle$ watch -d -n 1 'cat /proc/30086/status | grep ctxt_switches'

You'll see exclusively nonvoluntary context switches. This program isn't making system calls; it just spins the CPU until the kernel decides to let someone else have a turn. Don't forget to kill this useless process!

Staying ahead of the OOM killer

The Linux memory subsystem has a nasty habit of making promises it can't keep. A userspace program can successfully allocate as much memory as it likes. The kernel will only look for free space in physical memory once the program actually writes to the addresses it allocated. And if the kernel can't find enough space, a component called the OOM killer will use an ad-hoc heuristic to choose a victim process and unceremoniously kill it.

Needless to say, this feature is controversial. The kernel has no reliable idea of who's actually responsible for consuming the machine's memory. The victim process may be totally innocent. You can disable memory overcommitting on your own machine, but there's inherent risk in breaking assumptions that processes make — even when those assumptions are harmful.

As a less drastic step, let's keep an eye on the OOM killer so we can predict where it might strike next. The victim process will be the process with the highest "OOM score", which we can read from /proc/$PID/oom_score:

keegan@lyle$ cat oom-scores.bash
#!/bin/bash
for procdir in $(find /proc -maxdepth 1 -regex '/proc/[0-9]+'); do
  printf "%10d %6d %s\n" \
    "$(cat $procdir/oom_score)" \
    "$(basename $procdir)" \
    "$(cat $procdir/cmdline | tr '\0' ' ' | head -c 100)"
done 2>/dev/null | sort -nr | head -n 20

For each process we print the OOM score, the PID (obtained from the directory name) and the process's command line. proc provides string arrays in NULL-delimited format, which we convert using tr. It's important to suppress error output using 2>/dev/null because some of the processes found by find (including find itself) will no longer exist within the loop. Let's see the results:

keegan@lyle$ ./oom-scores.bash 
  13647872  15439 /usr/lib/chromium-browser/chromium-browser --type=plugin
   1516288  15430 /usr/lib/chromium-browser/chromium-browser --type=gpu-process
   1006592  13204 /usr/lib/nspluginwrapper/i386/linux/npviewer.bin --plugin
    687581  15264 /usr/lib/chromium-browser/chromium-browser --type=zygote
    445352  14323 /usr/lib/chromium-browser/chromium-browser --type=renderer
    444930  11255 /usr/lib/chromium-browser/chromium-browser --type=renderer
...

Unsurprisingly, my web browser and Flash plugin are prime targets for the OOM killer. But the rankings might change if some runaway process caused an actual out-of-memory condition. Let's (carefully!) run a program that will (slowly!) eat 500 MB of RAM:

keegan@lyle$ cat oomnomnom.c
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>
#define SIZE (10*1024*1024)

int main() {
  int i;
  for (i=0; i<50; i++) {
    void *m = mmap(NULL, SIZE, PROT_WRITE,
      MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    memset(m, 0x80, SIZE);
    sleep(1);
  }
  return 0;
}

On each loop iteration, we ask for 10 megabytes of memory as a private, anonymous (non-file-backed) mapping. We then write to this region, so that the kernel will have to allocate some physical RAM. Now we'll watch OOM scores and free memory as this program runs:

keegan@lyle$ gcc -Wall -o oomnomnom oomnomnom.c
keegan@lyle$ ./oomnomnom &
[1] 19697
keegan@lyle$ watch -d -n 1 './oom-scores.bash; echo; free -m'

You'll see oomnomnom climb to the top of the list.

So we've seen a few ways that proc can help us solve problems. Actually, we've only scratched the surface. Inside each process's directory you'll find information about resource limits, chroots, CPU affinity, page faults, and much more. What are your favorite proc tricks? Let us know in the comments!

~keegan
Comments:

Regarding the undeletion from proc fd entries, the problem in most unices is that there is no system call to create a hard link from an inode number, most kernel devs agree that there is not any need for one. You can use some filesystem debugging tools to do this like debugfs on ext2/3/4 for linux which is similar to dbfs on ufs for bsd. But that method has the following caveats:

The block device is accessed directly so you must have root privileges.
The filesystem must be clean and with no errors.
The filesystem format used, should support raw access/debugging and there are tools available.

I have succesfully recreated files that way, several times in the past, once Irecreated a running db file that way, the DBA was speechless. But I always felt that I'm risking too much.

The correct approach would be that of a tool for the linux vfs kernel calls to manipulate the FS via kernel vfs drivers. I found a project on github doing that, but I can't seem to locate it now. Anyway that seems to be disabled in recent kernels, see the following commit:

commit aae8a97d3ec30788790d1720b71d76fd8eb44b73
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date: Sat Jan 29 18:43:27 2011 +0530

fs: Don't allow to create hardlink for deleted file

Add inode->i_nlink == 0 check in VFS. Some of the file systems
do this internally. A followup patch will remove those instance.
This is needed to ensure that with link by handle we don't allow
to create hardlink of an unlinked file. The check also prevent a race
between unlink and link

Posted by guest on May 09, 2012 at 07:21 AM EDT #

Very userful. Thanks! I especially liked the Phantom Progress Bar.

Posted by Roger on August 07, 2012 at 07:39 PM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today