Sometimes as a sysadmin the logfiles just don't cut it, and to solve a problem you need to know what's really going on. That's when I turn to strace -- the system-call tracer.
A system call, or syscall, is where a program crosses the boundary between user code and the kernel. Fortunately for us using strace, that boundary is where almost everything interesting happens in a typical program.
The two basic jobs of a modern operating system are abstraction and multiplexing. Abstraction means, for example, that when your program wants to read and write to disk it doesn't need to speak the SATA protocol, or SCSI, or IDE, or USB Mass Storage, or NFS.
It speaks in a single, common vocabulary of directories and files, and
the operating system translates that abstract vocabulary into whatever
has to be done with the actual underlying hardware you have.
Multiplexing means that your programs and mine each get fair access to
the hardware, and don't have the ability to step on each other -- which
means your program can't be permitted to skip the kernel, and speak raw
SATA or SCSI to the actual hardware, even if it wanted to.
So for almost everything a program wants to do, it needs to talk to the kernel. Want to read or write a file? Make the open() syscall, and then the syscalls read() or write(). Talk on the network? You need the syscalls socket(), connect(), and again read() and write(). Make more processes? First clone() (inside the standard C library function fork()), then you probably want execve() so the new process runs its own program, and you probably want to interact with that process somehow, with one of wait4(), kill(), pipe(), and a host of others. Even looking at the clock requires a system call, clock_gettime(). Every one of those system calls will show up when we apply strace to the program.
In fact, just about the only thing a process can do without making a
telltale system call is pure computation -- using the CPU and RAM and
nothing else. As a former algorithms person, that's what I used to
think was the fun part. Fortunately for us as sysadmins, very few
real-life programs spend very long in that pure realm between having to
deal with a file or the network or some other part of the system, and
then strace picks them up again.
Let's look at a quick example of how strace solves problems.
Use #1: Understand A Complex Program's Actual Behavior
One day, I wanted to know which Git
commands take out a certain lock -- I had a script running a series of
different Git commands, and it was failing sometimes when run
concurrently because two commands tried to hold the lock at the same
time.
Now, I love sourcediving, and I've done some Git hacking, so I spent some time with the source
tree investigating this question. But this code is complex enough that I
was still left with some uncertainty. So I decided to get a plain,
ground-truth answer to the question: if I run "git diff", will it grab this lock?
Strace to the rescue. The lock is on a file called index.lock. Anything trying to touch the file will show up in strace. So we can just trace a command the whole way through and use grep to see if index.lock is mentioned:
$ strace git status 2>&1 >/dev/null | grep index.lock open(".git/index.lock", O_RDWR|O_CREAT|O_EXCL, 0666) = 3 rename(".git/index.lock", ".git/index") = 0 $ strace git diff 2>&1 >/dev/null | grep index.lock $
So git status takes the lock, and git diff doesn't.
Interlude: The Toolbox
To help make it useful for so many purposes, strace takes a variety of options to add or cut out different kinds of detail and help you see exactly what's going on.
In Medias Res, If You Want
Sometimes we don't have the luxury of starting a program over to run it under strace -- it's running, it's misbehaving, and we need to find out what's going on. Fortunately strace handles this case with ease. Instead of specifying a command line for strace to execute and trace, just pass -p PID where PID is the process ID of the process in question -- I find pstree -p invaluable for identifying this -- and strace will attach to that program, while it's running, and start telling you all about it.
Times
When I use strace, I almost always pass the -tt option. This tells me when each syscall happened -- -t prints it to the second, -tt
to the microsecond. For system administration problems, this often
helps a lot in correlating the trace with other logs, or in seeing where
a program is spending too much time.
For performance issues, the -T option comes in handy too -- it tells me how long each individual syscall took from start to finish.
Data
By default strace already prints the strings that the
program passes to and from the system -- filenames, data read and
written, and so on. To keep the output readable, it cuts off the
strings at 32 characters. You can see more with the -s option -- -s 1024 makes strace print up to 1024 characters for each string -- or cut out the strings entirely with -s 0.
Sometimes you want to see the full data flowing in just a few
directions, without cluttering your trace with other flows of data.
Here the options -e read= and -e write= come in handy.
For example, say you have a program talking to a database server,
and you want to see the SQL queries, but not the voluminous data that
comes back. The queries and responses go via write() and read() syscalls on a network socket to the database. First, take a preliminary look at the trace to see those syscalls in action:
$ strace -p 9026 Process 9026 attached - interrupt to quit read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116 poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52 read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116 poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52 [...]
Those write() syscalls are the SQL queries -- we can make out the SELECT foo FROM bar, and then it trails off. To see the rest, note the file descriptor the syscalls are happening on -- the first argument of read() or write(), which is 3 here. Pass that file descriptor to -e write=:
$ strace -p 9026 -e write=3 Process 9026 attached - interrupt to quit read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116 poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52 | 00000 30 00 00 00 03 53 45 4c 45 43 54 20 74 69 6d 65 0....SEL ECT time | | 00010 73 74 61 6d 70 20 46 52 4f 4d 20 61 72 74 69 66 stamp FR OM artif | | 00020 61 63 74 73 20 57 48 45 52 45 20 69 64 20 3d 20 acts WHE RE id = | | 00030 31 34 35 34 1454 |
and we see the whole query. It's both printed and in hex, in case
it's binary. We could also get the whole thing with an option like -s 1024, but then we'd see all the data coming back via read() -- the use of -e write= lets us pick and choose.
Filtering the Output
Sometimes the full syscall trace is too much -- you just want to see
what files the program touches, or when it reads and writes data, or
some other subset. For this the -e trace= option was made. You can select a named suite of system calls like -e trace=file (for syscalls that mention filenames) or -e trace=desc (for read() and write()
and friends, which mention file descriptors), or name individual system
calls by hand. We'll use this option in the next example.
Child Processes
Sometimes the process you trace doesn't do the real work itself, but
delegates it to child processes that it creates. Shell scripts and Make runs are notorious for taking this behavior to the extreme. If that's the case, you may want to pass -f to make strace "follow forks" and trace child processes, too, as soon as they're made.
For example, here's a trace of a simple shell script, without -f:
$ strace -e trace=process,file,desc sh -c \ 'for d in .git/objects/*; do ls $d >/dev/null; done' [...] stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4b68af5770) = 11948 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11948 --- SIGCHLD (Child exited) @ 0 (0) -- wait4(-1, 0x7fffc3473604, WNOHANG, NULL) = -1 ECHILD (No child processes)
Not much to see here -- all the real work was done inside process 11948, the one created by that clone() syscall.
Here's the same script traced with -f (and the trace edited for brevity):
$ strace -f -e trace=process,file,desc sh -c \ 'for d in .git/objects/*; do ls $d >/dev/null; done' [...] stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0 clone(Process 10738 attached child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5a93f99770) = 10738 [pid 10682] wait4(-1, Process 10682 suspended [pid 10738] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 [pid 10738] dup2(3, 1) = 1 [pid 10738] close(3) = 0 [pid 10738] execve("/bin/ls", ["ls", ".git/objects/28"], [/* 25 vars */]) = 0 [... setup of C standard library omitted ...] [pid 10738] stat(".git/objects/28", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [pid 10738] open(".git/objects/28", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3 [pid 10738] getdents(3, /* 40 entries */, 4096) = 2480 [pid 10738] getdents(3, /* 0 entries */, 4096) = 0 [pid 10738] close(3) = 0 [pid 10738] write(1, "04102fadac20da3550d381f444ccb5676"..., 1482) = 1482 [pid 10738] close(1) = 0 [pid 10738] close(2) = 0 [pid 10738] exit_group(0) = ? Process 10682 resumed Process 10738 detached <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10738 --- SIGCHLD (Child exited) @ 0 (0) ---
Now this trace could be a miniature education in Unix in itself -- future blog post? The key thing is that you can see ls do its work, with that open() call followed by getdents().
The output gets cluttered quickly when multiple processes are traced at once, so sometimes you want -ff, which makes strace write each process's trace into a separate file.
Use #2: Why/Where Is A Program Stuck?
Sometimes a program doesn't seem to be doing anything. Most often,
that means it's blocked in some system call. Strace to the rescue.
$ strace -p 22067 Process 22067 attached - interrupt to quit flock(3, LOCK_EX
Here it's blocked trying to take out a lock, an exclusive lock (LOCK_EX) on the file it's opened as file descriptor 3. What file is that?
$ readlink /proc/22067/fd/3 /tmp/foobar.lock
Aha, it's the file /tmp/foobar.lock. And what process is holding that lock?
$ lsof | grep /tmp/foobar.lock command 21856 price 3uW REG 253,88 0 34443743 /tmp/foobar.lock command 22067 price 3u REG 253,88 0 34443743 /tmp/foobar.lock
Process 21856 is holding the lock. Now we can go figure out why
21856 has been holding the lock for so long, whether 21856 and 22067
really need to grab the same lock, etc.
Other common ways the program might be stuck, and how you can learn more after discovering them with strace:
Uses #3, #4, ...
A post of this length can only scratch the surface of what strace can do in a sysadmin's toolbox. Some of my other favorites include