Thursday Jun 24, 2010

Attack of the Cosmic Rays!

It's a well-documented fact that RAM in modern computers is susceptible to occasional random bit flips due to various sources of noise, most commonly high-energy cosmic rays. By some estimates, you can even expect error rates as high as one error per 4GB of RAM per day! Many servers these days have ECC RAM, which uses extra bits to store error-correcting codes that let them correct most bit errors, but ECC RAM is still fairly rare in desktops, and unheard-of in laptops.

For me, bitflips due to cosmic rays are one of those problems I always assumed happen to "other people". I also assumed that even if I saw random cosmic-ray bitflips, my computer would probably just crash, and I'd never really be able to tell the difference from some random kernel bug.

A few weeks ago, though, I encountered some bizarre behavior on my desktop, that honestly just didn't make sense. I spent about half an hour digging to discover what had gone wrong, and eventually determined, conclusively, that my problem was a single undetected flipped bit in RAM. I can't prove whether the problem was due to cosmic rays, bad RAM, or something else, but in any case, I hope you find this story interesting and informative.

The problem

The symptom that I observed was that the expr program, used by shell scripts to do basic arithmetic, had started consistently segfaulting. This first manifested when trying to build a software project, since the GNU autotools make heavy use of this program:

[nelhage@psychotique]$ autoreconf -fvi
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force -I m4
autoreconf: configure.ac: tracing
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
…

dmesg revealed that the segfaulting program was expr:

psychotique kernel: [105127.372705] expr[7756]: segfault at 1a70 ip 0000000000001a70 sp 00007fff2ee0cc40 error 4 in expr

And I was easily able to reproduce the problem by hand:

[nelhage@psychotique]$ expr 3 + 3
Segmentation fault

expr definitely hadn't been segfaulting as of a day ago or so, so something had clearly gone suddenly, and strangely, wrong. I had no idea what, but I decided to find out.

Check the dumb things

I run Ubuntu, so the first things I checked were the /var/log/dpkg.log and /var/log/aptitude.log files, to determine whether any suspicious packages had been upgraded recently. Perhaps Ubuntu accidentally let a buggy package slip into the release. I didn't recall doing any significant upgrades, but maybe dependencies had pulled in an upgrade I had missed.

The logs revealed I hadn't upgraded anything of note in the last several days, so that theory was out.

Next up, I checked env | grep ^LD. The dynamic linker takes input from a number of environment variables, all of whose names start with LD_. Was it possible I had somehow ended up setting some variable that was messing up the dynamic linker, causing it to link a broken library or something?

[nelhage@psychotique]$ env | grep ^LD
[nelhage@psychotique]$ 

That, too, turned up nothing.

Start digging

I was fortunate in that, although this failure is strange and sudden, it seemed perfectly reproducible, which means I had the luxury of being able to run as many tests as I wanted to debug it.

The problem is a segfault, so I decided to pull up a debugger and figure out where it's segfaulting. First, though, I'd want debug symbols, so I could make heads or tails of the crashed program. Fortunately, Ubuntu provides debug symbols for every package they ship, in a separate repository. I already had the debug sources enabled, so I used dpkg -S to determine that expr belongs to the coreutils package:

[nelhage@psychotique]$ dpkg -S $(which expr)
coreutils: /usr/bin/expr

And installed the coreutils debug symbols:

[nelhage@psychotique]$ sudo aptitude install coreutils-dbgsym

Now, I could run expr inside gdb, catch the segfault, and get a stack trace:

[nelhage@psychotique]$ gdb --args expr 3 + 3
…
(gdb) run
Starting program: /usr/bin/expr 3 + 3
Program received signal SIGSEGV, Segmentation fault.
0x0000000000001a70 in ?? ()
(gdb) bt
#0  0x0000000000001a70 in ?? ()
#1  0x0000000000402782 in eval5 (evaluate=true) at expr.c:745
#2  0x00000000004027dd in eval4 (evaluate=true) at expr.c:773
#3  0x000000000040291d in eval3 (evaluate=true) at expr.c:812
#4  0x000000000040208d in eval2 (evaluate=true) at expr.c:842
#5  0x0000000000402280 in eval1 (evaluate=<value optimized out>) at expr.c:921
#6  0x0000000000402320 in eval (evaluate=<value optimized out>) at expr.c:952
#7  0x0000000000402da5 in main (argc=2, argv=0x0) at expr.c:329

So, for some reason, the eval5 function has jumped off into an invalid memory address, which of course causes a segfault. Repeating the test a few time confirmed that the crash was totally deterministic, with the same stack trace each time. But what is eval5 trying to do that's causing it to jump off into nowhere? Let's grab the source and find out:

[nelhage@psychotique]$ apt-get source coreutils
[nelhage@psychotique]$ cd coreutils-7.4/src/
[nelhage@psychotique]$ gdb --args expr 3 + 3
# Run gdb, wait for the segfault
(gdb) up
#1  0x0000000000402782 in eval5 (evaluate=true) at expr.c:745
745           if (nextarg (":"))
(gdb) l
740       trace ("eval5");
741     #endif
742       l = eval6 (evaluate);
743       while (1)
744         {
745           if (nextarg (":"))
746             {
747               r = eval6 (evaluate);
748               if (evaluate)
749                 {

I used the apt-get source command to download the source package from Ubuntu, and ran gdb in the source directory, so it could find the files referred to by the debug symbols. I then used the up command in gdb to go up a stack frame, to the frame where eval5 called off into nowhere.

From the source, we see that eval5 is trying to call the nextarg function. `gdb` will happily tell us where that function is supposed to be located:

(gdb) p nextarg
$1 = {_Bool (const char *)} 0x401a70 <nextarg>

Comparing that address with the address in the stack trace above, we see that they differ by a single bit. So it appears that somewhere a single bit has been flipped, causing that call to go off into nowhere.

But why?

So there's a flipped bit. But why, and how did it happen? First off, let's determine where the problem is. Is it in the expr binary itself, or is something more subtle going on?

[nelhage@psychotique]$ debsums coreutils | grep FAILED
/usr/bin/expr                                                             FAILED

The debsums program will compare checksums of files on disk with a manifest contained in the Debian package they came from. In this case, examining the coreutils package, we see that the expr binary has in fact been modified since it was installed. We can verify how it's different by downloading a new version of the package, and comparing the files:

[nelhage@psychotique]$ aptitude download coreutils
[nelhage@psychotique]$ mkdir coreutils
[nelhage@psychotique]$ dpkg -x coreutils_7.4-2ubuntu1_amd64.deb coreutils
[nelhage@psychotique]$ cmp -bl coreutils/usr/bin/expr /usr/bin/expr
 10113 377 M-^? 277 M-?

aptitude download downloads a .deb package, instead of actually doing the installation. I used dpkg -x to just extract the contents of the file, and cmp to compare the packaged expr with the installed one. -b tells cmp to list any bytes that differ, and -l tells it to list all differences, not just the first one. So we can see that two bytes differ, and by a single bit, which agrees with the failure we saw. So somehow the installed expr binary is corrupted.

So how did that happen? We can check the mtime ("modified time") field on the program to determine when the file on disk was modified (assuming, for the moment, that whatever modified it didn't fix up the mtime, which seems unlikely):

[nelhage@psychotique]$ ls -l /usr/bin/expr
-rwxr-xr-x 1 root root 111K 2009-10-06 07:06 /usr/bin/expr*

Curious. The mtime on the binary is from last year, presumably whenever it was built by Ubuntu, and set by the package manager when it installed the system. So unless something really fishy is going on, the binary on disk hasn't been touched.

Memory is a tricky thing.

But hold on. I have 12GB of RAM on my desktop, most of which, at any moment, is being used by the operating system to cache the contents of files on disk. expr is a pretty small program, and frequently used, so there's a good chance it will be entirely in cache, and my OS has basically never touched the disk to load it, since it first did so, probably when I booted my computer. So it's likely that this corruption is entirely in memory. But how can we test that? Simple: by forcing the OS to discard the cached version and re-read it from disk.

On Linux, we can do this by writing to the /proc/sys/vm/drop_caches file, as root. We'll take a checksum of the binary first, drop the caches, and compare the checksum after forcing it to be re-read:

[nelhage@psychotique]$ sha256sum /usr/bin/expr
4b86435899caef4830aaae2bbf713b8dbf7a21466067690a796fa05c363e6089  /usr/bin/expr
[nelhage@psychotique]$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3
[nelhage@psychotique]$ sha256sum /usr/bin/expr
5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949  /usr/bin/expr
[nelhage@psychotique]$ sha256sum coreutils/usr/bin/expr
5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949  coreutils/usr/bin/expr

And behold, the file changed. The corruption was entirely in memory. And, furthermore, expr no longer segfaults, and matches the version I downloaded earlier.

(The sudo tee idiom is a common one I used to write to a file as root from a normal user shell. sudo echo 3 > /proc/sys/vm/drop_caches of course won't work because the file is still opened for writing by my shell, which doesn't have the required permissions).

Conclusion

As I mentioned earlier, I can't prove this was due to a cosmic ray, or even a hardware error. It could have been some OS bug in my kernel that accidentally did a wild write into my memory in a way that only flipped a single bit. But that'd be a pretty weird bug.

And in fact, since that incident, I've had several other, similar problems. I haven't gotten around to memtesting my machine, but that does suggest I might just have a bad RAM chip on my hands. But even with bad RAM, I'd guess that flipped bits come from noise somewhere -- they're just susceptible to lower levels of noise. So it could just mean I'm more susceptible to the low-energy cosmic rays that are always falling. Regardless of whatever the cause was, though, I hope this post inspires you to think about the dangers of your RAM corrupting your work, and that the tale of my debugging helps you learn some new tools that you might find useful some day.

Now that I've written this post, I'm going to go memtest my machine and check prices on ECC RAM. In the meanwhile, leave your stories in the comments -- have you ever tracked a problem down to memory corruption? What are your practices for coping with the risk of these problems?

~nelhage

Edited to add a note that this could well just be bad RAM, in addition to a one-off cosmic-ray event.

About

Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today