My kingdom for a crash dump

In the Solaris group, we take a great deal of pride in our debugging tools. Every tool we design (truss, DTrace, MDB, KMDB, ptools, libumem) helps reduce time to root cause from days to hours or minutes, in addition to solving previously unsolvable problems. Over the past week or two I've some conversations about the state of Linux debugging, especially kernel debugging. Since I haven't had much Linux experience, I thought I'd do a little research on our competitor. All I can say is that the Linux community doesn't seem to take the same pride in debugging tools as we do. In particular, I was shocked by the lack of post-mortem debugging tools available for the linux kernel. While this may not be immediately useful for you developers out there, I find it interesting, and later I will dig into post-mortem debugging for user processes.

Post mortem debugging begins with a 'dump'. For processes, this is a usually called a 'core' (most UNIX users have probably seen something like 'segmentation fault - core dumped' before). For kernels, it is called a 'crash dump'. When something bad happens the system will 'panic', and desperately try to stop the world and write out the current machine state while relying on as little kernel state as possible. The world of a panicking kernel is a very scary place; I won't go into how exactly this works. Suffice to say that after you reboot the system, you will end up with a file such as /var/crash/machine/{vmcore,unix}.11

These crashdumps are essential to us kernel engineers, and the failure of the linux community to recognize this2 is shocking. The reason that these dumps are so important is that they are often the only evidence we will ever have of the bug. In particular, we must be able to root-cause a problem from a single crash dump. If a production system panics for one of our Fortune 500 customers, we can't exactly go up to them and say "Hmmm, that's too bad. Can you run this instrumented kernel with these flags and make it happen again?" Especially since we've already cost our customer a lot of money3 by failing in the first place. We can try to reproduce it in development, but as Bryan points out, this can be an error-prone and often futile process. With post-mortem debugging, the customer simply sends us a crash dump and we do all our analysis on this file while their system continues to (hopefully) run happily in production.

This seemingly simple idea forms the foundation of Solaris debugging (really part of a trifecta now that DTrace and KMDB have arrived). On the Linux side, there is nothing built into the base kernel. When a Linux kernel 'oops' happens (their name for a panic), all you get to work with is a simple message, the current machine state, and a stack backtrace. As you will see, only the most trivial problems can be solved with this limited set of information. There is a patch to add this functionality, known as Linux Kernel Crash Dumps (LKCD). Surprising as it may seem, this patch has yet to make it into the base kernel. The politics around this are simply astounding. A good overview of why is covered here, with Linus's initial rejection here. The reasons seem largely political, but there is a definite statement that this doesn't belong in the base kernel - it should be a vendor-added "feature". The last patch available seems to be for kernel version 2.5.44, so the project may be falling into disrepair. Without crash dumps, your only recourse is to try and reproduce on an instrumented kernel. Even if you can reproduce it, this usually means inserting printf() statements: the lack of a coherent kernel debugger such as KMDB is a subject for a future post. My favorite quote is from this email, which states:

If it becomes apparent through empirical data that crash dumps are a useful tool, I'm sure that Linus will become far more amenable. Until then, lets let him handle all of his other work which needs to get done.

You need empirical evidence to prove that crash dumps are useful? Try looking at any of the tens of thousands of kernel bugs in our database. At the bottom of this post you'll find a random bug that I analyzed not that long ago. Don't worry if you don't have a clue what it means; it may not be comprehensible to anyone outside of the Solaris kernel group, but it's a nice show of MDB's power on a real-world bug.

Note that the line that glazes over "a little ::kgrep | ::whattype magic" is actually really interesting. ::kgrep will search for references to an address anywhere in the kernel. See Bryan Cantrill's AADEBUG paper for information on the gory details of ::typegraph and ::whattype. With their powers combined, I was able to find the lock address in the stack of the thread shown.

I've read a little bit of the LKCD documentation, and I'm not convinced that the bug below could have been diagnosed with their set of commands, and certainly it would have been a much more painful task. If you've made it this far through my post, you are probably totally confused and/or or screaming "but how can I benefit from this?" Well, MDB works just as well on user processes (and cores) as it does on kernel crash dumps. A single tool can debug kernel crash dumps, live kernels, process cores, and live processes. I look forward to exploring MDB in much more detail in future posts.


Bug excerpt:

A quick cursory glance shows the the following:

> ::walk thread | ::findstack ! munges
88      ##################################  tp: 300072ebb40
        cv_waituntil_sig+0x8c()
        semop+0x548()
        syscall_trap+0x88()

72      ##################################  tp: 2a100017ce0
        taskq_thread+0x110()
        thread_start+4()

54      ##################################  tp: 3000ee13ca0
        rw_enter_sleep+0x144()
        as_pageunlock+0x48()
        default_physio+0x340()
        pread+0x244()
        syscall_trap+0x88()

We have 54 threads waiting for an address space lock.  Not necessarily a 
smoking gun, but a good place to start.  All of these threads are hung up on
the following rwlock:

> 0x3000eda5358::print rwlock_impl_t
{
    rw_wwwh = 0xb
}

So someone has a reader lock and we're all stuck waiting.  Maybe we have a 
recursive rw_enter(RW_READER) somewhere?  This rwlock belongs to the following
address space:

> 3000eda5328::print -a struct as a_lock
{
    3000eda5358 a_lock._opaque = [ 0xb ]
}
> 3000eda5328::as2proc | ::ps
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R   2673      1   2673   2673     30 0x00004000 0000030011f0e030 
oracle_mt_x04_fb

So who owns the current reader lock?  With a little ::kgrep | ::whattype magic
the owner appears to be:

> 30013e05220::findstack   
stack pointer for thread 30013e05220: 2a1011a09a1
[ 000002a1011a09a1 sema_p+0x140() ]
  000002a1011a0a51 biowait+0x60()
  000002a1011a0b01 ufs_getpage_miss+0x2f0()
  000002a1011a0c01 ufs_getpage+0x6a8()
  000002a1011a0d61 fop_getpage+0x48()
  000002a1011a0e31 segvn_fault+0x834()
  000002a1011a0fe1 as_fault+0x4a0()
  000002a1011a10f1 pagefault+0xac()
  000002a1011a11b1 trap+0xc14()
  000002a1011a12f1 utl0+0x4c()
>

In this thread, we've taken the address space lock in order to fault in a page,
but UFS is stuck in biowait().  Poking around, we see there are a bunch of
threads similarly stuck:

> ::walk thread | ::findstack ! grep biowait | wc -l
      28

So now something's amiss in the I/O system, which has had a cascading effect on
the VM system, and all hell breaks loose.  The particular buf that we're
interested in is:

> 30006a5a048::print struct buf b_vp | ::vnode2path
/dev/dsk/../../devices/ssm@0,0/pci@18,700000/pci@1/SUNW,isptwo@4/sd@1,0:b

So we're doing some I/O to one of the disks (that's a shocker). There are several 
more suspicious stacks in the dump with traces like:

> 2a1009d9ce0::findstack
stack pointer for thread 2a1009d9ce0: 2a1009d85c1
  000002a1009d8671 swtch+0x78()
  000002a1009d8721 turnstile_block+0x5f8()
  000002a1009d87d1 mutex_vector_enter+0x3dc()
  000002a1009d8881 aio_done+0x1c4()
  000002a1009d8951 volkiodone+0x364()
  000002a1009d8ad1 volsiodone+0x544()
  000002a1009d8bb1 vol_subdisksio_done+0x68()
  000002a1009d8c71 volkcontext_process+0x29c()
  000002a1009d8d41 voldiskiodone+0x368()
  000002a1009d8e41 ssd_return_command+0x198()
  000002a1009d8ef1 ssdintr+0x224()
  000002a1009d8fa1 qlc_fast_fcp_post+0x188()
  000002a1009d9051 qlc_isr+0x378()
  000002a1009d9151 qlc_intr+0xd0()
  000002a1009d9221 pci_intr_wrapper+0x9c()
  000002a1009d92d1 intr_thread+0x130()
  000002a100969171 disp_getwork+0xe8()

This thread is blocked on the following mutex:

> 000002a1009d8881+7ff::print struct frame fr_arg[4]
fr_arg[4] = 0x3001967c060

Which is the aio_mutex for this aio structure:

> 0x3001967c000::print -a aio_t aio_mutex
{
    3001967c060 aio_mutex._opaque = [ 0x30018878c21 ]
}

The owner of this mutex is here:

> 0x30018878c20::findstack
stack pointer for thread 30018878c20: 2a10175cd51
[ 000002a10175cd51 turnstile_block+0x5f8() ]
  000002a10175ce01 rw_enter_sleep+0x144()
  000002a10175ceb1 as_pageunlock+0x48()
  000002a10175cf71 aphysio_unlock+0x6c()
  000002a10175d021 aio_cleanup_cleanupq+0x2c()
  000002a10175d0d1 aio_cleanup+0xc4()
  000002a10175d181 aio_cleanup_thread+0x9c()
  000002a10175d231 kaioc+0xe0()
  000002a10175d2f1 syscall_trap+0x88()
>

Now it seems like we've finally found the culprit.  aio_cleanup_cleanupq() takes
the aio_mutex lock like so:

        [... source code removed ...]

We then call aphysio_unlock().  This particular aio has AIO_PAGELOCKDONE set, so
we go to unlock the page:

        [... source code removed ...]

For as_pageunlock(), we need the address space lock.

The address space lock is held by thread 30013e05220, which is stuck waiting for
I/O to complete.  Now we're in an ugly deadlock, since the I/O can't complete
while the as lock is held.

1 Essentially, we write out all of kernel memory to a reserved place on disk. Usually this is swap space, but it can be configured through dumpadm(1M). When the system reboots, savecore(1M) extracts this information and puts it in a more permanent location.

2 There are many people in the Linux community, so obviously this blanket statement doesn't apply to everyone. Whatever the local politics, I can't see how this wasn't implemented in Linux 0.1.0. Although it did take us until Solaris 7 to enable savecore(1M) by default.

3 We have customers that lose more than $100,000 per minute that their system is down. There is a reason why Solaris has to be bulletproof.

Comments:

I found a few more interesting Linux tidbits at http://puggy.symonds.net/~hareesh/ibm/project-seminar/project-seminar-1.pdf, which suggests that LKCD is alive and well, and that they are working on tools to allow gdb to examine the resulting crash dumps. There also seems to be a new project in planning as mentioned at http://www.linuxinsider.com/story/33729.html. The LKCD project is regularly integrated into custom branches like -osdl, and certainly makes it into RHEL and many other distros. In my mind that's still not good enough.

Posted by Eric Schrock on June 20, 2004 at 05:22 PM PDT #

What did you really expect? Linus is a hack and linnex is a chunk. It's more about hype than getting things done, remember? That's what makes the internet go round!

Posted by Joseph X on June 21, 2004 at 05:15 PM PDT #

Post a Comment:
Comments are closed for this entry.
About

Musings about Fishworks, Operating Systems, and the software that runs on them.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today