Sunday Oct 21, 2007

Debugging an xVM Panic

Handling a xVM Panic

The first project I took on when joining the xVM team was figuring out how to react when the xVM hypervisor panics. When Solaris panics, we are almost always able to capture a core dump representing the entire kernel state when the panic occurred. When the hypervisor panics, it dumps some cryptic information to the screen and resets the machine.

Obviously, our preference would be to capture as much information on a hypervisor panic as we do on a Solaris panic. There are two significant difficulties with this. First, xVM has no device drivers of its own, so xVM isn't capable of writing any of its own state to disk. Second, in the xVM model the hypervisor and the dom0 OS work closely together to deliver the full virtualization experience, so even capturing the xVM state would only give us half the information we need.

The solution I adopted was to modify the xVM hypervisor to pass control to Solaris in the event of a panic, allow Solaris to read all the xVM state, to interpret the xVM hypervisor as a standard kernel module, and to execute the standard Solaris panic code that we have spent many years hardening. Exactly how this works is pretty cool (for some particularly geeky definition of "cool"), but is much more complicated than I'm prepared to handle in a blog post. For the highlights, see the slides attached to the entry below.

If we just take it as read that a Solaris dom0 can properly capture an xVM core dump, I'd like to walk through one fairly simple example of using a dump to examine the state of both Solaris and xVM to diagnose a bug that one of our QA guys hit a few months back.

Interpreting an xVM core dump

The test machine is in a lab somewhere in China. It paniced one night, giving us the following information:

panic[cpu2]/vcpu=0xffbe1100: Domain 0 crashed: 
         gs:      161  fs:        0  es:     e010  ds:     e010
        edi: f4b63000 esi: ff192f62 ebp: ffbf3f38 esp: ffbf3f1c
        ebx: ffbf3f44 edx:        3 ecx:     1efc eax:        3
        trp:        0 err:        0 eip: ff13b83f  cs:     e008
        efl:      282 usp: ffbf3fe0  ss:     e010
        cr0: 8005003b cr2: c5a08fd4 cr3:   be1d00 cr4:      6f0
Xen panic[cpu2]/vcpu=0xffbe1100: Domain 0 crashed:

ffbf3f38 0xff11b058 (in Xen)
ffbf3f48 0xff1043f0 (in Xen)
ffbf3f68 0xff10435c (in Xen)
ffbf3f88 0xff10438b (in Xen)
ffbf3fa8 0xff17daf1 (in Xen)
dumping to /dev/dsk/c0t0d0s1, offset 860356608, content: kernel
100% done: 52473 pages dumped, compression ratio 3.51, dump succeeded

There's not a whole lot there to go on. If necessary, I'm sure we could have made a start at debugging the problem with only these few lines. Fortunately, that wasn't necessary. As the last two lines show, we were able to capture a core dump for postmortem examination.

Since this was a test machine under active use it was subject to crashing on a regular basis, so trying to examine the core dump on that machine was destined to be an exercise in futility. The QA engineer uploaded the entire dump to a more stable machine sitting in a lab in California, where I was able to debug it at my leisure.

The first step is to load the core dump into mdb, the Solaris binary debugger. Here we can see that the machine on which the debugger is running is called fsh-trout and the machine that generated the dump was called goldcloth. We can also see exactly which Solaris build the machine was running.

fsh-trout# ls -l
total 1316400
-rw-r--r--   1 root     root     1196459 Apr 10 11:29 unix.0
-rw-r--r--   1 root     root     672440320 Apr 10 11:59 vmcore.0
fsh-trout# mdb \*0
Loading modules: [ unix genunix specfs dtrace xpv_psm scsi_vhci ufs  ip
hook neti sctp arp usba fctl nca lofs zfs random sppp ptm nfs md logindmux
xpv ]
> $<utsname
{
    sysname = [ "SunOS" ]
    nodename = [ "goldcloth" ]
    release = [ "5.11" ]
    version = [ "matrix-build-2007-04-02" ]
    machine = [ "i86pc" ]
}

For a standard Solaris core dump this is a pretty typical way to start. For an xVM dump however, we need to know more than the Solaris build. We also need to know which version of xVM we were running:

> xenver::print
{
    xv_ver = 0xccad8fc0 "3.0.4-1-sun"
    xv_chgset = 0xccad1f80 "Fri Jan 05 16:11:49 2007 +0000 13187:e2466414acc3"
    xv_compiler = 0xccacffc0 "gcc version 3.4.3 (csl-sol210-3_4-20050802)"
    xv_compile_date = 0xccad6f08 "Mon Apr  2 06:59:37 PDT 2007"
    xv_compile_by = 0xccad8fa0 "xen-discuss"
    xv_compile_domain = 0xccad8f80 "opensolaris.org"
    xv_caps = 0xccad8f60 "xen-3.0-x86_32p"
}

In this case, the QA engineer emailed me the original panic message. Even if he hadn't, I can extract it from the core dump:

> ::msgbuf
Xen panic[cpu2]/vcpu=0xffbe1100: Domain 0 crashed: 
         gs:      161  fs:        0  es:     e010  ds:     e010
        edi: f4b63000 esi: ff192f62 ebp: ffbf3f38 esp: ffbf3f1c
        ebx: ffbf3f44 edx:        3 ecx:     1efc eax:        3
        trp:        0 err:        0 eip: ff13b83f  cs:     e008
        efl:      282 usp: ffbf3fe0  ss:     e010
        cr0: 8005003b cr2: c5a08fd4 cr3:   be1d00 cr4:      6f0
ffbf3f38 0xff11b058 (in Xen)
ffbf3f48 0xff1043f0 (in Xen)
ffbf3f68 0xff10435c (in Xen)
[...]

The last few lines represent the bottom of the panicing thread's execution stack. If you're a glutton for punishment, you can manually translate those addresses into symbol names. I, on the other hand, am lazy and prefer to let the machine do that kind of grunt work. To show the stack in human readable form:

> $C                                  
ffbf3f38 xpv`panic+0x33()
ffbf3f48 xpv`dom0_shutdown+0x43()
ffbf3f68 xpv`domain_shutdown+0x22()
ffbf3f88 xpv`__domain_crash+0x9d()
ffbf3fa8 xpv`__domain_crash_synchronous+0x25()
ffbf3fc8 0xff17daf1()
c5a0904c page_ctr_sub+0x47(0, 0, f3328700, 1)
c5a090e4 page_get_mnode_freelist+0x3c1(0, 10, 0, 0, 20009)
c5a09164 page_get_freelist+0x16f(f502ed14, e5406000, 0, c5a09224, e5406000,
[...] 
c5a0a154 lofi_map_file+0xfc(2400000, c5a0a3e8, 1, c5a0a800, cd326e78, 80100403)
c5a0a360 lofi_ioctl+0x13e(2400000, 4c4601, c5a0a3e8, 80100403, cd326e78, 
c5a0a38c cdev_ioctl+0x2e(2400000, 4c4601, c5a0a3e8, 80100403, cd326e78, c5a0a800
c5a0a3b4 ldi_ioctl+0xa4(dfb15310, 4c4601, c5a0a3e8, 80100403, cd326e78, c5a0a800
c5a0a804 xdb_setup_node+0xbc(e52e1000, c5a0a84c)
c5a0ac58 xdb_open_device+0xc9(e52e1000)
c5a0ac88 xdb_start_connect+0x59(e52e1000)
c5a0acbc xdb_oe_state_change+0x70(c66c9088, c57e5c40, 0, c5a0ad70)
c5a0acf8 ndi_event_run_callbacks+0x87(c54f0d80, c66c9088, c57e5c40, c5a0ad
[...]

In this stacktrace, we are looking at a mixture of Solaris and xVM hypervisor code. The symbols prefixed by xpv` represent symbols in the xVM portion of the address space. Those of you familiar with mdb will recognize this as the notation for a kernel module. Even though the machine originally booted with xVM living in an address space completely unknown to the kernel, in the core dump we have made it look like just another kernel module. This allows our existing debugging tools to work on xVM dumps without any modification at all.

One unfortunate limitation of mdb at the moment is that it doesn't understand xVM's notation for exception frames. xVM represents these frames on the stack by inverting the frame pointer. Since mdb isn't aware of this convention, it simply prints the first value of the frame out as a hex number. In the example above, that occurs in the frame at 0xffbf3fc8.

Since I suspect the cause of the panic is to be found in the exception frame, I have to decode it by hand:

> ffbf3fc8,4/X                        
0xffbf3fc8:     c5a0901c        0               e0002           f4c76e9d

The third slot in the frame contains the type of exception (0xe, which is a page fault), and the fourth slot in the frame contains the instruction pointer when the exception occurred. How do I know this? By reading the xVM exception handling code. Now you know, and you don't have to read the code yourself. You're welcome.

We have the instruction pointer, but what does it correspond to?

> f4c76e9d::dis  
page_ctr_sub_internal:          pushl  %ebp
page_ctr_sub_internal+1:        movl   %esp,%ebp
page_ctr_sub_internal+3:        subl   $0x30,%esp
page_ctr_sub_internal+6:        andl   $0xfffffff0,%esp
page_ctr_sub_internal+9:        pushl  %ebx
page_ctr_sub_internal+0xa:      pushl  %esi
page_ctr_sub_internal+0xb:      pushl  %edi

The very first instruction of the Solaris routine page_ctr_sub_internal() caused xVM to take a page fault. When you see a stack operation trigger a page fault, it almost always means you've blown your stack - that is, you've used all the space allocated for the stack and are now accessing some address beyond the stack limit.

We want to confirm that this is the cause of the panic, and then try to understand why it happened. To do so, we have to go back to the panic message to find the address that caused the fault (which on x86 can be found in the %cr2 register) and the CPU on which it happened.


> ::msgbuf !egrep "(cr2|vcpu)"
     Xen panic[cpu2]/vcpu=0xffbe1100: Domain 0 crashed: 
        cr0: 8005003b cr2: c5a08fd4 cr3:   be1d00 cr4:      6f0

OK, the faulting address is easy enough: 0xc5a08fd4. The CPU is a little bit trickier. Since this was an xVM panic, we know the xVM vcpu that caught the exception, but we really want to know the Solaris CPU. To figure that out, we have to wander through a few levels of indirection.

First we translate from the vcpu pointer to the VCPU id.

> 0xffbe1100::print struct vcpu vcpu_id
vcpu_id = 0

The VCPU id (0) is the same as the Solaris CPU id, so we look it up in the cpuinfo table:

vcpu_id = 0
> ::cpuinfo
 ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
  0 f50454b8  1b    0    0  60   no    no t-0    c5a0ade0 sched
  1 c5423080  1b    1    0  59  yes    no t-0    c6178ee0 python2.4
  2 c5a71a80  1b    0    0  59   no    no t-0    c617cac0 xenstored
  3 c5a70a00  1b    1    0  -1   no    no t-0    c5afdde0 (idle)

As a bonus, this table also tells us which thread was running on the CPU when it paniced: 0xc5a0ade0. Since our hypothesis is that this thread blew its stack, lets take a look at the base address of the stack:

> c5a0ade0::print kthread_t t_stkbase
t_stkbase = 0xc5a09000

The bottom of the stack is at 0xc5a09000 and we just tried to access 0xc5a08fd4 - which is 44 bytes below the bottom of the stack. So we were right about the cause of the panic. To see what ate up all of the stack space, lets look at the stack again:

> $C                                  
[...] 
c5a0a154 lofi_map_file+0xfc(2400000, c5a0a3e8, 1, c5a0a800, cd326e78, 80100403)
c5a0a360 lofi_ioctl+0x13e(2400000, 4c4601, c5a0a3e8, 80100403, cd326e78, 
c5a0a38c cdev_ioctl+0x2e(2400000, 4c4601, c5a0a3e8, 80100403, cd326e78, c5a0a800
c5a0a3b4 ldi_ioctl+0xa4(dfb15310, 4c4601, c5a0a3e8, 80100403, cd326e78, c5a0a800
c5a0a804 xdb_setup_node+0xbc(e52e1000, c5a0a84c)
c5a0ac58 xdb_open_device+0xc9(e52e1000)
c5a0ac88 xdb_start_connect+0x59(e52e1000)
c5a0acbc xdb_oe_state_change+0x70(c66c9088, c57e5c40, 0, c5a0ad70)
c5a0acf8 ndi_event_run_callbacks+0x87(c54f0d80, c66c9088, c57e5c40, c5a0ad
[...]

The leftmost column contains the frame pointer for each frame in the stack. Each time that number increases by a large amount, it means the calling routine allocated a lot of memory on the stack. In this example, both xdb_open_device() and xdb_setup_node() are allocating over 1KB on the stack.

By inspecting these routines we could see that each was allocating MAXPATHLEN arrays on the stack. That's not large for a userspace application, but it's pretty big for the kernel. By replacing these stack allocations with calls to kmem_alloc(), we were able to reduce the stack consumption, which fixed the panic.

Without the ability to capture a full dump of both xVM and Solaris, this trivial bug would have been much harder to detect and fix. The only information we would have had was the fact that we took a page fault, and the address on which we faulted. If we were lucky, we might also have gotten a hex stack dump - not a human readable stack like we see above, but a raw block of a few hundred hex numbers.

About

nilsn

Search

Categories
  • General
Archives
« October 2007
SunMonTueWedThuFriSat
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
22
23
24
25
26
27
28
29
30
31
   
       
Today