Wednesday Jul 08, 2009

Lies, Damned Lies, and Stack Traces

The kernel stack trace is a critical piece of information for diagnosing kernel bugs, but it can be tricky to interpret due to quirks in the processor architecture and in optimized code. Some of these are well known: tail calls and leaf functions obscure frames, function arguments may live in registers that have been modified since entry, and so on. These quirks can cause you to waste time chasing the wrong problem if you are not careful.

Here is a less well known example to be wary of that is specific to SPARC kernel stacks. Use mdb to examine the panic thread in a kernel crash dump:

    > \*panic_thread::findstack
    stack pointer for thread 30014adaf60: 2a10c548671
      000002a10c548721 die+0x98()
      000002a10c548801 trap+0x768()
      000002a10c548961 ktl0+0x64()
      000002a10c548ab1 hat_unload_callback+0x358()
      000002a10c548f21 segvn_unmap+0x2a8()
      000002a10c549021 as_free+0xf4()
      000002a10c5490d1 relvm+0x234()
      000002a10c549181 proc_exit+0x490()
      000002a10c549231 exit+8()
      000002a10c5492e1 syscall_trap+0xac()
    

This says that the thread did something bad at hat_unload_callback+0x358, which caused a trap and panic. But what does panicinfo show?

    > ::panicinfo
                 cpu              195
              thread      30014adaf60
             message BAD TRAP: type=30 rp=2a10c549210 addr=0 mmu_fsr=9
                  pc          1031360
    

The pc symbolizes to this:

    > 1031360::dis -n 0
    hat_unload_callback+0x3f8:      ldx       [%l4 + 0x10], %o3
    

Hmm, that is not the same offset that was shown in the call stack: 3f8 versus 358. Which one should you believe?

panicinfo is correct, and the call stack lies -- it is an artifact of the conventional interpretation of the o7 register in the SPARC architecture, plus a discontinuity caused by the trap. In the standard calling sequence, the pc is saved in the o7 register, the destination address is written to the pc, and the destination executes a save instruction that slides the register window and renames the o registers to i registers. A stack walker interprets the value of i7 in each window as the pc.

However, a SPARC trap uses a different mechanism for saving the pc, and does not modify o7. When the trap handler executes a save instruction, the o7 register contains the pc of the most recent call instruction. This is marginally interesting, but totally unrelated to the pc at which the trap was taken. The stack walker later extracts this value of o7 from the window and shows it as the frame's pc, which is wrong.

This particular stack lie only occurs after a trap, so you can recognize it by the presence of the Solaris trap function ktl0() on the stack. You can find the correct pc in a "struct regs" that the trap handler pushes on the stack at address sp+7ff-a0, where sp is the stack pointer for the frame prior to the ktl0(). From the example above, use the sp value to the left of hat_unload_callback:

    > 000002a10c548ab1+7ff-a0::print struct regs r_pc
    r_pc = 0x1031360
    

This works for any thread. If you are examining the panic thread, then ::panicinfo command performs the calculation for you and shows the correct pc.

About

Steve Sistare

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today