Learn mdb in 30 minutes

As part of my involvement with the Blastwave community software project, I maintain SPARC and x86 versions of the Subversion packages for Solaris. When I became the maintainer for this software, I inherited with it a strange bug, which now is causing me problems. I currently use CVS for my day to day revision control needs, but I'd like to switch to SVN if only for the cool factor, but also to evaluate how it could eventually be used to replace the ClearCase repository my organization uses.

The issue is with the Apache 2 module mod_dav_svn, which enables SVN repository access via the WebDAV (distributed authoring and versioning) protocol. This is a great feature, as then it is possible to take advantage of modules like mod_auth_ldap to manage access control, or mod_perl to provide even more sophisticated functionality. The bug I was tracing occurs any time a file is committed to a SVN repository via DAV. The commit will fail, and the Apache error logs will contain a line or two like this:

[Thu Sep 30 16:00:10 2004] [notice] child pid 18115 exit signal Illegal instruction (4)

The Apache worker thread dies, and is replaced by another thread, waiting for a new request. (It seems to me that this should probably have a higher severity than notice) On the client side, the svn client outputs this useful message:

svn: Commit failed (details follow):
svn: MKACTIVITY of /svn/!svn/act/7cba529c-56e5-0310-9002-f184ad8e84d2:
Could not read status line: connection was closed by server. (http://server:5957)

So how was I going to track down what was happening with mod_dav_svn if it was being launched by Apache in response to an incoming commit request from my subversion client? I was going to learn how to use mdb, that's how. The first thing I did was learn how to launch a program in the debugger. It's easy: mdb /path/to/program. Apache requires a set of arguments, specifically the -X argument, which forces it to stay in the foreground and such. Finding out how to set arguments was a bit more difficult.

I'm somewhat familiar with the GNU debugger gdb, but this was my first time into mdb or any sort of Solaris debugging of this type (I've done very very low level hardware, firmware and software debugging on the Sun Fire mid-frame server line, but never applications). Using the mdb dcmd ::dcmds lists out the available commands. ::help <dcmd> for any dcmd will give the usage for that dcmd. I picked out the ::run dcmd, which is how arguments are passed into the target executable, and how execution is restarted from the top.

Now that I know this much, I can see what is happening to this module. Once the server is started with the ::run dcmd, I tried to commit a small change to the sample svn repository on my server. As soon as I connect, mdb dumps out with a SIGILL (Illegal Instruction). Using the ::stack dcmd prints out the stack trace:

bash$ mdb -uM /opt/csw/apache2/sbin/httpd 
Loading modules: [ libthread.so.1 libc.so.1 ]
> ::run -X -f /opt/csw/apache2/etc/svn.conf
mdb: stop on SIGILL
mdb: target stopped at:
0x169f08:       unimp     0
mdb: You've got symbols!
Loading modules: [ ld.so.1 ]
> ::stack
0x169f08(fca7bbec, 178010, 1, fff, 176340, ff36d518)
mod_dav_svn.so`dav_svn_get_txn+0x44(177f90, 177eba, 0, 243f0, fedccdd4, fe62949c)
mod_dav_svn.so`dav_svn_prep_activity+0xc(177ee0, 177eba, fe62aaac, fe60a5e4, 177ee0, 0)
mod_dav_svn.so`dav_svn_get_resource+0x5a0(176378, 2f, 0, 0, fca7bda4, 177ee0)
0xfe705564(176378, 800, 0, fca7bda4, 27500, 0)
0xfe70f08c(176378, 1, fe62ac04, 0, 800, fe72c9fc)
ap_invoke_handler+0x178(176378, fe710298, e313c, e2f94, 0, 0)
ap_process_request+0x30(176378, 4, 176378, 0, fe4614a0, 0)
0x2aa60(152ca0, 0, 152cf8, 8ac00, 8ac00, 1000)
ap_process_connection+0xc8(152ca0, 2aa04, e32a8, 0, 152c98, 8d9f0)
0x34790(f81f0, 152bb0, 18, 152ca0, 0, 18)
libthread.so.1`_lwp_start(0, 0, 0, 0, 0, 0)

From the stack trace, we can see that the issue is occuring in some unnamed function called by the mod_dav_svn function dav_svn_get_txn. Setting a breakpoint on this function can be done using the ::bp dcmd using the symbol name. Now we can try the commit again with a new breakpoint and see where it gets us:

> ::bp dav_svn_get_txn
> ::run -X -f /opt/csw/apache2/etc/svn.conf
mdb: stop at dav_svn_get_txn
mdb: target stopped at:
mod_dav_svn.so`dav_svn_get_txn: save      %sp, -0x90, %sp
mdb: You've got symbols!

Excellent. Now we have control over mod_dav_svn before it hits the illegal instruction. Mdb shows in the output that we have just saved off the stack pointer after dav_svn_get_txn is called by dav_svn_prep_activity. Looking in the actual source code for dav_svn_get_txn shows that the first steps in this function are to initialize and open a Berkeley DB (BDB) database, which is used by this installation of subversion to manage transactions in the repository. Mdb will now allow us to single step through the code using the short dcmd :s which steps one instruction at a time. If single stepping gets into a C library function or other function that is not critical to the current debugging task, specify :s out to step out of the function:

> :s
mdb: target stopped at:
libsvn_subr-1.so.0.0.0`svn_path_join+0x28:      call      +0x26d9c    <PLT=libc.so.1`strlen>
> :s
mdb: target stopped at:
libsvn_subr-1.so.0.0.0`svn_path_join+0x2c:      nop
> :s out
mdb: stop at mod_dav_svn.so`dav_svn_get_txn+0x34
mdb: target stopped at:
mod_dav_svn.so`dav_svn_get_txn+0x34:    mov       1, %o2

This shows mdb stepping out of the strlen() function called from dav_svn_get_txn. Quite a bit of single stepping is needed to get to the actual error, as we call into the Apache support libraries eventually. I knew from some previous research that others had found the issue to be inside of the aprutil library, so I knew not to step out of any apr_ functions. One of the biggest pains I found with mdb was single stepping. What I really wanted to do was to break on a function and single step through everything within libaprutil (but only libaprutil), printing each disassembled line. I couldn't find out how to do this in 30 minutes using the Solaris Modular Debugger Guide, and I also couldn't find how to repeat the command I just sent to mdb. What I really needed, and continue to need, as I still haven't found the answer, is a command like gdb '.', which repeats the last command verbatim. With this, I would be able to type :s, then use a single character to step through lines, rather than two (50% efficiency improvement :). Eventually, after carpal tunnel started to set in, I hit the SIGILL:

> :s
mdb: target stopped at:
0xff3515c0:     ld        [%o0 + 0xf0], %l3
> :s
mdb: target stopped at:
0xff3515c4:     jmpl      %l3, %o7
> :s
mdb: target stopped at:
0xff3515c8:     clr       %o1
> :s
mdb: target stopped at:
0x169fc8:       unimp     0
> ::dis -w
0x169fa0:                       unimp     0
0x169fa4:                       unimp     0
0x169fa8:                       unimp     0
0x169fac:                       unimp     0
0x169fb0:                       cb13      -0x440000   <0xffd29fb0>
0x169fb4:                       unimp     0
0x169fb8:                       unimp     0
0x169fbc:                       unimp     0
0x169fc0:                       unimp     0x69
0x169fc4:                       unimp     0
0x169fc8:                       unimp     0      <--- culprit
0x169fcc:                       unimp     0
0x169fd0:                       unimp     0
0x169fd4:                       unimp     2
0x169fd8:                       swapa     [%g4 + -0x18] %asi, %i7
0x169fdc:                       swapa     [%g5 + %l0] 03, %i7
0x169fe0:                       unimp     0x20
0x169fe4:                       unimp     0xa
0x169fe8:                       unimp     0
0x169fec:                       unimp     0
0x169ff0:                       unimp     0

Using the ::dis dcmd disassembles instructions near the current code pointer. The -w argument shows a window on either side of the current line. One more step and the process would throw a SIGILL and the apache request will die. A little more break point experimentation resulted in this call sequence:


It looks like Forte generated a bit of bad code in libaprutil, which could be due to the fact that it is compiled with a high optimization level. This supported the claims of some users on the Subversion user mailing list who indicated that the issues were within libaprutil. The user had recompiled APR, Apache 2 and Subversion with GCC 3.4, and the SIGILL problem went away. The next step for me is to work with the apr and aprutil maintainer to produce some unoptimized versions of the APR libraries.


In s10_49, the following RFE went back:
<tt>4964128 mdb should support single-key accelerators ] and [ for ::step, ::step over</tt>
which pretty much does what you're asking for. (i.e. hitting <tt>]]]</tt> will ::step thrice)

Posted by Jonathan Adams on September 30, 2004 at 11:29 AM PDT #

Well, mdb is excellent at what it does, but dbx is sometimes a bit easier to do certain userland stuff. Single steping is one example: on dbx prompt, try "trace step" and it will auto-step each instruction automatically:
$ dbx /bin/ls
Reading ls
Reading ld.so.1
Reading libc.so.1
Reading libdl.so.1
Reading libc_psr.so.1
(dbx 1) stop in main
dbx: warning: 'main' has no debugger info -- will trigger on first instruction
(2) stop in main
(dbx 2) run
Running: ls 
(process id 29804)
stopped in main at 0x0001114c
0x0001114c: main       :        save     %sp, -96, %sp
(dbx 3) tracei step
(3) tracei step
(dbx 4) cont
0x00011150: main+0x0004:        sethi    %hi(0x13800), %g2
0x00011154: main+0x0008:        mov      %i0, %l2
0x00011158: main+0x000c:        add      %g2, 812, %o1
0x0001115c: main+0x0010:        clr      %l1
0x00011160: main+0x0014:        call     setlocale [PLT]        ! 0x24058
0x00011164: main+0x0018:        mov      6, %o0
0x00024058: setlocale        [PLT]:     sethi    %hi(0x15000), %g1
0x0002405c: setlocale+0x0004 [PLT]:     ba,a     _PROCEDURE_LINKAGE_TABLE_ [PLT]        ! 0x24004
0x00024004: _PROCEDURE_LINKAGE_TABLE_        [PLT]:     save     %sp, -64, %sp
0x00024008: _PROCEDURE_LINKAGE_TABLE_+0x0004 [PLT]:     call     elf_rtbndr     ! 0x7fbb2c00
0x0002400c: _PROCEDURE_LINKAGE_TABLE_+0x0008 [PLT]:     nop      
0x7fbb2c00: elf_rtbndr       :  mov      %i7, %o0
0x7fbb2c04: elf_rtbndr+0x0004:  save     %sp, -96, %sp
0x7fbb2c08: elf_rtbndr+0x0008:  srl      %g1, 10, %o1
0x7fbb2c0c: elf_rtbndr+0x000c:  ld       [%i7 + 8], %o0
0x7fbb2c10: elf_rtbndr+0x0010:  call     elf_bndr       ! 0x7fbc744c
0x7fbb2c14: elf_rtbndr+0x0014:  mov      %i0, %o2
0x7fbc744c: elf_bndr       :    save     %sp, -160, %sp
0x7fbc7450: elf_bndr+0x0004:    call     elf_bndr+0xc   ! 0x7fbc7458
0x7fbc7454: elf_bndr+0x0008:    sethi    %hi(0x20800), %o1
0x7fbc7458: elf_bndr+0x000c:    mov      %i2, %l3
0x7fbc745c: elf_bndr+0x0010:    inc      952, %o1
0x7fbc7460: elf_bndr+0x0014:    mov      %i0, %l2
0x7fbc7464: elf_bndr+0x0018:    add      %o1, %o7, %i3
0x7fbc7468: elf_bndr+0x001c:    mov      %i1, %i5
0x7fbc746c: elf_bndr+0x0020:    call     enter  ! 0x7fbc0ba0
0x7fbc7470: elf_bndr+0x0024:    clr      [%fp - 4]
0x7fbc0ba0: enter       :       save     %sp, -96, %sp
0x7fbc0ba4: enter+0x0004:       call     enter+0xc      ! 0x7fbc0bac
0x7fbc0ba8: enter+0x0008:       sethi    %hi(0x27400), %o1
\^Cdbx: warning: Interrupt ignored but forwarded to child.
signal INT (Interrupt) in enter at 0x7fbc0ba8
0x7fbc0ba8: enter+0x0008:       sethi    %hi(0x27400), %o1
(dbx 5) 
Of course, I can easily imagine adding a similar feature to mdb... BTW, if you think this is a compiler bug, please file a bug.

Posted by Seongbae Park on October 05, 2004 at 02:19 AM PDT #

Post a Comment:
Comments are closed for this entry.



« March 2015