Monday Dec 17, 2007

why is the stacktrace in gdb different to the one in dbx ?

After I posted my last blog, another idea struck me, how about I try live debugging with dbx ?

I downloaded Sun Studio 12 and installed on my lab system running Red Hat Enterprise Linux ES release 4 (Nahant Update 4). When I started apache in dbx, it crashed the first time around but subsequent starts did not crash. Even when it crashed, I couldn't get the symbols working correctly in dbx. What's the point in debugging live if I couldn't read the symbols correctly ?

I decided not to spend time on live debugging and focus instead on crash analysis.  But I was in for another surprise that is -- the stacktrace output in  dbx and gdb looked very different. I just couldn't figure out why the stacktrace output on gdb and dbx looked so different ?

The backtrace output in gdb looked as follows:

(gdb) bt
#0  0x003a9194 in spin () from /root/web_agents/apache22_agent/lib/libamapc22.so
#1  0x024dedc1 in PR_Select () from /usr/lib/libnspr4.so
#2  0x00b96371 in start_thread () from /lib/tls/libpthread.so.0
#3  0x009fdffe in clone () from /lib/tls/libc.so.6 

 
Now, compare that with the stacktrace in dbx ?

(dbx)

t@null (l@1) terminated by signal SEGV (Segmentation fault)
0x00929b7d: __mmap+0x000d: movl %edx,%ebx
Current function is apr_dso_load
139 os_handle = dlopen(path, flags);


(dbx) where
[1] __mmap(0xcbf000, 0x71d0, 0x5, 0x802, 0xa, 0x0), at 0x929b7d
[2] _dl_map_object_from_fd(0xbfe57e80, 0x90b9528, 0x90b9200, 0x2, 0x80000000, 0xbfe57e78), at 0x91d809
[3] _dl_map_object(0x2, 0x0, 0x80000000, 0x0, 0xbfe583b4, 0x44fd66), at 0x91f0bc
[4] openaux(0xbfe583b4, 0x80000000, 0x0, 0xbfe583b4, 0xb7f3d6bc, 0x922d50), at 0x922d98
[5] _dl_catch_error(0xbfe583b4, 0x24, 0xbfe584b0, 0xbfe58188, 0x92dfd4, 0xbfe583e8), at 0x9240fe
[6] _dl_map_object_deps(0x0, 0x80000000, 0x90000102, 0x0, 0xa5cff4, 0x0), at 0x92303a
[7] dl_open_worker(0xbfe585b0, 0x0, 0x0, 0x0, 0xb7f3d6bc, 0xa31fe0), at 0xa32159
[8] _dl_catch_error(0xbfe585b0, 0x90, 0xffffffff, 0x0, 0x970f98, 0xbfe58b40), at 0x9240fe
[9] __GI__dl_open(0xfffffffe, 0x0, 0x0, 0x92dfd4, 0xbfe58750, 0x92dca0), at 0xa32cb8
[10] dlopen_doit(0xbfe58750, 0x0, 0x0, 0xbfe58c44, 0xb7f3d6bc, 0xa88c70), at 0xa88cb8
[11] _dl_catch_error(0xbfe58750, 0xa88f50, 0x48, 0xe3736c, 0x905b0a8, 0x90c973a), at 0x9240fe
[12] _dlerror_run(0x0, 0x0, 0x31, 0x0, 0x90c9740, 0x102), at 0xa892bb
[13] __dlopen_check(0x90c9740, 0x102, 0x90c9739, 0xe23650, 0x905b0a8, 0x1), at 0xa88d11
=>[14] apr_dso_load(res_handle = <value not available>, path = <value not available>, pool = <value not available>), line 139 in "dso.c"
[15] load_module(cmd = <value not available>, dummy = <value not available>, modname = <value not available>, filename = <value not available>), line 238 in "mod_so.c"
[16] invoke_cmd(cmd = (nil), parms = (nil), mconfig = <value not available>, args = <value not available>), line 890 in "config.c"
[17] ap_build_config_sub(p = <value not available>, temp_pool = <value not available>, l = <value not available>, parms = <value not available>, current = <value not available>, curr_parent = <value not available>, conftree = <value not available>), line 1436 in "config.c"
[18] ap_build_config(parms = (nil), p = <value not available>, temp_pool = <value not available>, conftree = (nil)), line 1219 in "config.c"
[19] process_resource_config_nofnmatch(s = <value not available>, fname = (nil), conftree = <value not available>, p = <value not available>, ptemp = <value not available>, depth = <value not available>), line 1629 in "config.c"
[20] ap_process_resource_config(s = <value not available>, fname = <value not available>, conftree = <value not available>, p = <value not available>, ptemp = <value not available>), line 1661 in "config.c"
[21] include_config(cmd = (nil), dummy = <value not available>, name = (nil)), line 2584 in "core.c"
[22] invoke_cmd(cmd = (nil), parms = (nil), mconfig = <value not available>, args = <value not available>), line 890 in "config.c"
[23] ap_build_config_sub(p = <value not available>, temp_pool = <value not available>, l = <value not available>, parms = <value not available>, current = <value not available>, curr_parent = <value not available>, conftree = <value not available>), line 1436 in "config.c"
[24] ap_build_config(parms = (nil), p = <value not available>, temp_pool = <value not available>, conftree = (nil)), line 1219 in "config.c"
[25] process_resource_config_nofnmatch(s = <value not available>, fname = (nil), conftree = <value not available>, p = <value not available>, ptemp = <value not available>, depth = <value not available>), line 1629 in "config.c"
[26] ap_process_resource_config(s = <value not available>, fname = <value not available>, conftree = <value not available>, p = <value not available>, ptemp = <value not available>), line 1661 in "config.c"
[27] ap_read_config(process = <value not available>, ptemp = <value not available>, filename = <value not available>, conftree = <value not available>), line 2021 in "config.c"
[28] main(argc = -1075478912, argv = 0x90b9528), line 694 in "main.c"

There is hardly any resemblance in the stacktrace output produced by the two debug tools.  I posted this question to one of our internal aliases and I received a response from a colleague of mine in Singapore saying,  I was running backtrace against the wrong thread in gdb !!.   

When I ran the command "bt"  in gdb, I did not choose any particular thread,  all I did was open the core file in gdb and run the command "bt".  I always thought that by default when you open a core file in a debugger and run the command to print the stacktrace ( or backtrace),  it prints the trace of the thread that resulted this crash.   I now know that might always be the case. Finally, I managed to print the stacktrace of the thread that resulted in this crash in gdb.

(gdb) thread apply all bt
Thread 2 (process 7160):
#0 0x00929b7d in mmap () from /lib/ld-linux.so.2
#1 0x0091d809 in _dl_map_object_from_fd () from /lib/ld-linux.so.2
#2 0x0091f0bc in _dl_map_object () from /lib/ld-linux.so.2
#3 0x00922d98 in openaux () from /lib/ld-linux.so.2
#4 0x009240fe in _dl_catch_error () from /lib/ld-linux.so.2
#5 0x0092303a in _dl_map_object_deps () from /lib/ld-linux.so.2
#6 0x00a32159 in dl_open_worker () from /lib/tls/libc.so.6
#7 0x009240fe in _dl_catch_error () from /lib/ld-linux.so.2
#8 0x00a32cb8 in _dl_open () from /lib/tls/libc.so.6
#9 0x00a88cb8 in dlopen_doit () from /lib/libdl.so.2
#10 0x009240fe in _dl_catch_error () from /lib/ld-linux.so.2
#11 0x00a892bb in _dlerror_run () from /lib/libdl.so.2
#12 0x00a88d11 in dlopen@@GLIBC_2.1 () from /lib/libdl.so.2
#13 0x00e25222 in apr_dso_load (res_handle=0xbfe587ec,path=0x90c9740 "/root/web_agents/apache22_agent/lib/libamapc22.so",pool=0x905b0a8) at dso/unix/dso.c:139
#14 0x0809631e in load_module (cmd=0xbfe58bd0, dummy=0xbfe58a70,modname=0x90c96f8 "dsame_module", filename=0x90c9708 "/root/web_agents/apache22_agent/lib/libamapc22.so") at mod_so.c:238
#15 0x0807280f in invoke_cmd (cmd=0x80adde0, parms=0xbfe58bd0,mconfig=0xbfe58a70,args=0x90b6248 "") at config.c:890
#16 0x08072d3b in ap_build_config_sub (p=0x905b0a8,temp_pool=<value optimized out>, l=0x90b61f8 "LoadModule dsame_module /root/web_agents/apache22_agent/lib/libamapc22.so", parms=0xbfe58bd0, current=0xbfe58ab4, curr_parent=0xbfe58ab8, conftree=0xbfe58d74) at config.c:1436
#17 0x08073163 in ap_build_config (parms=0xbfe58bd0, p=0x905b0a8, temp_pool=0x90610c0, conftree=0xbfe58d74) at config.c:1219
#18 0x08073678 in process_resource_config_nofnmatch (s=0x905fc80,fname=0x90c8530 "/root/web_agents/apache22_agent/Agent_001/config/dsame.conf", conftree=0xbfe58d74, p=0x905b0a8, ptemp=0x90610c0, depth=0) at config.c:1629
#19 0x08073844 in ap_process_resource_config (s=0x905fc80,fname=0x90c8530 "/root/web_agents/apache22_agent/Agent_001/config/dsame.conf", conftree=0xbfe58d74, p=0x905b0a8, ptemp=0x90610c0) at config.c:1661
#20 0x0806bba1 in include_config (cmd=0xbfe59050, dummy=0xbfe58ef0,name=0x90c8468 "/root/web_agents/apache22_agent/Agent_001/config/dsame.conf") at core.c:2584
#21 0x0807280f in invoke_cmd (cmd=0x80a3ab8, parms=0xbfe59050,mconfig=0xbfe58ef0, args=0x90a2203 "") at config.c:890
#22 0x08072d3b in ap_build_config_sub (p=0x905b0a8,temp_pool=<value optimized out>,l=0x90a21c0 "include /root/web_agents/apache22_agent/Agent_001/config/dsame.conf", parms=0xbfe59050, current=0xbfe58f34, curr_parent=0xbfe58f38,conftree=0x80b43a8) at config.c:1436
#23 0x08073163 in ap_build_config (parms=0xbfe59050, p=0x905b0a8, temp_pool=0x90610c0, conftree=0x80b43a8) at config.c:1
#24 0x08073678 in process_resource_config_nofnmatch (s=0x905fc80,fname=0x90ab290 "/opt/apache226/conf/httpd.conf", conftree=0x80b43a8,p=0x905b0a8, ptemp=0x90610c0, depth=0) at config.c:1629
#25 0x08073844 in ap_process_resource_config (s=0x905fc80,fname=0x90ab290 "/opt/apache226/conf/httpd.conf", conftree=0x80b43a8,p=0x905b0a8, ptemp=0x90610c0) at config.c:1661
#26 0x0807418f in ap_read_config (process=0x9059120, ptemp=0x90610c0 filename=0x80a01e5 "conf/httpd.conf", conftree=0x80b43a8) at config.c:2021
#27 0x08061ee9 in main (argc=3, argv=0xbfe59354) at main.c:694

The backtrace against the correct thread in gdb made much more sense and gdb could resolve the symbols correctly too.  From gdb's backtrace it was clear that the crash was immediately after loading the dsame module, libamapc22.so which is our Policy Agent.

I filed a bug  6637032  against Access Manager Policy Agent and engineering has provided a fix to resolve this.

Monday Nov 26, 2007

program's behaviour changing when attached to debugger or when using monitoring tools

Last week I managed to replicate a problem reported by one of our customers in Japan.  It was about Policy agent 2.2 for Apache 2.2 crashing on 32 bit  Red Hat Enterprise Linux ES release 4 (Nahant Update 4).

I downloaded the apache code from http://httpd.apache.org/download.cgi and built it on the system that I was trying to replicate this problem on.

I downloaded the policy agent from http://www.sun.com/download/products.xml?id=471909fc
and before I installed the agent, I verified that I could start and stop the apache server correctly.

I followed the install steps outlined the install document and installation was successful. http://docs.sun.com/app/docs/doc/820-3288.

I now  tried to start the apache server after installing the policy agent and  it crashed with segmentation fault.  Hiya,  I could reproduce the problem !

#/opt/apache226/bin/apachectl -k start
/opt/apache226/bin/apachectl: line 78: 18504 Segmentation fault     
(core dumped) $HTTPD $ARGV

I opened the corefile in gdb and the backtrace looked as follows:

(gdb) bt
#0  0x003a9194 in spin ()
from /web_agents/apache22_agent/lib/libamapc22.so
#1  0x024dedc1 in PR_Select () from /usr/lib/libnspr4.so
#2  0x00b96371 in start_thread () from /lib/tls/libpthread.so.0
#3  0x009fdffe in clone () from /lib/tls/libc.so.6

Now that I could replicate the problem consistently on my lab system, I thought, I'll do some live debugging with gdb for that extra bit of fun.

I ran the program in gdb setting breakpoint in spin() and to my surprise, the process did not crash.  The program exited normally. I thought, this might be due to the breakpoint that I set.  I re-ran the program without the breakpoint and the program ran normally again in gdb .

(gdb) file /opt/apache22/bin/httpd
(gdb) set args -k start
(gdb) start
Breakpoint 1 at 0x80618bd: file main.c, line 438.
Starting program: /opt/apache226/bin/httpd -k start
[Thread debugging using libthread_db enabled]
[New Thread 0xb7ee06c0 (LWP 28225)]
[Switching to Thread 0xb7ee06c0 (LWP 28225)]
main (argc=3, argv=0xbfeb32a4) at main.c:438
438     {
(gdb) break spin
Function "spin" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 2 (spin) pending.
(gdb) continue
Continuing.
Breakpoint 3 at 0x3a9326
Pending breakpoint "spin" resolved
[New Thread 0x19a5bb0 (LWP 28229)]
[Switching to Thread 0x19a5bb0 (LWP 28229)]
Breakpoint 3, 0x003a9326 in spin ()
   from /web_agents/apache22_agent/lib/libamapc22.so
(gdb) continue
Continuing.
warning: Temporarily disabling breakpoints for unloaded shared library "/hema/opensso/web_agents/apache22_agent/lib/libamapc22.so"
Breakpoint 3 at 0x4b6326
Program exited normally.

 I then ran strace and ltrace to get more information and to my surprise, apache didn't crash if I ran with those monitoring tools either.

I suspect this is due to some sort of timing issue.  How do I debug a program if  it's behavior changes when attached to a debugger or when run with the monitoring tools !!

I gave up on live debugging and finally had to go back to core dump analysis of the core file in gdb.

About

Hema

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today