Wednesday Jan 17, 2007

Symbols for dbx when you have not compiled -g

Sometimes symbolic debuggers are just easier and quicker to use than non ymbolic debuggers. So when a user application dumps core or as in this case was behaving strangely according to the customers a gcore was provided it would be nice to not have to work out the offsets of each structure by hand and then use dbx's examine command to dump them out.

Specifically I wanted to take a peek at the ulwp_t structure that each thread. More specifically I wanted to look at the list of zombie threads. However the specifics are not really the point of this post, since most of what I wanted is printed by the “thread -info” dbx command.

Since the application is not compiled with the -g flag and nor are the libraries this could mean a load of hurt with examine. However if you have the source, and for OpenSolaris we all do there is another way. (My example is from SunOS 5.8 but apart from the definition being in a different header file the example will work on any release.)

The trick is to build a shared library that contains the definitions that you are interested in and then use the “loadobject -load” command to load it. If you then set the scope to be that file you can print the objects symbolically.

: s4u-60b-gmp03.eu TS 145 $; cat tat.c
#include <libthr.h>

ulwp_t \*foo = 0;

: s4u-60b-gmp03.eu TS 146 $; cc -g -I /share/bld/ONclones/on28-patch-eu/usr/src/uts/sparc -I /share/bld/ONclones/on28-patch-eu/usr/src/lib/liblwp/common -o tat.so -G tat.c
: s4u-60b-gmp03.eu TS 147 $; dbx binary core
Reading binary
core file header read successfully
Reading ld.so.1
Reading libpthread.so.1
.
.
t@151 (l@151) terminated by signal 0 (UNKNOWN SIGNAL)
0xfeee4d7c: __lwp_park+0x0010:  ta       8
(dbx) loadobject -load tat.so
Reading tat.so
Loaded loadobject: /share/eu/bld/scratch/cg13442/tmp/core/tat.so
(dbx) file tat.c            
(dbx) print \*((ulwp_t \*)all_zombies)
\*((ulwp_t \*) all_zombies) = {
    ul_self             = 0xfed23600
    ul_tls              = {
        tls_data = (nil)
        tls_size = 0
    }
    ul_forw             = 0xfe430000
    ul_back             = 0xfbfbc400
    ul_next             = (nil)
    ul_hash             = (nil)
    ul_rval             = (nil)
    ul_stk              = 0xfb000000 "<bad address 0xfb000000>"
    ul_mapsiz           = 0
    ul_guardsize        = 0
.
.
.
.
    ul_td_events_enable = '\\0'
    ul_sync_obj_reg     = '\\0'
    ul_qtype            = '\\0'
    ul_handoff          = '\\0'
    ul_usropts          = 128
    ul_startpc          = 0xfbccd568 = &worker_thread()
    ul_startarg         = 0x16e6bb8
    ul_wchan            = (nil)
.
.
.
.
    ul_savedregs        = {
        rs_pc     = 0
        rs_sp     = 0
        rs_o7     = 0
        rs_g1     = 0
        rs_g2     = 0
        rs_g3     = 0
        rs_g4     = 0
        rs_fsr    = 0
        rs_fpu_en = 0
    }
}
(dbx) 
(dbx) print ((ulwp_t \*)all_zombies)->ul_startpc
((ulwp_t \*) all_zombies)->ul_startpc = 0xfbccd568 = &worker_thread()
(dbx) print ((ulwp_t \*)all_zombies)->ul_forw->ul_startpc
((ulwp_t \*) all_zombies)->ul_forw->ul_startpc = 0xfbccd568 = &worker_thread()
(dbx) print ((ulwp_t \*)all_zombies)->ul_forw->ul_forw->ul_startpc
((ulwp_t \*) all_zombies)->ul_forw->ul_forw->ul_startpc = 0xfbccd568 = &worker_thread()
(dbx) 

Clearly you can use this technique with any data structure. I always feel there should be a better way so look forward to having it explained to me in the comments to this post.


Tags:

Tuesday Nov 02, 2004

core files and Friday madness

Last Friday I had one of those escalation's from another time zone where all I had was a number of core files from an application and the question was why did it fail.


Now the problem with application core files, prior to 10, is that they don't contain everything. You have to look at them on a system with the same binary and shared libraries as the one that created them. This is fine in a development environment as you can just login to the system where the application was running and you will have the right files. However once you are not in a development environment this quickly ceases to be true. If the files don't match then, depending on which debugger you use, you either get bogus information or errors or both:


dbx /usr/bin/sparcv9/ptree /var/tmp/core.ptree.9847.ep>
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.3' in your .dbxrcReading ptree
core file header read successfully
Reading ld.so.1
Reading libproc.so.1
Reading libc.so.1
Reading librtld_db.so.1
Reading libelf.so.1
Reading libdl.so.1
Reading libc_psr.so.1
WARNING!!
A loadobject was found with an unexpected checksum value.
See `help core mismatch' for details, and run `proc -map'
to see what checksum values were expected and found.
dbx: warning: Some symbolic information might be incorrect.
program terminated by signal SEGV (no mapping at the fault address)
0x0000000100001810: main+0x06cc:        cmp      %o0, 0
(dbx) >

Even the warning here is misleading, as it is not just the symbolic information that is incorrect. SEGV from “cmp %o0, 0”, I don't think so.

Luckily for me the customer has sent in an explorer from the system which contained the output from showrev -p. Using this it was a simple matter to install a lab system that matched the customers and away I went. The reason it was simple is we have some software to build a jumpstart profile from the output of showrev -p for exactly this kind of scenario.

However the down side of this is the 1 hour 10 minute wait while the system installs and patches itself. In this time I got thinking there must be a better (IE faster) way.

Asking the customer for the libraries was one possibility, but the customer was from a different time zone so that would induce a delay.

Since we have all the patches, I wondered if I could automount all the files from all the patches over a loop back mount of the root file system similar to the chroot we use to create our build systems. It was the first attempt at this that lead to last Friday 's failure.

However the principle of loop back mounting just the files that were in the patches might not be insane. The script that mounts them only has to run faster than the 1 hour 10 minutes of an install and this could be a winner.

So a short script later, and 1335 loop back mounts on the system and I have a chroot environment where all is well:

dbx /usr/bin/sparcv9/ptree /var/tmp/core.ptree.9847.eps>
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.3' in your .dbxrcReading ptree
core file header read successfully
Reading ld.so.1
Reading libproc.so.1
Reading libc.so.1
Reading librtld_db.so.1
Reading libelf.so.1
Reading libdl.so.1
Reading libc_psr.so.1
program terminated by signal SEGV (no mapping at the fault address)
0x0000000100001810: main+0x06cc:        ld       [%o0], %g2

The script to do all the mounts took 3m41.07s to run so that is a bit quicker than the full install and patch!

Now I will have to tidy this up for general consumption, make it so that it will run via RBAC (chroot will always need some sort of privilege) I've still not given up on getting the automounter to do the majority of the work mainly as it will save me having to work out how to unmount all the loop backs.

So if you are about to send a core file from an application to Sun to debug. Remember to include the output from showrev -p from the system that created the core file when the core was produced.

About

This is the old blog of Chris Gerhard. It has mostly moved to http://chrisgerhard.wordpress.com

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today