core files and Friday madness
By user12625760 on Nov 02, 2004
Last Friday I had one of those escalation's from another time zone where all I had was a number of core files from an application and the question was why did it fail.
Now the problem with application core files, prior to 10, is that they don't contain everything. You have to look at them on a system with the same binary and shared libraries as the one that created them. This is fine in a development environment as you can just login to the system where the application was running and you will have the right files. However once you are not in a development environment this quickly ceases to be true. If the files don't match then, depending on which debugger you use, you either get bogus information or errors or both:
dbx /usr/bin/sparcv9/ptree /var/tmp/core.ptree.9847.ep> For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.3' in your .dbxrcReading ptree core file header read successfully Reading ld.so.1 Reading libproc.so.1 Reading libc.so.1 Reading librtld_db.so.1 Reading libelf.so.1 Reading libdl.so.1 Reading libc_psr.so.1 WARNING!! A loadobject was found with an unexpected checksum value. See `help core mismatch' for details, and run `proc -map' to see what checksum values were expected and found. dbx: warning: Some symbolic information might be incorrect. program terminated by signal SEGV (no mapping at the fault address) 0x0000000100001810: main+0x06cc: cmp %o0, 0 (dbx) >
Even the warning here is misleading, as it is not just the symbolic information that is incorrect. SEGV from “cmp %o0, 0”, I don't think so.
Luckily for me the customer has sent in an explorer from the system which contained the output from showrev -p. Using this it was a simple matter to install a lab system that matched the customers and away I went. The reason it was simple is we have some software to build a jumpstart profile from the output of showrev -p for exactly this kind of scenario.
However the down side of this is the 1 hour 10 minute wait while the system installs and patches itself. In this time I got thinking there must be a better (IE faster) way.
Asking the customer for the libraries was one possibility, but the customer was from a different time zone so that would induce a delay.
Since we have all the patches, I wondered if I could automount all the files from all the patches over a loop back mount of the root file system similar to the chroot we use to create our build systems. It was the first attempt at this that lead to last Friday 's failure.
However the principle of loop back mounting just the files that were in the patches might not be insane. The script that mounts them only has to run faster than the 1 hour 10 minutes of an install and this could be a winner.
So a short script later, and 1335 loop back mounts on the system and I have a chroot environment where all is well:
dbx /usr/bin/sparcv9/ptree /var/tmp/core.ptree.9847.eps> For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.3' in your .dbxrcReading ptree core file header read successfully Reading ld.so.1 Reading libproc.so.1 Reading libc.so.1 Reading librtld_db.so.1 Reading libelf.so.1 Reading libdl.so.1 Reading libc_psr.so.1 program terminated by signal SEGV (no mapping at the fault address) 0x0000000100001810: main+0x06cc: ld [%o0], %g2
The script to do all the mounts took 3m41.07s to run so that is a bit quicker than the full install and patch!
Now I will have to tidy this up for general consumption, make it so that it will run via RBAC (chroot will always need some sort of privilege) I've still not given up on getting the automounter to do the majority of the work mainly as it will save me having to work out how to unmount all the loop backs.
So if you are about to send a core file from an application to Sun to debug. Remember to include the output from showrev -p from the system that created the core file when the core was produced.