Tuesday Nov 02, 2004

core files and Friday madness

Last Friday I had one of those escalation's from another time zone where all I had was a number of core files from an application and the question was why did it fail.


Now the problem with application core files, prior to 10, is that they don't contain everything. You have to look at them on a system with the same binary and shared libraries as the one that created them. This is fine in a development environment as you can just login to the system where the application was running and you will have the right files. However once you are not in a development environment this quickly ceases to be true. If the files don't match then, depending on which debugger you use, you either get bogus information or errors or both:


dbx /usr/bin/sparcv9/ptree /var/tmp/core.ptree.9847.ep>
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.3' in your .dbxrcReading ptree
core file header read successfully
Reading ld.so.1
Reading libproc.so.1
Reading libc.so.1
Reading librtld_db.so.1
Reading libelf.so.1
Reading libdl.so.1
Reading libc_psr.so.1
WARNING!!
A loadobject was found with an unexpected checksum value.
See `help core mismatch' for details, and run `proc -map'
to see what checksum values were expected and found.
dbx: warning: Some symbolic information might be incorrect.
program terminated by signal SEGV (no mapping at the fault address)
0x0000000100001810: main+0x06cc:        cmp      %o0, 0
(dbx) >

Even the warning here is misleading, as it is not just the symbolic information that is incorrect. SEGV from “cmp %o0, 0”, I don't think so.

Luckily for me the customer has sent in an explorer from the system which contained the output from showrev -p. Using this it was a simple matter to install a lab system that matched the customers and away I went. The reason it was simple is we have some software to build a jumpstart profile from the output of showrev -p for exactly this kind of scenario.

However the down side of this is the 1 hour 10 minute wait while the system installs and patches itself. In this time I got thinking there must be a better (IE faster) way.

Asking the customer for the libraries was one possibility, but the customer was from a different time zone so that would induce a delay.

Since we have all the patches, I wondered if I could automount all the files from all the patches over a loop back mount of the root file system similar to the chroot we use to create our build systems. It was the first attempt at this that lead to last Friday 's failure.

However the principle of loop back mounting just the files that were in the patches might not be insane. The script that mounts them only has to run faster than the 1 hour 10 minutes of an install and this could be a winner.

So a short script later, and 1335 loop back mounts on the system and I have a chroot environment where all is well:

dbx /usr/bin/sparcv9/ptree /var/tmp/core.ptree.9847.eps>
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.3' in your .dbxrcReading ptree
core file header read successfully
Reading ld.so.1
Reading libproc.so.1
Reading libc.so.1
Reading librtld_db.so.1
Reading libelf.so.1
Reading libdl.so.1
Reading libc_psr.so.1
program terminated by signal SEGV (no mapping at the fault address)
0x0000000100001810: main+0x06cc:        ld       [%o0], %g2

The script to do all the mounts took 3m41.07s to run so that is a bit quicker than the full install and patch!

Now I will have to tidy this up for general consumption, make it so that it will run via RBAC (chroot will always need some sort of privilege) I've still not given up on getting the automounter to do the majority of the work mainly as it will save me having to work out how to unmount all the loop backs.

So if you are about to send a core file from an application to Sun to debug. Remember to include the output from showrev -p from the system that created the core file when the core was produced.

Monday Sep 27, 2004

Solaris automounting to the Nth....

My first introduction to the automounter was when it arrived as part of NSE in the days of SunOS 3.5 it was always kind of neat, but with the advent of autofs it was ready for some real abuse.

From contacts with the development team and comments in bug reports it is now clear that it was never intended that you could do some of the things that you can do with the automounter, and parts of this certainly fall into that category.

Sustaining Solaris

One of the problems with sustaining Solaris is that you have to have a build system for each architecture and release you are sustaining. Needless to say this either leads to you having to install a build system each time or have some form of multi boot or have lots of systems.

We used to opt of the lots of systems but that meant that they were relatively small and so builds would take a long time. Plus they each sat idle for long periods of time.

Reinventing the wheel

To solve this we built 2 systems, one SPARC and one x86 onto which we restored all the root images of our build servers in a sub directory such that we could chroot into the directories and build any release. Neat and for the most part it worked very well. Only recently did I discover from a colleague that this was the same as one of the original uses for chroot.

Dr. Marshall Kirk Mckusick, private communication: ``According to the SCCS logs, the chroot call was added by Bill Joy on March 18, 1982 approximately 1.5 years before 4.2BSD was released. That was well before we had ftp servers of any sort (ftp did not show up in the source tree until January 1983). My best guess as to its purpose was to allow Bill to chroot into the /4.2BSD build directory and build a system using only the files, include files, etc contained in that tree. That was the only use of chroot that I remember from the early days.''

This all relies on the interface between the kernel and libc not changing for the system calls used in the build, which fortunately for SunOS 5.6 to 5.9 is the case. However for 5.10 we have had to build new systems, but hey we still only have 4 build systems instead of 10. This allows those systems to be really powerful and hence our builds take less time.

Automounting everything

Once all this was done we realised that it would be nice to have our home directories and other mount points available under the chroots, so we loopback mounted the /home autofs mount point from the real root, then there were other mount points so we started, but did not get far with, building a /etc/vfstab file that would do this. The revelation was when we used the automounter to mount the chroot areas. We use it to mount all the files and directories from the real root file system to get things working.

Lets look at the automounter map entry for 5.9:

on81
    / -suid,ro cpr-bld.uk:/export/d10/roots/&/$CPU
    /devices            -fstype=lofs,rw /devices
    /dev                -fstype=lofs,rw /dev
    /etc/passwd         -fstype=lofs,ro /etc/passwd
    /etc/shadow         -fstype=lofs,ro /etc/shadow
    /share              -fstype=lofs    /share
    /home               -fstype=lofs    /home
    /vol                -fstype=lofs    /vol
    /usr/dist           -fstype=lofs    /usr/dist
    /var/tmp            -fstype=lofs    /var/tmp
    /var/nis            -fstype=lofs,ro /var/nis
    /var/run            -fstype=lofs    /var/run
    /var/adm/utmpx      -fstype=lofs,ro /var/adm/utmpx
    /var/spool/mqueue   -fstype=lofs    /var/spool/mqueue
    /tmp                -fstype=lofs    /tmp
    /export             -fstype=lofs    /export
    /local              -fstype=lofs,ro /local
    /local/root         -fstype=lofs,ro /
    /opt/cprbld         -ro             cpr-bld.uk:/export/d10/roots/cprhome/$CPU
    /ws                 -fstype=lofs    /ws
    /net                -fstype=lofs    /net
    /usr/local          -fstype=lofs    /usr/local
    /proc               -fstype=lofs    /proc
    /opt/SUNWspro       -fstype=lofs,ro /share/on81-patch-tools/SUNWspro/SC6.1
    /opt/teamware       -fstype=lofs,ro /share/on81-patch-tools/teamware
    /opt/onbld          -fstype=lofs,ro /share/on81-patch-tools/onbld
    /etc/mnttab         -fstype=lofs,ro /etc/mnttab
    /var/spool/clientmqueue -fstype=lofs,rw /var/spool/clientmqueue
    /share/SUNWspro_latest -fstype=lofs /share/eu/lang/solaris/$CPU/links_perOS/latest_5.9/SUNWspro
    /share/SUNWspro_prefcs -fstype=lofs /share/eu/lang/solaris/$CPU/links_perOS/prefcs_5.9/SUNWspro

Now you can see we are automounting all sorts of things that you may not expect. In particular /etc/passwd and /etc/shadow so that we get the same password entries as the host system. In our world /home and /share are automount points, but for since the automounter runs in the real root automount maps that contain $OSROOT to select a particular OS specific mount point get the wrong entry when in the chroot. Hence we have the two entries that are in green.

The one thing that does not work is /etc/mnttab, since unlike the mnttab used in zones it has no knowledge of the chroot so gives bogus information.

Does it work? Yes, well enough for our old build systems to be consigned back to the lab for general use and us to be allowed to have some fast ones with lots of CPUs for our real systems.

For those Sun Employees who wish to know more see http://pod6.uk

About

This is the old blog of Chris Gerhard. It has mostly moved to http://chrisgerhard.wordpress.com

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today