By User9166071-Oracle on Jan 16, 2009
So, to help understand the where in our life cycle we're stuck we'll need to collect some data. This also might help us define a better problem statement, which is obviously vitally important in our rational troubleshooting process. See my previous post about the importance of defining a good problem statement
Enabling the trigger environmentA full crash dump is much more preferred over a live core as things tend to change on the fly whilst the dump is being taken and corrupts some of the structures in the dump so we can get strange results. In this example I'm going to assume that we don't really want to take down the entire box as it'll result in all services being effected other than Oracle. Obviously, try the live method first and if that doesn't yield results then try a full dump. So, here's the steps in enabling a triggered live savecore on an aiowait timed out error message in an Oracle alert log.
1/ You'll need to setup a dedicated dump device to collect a live savecore. You'll either need a spare raw partition or have to create a large file using mkfile, see the man page dumpadm(1M) for details on how to do this.
2/ Download livecore_aio (or livecore_aio.x86) and guds script.
3/ Create collect_data.sh:
root@hippy-1 # more collect_data.sh
/var/tmp/guds_2.8.4 -q -c30 -i1 -n5 -w0 -s<case ref> -X2 -D /var/tmp &
echo "aiowait coredump event - please guds output and crash dump to Sun" | mailx -s "AIOWAIT TIMED OUT EVENT" root
In my case I'm going to fire off collecting guds to gather some performance stats on the box, but you could add anything else you want to run here including a "reboot -d" to take a full crash dump before rebooting the box.
4/ Change perms on binary and script before copying to /var/tmp
chmod +x livecore_aio collect_data.sh
cp collect_data.sh to /var/tmp
note: livecore_aio expects collect_data.sh to be in /var/tmp for it to work correctly
Test the program and script (as root):
# touch dummy_file
#./livecore_aio dummy_file &
# echo "WARNING: aiowait timed out" >> dummy_file
This should produce a live savecore dump and kick off guds.
\*execute the livecore_aio binary
#./livecore_aio <full path of alert_log> &
When issue happens, upload the live core dump and guds data to Sun for analysis.