I was having trouble isolating a problem with "reduce" tasks running on Hadoop slave servers.
After poking around on the Hadoop slave, I found an interesting lead in
$ cat /tmp/hadoop-hadoop/mapred/local/userlogs/job_201302111641_0059/attempt_201302111641_0059_r_000001_1/stdout
# A fatal error has been detected by the Java Runtime Environment:
# SIGSEGV (0xb) at pc=0xfe67cb31, pid=25828, tid=2
# JRE version: 6.0_35-b10
# Java VM: Java HotSpot(TM) Server VM (20.10-b01 mixed mode solaris-x86 )
# Problematic frame:
# C [libc.so.1+0xbcb31] pthread_mutex_trylock+0x29
# An error report file with more information is saved as:
# If you would like to submit a bug report, please visit:
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
The HotSpot crash log (hs_err_pid25828.log in my case) will be very interesting because it contains information obtained at the time of the fatal error, including the following information, where possible:
Great, but hs_err_pid25654.log had been cleaned up before I could get to
it. In fact, I found that the hs_err_pid.log files were available for
less than a minute and they were always gone before I could capture one.
To try to retain the Java error log file, my first incorrect guess was:
My next approach was to add "-XX:ErrorFile=/tmp/hs_err_pid%p.log" to the Java command line for the reduce task.
When I tried adding the Java option to HADOOP_OPTS in
/usr/local/hadoop/conf/hadoop-env.sh, I realized that this setting isn't
applied to the Map and Reduce Task JVMs.
Finally, I found that adding the Java option to the mapred.child.java.opts property in mapred-site.xml WORKED!!
$ cat /usr/local/hadoop/conf/mapred-site.xml
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
Now I can view the Java error logs on my Hadoop slaves:
$ ls -l /tmp/*err*
-rw-r--r-- 1 hadoop hadoop 15626 May 16 15:42 /tmp/hs_err_pid10028.log
-rw-r--r-- 1 hadoop hadoop 15795 May 16 15:43 /tmp/hs_err_pid10232.log