Hadoop Java Error logs

I was having trouble isolating a problem with "reduce" tasks running on Hadoop slave servers. 

After poking around on the Hadoop slave, I found an interesting lead in /var/log/hadoop/userlogs/job_201302111641_0057/attempt_201302111641_0057_r_000001_1/stdout:

$ cat /tmp/hadoop-hadoop/mapred/local/userlogs/job_201302111641_0059/attempt_201302111641_0059_r_000001_1/stdout

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0xfe67cb31, pid=25828, tid=2
#
# JRE version: 6.0_35-b10
# Java VM: Java HotSpot(TM) Server VM (20.10-b01 mixed mode solaris-x86 )
# Problematic frame:
# C  [libc.so.1+0xbcb31]  pthread_mutex_trylock+0x29
#
# An error report file with more information is saved as:
# /tmp/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201302111641_0059/attempt_201302111641_0059_r_000001_1/work/hs_err_pid25828.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

The HotSpot crash log (hs_err_pid25828.log in my case) will be very interesting because it contains information obtained at the time of the fatal error, including the following information, where possible:

  • The operating exception or signal that provoked the fatal error
  • Version and configuration information
  • Details on the thread that provoked the fatal error and thread's stack trace
  • The list of running threads and their state
  • Summary information about the heap
  • The list of native libraries loaded
  • Command line arguments
  • Environment variables
  • Details about the operating system and CPU

Great, but hs_err_pid25654.log had been cleaned up before I could get to it. In fact, I found that the hs_err_pid.log files were available for less than a minute and they were always gone before I could capture one.

To try to retain the Java error log file, my first incorrect guess was:

 <property>
   <name>keep.failed.task.files</name>
   <value>true</value>
 </property>


My next approach was to add "-XX:ErrorFile=/tmp/hs_err_pid%p.log" to the Java command line for the reduce task.

When I tried adding the Java option to HADOOP_OPTS in /usr/local/hadoop/conf/hadoop-env.sh, I realized that this setting isn't applied to the Map and Reduce Task JVMs.

Finally, I found that adding the Java option to the mapred.child.java.opts property in mapred-site.xml WORKED!!

$ cat /usr/local/hadoop/conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

     <property>
         <name>mapred.job.tracker</name>
         <value>p3231-name-node:8021</value>
     </property>

     <property>
         <name>mapred.child.java.opts</name>
         <value>-XX:ErrorFile=/tmp/hs_err_pid%p.log</value>
     </property>


</configuration>


Now I can view the Java error logs on my Hadoop slaves:

$ ls -l /tmp/*err*
-rw-r--r--   1 hadoop   hadoop     15626 May 16 15:42 /tmp/hs_err_pid10028.log
-rw-r--r--   1 hadoop   hadoop     15795 May 16 15:43 /tmp/hs_err_pid10232.log

Comments:

I've never seen a JDK clean up a hs_err log file before. The hs_err file has the pid embedded in it, and because the pid gets moved forward as processes are created, even if the next JDK dies again the next hs-err file shouldn't have the same name. There is a problem on some newer Linux distros that if the /tmp area is used, the /tmp is really a tmpfs (in ram) file system and rebooting the machine will lose all the files in /tmp, thus losing your hs_err logs

Posted by guest on June 01, 2013 at 08:52 PM EDT #

Thanks guest. It isn't the JDK that is cleaning up the hs_err_pid.log files. It is the Hadoop Task Tracker process that is cleaning up its work directory.

Here is a DTrace script that can be used to determine which process is removing the error log files:

# dtrace -n 'syscall::unlinkat:entry/strstr(zonename,"node4")!=0 && strstr(copyinstr(arg1),"hs_err")!=0/{printf(" PID=%d\n CMD=%s\n FILE=%s\n\n", pid, curpsinfo->pr_psargs, copyinstr(arg1));}'
dtrace: description 'syscall::unlinkat:entry' matched 1 probe

CPU ID FUNCTION:NAME
2 8069 unlinkat:entry PID=18609
CMD=/usr/java/bin/java -Dproc_tasktracker -Xmx1000m -Dhadoop.log.dir=/var/log/h
FILE=/tmp/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201306031359_0006/attempt_201306031359_0006_r_000000_0/work/hs_err_pid23199.log

This shows that removing (aka unlinking) the file is not intrinsic to the JDK, but an action that has been coded into the Hadoop framework.

Posted by Jeff on June 03, 2013 at 05:52 PM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

user12620111

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today