Beta Testing the Sun Grid Engine Hadoop Integration
By templedf on Nov 30, 2009
In case you haven't heard yet, the upcoming release of Sun Grid Engine will include an integration with Apache Hadoop that will allow Map/Reduce jobs to be submitted to a Sun Grid Engine cluster while minding HDFS data locality. The 6.2u5 release will be out by the end of the year, but it's currently in the beta testing phase. And that's where you come in.
I'm looking for some volunteers to test the integration. To that end, this blog post will provide instructions for how to get the beta code checked out and built. The Hadoop integration is actually only loosely dependent on the Sun Grid Engine software itself. While it's planned to be part of u5, the integration should be usable with a cluster as old to 6.2u2, although I would really recommend at least 6.2u4.
In a nutshell, the integration consists of two components. The first is the hadoop parallel environment that allows Map/Reduce jobs to be started as parallel jobs in a Sun Grid Engine cluster. The second is the integration with HDFS, called Herd, that makes the Sun Grid Engine scheduler aware of the locations of the HDFS data blocks. Herd has two parts. One part is a load sensor that runs on every execution machine and reports the HDFS blocks on that machine. The other part is a JSV that translates HDFS data paths included in the job submission into a list of HDFS blocks needed by the job.
How to check out the source code
- Make sure you have a functional CVS client.
- cvs -d :pserver:email@example.com:/cvs login
- cvs -d :pserver:firstname.lastname@example.org:/cvs checkout gridengine/source
Technically, the above will only check out the source directory, but for the Hadoop integration, that's all you need. The Hadoop integration lives in three places. First, the scripts live in source/dist/hadoop. Second, the Herd code lives at source/libs/herd. Third, the JSV Java language binding upon which the Herd code depends lives at source/libs/jjsv.
How to build the source code
- Make sure you're using at least Ant 1.6.3 and the Java Standard Edition 6 platform.
- Copy the source/build.properties file to build_private.properties.
- Edit the build_private.properties file to include the corrects paths for the Java Standard Edition 6 platform and junit 3.8.
- Change to the gridengine/source directory.
- ant jjsv
- ant herd
After the above steps, you will find herd.jar at source/CLASSES/herd/herd.jar and JSV.jar at source/CLASSES/jjsv/JSV.jar.
How to install the integration
- Copy herd.jar and JSV.jar to the $SGE_ROOT/lib directory.
- Copy the source/dist/hadoop directory to somewhere accessible by all the execution nodes.
How to configure the integration
- Get HDFS up and running on your cluster. The most useful configuration will be to have every execution host be a data node, and to only have execution hosts as data nodes. Also, because of the way Hadoop does authentication and authorization, you'll need to make sure that either HDFS has security disabled or that root and the SGE admin user are in the HDFS super user group.
- Copy your Hadoop configuration directory to <hadoop>/conf, where <hadoop> is the directory that you copied in step 2 of How to install the integration.
- Delete the <hadoop>/conf/mapred.xml, <hadoop>/conf/masters, and <hadoop>/conf/slaves files.
- Edit the <hadoop>/env.sh file to contain the paths to the Java platform, the Hadoop install directory, and the Hadoop configuration directory you just created (<hadoop>/conf).
- Change into the <hadoop> directory.
- ./setup.pl -i
- Add the hadoop parallel environment to one or more of your queues
The setup.pl script will install the hadoop parallel environment and the complexes needed by Herd. It will also start the Herd load sensor on all the execution hosts. At this point, you should be ready to go. Wait for a couple of minutes to give all of the execution hosts a chance to start running the load sensor and reporting values. You can run qhost -F hdfs_primary_rack to check that the load sensor is functioning correctly. Every execution host should report an hdfs_primary_rack value. If one or more machines have not reported a value within about five minutes, see the troubleshooting section below.
Using the integration
To submit a job that uses the hadoop parallel environment, use -pe hadoop <n>, where <n> is the number of nodes. The hadoop parallel environment uses an allocation rule that guarantees that no more than one task tracker per job will run on a single host. To tell the scheduler what data the job needs, request the hfds_input resource with a value of the HDFS path to the job's data. The data path must be an absolute path.
Here's an example. Say I want to use the grep example to find occurrences of the word 'Sun' in a series of documents. First, I'd copy those document into HDFS to /user/dant/sungrep (e.g. bin/hadoop fs -copyFromLocal ~/Documents/\* /user/dant/sungrep). I would then submit the job with echo `pwd`/bin/hadoop --config \\$TMPDIR/conf jar `pwd`/hadoop-0.20.1-examples.jar grep sungrep output Sun | qsub -pe hadoop 16 -l hdfs_input=/user/dant/sungrep -jsv <hadoop>/jsv.sh.
Let's look at that in a little more detail. First, we're echoing the Hadoop command and piping it to qsub. Why? Well, when the integration runs, it creates a conf directory in the job's temp directory that is properly set up for the assigned hosts. Until the job runs, though, we don't know where the temp directory is. We get it's path from the $TMPDIR variable once the job starts. We therefore need to wrap the Hadoop command in a script. We could either write a script that contains the command, or we could let qsub write one for us by piping the command to qsub's stdin. Note that we used --config \\$TMPDIR/conf in the command. The backslash is important because it prevents the shell on the submission host from interpreting the $TMPDIR variable.
Next, the qsub command uses -pe hadoop 16 to request 16 nodes. When this job is run, a job tracker will be started on the "master" host, and a task tracker will be started on each of the 16 assigned nodes. The master host is the host where the parallel job's master task is started. After the job tracker and task trackers are running, the grep job itself will be started, launched from the master host. The hadoop PE is a tight integration with an allocation rule of "1". In order to run a Hadoop job on top of SGE, you must use the PE, even if it's only a single-node job.
The qsub command also uses -l hdfs_input=/user/dant/sungrep -jsv <hadoop>/jsv.sh. The -l resource request tells SGE what data will be used by the job. It must be specified as an absolute path. The -jsv switch actually translates the resource request for hdfs_input into requests for specific racks and blocks. Without the -jsv switch, the job would never run because no node offers the hdfs_input resource. (No node offers it because it doesn't really exist. It's just a placeholder for the JSV to replace with rack and block requests. In programming terms, it's a reference injection point.) The resource request and JSV can be left out of the qsub command. If they're left out, the scheduler will not take the HDFS data locality into consideration when scheduling the job.
You can also use the Hadoop integration to set up the job tracker and task trackers and then submit jobs to them directly. Instead of echoing the Hadoop command to qsub, echo sleep 300000 instead. That will cause the job tracker and task trackers to be set up, but instead of running a job, it will just sleep for a long time. You can then run qstat -j <jobid> | grep context to show the job's context. One of the context variables will be the URL for the job tracker. Using that URL, you can set up a Hadoop configuration to talk to the job tracker so that you can submit jobs to it from the command line.
It is also highly recommended that the use of the Hadoop integration be coupled with exclusive host access. The Hadoop task trackers all assume that they have exclusive access to their nodes. If you don't use exclusive host access with the Hadoop integration, you'll end up oversubscribing the nodes.
Hopefully everything will work perfectly the first time. If for some reason it doesn't, here are some tips to help diagnose the problem:
- The execds aren't reporting any hdfs resources, i.e. qhost -F | grep hdfs shows nothing.
- Sometimes it takes several minutes for the nodes to start reporting the hdfs resources. If after several minutes there's still nothing, pick an execution host and check if the load sensor is running: jps -l. Look for com.sun.grid.herd.HerdJsv. Note that it might be running as root or as the SGE admin user. Also note that jps may only show you your own processes. If the load sensor isn't running, look for log files in /tmp. They will be called sge_hadoop_loadsensor.out and sge_hadoop_<n>.log. The .out file is the output from starting the load sensor. The .log files are the logging output from the load sensor. One will be the log file from the load sensor framework, and the other will be the log file from the Herd load sensor. (You can control the logging verbosity from the logging.properties file in the <hadoop> directory.) The most common problem is that the load sensor is started as the user root on most platforms (for a reason I don't yet understand), but that HDFS usually is not. With HDFS, the user who started it is the super user, and only the super user can query the kind of information that the load sensor needs. As stated in the configuration section, you must either disable HDFS security or set a super user group that contains root (and probably the SGE admin user). The next most common problems are that the path to Hadoop or the Java platform is not correct in env.sh or that the conf directory contains bad configuration information. You can test the load sensor manually by changing into the <hadoop> directory and running loadsensor.sh. If it works, it will "hang". Press enter, and it should spit out the hdfs resource values for that host. Type QUIT and press enter to exit the load sensor.
- The job tracker and/or task trackers aren't starting.
- The first place to look is the PE output and error files. The output from starting the job tracker should be found there. The next place to look is the log files. The log files are written where the Hadoop configuration says to put them. Make sure that wherever that is, all the users have access to it from all the nodes. Inability to write the log file is a common reason why the job tracker and/or task trackers won't start. In addition to the usual Hadoop log files, the integration also write a hadoop-<adminuser>-sge-<hostname>.log file. That file contains the output from starting the task trackers from the master host. Another common reason for the job tracker and/or task trackers not to start is that the path to the Java platform isn't correctly configured in the hadoop-env.sh file.