Thursday Jan 14, 2010

Leading the Herd

Wow! There's been a surprising amount of noise lately about the Apache Hadoop integration with Sun Grid Engine 6.2u5. Since folks seem to be interested, I figure it's a good place to start on my feature deep-dive posts that I promised.

I'm going to assume that you already understand the many virtues of Hadoop. (And if you don't, Cloudera will be happy to tell you all about it.) Instead, to set the stage, let's talk about what Hadoop doesn't do so well. I currently see two important deficiencies in Hadoop: it doesn't play well with others, and it has no real accounting framework. Pretty much every customer I've seen running Hadoop does it on a dedicated cluster. Why? Because the tasktrackers assume they own the machines on which they run. If there's anything on the cluster other than Hadoop, it's in direct competition with Hadoop. That wouldn't be such a big deal if Hadoop clusters didn't tend to be so huge. Folks are dedicating hundreds, thousands, or even tens of thousands of machines to their Hadoop applications. That's a lot of hardware to be walled off for a single purpose. Are those machines really being used? You may not be able to tell. You can monitor state in the moment, and you can grep through log files to find out about past usage (Gah!), but there's no historical accounting capability there.

Coincidentally, these two issues are things that most workload managers (like Sun Grid Engine) do really well. And I'm not the first to notice that. The Hadoop on Demand project, which is included in the Hadoop distribution, was an attempt to integrate Hadoop first with Condor and then with Torque, probably for those same reasons. It's easy enough to have the Hadoop framework started on demand by a workload manager. The problem is that most workload managers know nothing about HDFS data block locality. When a typical workload manager assigns a set of nodes to a Hadoop application, it's picking the nodes it thinks are best, generally the ones with the least load, not the ones with the data. The result is that most of your data is going to have to be shipped to the machines where the tasks are executing. Since the great innovation of Map/Reduce is that we move the execution to the data instead of vice versa, bringing a workload manager into the picture shoots Hadoop in the foot.

Enter Sun Grid Engine 6.2 update 5. One of the main strengths of the Sun Grid Engine software is its ability to model just about anything as a resource (called a "complex" in SGE terms) and then use those resources to make scheduling decisions. Using that capability, we've modeled HDFS rack location and the locally present HDFS data blocks as resources. We then taught SGE how to translate an HDFS path into a set of racks and blocks. Finally, we taught SGE how to start up a set of jobtrackers and tasktrackers, and voila! Ze Hadoop integration iz born. More detail? Glad you asked.

The Gory Details

The Hadoop integration consists of two halves. The first half is a parallel environment that can start up a jobtracker and tasktrackers on the nodes assigned by the scheduler. (HDFS is assumed to already be running on all the nodes in the cluster.) In the end, it's really no different than an MPI integration (especially MPICH2). The second half is called Herd. It's the part that talks to HDFS about blocks and racks, and it has two parts. The first part is the load sensor that reports on the block and rack data for each execution host (hosts that are running the SGE execution daemon). The second part is a Job Submission Verifier that translates requests for HDSF data paths into requests for racks and blocks.

The process for running a Hadoop job as an SGE job looks something like this:

  1. At regular intervals, all of the execution hosts report their load. This includes the rack and block information collected by the Herd load sensors.
  2. User submits a job, e.g. echo $HADOOP_HOME/hadoop --config \\$TMP/conf jar $HADOOP_HOME/hadoop-\*-examples.jar grep input output SGE | qsub -jsv -l hdfs_input=/user/dant/input -pe hadoop 128
  3. The script starts the Herd JSV that talks to the HDFS namenode. It removes the hard hdfs_input request and replaces it with a soft request for hdfs_primary_rack, hdfs_secondary_rack, and a set of hdfs_blk<n><n> resources. (A hard request is one that is required for a job to run. A soft request is one that is desired but not required.) The primary rack resource lists the racks where most of the data live. The secondary rack resource lists all of the racks where any of the data lives. Because the HDFS data block id space is so large, we aggregate blocks into 256 chunks by their first two hex digits.
  4. The scheduler will do its best to satisfy the job's soft resource requests when assigning it nodes, 128 in this example. It's probably not going to be able to assign the perfect set of nodes for the desired data, but it should get pretty close.
  5. After the scheduler assigns hosts, the qmaster will send the job (everything between the "echo" and the "|") to one of the assigned execution hosts.
  6. Before the job is started on that host, the Hadoop parallel environment kicks in. It starts a tasktracker remotely on every node assigned to the job (all 128 in this example) and a single jobtracker locally. An important point here is that the tasktrackers are all started through SGE rather than through ssh. Because SGE starts the tasktrackers, it is able to track their resource usage and clean up after them later.
  7. After the Hadoop PE has done its thing, the job itself will run. Notice in the example that I told it to look in $TMP/conf for its configuration. The Hadoop PE sets up a conf directory for the job that points to the jobtracker it set up. That conf directory gets put in the job's temp directory, which is exported into the job's environment as $TMP.
  8. After the job completes, the Hadoop parallel environment takes down the jobtracker and tasktrackers.
  9. Information about the job, how it ran, and how it completed is logged by SGE into the accounting files.

Since everyone loves pretty pictures, here's the diagram of the process:

The Hadoop integration will attempt to start a jobtracker (and corresponding tasktrackers) per job. For most uses, that should be perfectly fine. If, however, you wan to use the HoD allocate/deallocate model, you can do that, too. Instead of giving SGE a Hadoop job to run, give it something that blocks (like "sleep 10000000"). When the job is started, the address of its jobtracker is added to its job context. Just query the job context, grab the address, and build your own conf directory to talk to the jobtracker. You can then submit multiple Hadoop jobs within the same SGE job.

Hopefully this gives you a clear picture of how the Hadoop integration works. You can find more information in the docs. I think it's a testament to the flexibility of Sun Grid Engine that the integration did not require and changes to the product. All I did was add in some components through the hooks that SGE already provides. One more thing I should also point out. This integration is in the Sun Grid Engine product, but not in the Grid Engine courtesy binaries that we just announced.




« April 2014