Leading the Herd

Wow! There's been a surprising amount of noise lately about the Apache Hadoop integration with Sun Grid Engine 6.2u5. Since folks seem to be interested, I figure it's a good place to start on my feature deep-dive posts that I promised.

I'm going to assume that you already understand the many virtues of Hadoop. (And if you don't, Cloudera will be happy to tell you all about it.) Instead, to set the stage, let's talk about what Hadoop doesn't do so well. I currently see two important deficiencies in Hadoop: it doesn't play well with others, and it has no real accounting framework. Pretty much every customer I've seen running Hadoop does it on a dedicated cluster. Why? Because the tasktrackers assume they own the machines on which they run. If there's anything on the cluster other than Hadoop, it's in direct competition with Hadoop. That wouldn't be such a big deal if Hadoop clusters didn't tend to be so huge. Folks are dedicating hundreds, thousands, or even tens of thousands of machines to their Hadoop applications. That's a lot of hardware to be walled off for a single purpose. Are those machines really being used? You may not be able to tell. You can monitor state in the moment, and you can grep through log files to find out about past usage (Gah!), but there's no historical accounting capability there.

Coincidentally, these two issues are things that most workload managers (like Sun Grid Engine) do really well. And I'm not the first to notice that. The Hadoop on Demand project, which is included in the Hadoop distribution, was an attempt to integrate Hadoop first with Condor and then with Torque, probably for those same reasons. It's easy enough to have the Hadoop framework started on demand by a workload manager. The problem is that most workload managers know nothing about HDFS data block locality. When a typical workload manager assigns a set of nodes to a Hadoop application, it's picking the nodes it thinks are best, generally the ones with the least load, not the ones with the data. The result is that most of your data is going to have to be shipped to the machines where the tasks are executing. Since the great innovation of Map/Reduce is that we move the execution to the data instead of vice versa, bringing a workload manager into the picture shoots Hadoop in the foot.

Enter Sun Grid Engine 6.2 update 5. One of the main strengths of the Sun Grid Engine software is its ability to model just about anything as a resource (called a "complex" in SGE terms) and then use those resources to make scheduling decisions. Using that capability, we've modeled HDFS rack location and the locally present HDFS data blocks as resources. We then taught SGE how to translate an HDFS path into a set of racks and blocks. Finally, we taught SGE how to start up a set of jobtrackers and tasktrackers, and voila! Ze Hadoop integration iz born. More detail? Glad you asked.

The Gory Details

The Hadoop integration consists of two halves. The first half is a parallel environment that can start up a jobtracker and tasktrackers on the nodes assigned by the scheduler. (HDFS is assumed to already be running on all the nodes in the cluster.) In the end, it's really no different than an MPI integration (especially MPICH2). The second half is called Herd. It's the part that talks to HDFS about blocks and racks, and it has two parts. The first part is the load sensor that reports on the block and rack data for each execution host (hosts that are running the SGE execution daemon). The second part is a Job Submission Verifier that translates requests for HDSF data paths into requests for racks and blocks.

The process for running a Hadoop job as an SGE job looks something like this:

  1. At regular intervals, all of the execution hosts report their load. This includes the rack and block information collected by the Herd load sensors.
  2. User submits a job, e.g. echo $HADOOP_HOME/hadoop --config \\$TMP/conf jar $HADOOP_HOME/hadoop-\*-examples.jar grep input output SGE | qsub -jsv jsv.sh -l hdfs_input=/user/dant/input -pe hadoop 128
  3. The jsv.sh script starts the Herd JSV that talks to the HDFS namenode. It removes the hard hdfs_input request and replaces it with a soft request for hdfs_primary_rack, hdfs_secondary_rack, and a set of hdfs_blk<n><n> resources. (A hard request is one that is required for a job to run. A soft request is one that is desired but not required.) The primary rack resource lists the racks where most of the data live. The secondary rack resource lists all of the racks where any of the data lives. Because the HDFS data block id space is so large, we aggregate blocks into 256 chunks by their first two hex digits.
  4. The scheduler will do its best to satisfy the job's soft resource requests when assigning it nodes, 128 in this example. It's probably not going to be able to assign the perfect set of nodes for the desired data, but it should get pretty close.
  5. After the scheduler assigns hosts, the qmaster will send the job (everything between the "echo" and the "|") to one of the assigned execution hosts.
  6. Before the job is started on that host, the Hadoop parallel environment kicks in. It starts a tasktracker remotely on every node assigned to the job (all 128 in this example) and a single jobtracker locally. An important point here is that the tasktrackers are all started through SGE rather than through ssh. Because SGE starts the tasktrackers, it is able to track their resource usage and clean up after them later.
  7. After the Hadoop PE has done its thing, the job itself will run. Notice in the example that I told it to look in $TMP/conf for its configuration. The Hadoop PE sets up a conf directory for the job that points to the jobtracker it set up. That conf directory gets put in the job's temp directory, which is exported into the job's environment as $TMP.
  8. After the job completes, the Hadoop parallel environment takes down the jobtracker and tasktrackers.
  9. Information about the job, how it ran, and how it completed is logged by SGE into the accounting files.

Since everyone loves pretty pictures, here's the diagram of the process:

The Hadoop integration will attempt to start a jobtracker (and corresponding tasktrackers) per job. For most uses, that should be perfectly fine. If, however, you wan to use the HoD allocate/deallocate model, you can do that, too. Instead of giving SGE a Hadoop job to run, give it something that blocks (like "sleep 10000000"). When the job is started, the address of its jobtracker is added to its job context. Just query the job context, grab the address, and build your own conf directory to talk to the jobtracker. You can then submit multiple Hadoop jobs within the same SGE job.

Hopefully this gives you a clear picture of how the Hadoop integration works. You can find more information in the docs. I think it's a testament to the flexibility of Sun Grid Engine that the integration did not require and changes to the product. All I did was add in some components through the hooks that SGE already provides. One more thing I should also point out. This integration is in the Sun Grid Engine product, but not in the Grid Engine courtesy binaries that we just announced.


Of course, one reason of having 1000s of machines in a Hadoop cluster is to have petabytes of storage. It's not just that the cluster is underused, just that there may be some spare cycles even when disk usage is high. Remember also that the HDFS datanode checksuming and load-balancing is always ongoing.

Posted by Steve Loughran on January 15, 2010 at 12:32 AM PST #

I have no issue with having thousands of machines to support HDFS. It just seems wasteful not being able to use those machines for anything else, even when they're not running Map/Reduce. I'm not saying that dedicated Hadoop clusters are bad, just that in many cases a consolidated cluster would be more cost effective.

Posted by Daniel Templeton on January 15, 2010 at 03:30 AM PST #

hey, if you have spare CPU time and running extra work doesn't impact your network, go for it.

There is also the opportunity to plug in a new scheduler for Hadoop, one that makes different decisions about what to run next, and where.

Posted by Steve on January 15, 2010 at 06:31 AM PST #

But no matter how smart the Hadoop scheduler is\*, it's still only scheduling Hadoop workload. That doesn't help if you're trying to consolidate your Hadoop applications into the cluster than runs your Fluent or BLAST or Algorithmic applications.

\* It would technically be possible to extend the Hadoop scheduler to be able to schedule non-Hadoop applications, but that seems like a major reinvention of the wheel to me.

Posted by Daniel Templeton on January 15, 2010 at 06:38 AM PST #

a) does sge prevent multiple task trackers from starting on the same node? how does sge prevent cpu slots from being wasted in the case of a single reduce?

b) what happens when hadoop security gets integrated in 0.22?

Posted by a on January 28, 2010 at 04:19 AM PST #

Good questions. SGE has a facility for exclusive host access. The Hadoop integration does not require the use of exclusive host access, but it's a really really good idea.

The problem of wasting resources during the reduce phase is something we're looking at for an upcoming release. Hadoop jobs run as parallel jobs, and today the size of parallel jobs in SGE is fixed at scheduling time.

I honestly have not had a chance to look into the new security model. Today we only support 0.20.1. We will look into how to support newer builds for the upcoming releases. I can speculate, though, that it can only make things better. The Hadoop integration stretches the limits of the current Hadoop security.

Posted by Daniel Templeton on January 28, 2010 at 04:29 AM PST #

That is true, that one of the main strengths of the Sun Grid Engine software is its ability to model just about anything as a resource and then use those resources to make scheduling decisions. Its so useful.

Posted by grid on February 26, 2010 at 11:29 AM PST #

Well, if one needs to run something else using spare cycles of Hadoop cluster's nodes then I think it would be totally reasonable to have a different resource manager (scheduler) in Hadoop rather than going in great length of marrying it with SGE.

Overcomplexity leads to problems.

Posted by Cos on November 11, 2010 at 04:43 PM PST #

I actually disagree completely.

Building out the Hadoop scheduler to be able to support non-Hadoop workloads is going to great lengths and is likely to result in dangerous complexity.

Running Hadoop on top of SGE (now OGE) is a very natural thing and not complex at all. We've been doing it for decades with things like MPI, and from the workload manager's perspective, MapReduce really isn't much different.

Posted by Daniel Templeton on November 12, 2010 at 12:42 AM PST #

> We've been doing it for decades...
That's especially tasty :) as according to my recollection GE has been acquired by Sun around 2000 or so.

Posted by Cos on November 12, 2010 at 02:50 AM PST #

Gridware (formerly Genias), the company Sun acquired to get Grid Engine, had been around since 1990, and it started by adopting an already existing source base. So, yes, \*decades\*. :)


Posted by Daniel Templeton on November 12, 2010 at 03:00 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed



« July 2016