Grid Engine: The World's First Cloud-aware Distributed Resource Manager

Sun has just released a new version of Grid Engine.  Grid Engine is a market leading product in the Distributed Resource Management space, but this new release really brings the product to the next level.  Specifically, it brings it up into the cloud!

So, what's so exciting about this release?  There are a number of things, but I'll focus on two.  First, Dynamic resource reallocation, including the ability to use on-demand resources from Amazon EC2.  Second is deep integration with Apache Hadoop -- one of the most popular workloads in the cloud today.

A new feature in Grid Engine allows you to manage resources across logical clusters (or even clouds).  This could be two collections of systems inside a corporation, or can include non-local cloud resources (such as EC2).  Why would you want to do this?  Let's look at a scenario.

Many auto companies use Grid Engine to coordinate the resources on the Grid/Cluster/Cloud they use for mechanical design and simulation.  Users across the company submit jobs (e.g. a crash simulation) and Grid Engine queues them and dispatches them based on priority and policy.  However, what happens when your submissions start to outpace the ability of your systems to keep up?  In the traditional model, you'd have to buy new hardware and add it to your Grid/Cluster/Cloud.  With the Grid Engine you can now configure rules that allow you to "cloud burst" these workloads out to another cloud.  With Amazon EC2 specifically, you pre-configure a set of AMI images on EC2 that have your application software and register them with Grid Engine.  You also give Grid Engine the credentials to manage your EC2 account.  Then, based on your policy, Grid Engine will:

  • Fire up new EC2 instances on demand (using your supplied AMIs)
  • Automatically set up a secure VPN network tunnel between your network and your EC2 instances
  • Join them to the Grid Engine cluster
  • Dispatch work to them
  • Take them back down once demand has subsided

It's a great example of on-demand resource management, and it has the potential to save customers real money in avoiding over-provisioning their internal clouds.

The next thing that's really exciting is Grid Engine's new integration with Hadoop.  Hadoop is a popular open-source implementation of Map-Reduce.  Map-Reduce is the fundamental building block that power's the internal clouds at Yahoo and Google, and it's commonly used as a way to enable applications that can process huge collections of data.

While Hadoop has seen a large amount of deployment in the web space (at companies like Facebook and others) it's only starting to see adoption in the Enterprise.  This new Grid Engine release can help change that.  Grid Engine is now a key ingredient to make Hadoop enterprise ready.  At a technical level, Hadoop applications can now be submitted to Grid Engine, just like any other kind of parallel computation job.  This means you can now more easily share a single set of physical resources between Hadoop and other tradition applications (financial risk modeling, crash simulations, weather prediction, batch processing -- you name it).  That means reduced cost to the customer.  Beyond that, Grid Engine now has a deep understanding of Hadoop's global file systems (HDFS), which means that Grid Engine can send work to the right part of the cluster (where the data lives locally) to make it ultra-efficient -- even when sharing.  And lastly, Grid Engine has a mature usage accounting and billing feature (ARCo) built-in.  That means you can now track and (internally) charge back for Hadoop jobs -- giving IT a real way to interact with the business.

There's a lot more to this release and you can read all about it over at Dan Templeton's blog, so I won't try to go into all the details.  Let it suffice to say that I'm really excited about this release.  Grid Engine has a future that makes it an increasingly important part of the infrastructure for Cloud Computing going forward.

Comments:

Hadoop+GridEngine is really cool. It's really what we have wanted to see.

Posted by Bonghwan Kim on January 13, 2010 at 11:32 PM PST #

[Trackback] This post was mentioned on Twitter by virtualsteve: Today's Blog: Grid Engine: The World's First Cloud-aware Distributed Resource Manager - http://bit.ly/4SnJVT

Posted by uberVU - social comments on January 14, 2010 at 02:53 AM PST #

It's really innovative ideas and the fusion of traditional resources and cloudy resources. Hadoop and GridEngine provides a dynamic provisioning...that's cool.

Posted by Bhaskar on January 14, 2010 at 02:55 AM PST #

EC2 and Hadoop integration = Major geek points with us programmers. Nice

Posted by Eric Wendelin on January 14, 2010 at 06:17 AM PST #

As a computing specialist that works in the Condor, Hadoop, and SGE communities, I wanted to post a follow up the above post, as it is factually incorrect.

At Cycle Computing (http://www.cyclecomputing.com), I started working with Condor users five years ago, as well as using Hadoop and SGE with clients in the past two years, in life sciences, insurance, finance, energy and chip design. Condor is very flexible and powerful, but all schedulers have use cases they’re great at.

We’re fans of computation management in general, but we were alarmed by Wilson’s post about Grid Engine being the “World’s First Cloud-Aware Distributed Resource Manager” because it supports machines running in EC2 through a VPN and can schedule Hadoop clusters. Simply put, it isn’t first.

Perhaps Wilson was misinformed about SGE being first with these features and will correct or update his post upon review of the following information.

The Condor Scheduler has had Hadoop Cluster scheduling since 2006, originally by Yahoo! using Condor in its Hadoop on Demand project. Condor has had Amazon EC2 scheduling since 2008. In 2007, CycleCloud offered Condor clusters as a service into the Amazon Cloud. Condor can even be used to do advanced, cost-based scheduling in Amazon EC2, as we discussed here (http://bit.ly/4BvKwJ). These dates are well published.

Condor is a freely available resource manager from the University of Wisconsin-Madison, with production users over the past 20 years including for many Fortune 500s and large companies like JP Morgan, Yahoo!, Fair Isaac, Altera and Hartford Insurance, who have also talked at Condor conferences (http://bit.ly/73vazd).

For reference, the timeline looks like this:
• 2006: Yahoo!, the major contributor behind Hadoop, used Condor to schedule Hadoop clusters on generic hardware; see Sameer Paranjpye’s presentation (bit.ly/73vazd)on Hadoop on Demand using Condor
• 2007: CycleCloud Release 1 (http://cyclecloud.com) implements Condor Clusters in the Amazon Cloud
• 2008: Condor adds the ability (http://bit.ly/6dc90k) to schedule machines in Amazon EC2
• 2009: Condor adds support for jobs not coded to use Hadoop to still use data directly out of HDFS using hdfs:// URLs
• 2009: Cycle works with Fortune 100 companies to deploy Hadoop clusters on a Condor pool of resources
• 2009: Sun supports some cloud functionality (http://bit.ly/4BKJd8) in the commercial-only binary release; non-paying customers should build the source themselves
• 2010: Wilson announces (http://bit.ly/5Bm8i8) Grid Engine does Hadoop cluster scheduling and Amazon EC2 utilization as the “World’s First Cloud-Aware Distributed Resource Manager”

Clearly, Condor was years ahead of SGE in having these features, and has been doing it in production environments for far longer.

That’s not to say that SGE, including these features, isn’t cool. In fact, it is! It just isn’t the first to do so. The Condor Team deserves attribution (and a correction by Wilson) for enabling cloud scheduling using Amazon EC2, and Hadoop on Demand at Yahoo! itself, long before SGE.

Posted by Jason Stowe on January 15, 2010 at 04:35 AM PST #

The debate with Jason is ongoing on his blog, but I just wanted to bring a few of the points back here. As far as I can tell, the Condor integration with Hadoop is to use HDFS for file staging. While that's a nifty feature (that I've considered for SGE as well), it's not quite the same thing as being able to route Hadoop MapReduce jobs to the nodes that contain their data.

Posted by Daniel Templeton on January 21, 2010 at 11:06 PM PST #

Post a Comment:
Comments are closed for this entry.
About

Thoughts on cloud computing, virtualization and data center management from Steve Wilson, Oracle engineering VP.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today