Welcome Sun Grid Engine 6.2 update 5
By templedf on Jan 06, 2010
The Sun Grid Engine 6.2 update 5 release is now available. Don't let the unassuming version number fool you; there's quite a few interesting features packed into this release. Let's talk about them, shall we?
Integration with Apache Hadoop
SGE 6.2u5 gets to claim the title of first workload manager with direct support for Apache Hadoop applications. What does that mean? First, it means that you can submit Hadoop applications to an SGE cluster just like you would any other parallel job. The cluster will take care of setting up the Hadoop jobtracker and tasktrackers for you. Second, it means that the SGE scheduler knows about the HDFS data locality such that it can route Hadoop jobs to nodes where the jobs' data already lives. The net result is that you can now realistically consolidate your Hadoop cluster into your SGE cluster, saving you time, money, and lots of headaches. See the docs for more info. [Also see my next post.]
Many applications benefit greatly by being tied to specific CPU sockets and/or cores. For example, some cache-hungry applications will execute in half the time if run on four cores on different sockets versus running on four cores in the same socket. With SGE 6.2u5, we've added the ability to specify these topology preferences when submitting your jobs. Whenever possible, the scheduler will honor the topology preferences when assigning jobs to nodes. For topology-sensitive applications and clusters with lots of Nehalem boxes, SGE 6.2u5 can speed up application execution considerably. See the docs for more info. [Also see my follow-up post.]
The SGE preemption model is what I call "after-market preemption" meaning that it's not an inherit aspect of every cluster. You have to take preemption (AKA subordination) into account when designing your cluster layout. Prior to SGE 6.2u5, the preemption model was rather coarse grained. SGE could only suspend an entire queue instance at a time, meaning that one high-priority job might be suspending two or four or sixteen or more lower-priority jobs. With SGE 6.2u5, we're introducing finer grained preemption. Now, rather than declaring that just Queue A is subordinated to Queue B, you can say that between Queues A and B there shouldn't be more than 4 jobs running, and given a conflict, Queue B wins. This new finer-grained preemption model means that you can now use subordination without paying for it with utilization. See the docs for more info. [Also see my follow-up post.]
User-controlled Array Task Throttling
One of the unique things about Sun Grid Engine is that it handles array jobs extremely efficiently. In many cases users will consolidate individual batch jobs together into array jobs to take advantage of that fact. The down side is that all tasks within an array job are considered equal with regard to scheduling policies. If an array job is the highest priority job in the system, all of it's tasks are also higher priority than any other jobs. If that array job has ten thousand tasks (something not uncommon or really even all that stressful for SGE), then all ten thousand tasks will be run before any other jobs (unless another job later becomes higher priority), at least by default. An administrator can configure a global limit to the number of tasks from a single array job that are allowed to execute at a time. Better than nothing, but global policies always leave something to be desired.
With SGE 6.2u5, we've introduced the ability for a user to apply self-imposed limits to his individual array jobs. Why would a user voluntarily set limits? In most cases it turns out that users want to do the right thing and will gladly do so given the chance. Self-imposed limits help the cluster run more smoothly, meaning that everyone gets what they want faster, and no one gets bonked on the head by the administrator. Additionally, if a user has more than one large array job pending, setting self-imposed limits allows them all to make progress instead of completing them serially. For more than one customer I know about, this feature alone will be reason enough to upgrade. [See my follow-up post for more info.]
Extended SGE Inspect
SGE Inspect, the new UI introduced in SGE6.2u3, was previously only a monitoring tool. With SGE 6.2u5, we've added the ability to manage parallel environments. Going forward we will continue adding management functionality. See the docs for more info.
Improved Cloud Connectivity
With SGE 6.2u3, we added the ability through the Service Domain Manager component to automatically provision additional cluster nodes from Amazon EC2 during peak periods. With SGE 6.2u5, we've expanded that functionality a bit and made it easier to use. See the docs for more info.
Improved Power Management
Same story as the cloud connectivity, really. We introduced the ability to automatically power down idle or underused nodes with SGE 6.u3 through the Service Domain Manager component. With SGE 6.2u5, we've fleshed it out a bit more and more it easier to use.
Over the next couple of weeks I'll try to write some posts about these features individually. If you're already Grid Engine savvy, go grab a copy and get started. If you need more info, try starting with the beginner's guide.