Sun Grid Engine 6.2 Update 2 Is Out!
By templedf on Mar 05, 2009
Sun Grid Engine 6.2u2 is now available. If you're not excited, you should be. First off, don't let the name fool you. 6.2u2 is not just bug fixes. It's a full feature release, and contains some great features. What features? Glad you asked.
First and foremost, job submission verifiers (JSVs). It's a feature we added specifically for TACC, but it's one that will be useful for almost everyone. In fact, I suspect that we'll discover it's the answer to some of the classic Sun Grid Engine problems. What is it? Before 6.2u2, there was no way to prevent a job from being submitted. It was (and still is) possible to choose not to schedule a job after it's been submitted, but before 6.2u2, that's all you could do. With 6.2u2 and JSV, you now have the option to insert a step between submission and acceptance. With that step, you can choose to accept or reject the job submission, but you can also choose to modify the job before accepting it, and that's where the magic comes in.
The verification step is handled through scripts or binaries. There's a new submission option,
-jsv, that adds a JSV to the submission. That means you can pick up JSVs from anywhere that you can stash a submission option: most notably the global sge_request file, your user sge_request file, and the directory's sge_request file, but also DRMAA native specification, DRMAA job category, the enigmatic -@ switch, and, of course, the command line itself. The
-jsv switch is cumulative, so if you have one in several of those places, several JSVs will be run for your submission. It's worth noting that all of the above listed JSV sources are controlled by the user, except the global sge_request file, and even that can be overridden with the
So far, we've only talked about the client side. JSVs can also come in on the server side. In the global host configuration an administrator can configure a single JSV. Unlike on the client side where every JSV is started from scratch with every job submission, on the server side the JSV is started once and queried repeatedly. The reason is that on the client side, performance isn't a big issue, but on the server side, the cost of forking and execing the JSV for every job submission can have a huge impact. By keeping the JSV running, we save that cost. The big advantage of the server-side JSV is that users can't circumvent it. If you really need to enforce a policy with a JSV, the server side is that place to do it.
Now, if you're thinking fast, you might question the point of the server-side JSV when users can change everything about the job using qalter after it's submitted. Well, so did we. When you configure a server-side JSV, users are no longer allowed to modify jobs after submission unless you specifically grant the ability to do so, and even then it's limited to the job attributes that you allow them to modify.
JSV is a huge topic, and I could probably go on for days about it. Instead I'll save it for a white paper and move on.
The next big feature in 6.2u2 is the new installer. You now have the option of using the old interactive text-based installer or a new graphical installer. The graphical installer has several important advantages. First, it lets you install an entire cluster at once. It actually sits on top of the auto-installer and reuses that same functionality to install remote nodes. The graphical installer, however, will first verify that all the nodes are reachable before the installation starts, so the installation won't quietly hang on an unreachable node. It also accepts wildcarded host name and IP address ranges, which makes installing a huge cluster much simpler.
The third major feature is that we've added support for Microsoft Windows Vista (Ultimate and Enterprise) and Server 2003R2 and 2008. Both 32-bit and 64-bit version are available. Harald (who you should encourage to start blogging!) worked really hard on ironing out the issues with the changes in the OS. We still rely on SFU for the Windows execution daemons, except that it's now called SUA.
The fourth big feature is job-level parallel job resource requests. Before 6.2u2, whenever a parallel job requested a resource, SGE would implicitly multiply that resource request by the number of assigned slaves (because each slave requests the resource on the host where it runs). That makes sense with, say, memory, where requesting 4GB really means that every slave should have 4GB. It doesn't make any sense for other things, like some software licenses. Now with 6.2u2, the administrator can flag a resource as job level, meaning that it is not multiplied by the number of assigned slaves when requested by a parallel job. In most cases, a resource that shouldn't be multiplied in for one job, shouldn't be multiplied for any job. There may be exceptions to the rule, but I doubt there will be many. I'd love to hear your feedback, though.
The last two new features aren't so much features as improvements. Starting with 6.2u2, the 64-bit Linux binaries use the jemalloc library instead of the default Linux malloc. The performance and memory footprint impact is significant, in some cases as much as 20% improvement. Also, starting with 6.2u2, the Linux binaries use poll() instead of select() in the commlib. For some flavors of Linux, the use of select() made it difficult to scale past a couple thousand hosts. With the commlib now using poll(), I've seen SGE scale well over 6000 Linux nodes.
And on top of all that, there is the usual pile of bug fixes. A handful of qmaster and scheduler issues cropped up recently in 6.2 and 6.2u1, but with 6.2u2 those should all now be resolved.
I highly recommend giving 6.2u2 a try, if for no reason other than JSV. Let me know what you think!