Sunday May 25, 2008

One More Down, Two To Go

This news is a little old now, but it's no less worthy of announcing. Thanks to our friends at FedStage, the same folks who brought us the Platform LSF DRMAA implementation, there is now an implementation of DRMAA for Altair's PBS Pro! With the addition of PBS Pro to the DRMAA family, that now leaves just two major DRMs without DRMAA support: DataSynapse GridServer and Microsoft CCS. Sun Grid Engine, Platform LSF, and Altair PBS Pro all have DRMAA implementations. Condor, Torque, EEGE, GridWay, and several others also have DRMAA implementations.

For the uninitiated, DRMAA is an API for submitting, monitoring, and controlling jobs in a DRM system. The API is intended to be simple and clean as well as cross-platform, cross-DRM, and cross-language. Sun Grid Engine, for example, ships with DRMAA implementations in the C and Java™ languages, and Perl, Python, and Ruby implementations are available from the open source community.

By the way, I should also give a shout-out to FedStage's other big DRMAA project, OpenDSP. It is exactly what its acronym proclaims it to be. It's a service for doing job submission, monitoring, and control remotely via DRMAA connections to the DRM systems. If you're looking for a framework for secure remote grid operations, definitely check it out!

Friday May 23, 2008

Exclusive Host Access With Grid Engine

I just got the following request in email:

It just happens that I'm using PBSpro ... at the moment...

You can have this resource request...

#PBS -l nodes=101:ppn=8#excl

We can implement the nodes/ppn with PEs in SGE.

But #excl means exclusive access to a node (only applies to batch).

That is what I want from SGE.

Since this is a request I've heard before, I thought it might be useful to share my answer.

Imagine you have a grid of n machines, and each machine has the same number of cores, say 4. Imagine also that you have two queues in your grid, long.q and short.q, that span all of the hosts. In order to implement exclusive node use, I need to do three things:

  1. Create a new queue called exclusive.q that spans all hosts and has a single slot per host. Also, set the subordinate_list to long.q=1,short.q=1.

  2. Create a new forced static boolean resource called exclusive and assign exclusive.q the complex_values, exclusive=TRUE.

  3. Set the subordinate_list for long.q and short.q to exclusive.q=1.

I can now submit a job with:

qsub -l exclusive /path/to/job

and it will be guaranteed to run as the only job on the machine. It should be pretty easy to take this simple example and extend it to work in your actual environment.

Let's talk about why it works. First, the exclusive queue is protected by a forced resource. Only jobs that request the resource can run in the queue. That prevents random jobs from accidentally wandering into that queue. Second, it is subordinated to the long and short queues. That means that if there are jobs running in either the long or short queue, the exclusive queue will be suspended, preventing jobs from being scheduled there. Lastly, the long and short queues are subordinated to the exclusive queue, meaning that if a job is running in the exclusive queue, the long and short queues are suspended, preventing jobs from being scheduled there. Because of the circular subordination scheme, we can guarantee that when one of the queues is suspended, it will have no jobs running in it, so our exclusive jobs won't accidentally suspend some other hapless job. (If there were another job in another queue, then the exclusive queue would already be suspended, so the exclusive job couldn't be scheduled there.)

While this configuration isn't a built-in feature of Grid Engine like it is with PBS Pro, what we offer is considerably more flexible. The administrator has the ability to be very specific about which machines can be exclusive and under which circumstances, and all of it works just like a regular queue, which makes administration easier. From the end user side, there's no appreciable difference.




« February 2016