Exclusive Host Access With Grid Engine

I just got the following request in email:

It just happens that I'm using PBSpro ... at the moment...

You can have this resource request...

#PBS -l nodes=101:ppn=8#excl

We can implement the nodes/ppn with PEs in SGE.

But #excl means exclusive access to a node (only applies to batch).

That is what I want from SGE.

Since this is a request I've heard before, I thought it might be useful to share my answer.

Imagine you have a grid of n machines, and each machine has the same number of cores, say 4. Imagine also that you have two queues in your grid, long.q and short.q, that span all of the hosts. In order to implement exclusive node use, I need to do three things:

  1. Create a new queue called exclusive.q that spans all hosts and has a single slot per host. Also, set the subordinate_list to long.q=1,short.q=1.

  2. Create a new forced static boolean resource called exclusive and assign exclusive.q the complex_values, exclusive=TRUE.

  3. Set the subordinate_list for long.q and short.q to exclusive.q=1.

I can now submit a job with:

qsub -l exclusive /path/to/job

and it will be guaranteed to run as the only job on the machine. It should be pretty easy to take this simple example and extend it to work in your actual environment.

Let's talk about why it works. First, the exclusive queue is protected by a forced resource. Only jobs that request the resource can run in the queue. That prevents random jobs from accidentally wandering into that queue. Second, it is subordinated to the long and short queues. That means that if there are jobs running in either the long or short queue, the exclusive queue will be suspended, preventing jobs from being scheduled there. Lastly, the long and short queues are subordinated to the exclusive queue, meaning that if a job is running in the exclusive queue, the long and short queues are suspended, preventing jobs from being scheduled there. Because of the circular subordination scheme, we can guarantee that when one of the queues is suspended, it will have no jobs running in it, so our exclusive jobs won't accidentally suspend some other hapless job. (If there were another job in another queue, then the exclusive queue would already be suspended, so the exclusive job couldn't be scheduled there.)

While this configuration isn't a built-in feature of Grid Engine like it is with PBS Pro, what we offer is considerably more flexible. The administrator has the ability to be very specific about which machines can be exclusive and under which circumstances, and all of it works just like a regular queue, which makes administration easier. From the end user side, there's no appreciable difference.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

templedf

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today