Limiting Certain Concurrent Jobs on SGE Cluster

A customer asked about how to limit total number of certain jobs on an SGE cluster.  The customer wants to limit certain jobs running on the cluster in such a way that only one job can be allowed to run on each execution host but no more than a given number of concurrent jobs can be allowed at a given time.  The reason to limit the total number of concurrent jobs is to avoid that the jobs may create any  resource contention.

The customer is running an old version of SGE, which doesn't have the resource quota set feature available starting SGE 6.1 release.  It is doable with pre-SGE 6.1 release but it requires a lot of work as compared to what can be done with the SGE resource quota set.

The following demonstrates how easy it is to set up such a customization with the SGE resource quota set feature.

First thing to do is to create a resource counter that tracks how many jobs are being executed.  Using the SGE complex parameter, one can define:

# qconf -sc

#name             shortcut   type       relop requestable consumable default  urgency
#------------------------------------------------------------------------------------
concurjob         ccj        INT        <=    FORCED      YES        0        0

...

Now, all these special jobs should be executed on a special queue called "archive" queue.  The archive queue will be configured so that all these special jobs must use the special resource counter when submitting the job.

# qconf -sq archive

qname                 archive
...
complex_values       
concurjob=1
...

As shown above, only one job will be scheduled to the archive queue instance per machine.

Now it's time to control the total number of such jobs globally. This can be done very easily with the resource quota set (RQS). The following command can be used to create such a  RQS rule.

#  qconf -arqs 
{
   name         limit_concur_jobs
   description  NONE
   enabled      TRUE
   limit        to concurjob=10
}


The red-colored, italicized entries are actually modified on the template.  This will complete all the customization that can limit the total number of special jobs running concurrently on the entire SGE cluster.

Now when you submit a special job to the archive queue, you must use the "-l concurjob=1" resource request, which in turn, will be used to monitor how many those special jobs are being run.

The following shows an example. For demonstration purpose, the archive queue is modified to accommodate two jobs per queue instance and the total number of allowed concurrent jobs to be 1.

s4u-80a-bur02# qconf -sq archive |egrep 'host|archive|concur'
qname                 archive
hostlist              @allhosts
complex_values       
concurjob=2

s4u-80a-bur02# qconf -srqs
{
   name         limit_concur_jobs
   description  NONE
   enabled      TRUE
   limit        to
concurjob=1
}

s4u-80a-bur02# qsub -b y -o /dev/null -j y -l ccj=1 sleep 3600
Your job 53 ("sleep") has been submitted
s4u-80a-bur02# qsub -b y -o /dev/null -j y -l ccj=1 sleep 3600
Your job 54 ("sleep") has been submitted

s4u-80a-bur02# qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
archive@s4u-80a-bur02          BIP   0/2/10         0.02     sol-sparc64   
     53 0.55500 sleep      root         r     10/24/2008 15:05:59     1        

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
     54 0.00000 sleep      root         qw    10/24/2008 15:05:57     1      

s4u-80a-bur02# qstat -j 54
...
scheduling info:            cannot run because it exceeds limit "/////" in rule "limit_concur_jobs/1"

As observed here, the job 54 is waiting to be scheduled when the resources become available.


Comments:

With the current release of SGE also "consumable JOB" might be used, as it won't get multiplied by the slot count and will this way limit the number of "jobs" instead of "slots". This of course only relevant in case of parallel jobs.

Posted by Reuti on March 08, 2011 at 02:43 AM EST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Chansup Byun

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today