Getting Grid Engine Scheduler to schedule jobs despite high load

During the preparation for an SDM demo which uses the Cloud Adapter to manage OpenSolaris zones my colleague ran into the problem that the Grid Engine Scheduler did not schedule jobs onto the zones because the overall load on the host was too high.

He installed the system on a opensolaris image running inside of VirtualBox. He submited 1000 sleeper jobs into the grid engine cluster and the MaxPendingJobsSLO started to produce resource requests. The zones service (Cloud Adapter) started up zones and they were assigned to the grid engine service. A small amount of sleeper jobs were scheduled onto the zones. Suddenly no more jobs where scheduled and the zones were unassigned from the grid engine service.

The problem was that his setup lead to a high load on the host (5.6).  Grid Engine defines per default a load threshold of 1.75 inside of a queue. If a load on a host is higher than 1.75 the corresponding queue instance goes into alarm state:

# qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@z2                       BIP   0/0/1          4.16     sol-x86       ad
---------------------------------------------------------------------------------
all.q@z3                       BIP   0/0/1          4.11     sol-x86       ad
---------------------------------------------------------------------------------
all.q@z4                       BIP   0/0/1          4.09     sol-x86       ad

No more jobs cloud be scheduled on the host. The MaxPendingJobsSLO saw that the scheduler does not get jobs into the hosts. The resources did no longer get usage from the MaxPendingJobsSLO. SDM decided to move the zones away from the grid engine service.

To fix this problem he increased the load threshold in the all.q:

# qconf -mq all.q
qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=10

Afterwards the alarm state of the queue instances was cleared. Some jobs went also into error state. He had to clear also this jobs:

 # qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@z2                       BIP   0/1/1          4.04     sol-x86       
      5 0.55500 sleep      rh           r     07/17/2009 13:55:50     1 4
############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
      4 0.55500 sleep      rh           Eqw   07/17/2009 13:39:20     1        
      5 0.55500 sleep      rh           Eqw   07/17/2009 13:47:50     1 1-3:1
      5 0.00000 sleep      rh           qw    07/17/2009 13:47:50     1 5-1000:1
# qmod -cj "\*"
rh cleared error state of job 4
rh cleared error state of job-array task 5.1
rh cleared error state of job-array task 5.2
rh cleared error state of job-array task 5.3
Job-array task 5.4 is not in error state
# qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@z2                       BIP   0/1/1          4.04     sol-x86       
      5 0.55500 sleep      rh           r     07/17/2009 13:55:50     1 4
############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
      4 0.55500 sleep      rh           qw    07/17/2009 13:39:20     1        
      5 0.55500 sleep      rh           qw    07/17/2009 13:47:50     1 1-3:1
      5 0.55500 sleep      rh           qw    07/17/2009 13:47:50     1 5-1000:1





Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

rhierlmeier

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today