By templedf on Jun 04, 2008
A common thing to want to do with Grid Engine is to let users request that their jobs be run as the only thing on the host(s). The naïve approach would be for the user to request a number of slots equal to the number of slots offered by the hosts, but for a plethora of reasons, that doesn't work. (Among the reasons are that we might not have the same number of slots per host, and more importantly, unless we're using a parallel environment that is configured for fill-up allocation, a job can't request all the slots on a host.) Let's talk through an approach that does work.
[Update: exclusive host access will now be a built-in feature of Sun Grid Engine 6.2u3.]
Let's think through this problem. A natural approach for a Grid Engine administrator would be to create a special queue on each host to which all other queues are subordinated. When jobs are running in that queue, then all other jobs on the system are suspended. That approach solves the problem (mostly), but it's a bit heavy-handed. Whenever an exclusive job gets put on a host, other jobs on that host get suspended until it is finished. If there is a steady stream of exclusive jobs, non-exclusive jobs could starve.
To fix that problem, you could set up circular subordination: make the other queues subordinate to the exclusive queue and the exclusive queue subordinate to the other queues. The effect of this circular subordination is that there can never be jobs in both the exclusive queue and any other queue, preventing the starvation issue. (If a job is running in a non-exclusive queue, the exclusive queue is unavailable (suspended), and vice versa.)
Another problem that crops up is keeping non-exclusive jobs from accidentally ending up in the exclusive queue. That problem is easily solved with a forced resource assigned to the exclusive queue. With a forced resource, only jobs that either request the resource or explicitly request the exclusive queue can run in the exclusive queue.
There's another problem. How do you keep multiple exclusive jobs from all running in the exclusive queue on the same host? One answer would be to only give the exclusive queue one slot. That works for non-parallel jobs and parallel jobs that are only allowed to run one slave per host. It does not work for parallel or parametric jobs where more than one task could (or should) run on a single host. One solution would be to change the forced resource to a forced integer consumable with a value equal to the number of slots. A job could then theoretically request as much of that resource as each host has, making sure that there isn't any left over for other jobs. Unfortunately, that won't work. First, we still have the problem that our hosts might not all have the same number of slots. We could try to solve that problem by setting the exclusive queue's consumable's value to 1. That guarantees that only one job can get the resource. The problem there is that a parallel job consumes one set of resources for each slave, so a parallel job with two slaves on a host will need 2 of our consumable. We could try requesting 1/<num_slaves_per_host> of the consumable for such a parallel job, so that after multiplying by the number of slaves on the host, we end up with a request for 1. That only works, however, if every host will be running the same number of slaves per host, and if we know how many that is ahead of time. "But, wait!" you say. "The consumable is an integer, so even if we request less than 1, we should still consume the entire resource!" You'd think so, but you'd be wrong. It turns out that if one job requests half of our resource, another job can still be assigned the other half, defeating our strategy.
In order to solve the problem, we need to fundamentally prevent the scheduler from looking at hosts that are running exclusive jobs. Well, one way to do that would be to add the host to a special host group, say @exclusive, and use a resource quota set rule to prevent jobs from being scheduled to machines in that hostgroup. We can do that from a prolog on the exclusive queue.
qconf -aattr hostgroup hostlist $HOST @exclusive (Note, that you don't need to remove the host from its current set of queues or host groups. The resource quota set rule obviates that need.) Now, the circular subordination makes sure that jobs can run either in the exclusive queue or the other queues (but not both), our forced complex makes sure that only jobs that request exclusivity get it, and our prolog and resource quota set rule make sure that the scheduler cannot put multiple exclusive jobs on the same host. But, you guessed it, there's still a problem.
Once a job starts running in the exclusive queue, everything works as intended. The problem is that the scheduler may put more than one exclusive job on the same host at the same time. Because the host isn't removed from the host group until an exclusive jobs starts, we need to keep the scheduler from scheduling multiple exclusive jobs at the same time. That's where load adjustments come in. We can create a new resource, say exclusive_load, and set a load threshold for the exclusive queue based on that resource, say
exclusive_load=1. By adding something like
exclusive_load=50 to the
job_load_adjustments attribute in the scheduler config (and probably also setting the
load_adjustment_decay_time to something small, like
0:0:30), we force the scheduler to consider a host's exclusive queue to be full (for the current scheduler interval) whenever a job is put there. After the decay interval, the host becomes available to the scheduler again, but by that time the prolog should have removed it from the host group.
By the way, credit for the host group/load adjustment idea goes to Roland Dittel. Unfortunately, Roland doesn't have a blog, so I can't link to it. If you run into Roland, be sure to tell him how much you'd love to see him start blogging.