Better Preemption

Continuing with the new feature theme, this week we're talking about slotwise subordination (AKA slotwise preemption). Preemption is the notion that a higher priority job can bump a lower priority job out of the way so it can execute. Pretty simple notion. Some workload managers have an implicit concept of preemption. Sun Grid Engine does not. We have what I like to call "after-market preemption". The net result is the same. In a workload manager with "built-in" preemption, like Platform LSF, it works by temporarily relaxing the slot count limit on a node and then resolving the oversubscription by bumping the lowest job on the totem pole to get the number of jobs back under the slot count limit. In Sun Grid Engine, the same thing happens, except that instead of the scheduler temporarily relaxing the slot count limits, you as the administrator configure the host with more slots than you need and a set of rules that create an artificial lower limit on the job count that is enforced by bumping the lowest priority jobs. It nets out to the same thing. With Sun Grid Engine you have a little more control over the process, but you pay for it with some added complexity.

That set of rules that defines the artificial limit is called subordination. By subordinating one queue to another, you tell the master that jobs running in the subordinated queue are lower priority and should be preempted when necessary. Specifically, all jobs in the subordinated queue are suspended when a threshold number of jobs (usually 1) are scheduled into the queue to which it is subordinated.

Queue subordination in Sun Grid Engine was implemented long ago, when single-socket, single-core machines still roamed the Earth. Back in those days, there was generally only one job running per host, so the queuewise subordination scheme worked out just fine. Now that we're in the era of multi-core machines, suspending the entire subordinate queue tends to be a bad idea. Enter slotwise preemption. In a nutshell, slotwise preemption lets you set a specific limit on the number of jobs allowed to be running on a host, regardless of how many queues and slots there are. If too many jobs land on the host, jobs in the lowest ranking queue(s) will be suspended until the number of running jobs is under the limit.

(Note that slotwise subordination deals only with the running job count. If you want to limit the active job count (running + suspended), you can do that by making the slots complex a host-level resource and setting it to the desired limit.)

Let's look at some examples from the queue_conf(5) man page:

Assume we have a cluster of dual-core machines and two queues that span all the machines, A.q and B.q, each with two slots:

% qconf -sq A.q | grep subordinate_list
subordinate_list      slots=2(B.q:0:sr)
% qconf -sq B.q | grep subordinate_list
subordinate_list      NONE

This configuration says that there are four slots available on each host (2 in each queue), but that only 2 jobs may be running on any host at any given time. If more than 2 jobs end up on a node, it will result in the excess jobs being suspended. Because B.q is subordinated to A.q, the excess jobs will always come from B.q.

Let's talk about the difference between queue-wise and slot-wise suspension for this example. With queue-wise suspension, you'd have two choices: either a single job in A.q would suspend all jobs in B.q, or two jobs in A.q would suspend all jobs in B.q. The choice is either undersubscribing (with one running job in A.q and two suspended jobs in B.q) or oversubscribing (with one running job in A.q and two running jobs in B.q). With slot-wise suspension, a job running in A.q will only suspend a job running in B.q if there are more than two running jobs on the host. We will therefore never oversubscribe, and we'll never undersubscribe as long as there's a job available to run.

Let's look at a more complex example:

% qconf -sq A.q | grep subordinate_list
subordinate_list      slots=2(B.q:1:sr,C.q:2:lr)
% qconf -sq B.q | grep subordinate_list
subordinate_list      NONE
% qconf -sq C.q | grep subordinate_list
subordinate_list      NONE

We've added a third queue, and we now have a very simple tree. Both B.q and C.q are subordinated to A.q, but there are still only 2 slots available for running jobs. If a host is scheduled with more than two running jobs, jobs will be suspended until we get down to two, just like before. What's different is that there's now a pecking order for the subordinated queues. Because B.q has a lower sequence number (1) than C.q (2), it is higher priority. That means we'll suspend jobs from C.q first, before suspending jobs from B.q. What's also different is how we pick the job to suspend. In B.q in both examples, the action is listed as "sr", which means to suspend the shortest running job. In C.q in this example, the action is "lr", which means to suspend the longest running job.

One more example:

% qconf -sq A.q | grep subordinate_list
subordinate_list      slots=3(B.q:0:sr)
% qconf -sq B.q | grep subordinate_list
subordinate_list      slots=2(C.q:0:sr)
% qconf -sq C.q | grep subordinate_list
subordinate_list      NONE

Now we have a tree with more than a two levels: C.q is subordinated to B.q is subordinated to A.q. Between B.q and C.q up to two jobs are allowed to be running, with B.q's jobs taking priority. Among A.q, B.q, and C.q, up to three jobs are allowed to be running, with A.q's jobs taking priority over B.q's jobs, and B.q's jobs taking priority over C.q's jobs. Now look carefully. Where did I specify that C.q should be subordinated to A.q? I didn't. It's implicit. Whenever you have a multi-level subordination tree, a node has its entire subtree subordinated to it, whether it's explicitly specified or not, with priority handled between nodes according to depth in the tree and priority with levels handled according to sequence numbers. Because of this implicit subordination, it does not make sense to ever have a higher slot limit lower down in the tree. The higher-level lower slot limit will always take precedence.

Hopefully slotwise subordination now makes sense, and you can see why it's important. Basically it brings Sun Grid Engine's preemption capabilities up to date with modern hardware, making it more efficient and more useful.

There is, however, one notable caveat I have to point out. With queue-wise suspension, when a subordinated queue has its jobs suspended, the queue itself is also suspended, preventing any other jobs from landing in that queue. That's not the case with slotwise subordination. It's possible for the scheduler to place a job into a subordinated queue where that job will immediately be suspended. Imagine in our first example above that A.q has two running jobs in it while B.q is empty. B.q remains a valid scheduling target, and any job that lands there will immediately be suspended because it violates the slotwise limit. The workaround is to use job load adjustments to make sure that hosts with running jobs are appropriately unattractive scheduling targets. Not a show-stopper, but definitely important to be aware of. We will address the issue in the next couple of releases.


Sun Grid Engine is not my favorite anyway.

Posted by Abercrombie And Fitch on January 28, 2010 at 04:43 PM PST #

Very interesting, but suspended jobs take up resources. What if an administrator would like the lower priority jobs to be completely "evicted" from the execute node? That is to say, the lower priority job should be rescheduled for execution on another machine... Should there be a cron job running to detect and evict jobs that are in the suspended state?

Posted by Victor on February 11, 2010 at 09:38 AM PST #

Excellent points. Today the only option for getting that behavior is to use a checkpointing setup. If SGE thinks your lower priority jobs are checkpointable, it will relocate them rather than suspending them. Note, they don't actually have to be checkpointable. You just have to tell SGE that they are.

If you go read the spec for the slotwise suspend on subordinate feature, it includes a way to specify that the lower priority jobs should be requeued instead of suspended. The functionality simply didn't make it into the u5 release. I can't comment on when it will make it into a release, but know that it's on our minds.

Posted by Daniel Templeton on February 11, 2010 at 01:12 PM PST #

Slotwise subordination appears to be only useful for single slot jobs. A multi-slot parallel job like an OpenMPI job may use many slots but still only be counted as one job for the purposes of subordination tally. Would it not have been better to count total slots? "slotwise" seems like a misnomer here.

Posted by Gary Smith on April 15, 2010 at 05:11 AM PDT #

> If you go read the spec for the slotwise suspend on subordinate feature, it includes a way to specify that the lower priority jobs should be requeued instead of suspended.

The remaining question is: are the low priority jobs get requeued BEFORE or AFTER the high priority job actually starts executing on the node?

Posted by Piavlo on May 08, 2010 at 06:33 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed



« July 2016