By templedf on Jan 28, 2010
Continuing with the new feature theme, this week we're talking about slotwise subordination (AKA slotwise preemption). Preemption is the notion that a higher priority job can bump a lower priority job out of the way so it can execute. Pretty simple notion. Some workload managers have an implicit concept of preemption. Sun Grid Engine does not. We have what I like to call "after-market preemption". The net result is the same. In a workload manager with "built-in" preemption, like Platform LSF, it works by temporarily relaxing the slot count limit on a node and then resolving the oversubscription by bumping the lowest job on the totem pole to get the number of jobs back under the slot count limit. In Sun Grid Engine, the same thing happens, except that instead of the scheduler temporarily relaxing the slot count limits, you as the administrator configure the host with more slots than you need and a set of rules that create an artificial lower limit on the job count that is enforced by bumping the lowest priority jobs. It nets out to the same thing. With Sun Grid Engine you have a little more control over the process, but you pay for it with some added complexity.
That set of rules that defines the artificial limit is called subordination. By subordinating one queue to another, you tell the master that jobs running in the subordinated queue are lower priority and should be preempted when necessary. Specifically, all jobs in the subordinated queue are suspended when a threshold number of jobs (usually 1) are scheduled into the queue to which it is subordinated.
Queue subordination in Sun Grid Engine was implemented long ago, when single-socket, single-core machines still roamed the Earth. Back in those days, there was generally only one job running per host, so the queuewise subordination scheme worked out just fine. Now that we're in the era of multi-core machines, suspending the entire subordinate queue tends to be a bad idea. Enter slotwise preemption. In a nutshell, slotwise preemption lets you set a specific limit on the number of jobs allowed to be running on a host, regardless of how many queues and slots there are. If too many jobs land on the host, jobs in the lowest ranking queue(s) will be suspended until the number of running jobs is under the limit.
(Note that slotwise subordination deals only with the running job count. If you want to limit the active job count (running + suspended), you can do that by making the slots complex a host-level resource and setting it to the desired limit.)
Let's look at some examples from the queue_conf(5) man page:
Assume we have a cluster of dual-core machines and two queues that span all the machines, A.q and B.q, each with two slots:
% qconf -sq A.q | grep subordinate_list subordinate_list slots=2(B.q:0:sr) % qconf -sq B.q | grep subordinate_list subordinate_list NONE
This configuration says that there are four slots available on each host (2 in each queue), but that only 2 jobs may be running on any host at any given time. If more than 2 jobs end up on a node, it will result in the excess jobs being suspended. Because B.q is subordinated to A.q, the excess jobs will always come from B.q.
Let's talk about the difference between queue-wise and slot-wise suspension for this example. With queue-wise suspension, you'd have two choices: either a single job in A.q would suspend all jobs in B.q, or two jobs in A.q would suspend all jobs in B.q. The choice is either undersubscribing (with one running job in A.q and two suspended jobs in B.q) or oversubscribing (with one running job in A.q and two running jobs in B.q). With slot-wise suspension, a job running in A.q will only suspend a job running in B.q if there are more than two running jobs on the host. We will therefore never oversubscribe, and we'll never undersubscribe as long as there's a job available to run.
Let's look at a more complex example:
% qconf -sq A.q | grep subordinate_list subordinate_list slots=2(B.q:1:sr,C.q:2:lr) % qconf -sq B.q | grep subordinate_list subordinate_list NONE % qconf -sq C.q | grep subordinate_list subordinate_list NONE
We've added a third queue, and we now have a very simple tree. Both B.q and C.q are subordinated to A.q, but there are still only 2 slots available for running jobs. If a host is scheduled with more than two running jobs, jobs will be suspended until we get down to two, just like before. What's different is that there's now a pecking order for the subordinated queues. Because B.q has a lower sequence number (1) than C.q (2), it is higher priority. That means we'll suspend jobs from C.q first, before suspending jobs from B.q. What's also different is how we pick the job to suspend. In B.q in both examples, the action is listed as "sr", which means to suspend the shortest running job. In C.q in this example, the action is "lr", which means to suspend the longest running job.
One more example:
% qconf -sq A.q | grep subordinate_list subordinate_list slots=3(B.q:0:sr) % qconf -sq B.q | grep subordinate_list subordinate_list slots=2(C.q:0:sr) % qconf -sq C.q | grep subordinate_list subordinate_list NONE
Now we have a tree with more than a two levels: C.q is subordinated to B.q is subordinated to A.q. Between B.q and C.q up to two jobs are allowed to be running, with B.q's jobs taking priority. Among A.q, B.q, and C.q, up to three jobs are allowed to be running, with A.q's jobs taking priority over B.q's jobs, and B.q's jobs taking priority over C.q's jobs. Now look carefully. Where did I specify that C.q should be subordinated to A.q? I didn't. It's implicit. Whenever you have a multi-level subordination tree, a node has its entire subtree subordinated to it, whether it's explicitly specified or not, with priority handled between nodes according to depth in the tree and priority with levels handled according to sequence numbers. Because of this implicit subordination, it does not make sense to ever have a higher slot limit lower down in the tree. The higher-level lower slot limit will always take precedence.
Hopefully slotwise subordination now makes sense, and you can see why it's important. Basically it brings Sun Grid Engine's preemption capabilities up to date with modern hardware, making it more efficient and more useful.
There is, however, one notable caveat I have to point out. With queue-wise suspension, when a subordinated queue has its jobs suspended, the queue itself is also suspended, preventing any other jobs from landing in that queue. That's not the case with slotwise subordination. It's possible for the scheduler to place a job into a subordinated queue where that job will immediately be suspended. Imagine in our first example above that A.q has two running jobs in it while B.q is empty. B.q remains a valid scheduling target, and any job that lands there will immediately be suspended because it violates the slotwise limit. The workaround is to use job load adjustments to make sure that hosts with running jobs are appropriately unattractive scheduling targets. Not a show-stopper, but definitely important to be aware of. We will address the issue in the next couple of releases.