Wednesday Jun 04, 2008

Exclusive Host Usage In Grid Engine

A common thing to want to do with Grid Engine is to let users request that their jobs be run as the only thing on the host(s). The naïve approach would be for the user to request a number of slots equal to the number of slots offered by the hosts, but for a plethora of reasons, that doesn't work. (Among the reasons are that we might not have the same number of slots per host, and more importantly, unless we're using a parallel environment that is configured for fill-up allocation, a job can't request all the slots on a host.) Let's talk through an approach that does work.

[Update: exclusive host access will now be a built-in feature of Sun Grid Engine 6.2u3.]

Let's think through this problem. A natural approach for a Grid Engine administrator would be to create a special queue on each host to which all other queues are subordinated. When jobs are running in that queue, then all other jobs on the system are suspended. That approach solves the problem (mostly), but it's a bit heavy-handed. Whenever an exclusive job gets put on a host, other jobs on that host get suspended until it is finished. If there is a steady stream of exclusive jobs, non-exclusive jobs could starve.

To fix that problem, you could set up circular subordination: make the other queues subordinate to the exclusive queue and the exclusive queue subordinate to the other queues. The effect of this circular subordination is that there can never be jobs in both the exclusive queue and any other queue, preventing the starvation issue. (If a job is running in a non-exclusive queue, the exclusive queue is unavailable (suspended), and vice versa.)

Another problem that crops up is keeping non-exclusive jobs from accidentally ending up in the exclusive queue. That problem is easily solved with a forced resource assigned to the exclusive queue. With a forced resource, only jobs that either request the resource or explicitly request the exclusive queue can run in the exclusive queue.

There's another problem. How do you keep multiple exclusive jobs from all running in the exclusive queue on the same host? One answer would be to only give the exclusive queue one slot. That works for non-parallel jobs and parallel jobs that are only allowed to run one slave per host. It does not work for parallel or parametric jobs where more than one task could (or should) run on a single host. One solution would be to change the forced resource to a forced integer consumable with a value equal to the number of slots. A job could then theoretically request as much of that resource as each host has, making sure that there isn't any left over for other jobs. Unfortunately, that won't work. First, we still have the problem that our hosts might not all have the same number of slots. We could try to solve that problem by setting the exclusive queue's consumable's value to 1. That guarantees that only one job can get the resource. The problem there is that a parallel job consumes one set of resources for each slave, so a parallel job with two slaves on a host will need 2 of our consumable. We could try requesting 1/<num_slaves_per_host> of the consumable for such a parallel job, so that after multiplying by the number of slaves on the host, we end up with a request for 1. That only works, however, if every host will be running the same number of slaves per host, and if we know how many that is ahead of time. "But, wait!" you say. "The consumable is an integer, so even if we request less than 1, we should still consume the entire resource!" You'd think so, but you'd be wrong. It turns out that if one job requests half of our resource, another job can still be assigned the other half, defeating our strategy.

In order to solve the problem, we need to fundamentally prevent the scheduler from looking at hosts that are running exclusive jobs. Well, one way to do that would be to add the host to a special host group, say @exclusive, and use a resource quota set rule to prevent jobs from being scheduled to machines in that hostgroup. We can do that from a prolog on the exclusive queue. qconf -aattr hostgroup hostlist $HOST @exclusive (Note, that you don't need to remove the host from its current set of queues or host groups. The resource quota set rule obviates that need.) Now, the circular subordination makes sure that jobs can run either in the exclusive queue or the other queues (but not both), our forced complex makes sure that only jobs that request exclusivity get it, and our prolog and resource quota set rule make sure that the scheduler cannot put multiple exclusive jobs on the same host. But, you guessed it, there's still a problem.

Once a job starts running in the exclusive queue, everything works as intended. The problem is that the scheduler may put more than one exclusive job on the same host at the same time. Because the host isn't removed from the host group until an exclusive jobs starts, we need to keep the scheduler from scheduling multiple exclusive jobs at the same time. That's where load adjustments come in. We can create a new resource, say exclusive_load, and set a load threshold for the exclusive queue based on that resource, say exclusive_load=1. By adding something like exclusive_load=50 to the job_load_adjustments attribute in the scheduler config (and probably also setting the load_adjustment_decay_time to something small, like 0:0:30), we force the scheduler to consider a host's exclusive queue to be full (for the current scheduler interval) whenever a job is put there. After the decay interval, the host becomes available to the scheduler again, but by that time the prolog should have removed it from the host group.

QED (Whew!)

By the way, credit for the host group/load adjustment idea goes to Roland Dittel. Unfortunately, Roland doesn't have a blog, so I can't link to it. If you run into Roland, be sure to tell him how much you'd love to see him start blogging.

Defining the Process Owner For Prologs & Epilogs

I've been working in the Grid Engine team for over five years ago, and I'm still learning about features of the product that I never knew about. One more was just brought to my attention.

When configuring a queue in Grid Engine, you can configure a prolog and epilog. The prolog is a script or binary that is run by the shepherd before running a job. The epilog is the same, except that it comes after a job finishes. When you set the prolog and epilog for a queue, all jobs that run in that queue inherit that prolog and epilog. (A job cannot specify its own prolog and epilog, but look for that to change in a future release. (Actually, if you configure your queue's prolog and epilog to read a custom environment variable in the job's environment and exec the path it contains, you can effectively allow a job to specify its own prolog and epilog by setting them in the environment variables.))

The epilog and prolog are well known tools. What I never noticed, though, is that not only can you specify a path, but you can also specify the user as whom the prolog or epilog should run. For example, if you set the queue's prolog to root@/path/to/my/prolog, the shepherd will execute the prolog as root, no matter who submitted the job. This is really helpful if your prolog and/or epilog needs to do something that has restricted access, such as mounting a directory or modifying the grid configuration. Because only the administrator can change the queue configuration, this feature is not a big security risk. (Actually, this feature is a compelling reason for restricting who has manager rights on your grid. Anyone who is recognized as a grid manager could change a queue to run a malicious prolog/epilog as root, submit a job to that queue, and compromise the system.)

Tuesday May 27, 2008

Making Grid Engine HA with Open High Availability Cluster and OpenSolaris

At the Open Source Grid & Cluster Conference a couple of weeks ago, Ashu from the Solaris Cluster team gave a 30-minute presentation about building a highly available Grid Engine cluster using the Open HA Cluster project. (Open HA Cluster is the open-sourced Solaris Cluster.) If you've got a spare 30 minutes, it's worth a look.

Intro to Grid Engine Queues

I just posted this information as answer to a question on the Grid Engine users mailing list, but I thought it was useful enough to post here, too. If you're new to Grid Engine and trying to understand what a queue is, hopefully this explanation will help.

Let's take it from the top. A queue is where a job runs, not where it waits to run. When a job is in the qw (queued and waiting) state, it has not yet been assigned to a queue. A job that has been assigned to a queue is in the r (running) state (or transferring or suspended). In the pre-6.0 days, a queue could only exist on a single host. With 6.0, we introduced the idea of cluster queues. A cluster queue is a queue that can span multiple hosts. Under the covers, it's essentially a group of pre-6.0 queues, all with the same name, and each on a different host. With one caveat. A pre-6.0 queue is composed of a long list of required attributes, like slots, pe_list, user_list, etc. Starting with 6.0, that long list of attributes is only required for the cluster queue. All of the queue instances that belong to that cluster queue inherit the attribute values from it. The queue instances are allowed, however, to override those attribute values with local settings. A common example of that is the slots attribute. When you install an execution daemon using the install_execd script, it will add a slots setting for the queue instance of all.q on that host (noted as all.q@host). And if it wasn't already clear, pre-6.0 "queue" == post-6.0 "queue instance". Post-6.0 "queue" == "cluster queue".

So, aside from governing the number of free slots on a host, what does a queue do? It controls the execution context of jobs that run in it. It determines what parallel environments are available, what file, memory, and CPU time limits should be applied, how the job should be started, stopped, suspended, and resumed, what the job's process' nice value is, etc.

Queues also have a concept of subordination. A queue that is subordinated to another queue will be suspended (along with all the jobs running in it) when jobs are running in that other queue. By default, the subordinated queue will be suspended when the other queue is full, but you can set the number of jobs required to suspend the subordinated queue. 1 is a common value, meaning that the subordinated queue should be suspended if any jobs are running in the other queue. Subordination trees can be arbitrarily complex. Circular subordination schemes are permitted, producing a sort of mutual exclusion effect.

One other oddity to point out is that the slot count for a queue is not really a queue attribute. It's actually a queue-level resource (aka complex). To allow multiple queues on the same host to share that host's CPUs without oversubscribing, you can set the slots resource at the host level. Doing so sets a host-wide slot limit, and all queues on that host must then share the given number of slots, regardless of how many slots each queue (or queue instance) may try to offer.

Since we're talking about resources, let's talk about one of the common queue/resource configuration patterns. By default, there's nothing (other than access lists) to prevent a stray job from wandering into a queue. That's bad for queues that govern expensive resources or that represent special access, like a priority queue. To solve this problem, the most common approach is to create a resource that is forced. A forced resource (one that has FORCED in the requestable column) has the property that any queue or host that offers that resource can only be used by jobs requesting that resource (or that queue or host, in which case, the resource request is implicit). By assigning such queues forced resources, you can guarantee that stray jobs can't end up in the queue. A nice side effect is that you can also assign an urgency to that resource, meaning that jobs requesting that resource (or the queue to which it's assigned) gain (or lose) priority when being scheduled.

For more information on the above topics, I recommend looking at the man pages for queue_conf(5), complex(5), and sge_priority(5).




« July 2016