Job Scheduling With Grid Engine
By templedf on Nov 30, 2007
Grid Engine is a Distributed Resource Manager (DRM). It's job is to match workload to resources in the most optimal fashion possible, otherwise known as scheduling jobs. What Grid Engine doesn't do is provide an automated means to run certain jobs at certain times of the day. For example, I have an application that I want started at 8:00 and stopped at 17:00, and I want that to happen every weekday of the year. You could certainly use Grid Engine to find a home for the application instance when you start it and to stop the job without needing to know where it ended up, but you can't give Grid Engine an application path and a schedule and have it do the starting and stopping for you automatically. At least not out of the box.
For that kind of automated job management, one would normally use some kind of job management tool, like cron or Autosys. In some cases, however, you can use Grid Engine to solve the problem. In this post, I'll talk about how you'd configure Grid Engine to do automated job management, and the administrative implications of doing so.
Grid Engine does have a feature that can be controlled by a schedule, and that's the enabling/disabling/suspending/resuming of queues. By configuring a calendar, you can tell Grid Engine to enable, disable, suspend or resume a queue according to a specified periodic schedule. (The difference between disabling and suspending is that a disabled queue stops accepting jobs, whereas a suspended queue stops accepting jobs and suspends all currently executing jobs.)
Grid Engine also has support for checkpointing environments. While we do not provide any native checkpointing facilities, we allow you to configure the system to recognize and use whatever checkpointing environments your jobs may be using. One of the advantages given to a job by using a checkpointing environment is that Grid Engine is able to migrate the job under certain conditions. For example, when a job is suspended, if that job is using a checkpointing environment, the job can instead be migrated to another machine. A migration essentially consists of executing a checkpoint, stopping the executing job and resubmitting it. When the job is scheduled to a new machine, it can read its previous state from the last checkpoint and continue where it left off. In order for Grid Engine to know how to initiate a checkpoint, you tell it in the checkpointing environment configuration what commands to execute for what actions.
Why the tangent on checkpointing? Remember that what we want to do is have a job started and stopped at scheduled times. Using calendars, we could configure a queue that is suspended during the times when our jobs aren't supposed to be running. If we submit our jobs to that queue, when the queue is suspended, the jobs will also be suspended, but that's not quite what we're looking for. That scenario doesn't provide any way for the job to do cleanup after the day's operations, and if it's caught in mid operation, that operation gets suspended until the queue is resumed, possibly leaving the data in an inconsitent state. We can, however, configure a checkpointing environment that does nothing. If when we submit our jobs to our queue, we say that they use this "null" checkpointing environment, Grid Engine will know that instead of suspending them, it can migrate them. Because the checkpointing environment has no action commands configured, the migration becomes just terminating the job and resubmitting it. Because our jobs request to be run in our special queue, and because they queue is at that point disabled, our jobs will remain pending until the next time the queue is resumed. Nifty, huh? You could also get the same effect from setting the suspend method for the queue to use
qmod -rj to reschedule the job instead of sending it the usual SIGSUSP.
Let's look at what that means administratively. First, it means that the administrator is managing the schedule of the queues instead of the schedule of the jobs. The jobs will run on indefinitely, being started, stopped, and restarted based on the suspension and resumption of the queues. This scheme has three implications. First it means that jobs must not be allowed to end, either through failure or termination or "natural causes." To that end, it may be useful to configure the epilog script for all of the queues to exit with exit code 99. Exit code 99 tells Grid Engine that the job should be rescheduled. By having the epilog scripts always return 99, it guarantees that if a job ever ends for any reason, it will automatically be restarted. Second, it means that the administrator has to think in terms of indirect effects instead of direct actions. For example, to change the schedule for a job, instead of changing anything about the job, the administrator has to change the schedule for the queue in which the job will run. Third, every job schedule requires a separate queue, as each queue is only able to adhere to a single schedule. If you have an environment where there are a large number of jobs with differing schedules, you can end up with a large number of queues, which can make maintenance a little complicated.
Another administrative consideration for this solution is the lack of a means to see a single-source, comprehensive schedule of what jobs will run when. To build such a schedule, you'd first have to look at all the configured calendars and their associated queues and place them on a time line. Then you'd have to go through the list of all running and pending jobs and find the jobs that are bound to those queues, placing them in the appropriate slots on the time line. Grid Engine provides no such tool. Given, however, that all of the Grid Engine command-line tools are 100% scriptable, it would certainly be possible to write such a tool, and it probably wouldn't be that difficult. (And it would probably be written in Perl.)
qconf -scall to see the list of calendars;
qconf -scal name to see a calendar's configuration; parse the year and week fields to find out when it's active;
qconf -sql to see the list of queues;
qconf -sq name to see a queue's configuration; parse the calendar field to find out what calendar is controling it;
qstat -xml to see all job data in XML format;
qstat -j jobid -xml to see XML data about a job; parse the hard_queue_list field to see what queue it requested; and finally put all the data together into a single chart.
Given that it's not a straight-forward may to do job management, why would you want to use Grid Engine for that purpose? You'd do it so that your managed applications can share the same resources as the rest of the work that being done on the grid, increasing efficiency, and because if Grid Engine is handling the job management instead of bringing in an additional tool, then there's one less thing to manage, fail over, back up, etc. Is it a perfect solution? No. But if it bothers you, I invite you to make it better. That's the glory of Grid Engine being an open source project! You, the user, have the opportunity to make changes and contribute them back, so that we can include them in the next rev of the product.