The Grid Engine Scheduler Quantified
By templedf on Nov 08, 2007
Earlier this week I helped out with an RFP that included a section for Sun Grid Engine. After I sent in my contributions, it occurred to me that some of the information might be useful to others. Below is my description of the Grid Engine scheduler. It gives a pretty thorough idea of what the scheduler is capable of. In case you were wondering.
Sun Grid Engine supports mixed workloads of batch, interactive, parallel and parametric jobs.
The Sun Grid Engine scheduler is a highly configurable workload scheduler, providing a variety of options for matching workload (jobs) to available resources. The work of the scheduler is done in two distinct steps. The first step is the selection of jobs to be scheduled. The selection step is carried out by applying the various scheduler policies to arrive at a final order of importance for pending jobs. Jobs that are granted the same priority by the scheduler policies will be placed in order of submission.
After the jobs have been sorted in priority order according to the scheduler policies in place, the second step of workload scheduling takes place. In the second step, the scheduler matches the resources requested by the jobs to the resources offered execution machines, in the order determined by the selection step. The scheduler looks at four things to finally decide where a job will be executed. First, the scheduler filters the job's potential execution host list by the job's “hard resource requests.” A job can only run on an execution host which provides the resources the job has declared as necessary. Second, the scheduler filters the list of potential execution hosts by the job's “soft resource requests.” If a job wants, but does not need, a particular resource, the job will be run on an execution host offering that resource if possible. If no such execution host is available, that resource request is ignored. Third, the scheduler will select an execution host from the list of potential execution hosts according to host load, the host's 5-minute average load divided by the number of processors. The least loaded execution host from the list of potential execution hosts will be selected as the destination for the job. Lastly, the scheduler will select a queue on the selected execution host according to the available queues' sequence numbers. The queue with the lowest sequence number will be selected as the destination for the job.
In the above paragraph, the default configuration was discussed. In actuality, the scheduler's behavior in highly configurable. In the third step, the execution host is selected by host “load.” The scheduler's concept of load can be configured to reflect the priorities of the organization. By default it is the normalized 5-minute load average, but can be based on any resource in the system. Free memory, CPU speed and free disk space are examples of other common metrics. In the above paragraph, steps three and four can be swapped. In that case, the scheduler will filter the execution host list first by hard resource requests, then by soft resource requests. Then it will select a destination queue for the job based on queue sequence numbers. Finally, it will select a destination execution host for the job from the list of execution hosts on which the destination queue is available.
The scheduler supports three classes of scheduler policies: entitlement, urgency and custom.
The entitlement policy consists of three components: share tree policy, functional policy and override ticket policy.
The share tree policy is a fair-share policy that attempts to ensure that target resource shares are achieved over a configurable period of time. Entitlement shares are configured in a directed acyclic graph, called a share tree, that describes how resource usage is to be shared among users. The shares of users who have no jobs waiting to be run are shared among the other users according to the share distribution defined by the share tree. Users who have received more than their target share of the resource may later be penalized to ensure that resource usage over a given period matches the target resource shares.
The functional policy describes the relative target shares of resources for users, departments, projects, and individual jobs. Each job gets a total share of resources based on the sum of the shares it receives from its submitting user, the submitting user's department, the project to which the job belongs and the share assigned directly to the job. The shares of users who have no jobs waiting to be run are shared among the other users according to the share distribution defined by the share tree. The functional policy is non-historical, meaning that users who gain extra shares of resources will not be later penalized for the overage.
The override ticket policy provides a means for the administrator to give (and later remove) extra priority to a user, department, project or individual job with respect to the share tree and functional policies. The extra priority provided by the override ticket policy ultimately results in increased resource shares for the target user, department, project or job.
The urgency policy consists of three components: deadline time urgency, wait time urgency and resource urgency.
Deadline time urgency is priority associated with a job that increases as a job approaches its start time deadline. As the current time approaches a job's start time deadline, the job's urgency will approach the maximum deadline urgency in an asymptotic fashion. When the current time has reached a job's deadline time, that job is assigned the maximum deadline urgency. To prevent abuse of the system, only a select set of users is allowed to submit jobs with start time deadlines.
Wait time urgency is priority that is collected by a job as it remains in a pending state, waiting to be scheduled. Every second spent waiting accrues more priority for the job.
Resource urgency is a priority associated with resources that is inherited by jobs requesting those resources. By assigning a resource such an urgency, it ensure that jobs requesting that resource are given a priority in the scheduler's job selection process, in turn ensuring that the resource is in use as much as possible.
The custom policy is a number between -1023 and 1024 that represents the priority of a job, with higher numbers representing higher priorities, as described by POSIX 1003.1b. It's called the custom policy because it can be used by external processes to apply custom scheduling orders.
The jobs priorities derived from each of the three scheduling policies are normalized to a value between 0 and 1 within their respective policies, weighted and summed together to arrive at a final priority value. The weights used for each policy are configurable. By default the custom policy is weighted most heavily, and the urgency policy is weighted least heavily, with the entitlement policy in the middle. By defaults, the weights differ by an order of magnitude, ensuring a clear interaction among the policy contributions to the overall job priority.
In addition to the scheduler policies, the scheduler's decision-making process can be influenced by queue configuration, resource configuration, calendar configuration and resource quota sets. Through the above mechanisms, an organization's business rules and grid topology can be modeled through the grid, directing the scheduler to make decisions within the boundaries expressed by the reality of the grid environment. In particular, the above mechanisms provide the means to place fine-grained limits on the execution of jobs, such as how many can be simultaneously active, when they are allowed to be active, on which hosts they are allowed to be executed and by whom.