Self Control

Good day, and welcome to week four of my continuing attempt to cover all the features added in the latest release (6.2u5) of Sun Grid Engine. This week we'll talk about array task throttling.

Sun Grid Engine supports four classes of jobs. Interactive jobs are the equivalent of doing an rsh/rlogin/ssh to a node in the cluster, except that the connection is managed by Sun Grid Engine. Batch jobs are your traditional "go run this somewhere" type of job. They represent a single instance of an executable. Parallel jobs consist of multiple processes working in collaboration. All of the processes need to be scheduled and running at the same time in order for the job to run. Parametric or array jobs are like what you see in Apache Hadoop, where multiple copies of the same executable are run across multiple nodes against different parts of the data set. The important characteristic that distinguishes array jobs from parallel jobs is that the tasks of an array job are completely independent from each other and hence do not need to all be running together.

The way that Sun Grid Engine processes array jobs is particularly efficient. In fact, a common trick to improve cluster throughput is to bundle many batch jobs together to be submitted as a single array job. Because array jobs are so efficient, users use lots of them, sometimes with huge task counts. There is no explicit limit on the number of tasks that an array job can contain. Hundreds of thousands of tasks in a single array job are not uncommon.

There is a problem, however. From the Sun Grid Engine scheduler's perspective, all of the tasks of an array job are equal. That means that if the highest priority job waiting to execute is an array job, then all of that job's tasks are higher priority than any other job (or task) waiting to run. If that job has a million tasks, then the cluster is going to have to process all million of those tasks before anything else will be executed. Now, the policies do come into play here, and if a higher priority job is submitted or if the array job loses priority through some policy (like the fair share policy), then it and its remaining tasks will fall back in the execution order. Nonetheless, this approach makes it possible for a user to unintentionally execute a denial of service attach on the cluster.

For quite some time there has been an option that an administrator can configure to set a limit on the maximum number of tasks that can be simultaneously executed from a single array job (max_aj_instances in sge_conf(5)). That solves the problem, but only in a very general and somewhat suboptimal way. As with any such global setting, the administrator has to make a trade-off between having a limit that works well for the majority and having a limit that doesn't unduly restrict certain users. (The default is 2000 tasks per array job.) Well, it turns out that given the opportunity, most users will willing set such a limit themselves, both to avoid being bonked on the head by the administrator for abusing the cluster, and for reasons of self interest, such as by allowing multiple of their array jobs to share cluster time rather than being processed sequentially. So, with 6.2u5, we've given users exactly that ability.

Let's look at an example:

% qsub -t 1-100000

will submit an array job that will run the script one hundred thousand times. Each time it runs, an environment variable ($SGE_TASK_ID) will be set to tell that instance which task number it is. The script must be able to translate that task ID into a pointer to its portion of the data set. In a cluster with default settings, up to 2000 of the tasks of this job will be allowed to be running at a time. If the cluster only has 2000 slots, that could be a bad thing.

% qsub -t 1-100000 -tc 20

submits the same job, except that it places a limit of 20 on the number of tasks allowed to be running simultaneously. In our fictitious 2000-slot cluster, that's a quite neighborly thing to do. If you try to set the limit above the global limit set by the administrator, the global limit prevails.

While this feature is pretty simple, it can mean a large difference in job throughput for some clusters. I know one customer in particular that went way out of their way to implement this feature themselves using clever configuration tricks. The massive headache of hacking together a solution was worth it to them to be able to set per-job task limits.


Post a Comment:
  • HTML Syntax: NOT allowed



« April 2014