Why Upgrade to 6.2?
By templedf on Jul 31, 2008
In a previous post I gave a high-level overview of what features each new release of Grid Engine has brought to the table, including what's coming in 6.2. Since 6.2 is now just around the corner, I wanted to go into a bit more detail on why you want to be the first kid on your block to upgrade.
Let's just go through the features in detail, one by one:
The reason for advance reservation is that sometimes it's important to coordinate the availability of compute resources with external factors, such as people, facility, and/or equipment availability. If, for example, you're trying to process data from some celestial event during the event to help further focus the data gathering, you want the compute resources available while the event is occurring. That is exactly what advance reservation enables.
With 6.2, we introduce three new commands: qrsub, qrdel, and qrstat. qrsub lets users create new advance reservations. A reservation must have a duration or an end time. If a reservation does not request a certain start time, the start time is assumed to be now. When a user runs qrsub, the scheduler will attempt to insert the reservation into its resource schedule. If there's room, the reservation will be granted and assigned an id. If the resources are not available at that time, the reservation will be denied.
Once a user has been granted a reservation, there are several things he can do with it. qsub now has an option that allows users to submit a job to a given reservation. If the reservation is not yet active, i.e. it's for a future time, the job will remain pending until the reservation's start time. A job submitted to a reservation can only run on the resources that were assigned to the reservation. If a job submitted to a reservation is still running when that reservation ends, it will automatically be terminated. When the reservation is first requested, the requesting user can include a list of users and groups who are also allowed to user the reservation. Any user in that list is allowed to submit jobs to a reservation. An advance reservation could alternatively be used to block off a set of machines for some out-of-band purpose, such as taking them down for maintenance or logging into them directly to do some work.
Once a reservation is no longer needed, the creating user can delete it using the qrdel command. Once a reservation is deleted, it's gone. If a user needs to recreate the reservation, she will have to effectively create a new reservation requesting the same (or similar) resources.
In order to see the scheduler's master reservation plan, users can run the qrstat command. qrstat shows what resources are reserved when.
In the time between when a reservation is created and when the reservation becomes active, the scheduler will attempt to backfill the resources with jobs with durations that fit into the available time window. By default, the scheduler will not backfill with jobs that do not specify a wallclock time limit.
There are a couple of limits on users' ability to create reservations. First, a new scheduler parameter controls the maximum number of allowed reservations. Second, reservations can only be made on resources that the scheduler can determine will be available at the desired time of the reservation. The scheduler knows that the resource will be available either 1) because the resource is currently unused, or 2) because the job currently running on the resource has a wallclock time limit that says the job will end before the reservation is supposed to begin.
Multi-clustering with Service Domain Manager (Project Hedeby)
Service Domain Manager (SDM) or Project Hedeby is a framework for managing resource sharing among services. It enables an administrator to define service level objectives (SLOs) that govern the distribution of resources. As workloads change, resources are automatically migrated from one service to another, in order to continue satisfying the SLOs. A service in this context is any application that can scale across multiple nodes.
With Grid Engine 6.2 we're including a feature-limited version of SDM to enable a form of multi-clustering. Using SDM, several Grid Engine clusters can share their resources. The clusters' users continue to use the individual clusters as before. Some just get larger, while others get smaller, as workloads change.
The multi-clustering capability of 6.2 has multiple applications. Any time that you need to have multiple masters for any reason, 6.2's multi-clustering will enable you to combine the individual clusters into a larger "meta-cluster," which will help you keep your resource utilization up.
Scalability to 63,000 cores
A tremendous amount of work has gone into scalability improvements for 6.2. Let's talk about them one at a time.
Scheduler as a thread
Perhaps the biggest change with 6.2 is that the scheduler is no longer its own process. Instead, it's another thread in the qmaster. By bringing the scheduler into the qmaster, we've laid the groundwork for significant scalability improvements. Instead of having to communicate all of the necessary data over the wire between the qmaster and scheduler, the scheduler is able to simply share the qmaster's internal data structures. For now, the performance impact is very modest, but as we're able to refine the data locking, we should be able to squeeze out some significant performance gains.
Improved interactive job support
Prior to 6.2, interactive jobs required external binaries to run. By default, qrsh used rlogin/rsh and rlogind/rshd to run an interactive job. For example, the command form of qrsh would submit rshd as a job and then fork off an rsh to connect to that rshd. The actual running of the command is handled by rsh/rshd rather than Grid Engine. That has several disadvantages. First, even if Grid Engine is installed securely the rsh/rshd connection isn't secure. Second, rsh has a limit of 512 ports, meaning that a single machine cannot start more than 512 interactive jobs. Because Grid Engine handles tight integration of parallel jobs via the interactive job framework, that means rsh limits the size of parallel jobs to 512 slave tasks.
We do, however, let you configure which interactive job utilities to use. For example, you can use ssh/sshd to overcome the two problems mentioned above, but that creates new problems. First, because ssh is secure, it's slower. All communications have to encrypted and then decrypted, meaning more time is spent just processing the traffic. Also, in order for Grid Engine to keep accurate accounting logs, the sshd binary has to be patched for Grid Engine. (Grid Engine actually uses its own patched rshd by default.)
With 6.2, we offer a new option for interactive job support. By default with 6.2, interactive jobs are handled through a built-in process. Instead of submitting an rshd and forking off an rsh to connect to it, all of the communications are handled internally by Grid Engine. qrsh talks to the Grid Engine daemon on the execution node, which forwards the traffic to/from the job shell. No external binaries, no external communications. All of the above problems go away. As an added bonus, interactive jobs now get a PTY, which will make a lot of people's lives easier. The only downside to the new interactive job support is that X11 forwarding is not yet supported. (I should point out that X11 forwarding is different from xhosting. xhosting is supported.) Using the new interactive job support, 10k+ task parallel jobs should be no problem.
When you're trying to support a cluster with thousands or tens of thousands of nodes, even the most innocuous network chatter came become a big problem. With 6.2 we're done our best to reduce that chatter to a minimum. One thing that has been done is a review of the qmaster/execd communications to eliminate any unnecessary messages. Another big change is that the execution daemons now only report resource state diffs rather than reporting the entire state of all resources, even the ones that never change, every load report interval. In small clusters, you may not see the difference, but in huge clusters, the difference is noteworthy.
Other "large cluster" improvements
A variety of other scalability enhancements have been done, mostly with regards to reducing memory consumption, reducing qmaster startup time, and eliminating unnecessary overhead. Again, the effects on small clusters will be small, but large clusters will benefit tremendously.
Array Task Dependency
Since before I joined the team, Grid Engine has been able to manage job dependencies. A user can submit a job and specify that the job cannot be started until a set of other jobs have exited. This works for batch jobs, array jobs, parallel jobs, and even interactive jobs. In the case of array jobs, a job dependent on an array job must wait for all the array job's tasks to exit, and an array job that is dependent on another job cannot start any tasks until that other job has exited. If an array job depends on another array job, no task of the second array job can start until every task of the first array job has exited. For most purposes, that behavior is sufficient.
Imagine for a moment that you work for a visual effects company that uses Grid Engine to render video effects. (If you're imagination is vivid, imagine you work for an Australian visual effects company that has done work for several blockbuster films.) In your day-to-day rendering, you have two choices for how to approach the task given the way Grid Engine works (before 6.2). One option is to have an array job per rendering step, with each job task representing a frame. You could then use job dependencies to make sure that step 2 doesn't start until step 1 finishes. That works, but if one frame takes a lot longer than the others to render, all the other frames are stuck in the current step when they could have moved on to the next step. Another option would be to have a batch job for each frame. That way, as soon as a frame finishes a step, it can move on to the next step, regardless of what step the other frames are on. That's less wasteful, but it's also considerably more difficult to manage (millions of jobs instead of tens), and it makes it hard to take advantage of special resources for individual steps. Yet another option would be to do the rendering as an array job of array jobs. That solves all the technical issues, but is practically impossible to manage.
What you'd really want if you were that visual effects company is that ability to have a task in one array job depend on a task in another array job. That way, you could submit each step as an array job where each task represents a frame, and each task could depend on the corresponding task in the previous step. That feature is exactly what 6.2 provides. (Actually, the feature was implemented and contributed by that not-so-imaginary Australian visual effects company.)
With 6.2 a user can declare that an array job's tasks are dependent on the tasks of another array job. Each task of the second job will then each depend on the task of the first job with the same task number, i.e. job 2 task 1 will depend on job 1 task 1. In addition, array task dependencies support "chunking." Chunking means grouping tasks together for efficiency. For example, step one might be really light weight, making it more efficient to have each task process three frames instead of just one. The way chunking is representing in an array task dependency is by the array job's step size. By default, array job tasks are numbered in increments of 1, i.e. 1, 2, 3, 4, 5, etc. It is possible, however, to declare a step size for the task numbers other than 1. A step size of 3 would result in tasks numbered 1, 4, 7, etc. In an array task dependency, if the corresponding task number in the previous job doesn't exist because of chunking, the dependency falls to the chunked task that contains the corresponding task number. For example, tasks 1, 2, and 3 from an array job with a step size of 1 might all depend on task 1 of the previous array job with a step size of 3. It works the other way around as well. Task 1 of an array job with a step size of 3 might depend on tasks 1, 2, and 3 of the previous array job with a step size of 1. It even works for uneven combinations, such as task 1 of an array job with a step size of 3 depending on tasks 1 and 3 of the previous array job with a step size of 2.
One of the major areas of focus for 6.2 was improving the Accounting and Reporting Console (ARCo), In previous releases, the ARCo infrastructure was a little pokey, and it was not very difficult to produce a stream of accounting data fast enough to completely swamp the DBWriter component. (The DBWriter's job is to transfer data from the accounting logs into the ARCo database.) With 6.2 that has been fixed, along with a number of other performance-related issues. ARCo is now fast, and it will continue to get better. Another important change for ARCo is that you can now have more than one cluster write into the same database without conflict. ARCo will even let you run queries against the data from all the clusters. That is important, of course, because of the new multi-clustering support that was also adding in 6.2 (as described above).
Every release we add a few more features to take advantage of what the Solaris 10 operating environment has to offer. In 6.1 we added a DTrace script and declared support for Solaris Zones and ZFS. With 6.2 we're adding support for Service Tags and the Service Management Framework.
Service Tags are a way for you as an administrator to keep track of everything in your network. When a machine has service tags enabled, it responds to broadcast requests for information from the service tag client. When you install 6.2, you have the option of allowing Grid Engine to register a service tag on the master machine to indicate that you have Grid Engine running in your network. You can then see that information from the service tags client. You can also upload that information to Sun's service management repository, and we'll keep track of it for you.
The Service Management Framework (SMF) is a replacement for the traditional UNIX init scripts. Instead of startup and shutdown scripts, services get an entry in a services database that lists how to start and stop the service, among other things. When 6.2 is installed on a Solaris host that supports SMF, if you choose to have Grid Engine start when the machine boots, the installer will create an SMF entry instead of an init script. If you need to make changes to the way Grid Engine is started, you can edit the $SGE_ROOT/$SGE_CELL/common/sgemaster file just like you would have with the old init scripts. Perhaps the most useful part of SMF is that you get an automatic watchdog for your services. If one of your Grid Engine daemons dies or is killed (not using qconf or the sgemaster and sgeexecd scripts), the watchdog process will restart the service automatically.
Are you totally stoked now, or what?
Pretty impressive feature list, eh? And that list didn't even include the myriad major and minor bug fixes that are delivered with 6.2. If you can't wait to try it out, you have two options. First, the beta2 courtesy binaries are still available on the open source site. Second, you can grab the V62_TAG tag from the CVS repository and build it yourself. Have fun, and let us know how it turns out!