Wednesday Oct 01, 2008


I recently rediscovered a hidden qconf option. I remember talking with the engineer when he implemented the option years ago, but because it was never documented, I forgot that it existed. A recent customer eval reminded me that it's there, and I think it's one worth sharing.

The hidden option is qconf -bonsai. It is a human-readable equivalent of qconf -sstree, which if you've looked at you'll know isn't even remotely human-readable. It prints the current share tree configuration using spacing to represent hierarchy.

Let's look at an example. This is the output from qconf -sstree for my home test cluster:

# qconf -sstree

This is the output from qconf -bonsai for the same cluster:

# qconf -bonsai

Now, as for why it's an undocumented feature, I suspect it's historical. It was originally added on a whim by one of the engineers and was just never fully embraced. I remember there being talk about changing the name of the switch and making it a documented feature, but I suspect that plan just got lost in the shuffle.

Thursday Sep 18, 2008

Cheap at Twice the Price

We've just added pricing to the Sun Store for Sun Grid Engine 6.2. Just go to the Get It tab, scroll down to the media kit, perpetual licenses, or subscription licenses section, and click the Get It button. On the Sun Store page you'll find the complete pricing information for that option.

Friday Aug 22, 2008

Announcing Grid Engine 6.1 Update 5

Grid Engine 6.1 Update 5 is now ready and courtesy binaries are available for download. SGE 6.1u5 is a maintenance release and fixes bugs of the software, installation procedure and man pages. For more information, see the announcement.

Thursday Aug 14, 2008

Feature Poll

I just posted a poll over on about what potential new Sun Grid Engine features are most important. Pop on over and share your opinion!

Monday Aug 11, 2008

Sun Grid Engine 6.2 Information

Since there's actually quite a lot of information out there about the Sun Grid Engine 6.2 release, I thought it might be useful to provide a single source for where to find it. (Actually, the completely revamped Sun Grid Engine is already a single source for this information, but you have to browse a bit to find it all.) Here ya go:

There are still a couple more things in the coming soon category. As they go live, I will update the above list.

Friday Aug 08, 2008

It's a Bouncing Baby 6.2!

Congratulations to the Sun Grid Engine team! The new 6.2 release is finally out. (Actually, it's been out for a couple of days. I'm just a tardy blogger.) To find out more about what's in it, check out my previous post, see the new, improved Sun Grid Engine product page, or listen to the podcast that Miha and I recorded. To download a copy of the software, pop on over to the Sun Download Center. The open source courtesy binaries will be made available shortly.

As if that were not exciting enough, there's also a chance to win a free t-shirt! Andy, our non-blogging engineering manager (Encourage him to start blogging next time you see him!), has put a bounty on 6.2 production clusters. For more information, see Chris' post on or Andy's original email. Act now! Supplies are limited!

Thursday Jul 31, 2008

Why Upgrade to 6.2?

In a previous post I gave a high-level overview of what features each new release of Grid Engine has brought to the table, including what's coming in 6.2. Since 6.2 is now just around the corner, I wanted to go into a bit more detail on why you want to be the first kid on your block to upgrade.

Let's just go through the features in detail, one by one:

Advance Reservation

The reason for advance reservation is that sometimes it's important to coordinate the availability of compute resources with external factors, such as people, facility, and/or equipment availability. If, for example, you're trying to process data from some celestial event during the event to help further focus the data gathering, you want the compute resources available while the event is occurring. That is exactly what advance reservation enables.

With 6.2, we introduce three new commands: qrsub, qrdel, and qrstat. qrsub lets users create new advance reservations. A reservation must have a duration or an end time. If a reservation does not request a certain start time, the start time is assumed to be now. When a user runs qrsub, the scheduler will attempt to insert the reservation into its resource schedule. If there's room, the reservation will be granted and assigned an id. If the resources are not available at that time, the reservation will be denied.

Once a user has been granted a reservation, there are several things he can do with it. qsub now has an option that allows users to submit a job to a given reservation. If the reservation is not yet active, i.e. it's for a future time, the job will remain pending until the reservation's start time. A job submitted to a reservation can only run on the resources that were assigned to the reservation. If a job submitted to a reservation is still running when that reservation ends, it will automatically be terminated. When the reservation is first requested, the requesting user can include a list of users and groups who are also allowed to user the reservation. Any user in that list is allowed to submit jobs to a reservation. An advance reservation could alternatively be used to block off a set of machines for some out-of-band purpose, such as taking them down for maintenance or logging into them directly to do some work.

Once a reservation is no longer needed, the creating user can delete it using the qrdel command. Once a reservation is deleted, it's gone. If a user needs to recreate the reservation, she will have to effectively create a new reservation requesting the same (or similar) resources.

In order to see the scheduler's master reservation plan, users can run the qrstat command. qrstat shows what resources are reserved when.

In the time between when a reservation is created and when the reservation becomes active, the scheduler will attempt to backfill the resources with jobs with durations that fit into the available time window. By default, the scheduler will not backfill with jobs that do not specify a wallclock time limit.

There are a couple of limits on users' ability to create reservations. First, a new scheduler parameter controls the maximum number of allowed reservations. Second, reservations can only be made on resources that the scheduler can determine will be available at the desired time of the reservation. The scheduler knows that the resource will be available either 1) because the resource is currently unused, or 2) because the job currently running on the resource has a wallclock time limit that says the job will end before the reservation is supposed to begin.

Multi-clustering with Service Domain Manager (Project Hedeby)

Service Domain Manager (SDM) or Project Hedeby is a framework for managing resource sharing among services. It enables an administrator to define service level objectives (SLOs) that govern the distribution of resources. As workloads change, resources are automatically migrated from one service to another, in order to continue satisfying the SLOs. A service in this context is any application that can scale across multiple nodes.

With Grid Engine 6.2 we're including a feature-limited version of SDM to enable a form of multi-clustering. Using SDM, several Grid Engine clusters can share their resources. The clusters' users continue to use the individual clusters as before. Some just get larger, while others get smaller, as workloads change.

The multi-clustering capability of 6.2 has multiple applications. Any time that you need to have multiple masters for any reason, 6.2's multi-clustering will enable you to combine the individual clusters into a larger "meta-cluster," which will help you keep your resource utilization up.

Scalability to 63,000 cores

A tremendous amount of work has gone into scalability improvements for 6.2. Let's talk about them one at a time.

Scheduler as a thread

Perhaps the biggest change with 6.2 is that the scheduler is no longer its own process. Instead, it's another thread in the qmaster. By bringing the scheduler into the qmaster, we've laid the groundwork for significant scalability improvements. Instead of having to communicate all of the necessary data over the wire between the qmaster and scheduler, the scheduler is able to simply share the qmaster's internal data structures. For now, the performance impact is very modest, but as we're able to refine the data locking, we should be able to squeeze out some significant performance gains.

Improved interactive job support

Prior to 6.2, interactive jobs required external binaries to run. By default, qrsh used rlogin/rsh and rlogind/rshd to run an interactive job. For example, the command form of qrsh would submit rshd as a job and then fork off an rsh to connect to that rshd. The actual running of the command is handled by rsh/rshd rather than Grid Engine. That has several disadvantages. First, even if Grid Engine is installed securely the rsh/rshd connection isn't secure. Second, rsh has a limit of 512 ports, meaning that a single machine cannot start more than 512 interactive jobs. Because Grid Engine handles tight integration of parallel jobs via the interactive job framework, that means rsh limits the size of parallel jobs to 512 slave tasks.

We do, however, let you configure which interactive job utilities to use. For example, you can use ssh/sshd to overcome the two problems mentioned above, but that creates new problems. First, because ssh is secure, it's slower. All communications have to encrypted and then decrypted, meaning more time is spent just processing the traffic. Also, in order for Grid Engine to keep accurate accounting logs, the sshd binary has to be patched for Grid Engine. (Grid Engine actually uses its own patched rshd by default.)

With 6.2, we offer a new option for interactive job support. By default with 6.2, interactive jobs are handled through a built-in process. Instead of submitting an rshd and forking off an rsh to connect to it, all of the communications are handled internally by Grid Engine. qrsh talks to the Grid Engine daemon on the execution node, which forwards the traffic to/from the job shell. No external binaries, no external communications. All of the above problems go away. As an added bonus, interactive jobs now get a PTY, which will make a lot of people's lives easier. The only downside to the new interactive job support is that X11 forwarding is not yet supported. (I should point out that X11 forwarding is different from xhosting. xhosting is supported.) Using the new interactive job support, 10k+ task parallel jobs should be no problem.

Streamlined communications

When you're trying to support a cluster with thousands or tens of thousands of nodes, even the most innocuous network chatter came become a big problem. With 6.2 we're done our best to reduce that chatter to a minimum. One thing that has been done is a review of the qmaster/execd communications to eliminate any unnecessary messages. Another big change is that the execution daemons now only report resource state diffs rather than reporting the entire state of all resources, even the ones that never change, every load report interval. In small clusters, you may not see the difference, but in huge clusters, the difference is noteworthy.

Other "large cluster" improvements

A variety of other scalability enhancements have been done, mostly with regards to reducing memory consumption, reducing qmaster startup time, and eliminating unnecessary overhead. Again, the effects on small clusters will be small, but large clusters will benefit tremendously.

Array Task Dependency

Since before I joined the team, Grid Engine has been able to manage job dependencies. A user can submit a job and specify that the job cannot be started until a set of other jobs have exited. This works for batch jobs, array jobs, parallel jobs, and even interactive jobs. In the case of array jobs, a job dependent on an array job must wait for all the array job's tasks to exit, and an array job that is dependent on another job cannot start any tasks until that other job has exited. If an array job depends on another array job, no task of the second array job can start until every task of the first array job has exited. For most purposes, that behavior is sufficient.

Imagine for a moment that you work for a visual effects company that uses Grid Engine to render video effects. (If you're imagination is vivid, imagine you work for an Australian visual effects company that has done work for several blockbuster films.) In your day-to-day rendering, you have two choices for how to approach the task given the way Grid Engine works (before 6.2). One option is to have an array job per rendering step, with each job task representing a frame. You could then use job dependencies to make sure that step 2 doesn't start until step 1 finishes. That works, but if one frame takes a lot longer than the others to render, all the other frames are stuck in the current step when they could have moved on to the next step. Another option would be to have a batch job for each frame. That way, as soon as a frame finishes a step, it can move on to the next step, regardless of what step the other frames are on. That's less wasteful, but it's also considerably more difficult to manage (millions of jobs instead of tens), and it makes it hard to take advantage of special resources for individual steps. Yet another option would be to do the rendering as an array job of array jobs. That solves all the technical issues, but is practically impossible to manage.

What you'd really want if you were that visual effects company is that ability to have a task in one array job depend on a task in another array job. That way, you could submit each step as an array job where each task represents a frame, and each task could depend on the corresponding task in the previous step. That feature is exactly what 6.2 provides. (Actually, the feature was implemented and contributed by that not-so-imaginary Australian visual effects company.)

With 6.2 a user can declare that an array job's tasks are dependent on the tasks of another array job. Each task of the second job will then each depend on the task of the first job with the same task number, i.e. job 2 task 1 will depend on job 1 task 1. In addition, array task dependencies support "chunking." Chunking means grouping tasks together for efficiency. For example, step one might be really light weight, making it more efficient to have each task process three frames instead of just one. The way chunking is representing in an array task dependency is by the array job's step size. By default, array job tasks are numbered in increments of 1, i.e. 1, 2, 3, 4, 5, etc. It is possible, however, to declare a step size for the task numbers other than 1. A step size of 3 would result in tasks numbered 1, 4, 7, etc. In an array task dependency, if the corresponding task number in the previous job doesn't exist because of chunking, the dependency falls to the chunked task that contains the corresponding task number. For example, tasks 1, 2, and 3 from an array job with a step size of 1 might all depend on task 1 of the previous array job with a step size of 3. It works the other way around as well. Task 1 of an array job with a step size of 3 might depend on tasks 1, 2, and 3 of the previous array job with a step size of 1. It even works for uneven combinations, such as task 1 of an array job with a step size of 3 depending on tasks 1 and 3 of the previous array job with a step size of 2.

ARCo enhancements

One of the major areas of focus for 6.2 was improving the Accounting and Reporting Console (ARCo), In previous releases, the ARCo infrastructure was a little pokey, and it was not very difficult to produce a stream of accounting data fast enough to completely swamp the DBWriter component. (The DBWriter's job is to transfer data from the accounting logs into the ARCo database.) With 6.2 that has been fixed, along with a number of other performance-related issues. ARCo is now fast, and it will continue to get better. Another important change for ARCo is that you can now have more than one cluster write into the same database without conflict. ARCo will even let you run queries against the data from all the clusters. That is important, of course, because of the new multi-clustering support that was also adding in 6.2 (as described above).

Solaris Enhancements

Every release we add a few more features to take advantage of what the Solaris 10 operating environment has to offer. In 6.1 we added a DTrace script and declared support for Solaris Zones and ZFS. With 6.2 we're adding support for Service Tags and the Service Management Framework.

Service Tags are a way for you as an administrator to keep track of everything in your network. When a machine has service tags enabled, it responds to broadcast requests for information from the service tag client. When you install 6.2, you have the option of allowing Grid Engine to register a service tag on the master machine to indicate that you have Grid Engine running in your network. You can then see that information from the service tags client. You can also upload that information to Sun's service management repository, and we'll keep track of it for you.

The Service Management Framework (SMF) is a replacement for the traditional UNIX init scripts. Instead of startup and shutdown scripts, services get an entry in a services database that lists how to start and stop the service, among other things. When 6.2 is installed on a Solaris host that supports SMF, if you choose to have Grid Engine start when the machine boots, the installer will create an SMF entry instead of an init script. If you need to make changes to the way Grid Engine is started, you can edit the $SGE_ROOT/$SGE_CELL/common/sgemaster file just like you would have with the old init scripts. Perhaps the most useful part of SMF is that you get an automatic watchdog for your services. If one of your Grid Engine daemons dies or is killed (not using qconf or the sgemaster and sgeexecd scripts), the watchdog process will restart the service automatically.

Are you totally stoked now, or what?

Pretty impressive feature list, eh? And that list didn't even include the myriad major and minor bug fixes that are delivered with 6.2. If you can't wait to try it out, you have two options. First, the beta2 courtesy binaries are still available on the open source site. Second, you can grab the V62_TAG tag from the CVS repository and build it yourself. Have fun, and let us know how it turns out!

Wednesday Jul 16, 2008

Why Upgrade?

One of the questions that comes up often in Grid Engine land is, "Why should I upgrade?" Now that 6.2 is almost ready, I thought now would be a good time to provide a clear and concise answer to the question.

Why upgrade to Grid Engine 6.2?

The watchword for 6.2 is scalability. If you're running a large (multi-thousand host) cluster, you really want to be running 6.2. A lot has been done to address scalability in large clusters. Advance reservation is another headliner. 6.2 offers you the ability to reserve a set of resources at a specific time. The other big-ticket item for 6.2 is multi-clustering. Using a feature-limited release of Project Hedeby (AKA Haithabu, Service Domain Manager (SDM)), Grid Engine 6.2 offers you the ability to set up several independent Grid Engine 6.2 clusters that are also to share resources. As one cluster gets overloaded while other clusters are idle, resources will automatically be migrated from the underused clusters to the overloaded cluster.

Here's the complete feature list:

  • Scalability to 63,000 cores
    • Streamlined communications between qmaster and execution daemons
    • The scheduler is no longer a separate process and is now a thread in the qmaster
    • More efficient resource matching process in the scheduler
    • Reduced qmaster startup time
    • Reduced qmaster memory requirements for large clusters
    • ARCo scalability improvements — faster DBWriter and faster queries
  • Advance reservation — reserve resources for a given period of time. qsub now lets you submit jobs into a pre-existing reservation
  • New interactive job support — with 6.2, you can now configure interactive jobs (and hence parallel slave tasks) to communicate with the client through the existing Grid Engine communications channels, instead of having to fork off an rsh/rshd (or ssh/sshd, telnet/telnetd, etc.) pair
  • Administration improvements
    • ARCo installation documentation is much better
    • Support for Solaris SMF (in addition to traditional rc scripts)
    • Support for Sun Service Tags on Solaris and Linux
  • JMX interface for the qmaster — the qmaster now offers a JMX management interface that enables the complete set of Grid Engine management operations. The API is, however, unstable and will change, probably significantly
  • Multi-clustering
    • Project Hedeby will enable the automatic migration of resources from underloaded clusters to overloaded clusters. Service Level Objects configured for each cluster determine the boundaries of overloaded and underloaded, and policies govern the relative importance of the clusters.
    • ARCo now supports multiple clusters in the same database using the same web interface

What was introduced with Grid Engine 6.1?

The two big wins for 6.1 are resource quota sets and boolean expressions. Both go a long way towards simplifying the administrator's life and present a compelling reason to upgrade from earlier releases all by themselves. The rest of the lesser 6.1 features are also largely targeted at improving the administration experience.

Here's the complete feature list:

  • Resource quota sets (RQS) — allows the administrator to define fine-grained limits over which users, projects, and/or groups can use what resources on what hosts, queues, and/or PEs. Much of what RQS provides you was previously only possible with large numbers of special-purpose queues
  • Boolean expressions — prior to 6.1, a resource request could use logical OR, and multiple requests were treated as a logical AND. 6.1 understands full boolean expressions, including logical OR, AND, NOT, and grouping. For example, "-l arch=sol-\*&!(\*-sparc\*|\*64)" What's even better is that the boolean expressions are understood by any command that handles comples strings, such as qhost and qstat. "qstat -f -q '(prod-\*|test-\*)&!\*-ny'"
  • Shared library path is "fixed" — with 6.1, the shared library path is no longer set by the settings file for Solaris and Linux hosts. Previously, sourcing the settings file would prepend the Grid Engine library directory to the shared library path, which could cause conflicts with applications that use local BDB or OpenSSL libraries. Unfortunately, that fix means that users of DRMAA applications must now explicitly add the Grid Engine library path to their shared library paths in order for DRMAA to work. (The Grid Engine binaries now use the compiled-in run path to find the Grid Engine libraries, so they don't need the shared library path. External DRMAA applications, on the other hand, are rarely able to use the same trick.)
  • -wd for qsub, qrsh, qsh, qalter, and qmon — allows you to specify the working directory. -cwd is effectively aliased to "-wd `$CWD". (That means that if you include both in the same command, the later one overrides the former, as if they were both the same kind of switch.)
  • -xml for qhost " prints output in XML instead of formatted text
  • Source-level\* SSH tight integration
  • MySQL support for ARCo
  • OS Support
    • Support for MacOS X on Intel, Linux on IA64, FreeBSD (source-level\* only), and native 64-bit HP-UX 11
    • Solaris DTrace script — allows you to see potential bottlenecks in the master and scheduler using Solaris DTrace
    • Online job usage information for MacOS X, AIX, and HP-UX
    • Built-in resource data collection on AIX — previously required an extra load sensor script to be configured
  • DRMAA 1.0 for C and Java languages
  • JGDI early access — Java language API for Grid Engine management operations. Very unstable. This API becomes the JMX interface in 6.2
  • ARCo correctly accounts daily usage of long-running jobs — before 6.1u3, a long running job did not update the accounting database until it was done, meaning that a job that takes 3 months to complete would have zero resource usage in the accounting database until it completed, which could cause accounting errors in daily, weekly, or even monthly reports. With 6.1u3, the accounting database will be updated with resource usage information for long-running jobs on a daily basis.

\*Source-level support — some features are included only if you build the binaries yourself. Those features are considered "source-level".

What changed between Grid Engine 5.x and Grid Engine 6.0?

Grid Engine 6.0 was a huge step forward technologically from 5.3. 6.0 introduced cluster queues, ARCo, the Windows port, the multi-threaded qmaster, BDB, XML output, DRMAA, and much more. The gap between 5.3 and 6.0 is so large, that there really isn't a question of whether to upgrade. There is almost no use case that wouldn't benefit significantly from upgrading from 5.x to 6.x.

Below is the feature list, but it may be incomplete. I'm reconstructing this one from memory. As I find errors and omissions, I will correct them. (Let me know if you find any!)

  • Cluster queues — prior to 6.0, a queue could only be on a single host. 6.0 made it possible for a single queue to span multiple hosts, greatly reducing administrator burden
  • Accounting and Reporting Console — web-based front-end for an accounting database derived from the Grid Engine accounting file (also new with 6.0). ARCo makes it possible for an administrator to create canned queries for generating usage reports. ARCo was originally only available the N1 Grid Engine product, but was released into open source with 6.0u8
  • Windows port — a port of the execution daemon and shepherd to Microsoft SFU (now known as SUA). Originally released only in the N1 Grid Engine 6.0u4 product, the Windows port still hasn't made it into the open source, but it will soon
  • Multi-threaded qmaster daemon — prior to 6.0 the qmaster was a single-threaded loop, meaning that a large influx of jobs could cause the qmaster to think its execution daemons had died. With 6.0, the qmaster is multi-threaded, freeing it from the constraints of a single giant control loop, and laying the foundation for significant scalability improvements
  • -xml for qstat — qstat prints output in XML instead of formatted text. Introduced in 6.0u2
  • DRMAA 0.97 C language binding — updated to 1.0 in 6.0u8
  • DRMAA 0.5 Java language binding — introduced in 6.0u4. Updated to 1.0 in 6.0u8
  • qsub -sync — qsub behaves synchronously for ease of scripting
  • Berkeley Database — 6.0 added both local and remote Berkeley database servers as spooling options instead of just flat files
  • New communications library — before 6.0, communications were handled by a separate single-threaded daemon called the commd. With 6.0, every daemon has it's own built-in multi-threaded communications channel. The commd is retired
  • Automated installer — 6.0 adds a -auto switch to inst_sge that reads a config file and installs a cluster in a non-interactive mode. If remote access is properly configured, the auto installer can also install execution daemons on remote machines
  • Backslash line continuation — with 6.0 configuration files can use a backslash to continue an entry on the next line. The SGE_SINGLE_LINE environment variable disable this behavior to ease scripting
  • Resource reservation — 6.0u4 added resource reservation to prevent large jobs from being starved by smaller jobs. With resource reservation, a large job is able to collect resources until it has enough to run. While waiting for all needed resources to become available, idle resources may be backfilled with short jobs
  • qping — on the surface, it's a utility to tell if your Grid Engine daemons are still alive, but if you dig a little deeper, you'll discover that it can also be used to profile threads in the qmaster and debug communications traffic
  • qsub -shell — allows you to control whether Grid Engine will start a shell to start your job. The default is "yes" The alternative is to have Grid Engine execute your job directly, which has implications on environment variable interpretation and error conditions
  • backup/restore — with 6.0, the inst_sge script can be used to backup your cluster's configuration and state data and restore it later
  • target-specific qmake resource requests — with 6.0 it's possible to specific the resources to be requested by qmake jobs on a per-target basis

Thursday Jun 26, 2008

Xen and the Art of Cluster Scheduling

I keep finding myself talking about this paper, and I keep having to search for it. To save everyone the trouble in the future, here it is.

Where Not to Run

Reuti just reminded me of a nice application of one of the new features we added in Grid Engine 6.1. Before 6.1, resource requests were limited to simple boolean AND and OR expressions. For example, when submitting a job, a user might request "-l a=sol-x\*|sol-amd64 -l mem_free=4G -l exclusive=TRUE", meaning that the job must run on a Solaris i386 or AMD64 machine, and the machine must have at least 4GB of memory free, and the job wants exclusive access to the host. (AND is represented by multiple -l switches.) There was no way, however, to request, for example, Solaris on anything but x86.

Enter 6.1. With 6.1 we introduced full boolean expressions for resource requests. A user can now make requests like, "-l a =sol-\*&!sol-sparc\*". (The job must run on Solaris, but not on SPARC or SPARC64.) Even better, you make create complex boolean statements, like "-l (sol-\*&!\*-x86)|(lx2[46]-\*&!(\*-x86|\*-ia64))". (The job must run on either Solaris on anything but x86 or Linux on anything except x64 or Itanium.)

Now, to the title problem. In the email that prompted this post, Reuti responded to a question about how to submit a job to any host, except for one. With 6.1, the answer is simple. Grid Engine has a built-in complex called hostname, or h for short. Using the new boolean expressions, it's very simple to request "-l h=!badhostname", which allows the job to run on any machine except the one named badhostname.

Monday Jun 23, 2008

Announcing Grid Engine 6.2 Beta 2 Binaries

I'm a little slow on the draw, but in case you haven't noticed already, Grid Engine 6.2 Beta 2 is now ready for download! Go pull it down and give it a whirl!

You should also have a look at my slide deck from SuperComputing '07 talking about what's new in 6.2 You can find it on the OpenSolaris HPC Community's presentations page.

Wednesday Jun 04, 2008

Exclusive Host Usage In Grid Engine

A common thing to want to do with Grid Engine is to let users request that their jobs be run as the only thing on the host(s). The naïve approach would be for the user to request a number of slots equal to the number of slots offered by the hosts, but for a plethora of reasons, that doesn't work. (Among the reasons are that we might not have the same number of slots per host, and more importantly, unless we're using a parallel environment that is configured for fill-up allocation, a job can't request all the slots on a host.) Let's talk through an approach that does work.

[Update: exclusive host access will now be a built-in feature of Sun Grid Engine 6.2u3.]

Let's think through this problem. A natural approach for a Grid Engine administrator would be to create a special queue on each host to which all other queues are subordinated. When jobs are running in that queue, then all other jobs on the system are suspended. That approach solves the problem (mostly), but it's a bit heavy-handed. Whenever an exclusive job gets put on a host, other jobs on that host get suspended until it is finished. If there is a steady stream of exclusive jobs, non-exclusive jobs could starve.

To fix that problem, you could set up circular subordination: make the other queues subordinate to the exclusive queue and the exclusive queue subordinate to the other queues. The effect of this circular subordination is that there can never be jobs in both the exclusive queue and any other queue, preventing the starvation issue. (If a job is running in a non-exclusive queue, the exclusive queue is unavailable (suspended), and vice versa.)

Another problem that crops up is keeping non-exclusive jobs from accidentally ending up in the exclusive queue. That problem is easily solved with a forced resource assigned to the exclusive queue. With a forced resource, only jobs that either request the resource or explicitly request the exclusive queue can run in the exclusive queue.

There's another problem. How do you keep multiple exclusive jobs from all running in the exclusive queue on the same host? One answer would be to only give the exclusive queue one slot. That works for non-parallel jobs and parallel jobs that are only allowed to run one slave per host. It does not work for parallel or parametric jobs where more than one task could (or should) run on a single host. One solution would be to change the forced resource to a forced integer consumable with a value equal to the number of slots. A job could then theoretically request as much of that resource as each host has, making sure that there isn't any left over for other jobs. Unfortunately, that won't work. First, we still have the problem that our hosts might not all have the same number of slots. We could try to solve that problem by setting the exclusive queue's consumable's value to 1. That guarantees that only one job can get the resource. The problem there is that a parallel job consumes one set of resources for each slave, so a parallel job with two slaves on a host will need 2 of our consumable. We could try requesting 1/<num_slaves_per_host> of the consumable for such a parallel job, so that after multiplying by the number of slaves on the host, we end up with a request for 1. That only works, however, if every host will be running the same number of slaves per host, and if we know how many that is ahead of time. "But, wait!" you say. "The consumable is an integer, so even if we request less than 1, we should still consume the entire resource!" You'd think so, but you'd be wrong. It turns out that if one job requests half of our resource, another job can still be assigned the other half, defeating our strategy.

In order to solve the problem, we need to fundamentally prevent the scheduler from looking at hosts that are running exclusive jobs. Well, one way to do that would be to add the host to a special host group, say @exclusive, and use a resource quota set rule to prevent jobs from being scheduled to machines in that hostgroup. We can do that from a prolog on the exclusive queue. qconf -aattr hostgroup hostlist $HOST @exclusive (Note, that you don't need to remove the host from its current set of queues or host groups. The resource quota set rule obviates that need.) Now, the circular subordination makes sure that jobs can run either in the exclusive queue or the other queues (but not both), our forced complex makes sure that only jobs that request exclusivity get it, and our prolog and resource quota set rule make sure that the scheduler cannot put multiple exclusive jobs on the same host. But, you guessed it, there's still a problem.

Once a job starts running in the exclusive queue, everything works as intended. The problem is that the scheduler may put more than one exclusive job on the same host at the same time. Because the host isn't removed from the host group until an exclusive jobs starts, we need to keep the scheduler from scheduling multiple exclusive jobs at the same time. That's where load adjustments come in. We can create a new resource, say exclusive_load, and set a load threshold for the exclusive queue based on that resource, say exclusive_load=1. By adding something like exclusive_load=50 to the job_load_adjustments attribute in the scheduler config (and probably also setting the load_adjustment_decay_time to something small, like 0:0:30), we force the scheduler to consider a host's exclusive queue to be full (for the current scheduler interval) whenever a job is put there. After the decay interval, the host becomes available to the scheduler again, but by that time the prolog should have removed it from the host group.

QED (Whew!)

By the way, credit for the host group/load adjustment idea goes to Roland Dittel. Unfortunately, Roland doesn't have a blog, so I can't link to it. If you run into Roland, be sure to tell him how much you'd love to see him start blogging.

Defining the Process Owner For Prologs & Epilogs

I've been working in the Grid Engine team for over five years ago, and I'm still learning about features of the product that I never knew about. One more was just brought to my attention.

When configuring a queue in Grid Engine, you can configure a prolog and epilog. The prolog is a script or binary that is run by the shepherd before running a job. The epilog is the same, except that it comes after a job finishes. When you set the prolog and epilog for a queue, all jobs that run in that queue inherit that prolog and epilog. (A job cannot specify its own prolog and epilog, but look for that to change in a future release. (Actually, if you configure your queue's prolog and epilog to read a custom environment variable in the job's environment and exec the path it contains, you can effectively allow a job to specify its own prolog and epilog by setting them in the environment variables.))

The epilog and prolog are well known tools. What I never noticed, though, is that not only can you specify a path, but you can also specify the user as whom the prolog or epilog should run. For example, if you set the queue's prolog to root@/path/to/my/prolog, the shepherd will execute the prolog as root, no matter who submitted the job. This is really helpful if your prolog and/or epilog needs to do something that has restricted access, such as mounting a directory or modifying the grid configuration. Because only the administrator can change the queue configuration, this feature is not a big security risk. (Actually, this feature is a compelling reason for restricting who has manager rights on your grid. Anyone who is recognized as a grid manager could change a queue to run a malicious prolog/epilog as root, submit a job to that queue, and compromise the system.)

Tuesday May 27, 2008

Making Grid Engine HA with Open High Availability Cluster and OpenSolaris

At the Open Source Grid & Cluster Conference a couple of weeks ago, Ashu from the Solaris Cluster team gave a 30-minute presentation about building a highly available Grid Engine cluster using the Open HA Cluster project. (Open HA Cluster is the open-sourced Solaris Cluster.) If you've got a spare 30 minutes, it's worth a look.

Intro to Grid Engine Queues

I just posted this information as answer to a question on the Grid Engine users mailing list, but I thought it was useful enough to post here, too. If you're new to Grid Engine and trying to understand what a queue is, hopefully this explanation will help.

Let's take it from the top. A queue is where a job runs, not where it waits to run. When a job is in the qw (queued and waiting) state, it has not yet been assigned to a queue. A job that has been assigned to a queue is in the r (running) state (or transferring or suspended). In the pre-6.0 days, a queue could only exist on a single host. With 6.0, we introduced the idea of cluster queues. A cluster queue is a queue that can span multiple hosts. Under the covers, it's essentially a group of pre-6.0 queues, all with the same name, and each on a different host. With one caveat. A pre-6.0 queue is composed of a long list of required attributes, like slots, pe_list, user_list, etc. Starting with 6.0, that long list of attributes is only required for the cluster queue. All of the queue instances that belong to that cluster queue inherit the attribute values from it. The queue instances are allowed, however, to override those attribute values with local settings. A common example of that is the slots attribute. When you install an execution daemon using the install_execd script, it will add a slots setting for the queue instance of all.q on that host (noted as all.q@host). And if it wasn't already clear, pre-6.0 "queue" == post-6.0 "queue instance". Post-6.0 "queue" == "cluster queue".

So, aside from governing the number of free slots on a host, what does a queue do? It controls the execution context of jobs that run in it. It determines what parallel environments are available, what file, memory, and CPU time limits should be applied, how the job should be started, stopped, suspended, and resumed, what the job's process' nice value is, etc.

Queues also have a concept of subordination. A queue that is subordinated to another queue will be suspended (along with all the jobs running in it) when jobs are running in that other queue. By default, the subordinated queue will be suspended when the other queue is full, but you can set the number of jobs required to suspend the subordinated queue. 1 is a common value, meaning that the subordinated queue should be suspended if any jobs are running in the other queue. Subordination trees can be arbitrarily complex. Circular subordination schemes are permitted, producing a sort of mutual exclusion effect.

One other oddity to point out is that the slot count for a queue is not really a queue attribute. It's actually a queue-level resource (aka complex). To allow multiple queues on the same host to share that host's CPUs without oversubscribing, you can set the slots resource at the host level. Doing so sets a host-wide slot limit, and all queues on that host must then share the given number of slots, regardless of how many slots each queue (or queue instance) may try to offer.

Since we're talking about resources, let's talk about one of the common queue/resource configuration patterns. By default, there's nothing (other than access lists) to prevent a stray job from wandering into a queue. That's bad for queues that govern expensive resources or that represent special access, like a priority queue. To solve this problem, the most common approach is to create a resource that is forced. A forced resource (one that has FORCED in the requestable column) has the property that any queue or host that offers that resource can only be used by jobs requesting that resource (or that queue or host, in which case, the resource request is implicit). By assigning such queues forced resources, you can guarantee that stray jobs can't end up in the queue. A nice side effect is that you can also assign an urgency to that resource, meaning that jobs requesting that resource (or the queue to which it's assigned) gain (or lose) priority when being scheduled.

For more information on the above topics, I recommend looking at the man pages for queue_conf(5), complex(5), and sge_priority(5).




« August 2016