Wednesday Oct 01, 2008


I recently rediscovered a hidden qconf option. I remember talking with the engineer when he implemented the option years ago, but because it was never documented, I forgot that it existed. A recent customer eval reminded me that it's there, and I think it's one worth sharing.

The hidden option is qconf -bonsai. It is a human-readable equivalent of qconf -sstree, which if you've looked at you'll know isn't even remotely human-readable. It prints the current share tree configuration using spacing to represent hierarchy.

Let's look at an example. This is the output from qconf -sstree for my home test cluster:

# qconf -sstree

This is the output from qconf -bonsai for the same cluster:

# qconf -bonsai

Now, as for why it's an undocumented feature, I suspect it's historical. It was originally added on a whim by one of the engineers and was just never fully embraced. I remember there being talk about changing the name of the switch and making it a documented feature, but I suspect that plan just got lost in the shuffle.

HPC Publications from the APSTC

Just wanted to point out the great papers available from the APSTC. Have a look at their list of publications, reports, and briefs.

Friday Sep 19, 2008

Now That's a Lot of Data

Now that the LHC is coming online, research organizations around the world are itching to get their hands on the data. Sharing some experimental data might not sound very hard, but there's actually quite of lot of complexity in getting that much data out of the collider and distributed to all the first and second tier sites. This short film explains it pretty well.

By the way, if you haven't seen the Large Hadron Rap yet, you should check it out:

Thursday Sep 18, 2008

Cheap at Twice the Price

We've just added pricing to the Sun Store for Sun Grid Engine 6.2. Just go to the Get It tab, scroll down to the media kit, perpetual licenses, or subscription licenses section, and click the Get It button. On the Sun Store page you'll find the complete pricing information for that option.

Wednesday Sep 03, 2008

Big Just Got Huge

I'm sure you're all familiar with the TACC Ranger system by now. 62,000 cores, 125TB RAM, a couple of petabytes of storage, #4 on the Top500 list, all running one gigantic Sun Grid Engine cluster under a single qmaster. As if that weren't exciting enough, I have some new amazingness to share.

I heard a couple of weeks ago from one of the Sun engineers onsite at TACC that they have successfully run a 60,000-core parallel job on Ranger. For those of you who are familiar with MPI, I'll give you a moment to recover. For those of you who aren't, a parallel job is a distributed application with multiple cooperating tasks running across multiple machines. In this this case, it's a single application instance composed of cooperating tasks spread across several thousand servers. Yes, really.

Even more unbelievable, this feat was accomplished with a special branch of the Sun Grid Engine 6.1 release using the old SSH-based parallel job support. (The 6.2 Grid Engine release includes a more scalable "built-in" method for starting parallel jobs that blows the doors off the old RSH- or SSH-based model.) When TACC has completed the upgrade to 6.2, the scalability numbers will be outrageous!

This 60k-core job is part of a facial recognition application being developed by PNNL. The application is able to recognize faces in images in faster-than-real-time using the Ranger system at TACC. The reason the job didn't use all 62k+ cores in the system is administrative: there isn't a single queue that spans every host yet. That will be remedied soon, I'm told.

Sun HPC ClusterTools 8.0 Now Available

The 8.0 release of Sun HPC ClusterTools is finally out. ClusterTools is our OpenMPI implementation. Aside from boosting the scalability and adding support for Linux, 8.0 includes:

  • Based on Open MPI 1.3
  • Plug-ins for Sun Grid Engine (SGE) and Portable Batch System (PBS)
  • Support for Linux (RHEL 4&5, SLES 9&10), Solaris 10, OpenSolaris
  • Support for Sun Studio compilers and tools and GNU/gcc toolchains on both Solaris and Linux OSes
  • MPI profiling support with Sun Studio Analyzer, plus support for VampirTrace and MPI PERUSE
  • Infiniband multi-rail support
  • Mellanox ConnectX Infiniband support
  • DTrace provider support on Solaris and OpenSolaris
  • Enhanced performance and scalability, including processor affinity support
  • Support for InfiniBand, GbE, 10GbE, and Myrinet interconnects
  • Full MPI-2 standard compliance, including MPI I/O and one sided communication

Go check it out!

Friday Aug 22, 2008

Announcing Grid Engine 6.1 Update 5

Grid Engine 6.1 Update 5 is now ready and courtesy binaries are available for download. SGE 6.1u5 is a maintenance release and fixes bugs of the software, installation procedure and man pages. For more information, see the announcement.

Thursday Aug 14, 2008

Feature Poll

I just posted a poll over on about what potential new Sun Grid Engine features are most important. Pop on over and share your opinion!

Monday Aug 11, 2008

Sun Grid Engine 6.2 Information

Since there's actually quite a lot of information out there about the Sun Grid Engine 6.2 release, I thought it might be useful to provide a single source for where to find it. (Actually, the completely revamped Sun Grid Engine is already a single source for this information, but you have to browse a bit to find it all.) Here ya go:

There are still a couple more things in the coming soon category. As they go live, I will update the above list.

Friday Aug 08, 2008

It's a Bouncing Baby 6.2!

Congratulations to the Sun Grid Engine team! The new 6.2 release is finally out. (Actually, it's been out for a couple of days. I'm just a tardy blogger.) To find out more about what's in it, check out my previous post, see the new, improved Sun Grid Engine product page, or listen to the podcast that Miha and I recorded. To download a copy of the software, pop on over to the Sun Download Center. The open source courtesy binaries will be made available shortly.

As if that were not exciting enough, there's also a chance to win a free t-shirt! Andy, our non-blogging engineering manager (Encourage him to start blogging next time you see him!), has put a bounty on 6.2 production clusters. For more information, see Chris' post on or Andy's original email. Act now! Supplies are limited!

Thursday Jul 31, 2008

Why Upgrade to 6.2?

In a previous post I gave a high-level overview of what features each new release of Grid Engine has brought to the table, including what's coming in 6.2. Since 6.2 is now just around the corner, I wanted to go into a bit more detail on why you want to be the first kid on your block to upgrade.

Let's just go through the features in detail, one by one:

Advance Reservation

The reason for advance reservation is that sometimes it's important to coordinate the availability of compute resources with external factors, such as people, facility, and/or equipment availability. If, for example, you're trying to process data from some celestial event during the event to help further focus the data gathering, you want the compute resources available while the event is occurring. That is exactly what advance reservation enables.

With 6.2, we introduce three new commands: qrsub, qrdel, and qrstat. qrsub lets users create new advance reservations. A reservation must have a duration or an end time. If a reservation does not request a certain start time, the start time is assumed to be now. When a user runs qrsub, the scheduler will attempt to insert the reservation into its resource schedule. If there's room, the reservation will be granted and assigned an id. If the resources are not available at that time, the reservation will be denied.

Once a user has been granted a reservation, there are several things he can do with it. qsub now has an option that allows users to submit a job to a given reservation. If the reservation is not yet active, i.e. it's for a future time, the job will remain pending until the reservation's start time. A job submitted to a reservation can only run on the resources that were assigned to the reservation. If a job submitted to a reservation is still running when that reservation ends, it will automatically be terminated. When the reservation is first requested, the requesting user can include a list of users and groups who are also allowed to user the reservation. Any user in that list is allowed to submit jobs to a reservation. An advance reservation could alternatively be used to block off a set of machines for some out-of-band purpose, such as taking them down for maintenance or logging into them directly to do some work.

Once a reservation is no longer needed, the creating user can delete it using the qrdel command. Once a reservation is deleted, it's gone. If a user needs to recreate the reservation, she will have to effectively create a new reservation requesting the same (or similar) resources.

In order to see the scheduler's master reservation plan, users can run the qrstat command. qrstat shows what resources are reserved when.

In the time between when a reservation is created and when the reservation becomes active, the scheduler will attempt to backfill the resources with jobs with durations that fit into the available time window. By default, the scheduler will not backfill with jobs that do not specify a wallclock time limit.

There are a couple of limits on users' ability to create reservations. First, a new scheduler parameter controls the maximum number of allowed reservations. Second, reservations can only be made on resources that the scheduler can determine will be available at the desired time of the reservation. The scheduler knows that the resource will be available either 1) because the resource is currently unused, or 2) because the job currently running on the resource has a wallclock time limit that says the job will end before the reservation is supposed to begin.

Multi-clustering with Service Domain Manager (Project Hedeby)

Service Domain Manager (SDM) or Project Hedeby is a framework for managing resource sharing among services. It enables an administrator to define service level objectives (SLOs) that govern the distribution of resources. As workloads change, resources are automatically migrated from one service to another, in order to continue satisfying the SLOs. A service in this context is any application that can scale across multiple nodes.

With Grid Engine 6.2 we're including a feature-limited version of SDM to enable a form of multi-clustering. Using SDM, several Grid Engine clusters can share their resources. The clusters' users continue to use the individual clusters as before. Some just get larger, while others get smaller, as workloads change.

The multi-clustering capability of 6.2 has multiple applications. Any time that you need to have multiple masters for any reason, 6.2's multi-clustering will enable you to combine the individual clusters into a larger "meta-cluster," which will help you keep your resource utilization up.

Scalability to 63,000 cores

A tremendous amount of work has gone into scalability improvements for 6.2. Let's talk about them one at a time.

Scheduler as a thread

Perhaps the biggest change with 6.2 is that the scheduler is no longer its own process. Instead, it's another thread in the qmaster. By bringing the scheduler into the qmaster, we've laid the groundwork for significant scalability improvements. Instead of having to communicate all of the necessary data over the wire between the qmaster and scheduler, the scheduler is able to simply share the qmaster's internal data structures. For now, the performance impact is very modest, but as we're able to refine the data locking, we should be able to squeeze out some significant performance gains.

Improved interactive job support

Prior to 6.2, interactive jobs required external binaries to run. By default, qrsh used rlogin/rsh and rlogind/rshd to run an interactive job. For example, the command form of qrsh would submit rshd as a job and then fork off an rsh to connect to that rshd. The actual running of the command is handled by rsh/rshd rather than Grid Engine. That has several disadvantages. First, even if Grid Engine is installed securely the rsh/rshd connection isn't secure. Second, rsh has a limit of 512 ports, meaning that a single machine cannot start more than 512 interactive jobs. Because Grid Engine handles tight integration of parallel jobs via the interactive job framework, that means rsh limits the size of parallel jobs to 512 slave tasks.

We do, however, let you configure which interactive job utilities to use. For example, you can use ssh/sshd to overcome the two problems mentioned above, but that creates new problems. First, because ssh is secure, it's slower. All communications have to encrypted and then decrypted, meaning more time is spent just processing the traffic. Also, in order for Grid Engine to keep accurate accounting logs, the sshd binary has to be patched for Grid Engine. (Grid Engine actually uses its own patched rshd by default.)

With 6.2, we offer a new option for interactive job support. By default with 6.2, interactive jobs are handled through a built-in process. Instead of submitting an rshd and forking off an rsh to connect to it, all of the communications are handled internally by Grid Engine. qrsh talks to the Grid Engine daemon on the execution node, which forwards the traffic to/from the job shell. No external binaries, no external communications. All of the above problems go away. As an added bonus, interactive jobs now get a PTY, which will make a lot of people's lives easier. The only downside to the new interactive job support is that X11 forwarding is not yet supported. (I should point out that X11 forwarding is different from xhosting. xhosting is supported.) Using the new interactive job support, 10k+ task parallel jobs should be no problem.

Streamlined communications

When you're trying to support a cluster with thousands or tens of thousands of nodes, even the most innocuous network chatter came become a big problem. With 6.2 we're done our best to reduce that chatter to a minimum. One thing that has been done is a review of the qmaster/execd communications to eliminate any unnecessary messages. Another big change is that the execution daemons now only report resource state diffs rather than reporting the entire state of all resources, even the ones that never change, every load report interval. In small clusters, you may not see the difference, but in huge clusters, the difference is noteworthy.

Other "large cluster" improvements

A variety of other scalability enhancements have been done, mostly with regards to reducing memory consumption, reducing qmaster startup time, and eliminating unnecessary overhead. Again, the effects on small clusters will be small, but large clusters will benefit tremendously.

Array Task Dependency

Since before I joined the team, Grid Engine has been able to manage job dependencies. A user can submit a job and specify that the job cannot be started until a set of other jobs have exited. This works for batch jobs, array jobs, parallel jobs, and even interactive jobs. In the case of array jobs, a job dependent on an array job must wait for all the array job's tasks to exit, and an array job that is dependent on another job cannot start any tasks until that other job has exited. If an array job depends on another array job, no task of the second array job can start until every task of the first array job has exited. For most purposes, that behavior is sufficient.

Imagine for a moment that you work for a visual effects company that uses Grid Engine to render video effects. (If you're imagination is vivid, imagine you work for an Australian visual effects company that has done work for several blockbuster films.) In your day-to-day rendering, you have two choices for how to approach the task given the way Grid Engine works (before 6.2). One option is to have an array job per rendering step, with each job task representing a frame. You could then use job dependencies to make sure that step 2 doesn't start until step 1 finishes. That works, but if one frame takes a lot longer than the others to render, all the other frames are stuck in the current step when they could have moved on to the next step. Another option would be to have a batch job for each frame. That way, as soon as a frame finishes a step, it can move on to the next step, regardless of what step the other frames are on. That's less wasteful, but it's also considerably more difficult to manage (millions of jobs instead of tens), and it makes it hard to take advantage of special resources for individual steps. Yet another option would be to do the rendering as an array job of array jobs. That solves all the technical issues, but is practically impossible to manage.

What you'd really want if you were that visual effects company is that ability to have a task in one array job depend on a task in another array job. That way, you could submit each step as an array job where each task represents a frame, and each task could depend on the corresponding task in the previous step. That feature is exactly what 6.2 provides. (Actually, the feature was implemented and contributed by that not-so-imaginary Australian visual effects company.)

With 6.2 a user can declare that an array job's tasks are dependent on the tasks of another array job. Each task of the second job will then each depend on the task of the first job with the same task number, i.e. job 2 task 1 will depend on job 1 task 1. In addition, array task dependencies support "chunking." Chunking means grouping tasks together for efficiency. For example, step one might be really light weight, making it more efficient to have each task process three frames instead of just one. The way chunking is representing in an array task dependency is by the array job's step size. By default, array job tasks are numbered in increments of 1, i.e. 1, 2, 3, 4, 5, etc. It is possible, however, to declare a step size for the task numbers other than 1. A step size of 3 would result in tasks numbered 1, 4, 7, etc. In an array task dependency, if the corresponding task number in the previous job doesn't exist because of chunking, the dependency falls to the chunked task that contains the corresponding task number. For example, tasks 1, 2, and 3 from an array job with a step size of 1 might all depend on task 1 of the previous array job with a step size of 3. It works the other way around as well. Task 1 of an array job with a step size of 3 might depend on tasks 1, 2, and 3 of the previous array job with a step size of 1. It even works for uneven combinations, such as task 1 of an array job with a step size of 3 depending on tasks 1 and 3 of the previous array job with a step size of 2.

ARCo enhancements

One of the major areas of focus for 6.2 was improving the Accounting and Reporting Console (ARCo), In previous releases, the ARCo infrastructure was a little pokey, and it was not very difficult to produce a stream of accounting data fast enough to completely swamp the DBWriter component. (The DBWriter's job is to transfer data from the accounting logs into the ARCo database.) With 6.2 that has been fixed, along with a number of other performance-related issues. ARCo is now fast, and it will continue to get better. Another important change for ARCo is that you can now have more than one cluster write into the same database without conflict. ARCo will even let you run queries against the data from all the clusters. That is important, of course, because of the new multi-clustering support that was also adding in 6.2 (as described above).

Solaris Enhancements

Every release we add a few more features to take advantage of what the Solaris 10 operating environment has to offer. In 6.1 we added a DTrace script and declared support for Solaris Zones and ZFS. With 6.2 we're adding support for Service Tags and the Service Management Framework.

Service Tags are a way for you as an administrator to keep track of everything in your network. When a machine has service tags enabled, it responds to broadcast requests for information from the service tag client. When you install 6.2, you have the option of allowing Grid Engine to register a service tag on the master machine to indicate that you have Grid Engine running in your network. You can then see that information from the service tags client. You can also upload that information to Sun's service management repository, and we'll keep track of it for you.

The Service Management Framework (SMF) is a replacement for the traditional UNIX init scripts. Instead of startup and shutdown scripts, services get an entry in a services database that lists how to start and stop the service, among other things. When 6.2 is installed on a Solaris host that supports SMF, if you choose to have Grid Engine start when the machine boots, the installer will create an SMF entry instead of an init script. If you need to make changes to the way Grid Engine is started, you can edit the $SGE_ROOT/$SGE_CELL/common/sgemaster file just like you would have with the old init scripts. Perhaps the most useful part of SMF is that you get an automatic watchdog for your services. If one of your Grid Engine daemons dies or is killed (not using qconf or the sgemaster and sgeexecd scripts), the watchdog process will restart the service automatically.

Are you totally stoked now, or what?

Pretty impressive feature list, eh? And that list didn't even include the myriad major and minor bug fixes that are delivered with 6.2. If you can't wait to try it out, you have two options. First, the beta2 courtesy binaries are still available on the open source site. Second, you can grab the V62_TAG tag from the CVS repository and build it yourself. Have fun, and let us know how it turns out!

Wednesday Jul 16, 2008

Why Upgrade?

One of the questions that comes up often in Grid Engine land is, "Why should I upgrade?" Now that 6.2 is almost ready, I thought now would be a good time to provide a clear and concise answer to the question.

Why upgrade to Grid Engine 6.2?

The watchword for 6.2 is scalability. If you're running a large (multi-thousand host) cluster, you really want to be running 6.2. A lot has been done to address scalability in large clusters. Advance reservation is another headliner. 6.2 offers you the ability to reserve a set of resources at a specific time. The other big-ticket item for 6.2 is multi-clustering. Using a feature-limited release of Project Hedeby (AKA Haithabu, Service Domain Manager (SDM)), Grid Engine 6.2 offers you the ability to set up several independent Grid Engine 6.2 clusters that are also to share resources. As one cluster gets overloaded while other clusters are idle, resources will automatically be migrated from the underused clusters to the overloaded cluster.

Here's the complete feature list:

  • Scalability to 63,000 cores
    • Streamlined communications between qmaster and execution daemons
    • The scheduler is no longer a separate process and is now a thread in the qmaster
    • More efficient resource matching process in the scheduler
    • Reduced qmaster startup time
    • Reduced qmaster memory requirements for large clusters
    • ARCo scalability improvements — faster DBWriter and faster queries
  • Advance reservation — reserve resources for a given period of time. qsub now lets you submit jobs into a pre-existing reservation
  • New interactive job support — with 6.2, you can now configure interactive jobs (and hence parallel slave tasks) to communicate with the client through the existing Grid Engine communications channels, instead of having to fork off an rsh/rshd (or ssh/sshd, telnet/telnetd, etc.) pair
  • Administration improvements
    • ARCo installation documentation is much better
    • Support for Solaris SMF (in addition to traditional rc scripts)
    • Support for Sun Service Tags on Solaris and Linux
  • JMX interface for the qmaster — the qmaster now offers a JMX management interface that enables the complete set of Grid Engine management operations. The API is, however, unstable and will change, probably significantly
  • Multi-clustering
    • Project Hedeby will enable the automatic migration of resources from underloaded clusters to overloaded clusters. Service Level Objects configured for each cluster determine the boundaries of overloaded and underloaded, and policies govern the relative importance of the clusters.
    • ARCo now supports multiple clusters in the same database using the same web interface

What was introduced with Grid Engine 6.1?

The two big wins for 6.1 are resource quota sets and boolean expressions. Both go a long way towards simplifying the administrator's life and present a compelling reason to upgrade from earlier releases all by themselves. The rest of the lesser 6.1 features are also largely targeted at improving the administration experience.

Here's the complete feature list:

  • Resource quota sets (RQS) — allows the administrator to define fine-grained limits over which users, projects, and/or groups can use what resources on what hosts, queues, and/or PEs. Much of what RQS provides you was previously only possible with large numbers of special-purpose queues
  • Boolean expressions — prior to 6.1, a resource request could use logical OR, and multiple requests were treated as a logical AND. 6.1 understands full boolean expressions, including logical OR, AND, NOT, and grouping. For example, "-l arch=sol-\*&!(\*-sparc\*|\*64)" What's even better is that the boolean expressions are understood by any command that handles comples strings, such as qhost and qstat. "qstat -f -q '(prod-\*|test-\*)&!\*-ny'"
  • Shared library path is "fixed" — with 6.1, the shared library path is no longer set by the settings file for Solaris and Linux hosts. Previously, sourcing the settings file would prepend the Grid Engine library directory to the shared library path, which could cause conflicts with applications that use local BDB or OpenSSL libraries. Unfortunately, that fix means that users of DRMAA applications must now explicitly add the Grid Engine library path to their shared library paths in order for DRMAA to work. (The Grid Engine binaries now use the compiled-in run path to find the Grid Engine libraries, so they don't need the shared library path. External DRMAA applications, on the other hand, are rarely able to use the same trick.)
  • -wd for qsub, qrsh, qsh, qalter, and qmon — allows you to specify the working directory. -cwd is effectively aliased to "-wd `$CWD". (That means that if you include both in the same command, the later one overrides the former, as if they were both the same kind of switch.)
  • -xml for qhost " prints output in XML instead of formatted text
  • Source-level\* SSH tight integration
  • MySQL support for ARCo
  • OS Support
    • Support for MacOS X on Intel, Linux on IA64, FreeBSD (source-level\* only), and native 64-bit HP-UX 11
    • Solaris DTrace script — allows you to see potential bottlenecks in the master and scheduler using Solaris DTrace
    • Online job usage information for MacOS X, AIX, and HP-UX
    • Built-in resource data collection on AIX — previously required an extra load sensor script to be configured
  • DRMAA 1.0 for C and Java languages
  • JGDI early access — Java language API for Grid Engine management operations. Very unstable. This API becomes the JMX interface in 6.2
  • ARCo correctly accounts daily usage of long-running jobs — before 6.1u3, a long running job did not update the accounting database until it was done, meaning that a job that takes 3 months to complete would have zero resource usage in the accounting database until it completed, which could cause accounting errors in daily, weekly, or even monthly reports. With 6.1u3, the accounting database will be updated with resource usage information for long-running jobs on a daily basis.

\*Source-level support — some features are included only if you build the binaries yourself. Those features are considered "source-level".

What changed between Grid Engine 5.x and Grid Engine 6.0?

Grid Engine 6.0 was a huge step forward technologically from 5.3. 6.0 introduced cluster queues, ARCo, the Windows port, the multi-threaded qmaster, BDB, XML output, DRMAA, and much more. The gap between 5.3 and 6.0 is so large, that there really isn't a question of whether to upgrade. There is almost no use case that wouldn't benefit significantly from upgrading from 5.x to 6.x.

Below is the feature list, but it may be incomplete. I'm reconstructing this one from memory. As I find errors and omissions, I will correct them. (Let me know if you find any!)

  • Cluster queues — prior to 6.0, a queue could only be on a single host. 6.0 made it possible for a single queue to span multiple hosts, greatly reducing administrator burden
  • Accounting and Reporting Console — web-based front-end for an accounting database derived from the Grid Engine accounting file (also new with 6.0). ARCo makes it possible for an administrator to create canned queries for generating usage reports. ARCo was originally only available the N1 Grid Engine product, but was released into open source with 6.0u8
  • Windows port — a port of the execution daemon and shepherd to Microsoft SFU (now known as SUA). Originally released only in the N1 Grid Engine 6.0u4 product, the Windows port still hasn't made it into the open source, but it will soon
  • Multi-threaded qmaster daemon — prior to 6.0 the qmaster was a single-threaded loop, meaning that a large influx of jobs could cause the qmaster to think its execution daemons had died. With 6.0, the qmaster is multi-threaded, freeing it from the constraints of a single giant control loop, and laying the foundation for significant scalability improvements
  • -xml for qstat — qstat prints output in XML instead of formatted text. Introduced in 6.0u2
  • DRMAA 0.97 C language binding — updated to 1.0 in 6.0u8
  • DRMAA 0.5 Java language binding — introduced in 6.0u4. Updated to 1.0 in 6.0u8
  • qsub -sync — qsub behaves synchronously for ease of scripting
  • Berkeley Database — 6.0 added both local and remote Berkeley database servers as spooling options instead of just flat files
  • New communications library — before 6.0, communications were handled by a separate single-threaded daemon called the commd. With 6.0, every daemon has it's own built-in multi-threaded communications channel. The commd is retired
  • Automated installer — 6.0 adds a -auto switch to inst_sge that reads a config file and installs a cluster in a non-interactive mode. If remote access is properly configured, the auto installer can also install execution daemons on remote machines
  • Backslash line continuation — with 6.0 configuration files can use a backslash to continue an entry on the next line. The SGE_SINGLE_LINE environment variable disable this behavior to ease scripting
  • Resource reservation — 6.0u4 added resource reservation to prevent large jobs from being starved by smaller jobs. With resource reservation, a large job is able to collect resources until it has enough to run. While waiting for all needed resources to become available, idle resources may be backfilled with short jobs
  • qping — on the surface, it's a utility to tell if your Grid Engine daemons are still alive, but if you dig a little deeper, you'll discover that it can also be used to profile threads in the qmaster and debug communications traffic
  • qsub -shell — allows you to control whether Grid Engine will start a shell to start your job. The default is "yes" The alternative is to have Grid Engine execute your job directly, which has implications on environment variable interpretation and error conditions
  • backup/restore — with 6.0, the inst_sge script can be used to backup your cluster's configuration and state data and restore it later
  • target-specific qmake resource requests — with 6.0 it's possible to specific the resources to be requested by qmake jobs on a per-target basis

Sunday Jun 29, 2008

I Love German Metal

There is something I just love about German heavy metal. American heavy metal tends to be a little much for me. It just comes off overwrought and corny. Metallica's Black Album is my favorite of the lot. But there's just something hopelessly charming about German metal. Rammstein is a good example. You may remember them, as they caught some airtime in the US around 2000 with Du Hast. (It was the only song on alternative radio in German, so it was kinda hard to miss.) There's just something about the tone of the music that makes it so much more palatable to me than the American equivalent. It probably also help that I speak German.

Last time I was in Regensburg, my colleagues took me out to a metal bar, and I found a new favorite song: Wir Werden Alle Sterben by Knorkator. If you love heavy metal, it's definitely worth the $0.99 to download. The gist of the lyrics is that the singer had a conversation with his manager, in which his manager suggested that he needed to write a song to uplift the spirits of his fans. This song is the result of that conversation. The title and first line of the chorus translates to "We're all going to die." Quaint, eh? The juxtaposition of song's super heavy metal riffs with an upbeat, bouncy chorus singing that we're all going to die is just too much to resist. Go check it out!

Thursday Jun 26, 2008

Xen and the Art of Cluster Scheduling

I keep finding myself talking about this paper, and I keep having to search for it. To save everyone the trouble in the future, here it is.

Where Not to Run

Reuti just reminded me of a nice application of one of the new features we added in Grid Engine 6.1. Before 6.1, resource requests were limited to simple boolean AND and OR expressions. For example, when submitting a job, a user might request "-l a=sol-x\*|sol-amd64 -l mem_free=4G -l exclusive=TRUE", meaning that the job must run on a Solaris i386 or AMD64 machine, and the machine must have at least 4GB of memory free, and the job wants exclusive access to the host. (AND is represented by multiple -l switches.) There was no way, however, to request, for example, Solaris on anything but x86.

Enter 6.1. With 6.1 we introduced full boolean expressions for resource requests. A user can now make requests like, "-l a =sol-\*&!sol-sparc\*". (The job must run on Solaris, but not on SPARC or SPARC64.) Even better, you make create complex boolean statements, like "-l (sol-\*&!\*-x86)|(lx2[46]-\*&!(\*-x86|\*-ia64))". (The job must run on either Solaris on anything but x86 or Linux on anything except x64 or Itanium.)

Now, to the title problem. In the email that prompted this post, Reuti responded to a question about how to submit a job to any host, except for one. With 6.1, the answer is simple. Grid Engine has a built-in complex called hostname, or h for short. Using the new boolean expressions, it's very simple to request "-l h=!badhostname", which allows the job to run on any machine except the one named badhostname.




« August 2016