Tuesday Jan 18, 2011

Full Speed Ahead

Last week I had the opportunity to do a webcast with Moe Fardoost, our marketing director, on the future direction for the Oracle Grid Engine product. If you're curious about where Grid Engine is headed, take a look. For the very lazy among you, the summary is that we're focused on three major themes: core infrastructure and feature improvements, tighter integrations with other Oracle products, and a richer cloud feature set.

Thursday Dec 23, 2010

Oracle Grid Engine: Changes for a Bright Future at Oracle

For the past decade, Oracle Grid Engine has been helping thousands of customers marshal the enterprise technical computing processes at the heart of bringing their products to market. Many customers have achieved outstanding results with it via higher data center utilization and improved performance. The latest release of the product provides best-in-class capabilities for resource management including: Hadoop integration, topology-aware scheduling, and on-demand connectivity to the cloud.

Oracle Grid Engine has a rich history, from helping BMW Oracle Racing prepare for the America’s Cup to helping isolate and identify the genes associated with obesity; from analyzing and predicting the world's financial markets to producing the digital effects for the popular Harry Potter series of films. Since 2001, the Grid Engine open source project has made Oracle Grid Engine functionality available for free to open source users. The Grid Engine open source community has grown from a handful of users in 2001 into the strong, self-sustaining community that it is now.

Today, we are entering a new chapter in Oracle Grid Engine’s life. Oracle has been working with key members of the open source community to pass on the torch for maintaining the open source code base to the Open Grid Scheduler project hosted on SourceForge. This transition will allow the Oracle Grid Engine engineering team to focus their efforts more directly on enhancing the product. In a matter of days, we will take definitive steps in order to roll out this transition. To ensure on-going communication with the open source community, we will provide the following services:

  • Upon the decommissioning of the current open source site on December 31st, 2010, we will begin to transition the information on the open source project to Oracle Technology Network’s home page for Oracle Grid Engine. This site will ultimately contain the resources currently available on the open source site, as well as a wealth of additional product resources.
  • The Oracle Grid Engine engineering team will be available to answer questions and provide guidance regarding the open source project and Oracle Grid Engine via the online product forum
  • The Open Grid Scheduler project will be continuing on the tradition of the Grid Engine open source project. While the Open Grid Scheduler project will remain independent of the Oracle Grid Engine product, the project will have the support of the Oracle team, including making available artifacts from the original Grid Engine open source project.

Oracle is committed to enhancing Oracle Grid Engine as a commercial product and has an exciting road map planned. In addition to developing new features and functionality to continue to improve the customer experience, we also plan to release game-changing integrations with several other Oracle products, including Oracle Enterprise Manager and Oracle Coherence. Also, as Oracle's cloud strategy unfolds, we expect that the Oracle Grid Engine product's role in the overall strategy will continue to grow. To discuss our general plans for the product, we would like to invite you to join us for a live webcast on Oracle Grid Engine’s new road map. Click here to register.

Next Steps:

Thank you to everyone in the community for their support over the last decade and their continued support going forward!

Tuesday Nov 30, 2010

JARYBA Achieves Oracle Validated Integration and Announces Support for Oracle Grid Engine With SmartSuspend v2.0

I am very pleased to announce that we've signed up our first partner to the Oracle Validated Integration program for Oracle Grid Engine. Jaryba's SmartSuspend product is a clever way to allow jobs suspended by Grid Engine to release all of the resources they're holding, even memory and FLEXlm licenses. And it works without requiring any changes to the applications. You don't even have to recompile.

If you've ever run into the issue of running out of swap space because of preempted jobs holding onto their memory, SmartSuspend might be the answer you're looking for. It works by inserting itself between the application and the OS so that it can track the memory and license usage. When a job is suspended, SmartSuspend first uses its knowledge of the resources requested by the application to let all of those resources go. When the job is resumed, SmartSuspend first attempts to recapture those resources before allowing the application to run. From the application's perspective, nothing changes. From the administrator's perspective, the difference is huge.

Friday Nov 12, 2010

NOLA Bound

After much internal... discussion, Oracle has decided to have a booth at SC10 after all, and as usual, I will be there waving the Grid Engine banner. If you're at the show, please come by and say high. I believe they've scheduled some office hours of sorts for me on Tuesday afternoon, but I should be hanging around the Oracle booth for most of the show. (Except Thursday, so don't wait until the last minute!) I think I'll also be making an appearance at the Univa UD booth on Tuesday morning at 11:00.

I also want to mention the RCE Podcast Brock Palen and Jeff Squyers were kind enough to invite me to record. If you're interested in an intro to OGE or a high level status check, go have a listen.

I guess since I have your attention, I should also point out that the presentation I did at Oracle OpenWorld '10 about using Grid Engine for large-scale data-oriented computing (e.g. Hadoop) with Tom White from Cloudera is now available on the Grid Engine OTN page.

Wednesday Oct 06, 2010

SWWM Seeks SWISV

I've said it before: being adopted into the Oracle family has been a great thing for the Oracle Grid Engine product. One of the many reasons is that we get to take advantage of the amazing partner program that Oracle has, the Oracle Partner Network.

Over the years, a number of companies have built products that include, build on, or use either the Grid Engine product or the Grid Engine open source project. While we were Sun, there really was little that we could offer these companies in terms of useful partnership opportunities. Now that we're Oracle, there are actually several very active, very interesting programs available for partners. If your company is working with Grid Engine, and you'd like to investigate a closer relationship with Oracle, there's never been a better time!

Here's just a quick overview of some of the programs Oracle has to offer:

  • Oracle Validated Integration -- I love this program. It's a way to has Oracle certify and swear to the fact that your product is validated on Grid Engine and that the combination works as designed. It gives your customers an extra boost of confidence in your product, and it gets your product listed on the OVI partner solutions page. (Note that the program information says it's only for a limited set of Oracle products. Since Grid Engine is now under the Oracle Enterprise Manager product family, we do indeed qualify.)
  • Application-Specific Full Use & Embedded licensing -- We now have the ability to negotiate OEM contracts to include or embed Grid Engine in your product. It was possible before, but now it's actually a normal thing to do. There's even a standard program and process for it, including some very nice discounts. You can find out more about the program on page 54 of the Software Investment Guide.
  • Oracle Partner Network -- The OPN is your one-stop shop for hitching your wagon to the Oracle engine. With multiple levels and a huge number of benefits, the OPN is a great way to develop a closer relationship with Oracle.
  • OPN Specialization for Cloud computing and SaaS -- OPN has this concept of partner specializations. It's a way for you to distinguish yourself by demonstrating your deeper knowledge in specific areas. There's now a specialization for the cloud and SaaS.

If any of these programs sound interesting, you know where to find me. You can also send a Tweet or DM to my partner partner, Susan Wu, susanwu88 on Twitter.

(Don't worry. I'll get back to blogging geeky things again soon.)

Tuesday Sep 21, 2010

A Quick Update From the Experts at Oracle OpenWorld

Just wanted to point out this interview that came out yesterday. The summary is: really, honestly, really, Grid Engine is alive and well and has a bright future in front of it. The rumors of Grid Engine's death have been greatly exaggerated.

Wednesday Sep 15, 2010

Grid Engine at Oracle Open World

In case any of you will be visiting Oracle Open World next week, be sure to come check out my sessions. I have two OpenWorld sessions and one JavaOne hands-on lab. (The lab isn't actually directly related to Grid Engine, but there's a tie-in via our Hadoop support.)

S316977: Scalable Enterprise Data Processing for the Cloud with Oracle Grid Engine
Dan Templeton (Oracle), Tom White (Cloudera)
Thursday 23-Sep-10 12:00-13:00 Moscone South Rm 310
S317230: Who's Using Your Grid? What's on Your Grid? How to Get More
Dan Templeton, Dave Teszler, Zeynep Koch
Tuesday 21-Sep-10 17:00-18:00 Moscone South Rm 305
S314413: Extracting Real Value from Your Data with Apache Hadoop
Dan Templeton (Oracle), Sarah Sproehnle (Cloudera), Michal Bachorik (Oracle)
Wednesday 22-Sep-10 12:30-14:30 Hilton San Francisco Plaza B

Also, Melissa McDade's talk will also have some Grid Engine content:

S318115: High-Performance Computing for the Oil and Gas Industry
Dan Hough, Melinda McDade
Wednesday, 22-Sep-10 10:00-11:00 InterContinental San Francisco Telegraph Hill

Thursday Aug 05, 2010

Not Dead Yet!

Just noticed this article go flitting by in a tweet from a Grid Engine community member. Since the article lacks any useful details what-so-ever about who was cut and where, I thought I should pop my head up to declare that all is well in Grid Engine land.

First, I have to apologize for my long lack of blog updates. Now that I've taken over the Grid Engine product management role, I've been up to my elbows non-stop. Maybe this post will get me back into the habit of blogging regularly. I still have one more post to write about what's new in 6.2u5.

Second, the Grid Engine team is still here, as is the Oracle Grid Engine product. In fact, in my almost a decade working on this product, we've never been in a better position. One thing Oracle does very well is to be clear about their intentions. Either your product has a road map, or it doesn't. We have a road map. We have a rather exciting road map, in fact, and I'm looking forward to using our new home in Oracle as a launching pad for the next generation of the Grid Engine technology.

Lastly, just to add a little credence to the above statement, let me share a little about where we have landed in Oracle. The Oracle Grid Engine team now sits in the Oracle Enterprise Manager organization, directly under the Ops Center team. Enterprise Manager is Oracle's product for managing the data center from top to bottom, the entire software stack, down through the OS, all the way to the hardware and storage. Software. Hardware. Complete. Interestingly, the Enterprise Manager group would seem to be one of the key components in Oracle cloud strategy. Hmmm... Cloud... Grid Engine... One could imagine there being some kind of fit there. Odd that we should land in the same group...

The technology that Grid Engine brings to the Oracle product family is unique. Not only do we not compete with any other existing Oracle product, there are several other Oracle products with which Grid Engine has a very natural synergy. I have very high hopes for the role Grid Engine will play at Oracle going forward. Without getting into any details, look for good things coming from our direction in the future.

Oracle policy prevents me from saying anything concrete or specific about our plans or positioning or anything else, really, but I hope I've been able to give you 1) confidence that we're alive and doing quite well, thank you, and 2) we have a long and exciting road ahead of us.

Wednesday Jan 20, 2010

Topology-Aware Scheduling

Continuing in my feature deep dives, let's talk about topology-aware scheduling. Some applications have serious resource needs. Not only do they need raw CPU cores, but they also beat the snot out of the local cache or burn up the I/O channels. These sorts of applications don't play well with others. In fact, they often don't play well with themselves. For these applications, how the threads/processes are distributed across the CPUs makes a huge difference. If, for example, all the threads/processes have their own core but are all sharing a socket, they might end up fighting over cache space or I/O bandwidth. Depending on the CPU architecture, the conflicts may be more subtle, such as only the processes on specific groups of cores colliding. The price for making a bad choice of how to assign these applications to cores is poor performance, in some cases doubling the time to completion.

It's not just the powerhouse apps that care about CPU topology, though. Most operating systems will schedule processes and threads to execute on available cores rather willy-nilly, with no sense of core affinity. Because an average OS does context switches at a rather high frequency, an application may find itself executing on a different CPU and core every time it gets the chance to run. If that application makes any use of the CPU cache, for example, its performance will suffer for it. The performance might not suffer much, but the difference is usually measurable.

For these reasons, we've added topology-aware scheduling to Sun Grid Engine 6.2 update 5. With topology-aware scheduling, the user who submits the job can specify how that job should be laid out across a machine's CPUs. Users are allowed to specify three different flavors of distribution strategy: linear, striding, or explicit. In linear distribution, the execution daemon will place the job's threads/processes on consecutive cores if possible. If it can't fit the entire job on a single socket, it will span the job across sockets. The striding strategy tells the execution daemon to place the job on every nth core, e.g. every 4th core or every other core. The explicit strategy lets the user decide exactly which cores will be assigned to the job. Note that the core binding is a request, not a requirement. If for some reason the execution daemon can't fulfill the request, the job will still be executed; it just won't be bound.

In addition to the three binding strategies, there are also three possible binding mechanisms. You can either allow Sun Grid Engine to do the binding automatically as part of the job execution, or you can have Sun Grid Engine add the binding parameters to the machines file for OpenMPI jobs, or you can have Sun Grid Engine just describe the intended binding in an environment variable with the expectation that the job will bind itself based on that information. When the job is bound by Sun Grid Engine during execution, the job will be tied to specific CPU cores using an OS-specific system call. On Linux, the bound processors may be shared with other processes. On Solaris, the bound processors are used exclusively for the job. In either case, the job will only be allowed to execute on the bound processors.

In order to allow users to tell what kinds of topologies are provided by the machines in the cluster, some new default complexes have been added that describe the socket/core/thread layouts of the machines. These new complexes can be used during job submission to request specific topologies, or they can be used with qhost to report what's available.

Let's look at a couple of examples (taken from the docs).

% qsub -binding linear:4 -l m_core=8 -l m_socket=2 -l arch=lx26-amd64 job.sh

This example will look for a machine with 8 cores and 2 sockets (i.e. dual-socket, quad-core) and try to bind to four consecutive cores. The execution daemon will try to put all four cores on the same socket, but if that's not possible, it will spread the job out over as many sockets as required (but as few as possible).

% qsub -binding striding:2:4 -l m_core=8 -l m_socket=2 -l arch=lx26-amd64 job.sh

This example will again look for a dual-socket, quad-core machine, but this time the job will occupy the third core on both sockets. (The first core is number 0.) If the third core on either socket is occupied, the job will not be bound.

% qsub -binding explicit:0,0:0,3:1,0:1,3 -l m_core=8 -l m_socket=2 -l arch=lx26-amd64 job.sh

This last example will yet again look for a dual-socket, quad-core machine. This time the job will be bound to the first and fourth cores on both sockets. Again, if any of those core are already bound to another job, the job will not be bound.

It's clear that jobs that benefit from specific process placement with respect to CPU cores will perform much better in a 6.2u5 cluster, thanks to this new feature. Even for regular old run-of-the-mill jobs, though, submitting with -binding linear:1 should provide a small performance bump because it will keep them from being jostled around between context switches. In fact, I won't be surprised if 12 months from now I include adding that switch to the sge_request file in my top 10 list of best practices.

Wednesday Jan 06, 2010

Welcome Sun Grid Engine 6.2 update 5

The Sun Grid Engine 6.2 update 5 release is now available. Don't let the unassuming version number fool you; there's quite a few interesting features packed into this release. Let's talk about them, shall we?

Integration with Apache Hadoop

SGE 6.2u5 gets to claim the title of first workload manager with direct support for Apache Hadoop applications. What does that mean? First, it means that you can submit Hadoop applications to an SGE cluster just like you would any other parallel job. The cluster will take care of setting up the Hadoop jobtracker and tasktrackers for you. Second, it means that the SGE scheduler knows about the HDFS data locality such that it can route Hadoop jobs to nodes where the jobs' data already lives. The net result is that you can now realistically consolidate your Hadoop cluster into your SGE cluster, saving you time, money, and lots of headaches. See the docs for more info. [Also see my next post.]

Topology-aware Scheduling

Many applications benefit greatly by being tied to specific CPU sockets and/or cores. For example, some cache-hungry applications will execute in half the time if run on four cores on different sockets versus running on four cores in the same socket. With SGE 6.2u5, we've added the ability to specify these topology preferences when submitting your jobs. Whenever possible, the scheduler will honor the topology preferences when assigning jobs to nodes. For topology-sensitive applications and clusters with lots of Nehalem boxes, SGE 6.2u5 can speed up application execution considerably. See the docs for more info. [Also see my follow-up post.]

Slotwise Subordination

The SGE preemption model is what I call "after-market preemption" meaning that it's not an inherit aspect of every cluster. You have to take preemption (AKA subordination) into account when designing your cluster layout. Prior to SGE 6.2u5, the preemption model was rather coarse grained. SGE could only suspend an entire queue instance at a time, meaning that one high-priority job might be suspending two or four or sixteen or more lower-priority jobs. With SGE 6.2u5, we're introducing finer grained preemption. Now, rather than declaring that just Queue A is subordinated to Queue B, you can say that between Queues A and B there shouldn't be more than 4 jobs running, and given a conflict, Queue B wins. This new finer-grained preemption model means that you can now use subordination without paying for it with utilization. See the docs for more info. [Also see my follow-up post.]

User-controlled Array Task Throttling

One of the unique things about Sun Grid Engine is that it handles array jobs extremely efficiently. In many cases users will consolidate individual batch jobs together into array jobs to take advantage of that fact. The down side is that all tasks within an array job are considered equal with regard to scheduling policies. If an array job is the highest priority job in the system, all of it's tasks are also higher priority than any other jobs. If that array job has ten thousand tasks (something not uncommon or really even all that stressful for SGE), then all ten thousand tasks will be run before any other jobs (unless another job later becomes higher priority), at least by default. An administrator can configure a global limit to the number of tasks from a single array job that are allowed to execute at a time. Better than nothing, but global policies always leave something to be desired.

With SGE 6.2u5, we've introduced the ability for a user to apply self-imposed limits to his individual array jobs. Why would a user voluntarily set limits? In most cases it turns out that users want to do the right thing and will gladly do so given the chance. Self-imposed limits help the cluster run more smoothly, meaning that everyone gets what they want faster, and no one gets bonked on the head by the administrator. Additionally, if a user has more than one large array job pending, setting self-imposed limits allows them all to make progress instead of completing them serially. For more than one customer I know about, this feature alone will be reason enough to upgrade. [See my follow-up post for more info.]

Extended SGE Inspect

SGE Inspect, the new UI introduced in SGE6.2u3, was previously only a monitoring tool. With SGE 6.2u5, we've added the ability to manage parallel environments. Going forward we will continue adding management functionality. See the docs for more info.

Improved Cloud Connectivity

With SGE 6.2u3, we added the ability through the Service Domain Manager component to automatically provision additional cluster nodes from Amazon EC2 during peak periods. With SGE 6.2u5, we've expanded that functionality a bit and made it easier to use. See the docs for more info.

Improved Power Management

Same story as the cloud connectivity, really. We introduced the ability to automatically power down idle or underused nodes with SGE 6.u3 through the Service Domain Manager component. With SGE 6.2u5, we've fleshed it out a bit more and more it easier to use.

 
 

Over the next couple of weeks I'll try to write some posts about these features individually. If you're already Grid Engine savvy, go grab a copy and get started. If you need more info, try starting with the beginner's guide.

Monday Nov 30, 2009

Sun Grid Engine for Dummies

I've recently been asked for a really introductory doc on Sun Grid Engine, and I was dismayed to realize that there really isn't anything like that out there. Even the Beginner's Guide I wrote has some fairly high expectations of the reader's experience level. So, this post will be my attempt at a truly introductory introduction to Sun Grid Engine.

Let's Begin at the Beginning

Servers tend to be used for one of two purposes: running services or processing workloads. Services tend to be long-running and don't tend to move around much. Workloads, however, such as running calculations, are usually done in a more "on demand" fashion. When a user needs something, he tells the server, and the server does it. When it's done, it's done. For the most part it doesn't matter on which particular machine the calculations are run. All that matters is that the user can get the results. This kind of work is often called batch, offline, or interactive work. Sometimes batch work is called a job. Typical jobs include processing of accounting files, rendering images or movies, running simulations, processing input data, modeling chemical or mechanical interactions, and data mining. Many organizations have hundreds, thousands, or even tens of thousands of machines devoted to running jobs.

Now, the interesting thing about jobs is that (for the most part) if you can run one job on one machine, you can run 10 jobs on 10 machines or 100 jobs on 100 machines. In fact, with today's multi-core chips, it's often the case that you can run 4, 8, or even 16 jobs on a single machine. Obviously, the more jobs you can run in parallel, the faster you can get your work done. If one job takes 10 minutes on one machine, 100 jobs still only take ten minutes when run on 100 machines. That's much better than 1000 minutes to run those 100 jobs on a single machine. But there's a problem. It's easy for one person to run one job on one machine. It's still pretty easy to run 10 jobs on 10 machines. Running 1600 jobs on 100 machines is a tremendous amount of work. Now imagine that you have 1000 machines and 100 users all trying to running 1600 jobs each. Chaos and unhappiness would ensue.

To solve the problem of organizing a large number of jobs on a set of machines, distributed resource managers (DRMs) were created. (A DRM is also sometimes called a workload manager. I will stick with the term, DRM.) The role of a DRM is to take a list of jobs to be executed and distributed them across the available machines. The DRM makes life easier for the users because they don't have to track all their jobs themselves, and it makes life easier for the administrators because they don't have to manage users' use of the machines directly. It's also better for the organization in general because a DRM will usually do a much better job of keeping the machines busy than users would on their own, resulting in much higher utilization of the machines. Higher utilization effectively means more compute power from the same set of machines, which makes everyone happy.

Here's a bit more terminology, just to make sure we're all on the same page. A cluster is a group of machines cooperating to do some work. A DRM and the machines it manages compose a cluster. A cluster is also often called a grid. There has historically been some debate about what exactly a grid is, but for most purposes grid can be used interchangeably with cluster. Cloud computing is a hot topic that builds on concepts from grid/cluster computing. One of the defining characteristics of a cloud is the ability to "pay as you go." Sun Grid Engine offers an accounting module that can track and report on fine grained usage of the system. Beyond that, Sun Grid Engine now offers deep integration to other technologies commonly being used in the cloud, such as Apache Hadoop.

How Does It Work?

A Sun Grid Engine cluster is composed of execution machines, a master machine, and zero or more shadow master machines. The execution machines all run copies of the Sun Grid Engine execution daemon. The master machine runs the Sun Grid Engine qmaster daemon. The shadow master machines run the Sun Grid Engine shadow daemon. In the event that the master machine fails, the shadow daemon on one of the shadow master machines will become the new master machine. The qmaster daemon is the heart of the cluster, and without it the no jobs can be submitted or scheduled. The execution daemons are the work horses of the cluster. Whenever a job is run, it's run by one of the execution daemons.

To submit a job to the cluster, a user uses one of the submission commands, such as qsub. Jobs can also be submitted from the graphical user interface, qmon, but the command-line tools are by far more commonly used. In the job submission command, the user includes all of the important information about the job, like what it should actually run, what kind of execution machine it needs, how much memory it will consume, how long it will run, etc. All of that information is then used by the qmaster to schedule and manage the job as it goes from pending to running to finished. For example, a qsub submission might look like: qsub -wd /home/dant/blast -i /home/dant/seq.tbl -l mem_free=4G cross-blast.pl ddbdb. This job searches for DNA sequences from the input file /home/dant/seq.tbl in the ddbdb sequence database. It requests that it be run in the /home/dant/blast directory, that the /home/dant/seq.tbl file be piped to the job's standard input, and that it run on a machine that has at least 4GB of free memory.

Once a job has been submitted, it enters the pending state. On the next scheduling run, the qmaster will rank the job in importance versus the other pending jobs. The relative importance of a job is largely determined by the configured scheduling policies. Once the jobs have been ranked by importance, the most important jobs will be scheduled to available job slots. A slot is the capacity to run a job. Generally, the number of slots on an execution machine is set to equal the number of CPU cores the machine has; each core can run one job and hence represents one slot. Every available slot is filled with a pending job, if one is available. If a job requires a resource or a slot on a certain type of machine that isn't currently available, that job will be skipped over during that scheduling run.

Once the job has been scheduled to an execution machine, it is sent to the execution daemon on that machine to be run. The execution daemon executes the command specified by the job, and the job enters the running state. Once the job is running, it is allowed to continue running until it completes, fails, is terminated, or is requeued (in which case we start over again). Along the way the job may be suspended, resumed, and/or checkpointed any number of times. (Sun Grid Engine does not handle checkpointing itself. Instead, Sun Grid Engine will trigger whatever checkpointing mechanism is available to a job, if any is available.)

After a job has completed or failed, the execution daemon cleans up after it and notifies the qmaster. The qmaster records the job's information in the accounting logs and drops the job from its list of active jobs. If the submission client was synchronous, the qmaster will notify the client that the job ended. Information about completed jobs is available through the qacct command-line tool or the Accounting and Reporting Console's web console.

In addition to traditional style batch jobs, as in the BLAST example above, Sun Grid Engine can also manage interactive jobs, parallel jobs, and array jobs. An interactive job is like logging into a remote machine, except that Sun Grid Engine decides to which machine to connect the user. While the user is logged in, Sun Grid Engine is monitoring what the user is doing for the accounting logs. A parallel job is a distributed job that runs across multiple nodes. Typically a parallel job relies on a parallel environment, like MPI, to manage its inter-process communication. An array job is similar to a parallel job except that it's processes don't communicate; they're all independent. Rendering an image is a classic array job example. The main difference between a parallel job and an array job is that a parallel job needs to have all of its processes running at the same time, whereas an array job doesn't; it could be run serially and would still work just fine.

What's So Special About Sun Grid Engine?

If any old DRM (and there are quote a few out there) solves the problem, why should you be particularly interested in Sun Grid Engine? Well, there are a few reasons. My top reasons (in no particular order) why Sun Grid Engine is so great are:

  • Scalability — Sun Grid Engine is a highly scalable DRM system. We have customers running clusters with thousands of machines, tens of thousands of CPU cores, and/or processing tens of millions of jobs per month.
  • Flexibility — Sun Grid Engine makes it possible to customize the system to exactly fit your needs.
  • Advanced scheduler — Sun Grid Engine does more than just spread jobs evenly around a group of machines. The Sun Grid Engine qmaster supports a variety policies to fine-tune how jobs are distributed to the machines. Using the scheduling policies, you can configure Sun Grid Engine to make its scheduling decisions match your organization's business rules.
  • Reliability — Something that I hear regularly from customers is that Sun Grid Engine just works and that it keeps working. After the initial configuration, Sun Grid Engine takes very little effort to maintain.

The Sun Grid Engine software has a long list of features that make it a powerful, flexible, scalable, and ultimately useful DRM system. With both open source and supported product options, Sun Grid Engine offers a very low barrier to entry and enterprise class functionality and support.

Typical Use Cases

One of the easiest ways to understand Sun Grid Engine is to see it in action. To that end, let's look at some typical use cases.

  • Mentor Graphics, a leading EDA software vendor, uses the Sun Grid Engine software to manage its regression tests. To test their software, they submit the tests as thousands of jobs to be run on the cluster. Sun Grid Engine makes sure that every machine is busy running tests. When a machine completes a test run, Sun Grid Engine assigns it another, until all of the tests are completed.

    In addition to using Sun Grid Engine to manage the physical machines, they also use Sun Grid Engine to manage their software licenses. When a test needs a software license to run, that need is reflected in the job submission. Sun Grid Engine makes sure that no more licenses are used than are available.

    This customer has a diverse set of machines, including Solaris, Linux, and Windows. In a single cluster they process over 25 million jobs per month. That's roughly 10 jobs per second, 24/7. (In reality, their workload is bursty. At some times they may see more than 100 jobs per second, and at other times they may see less than 1.)

  • Complete Genomics is using Grid Engine to manage the computations needed to do sequencing of the human genome. Their sequencing instruments are like self-contained robotic laboratories and require a tremendous amount of computing power and storage. Using Grid Engine as the driver for their computations, this customer intends to transform the way disease is studied, diagnosed and treated by enabling cost-effective comparisons of genomes from thousands of individuals. They currently have a moderate sized cluster, with a couple hundred machines, but they intend to grow that cluster by more than an order of magnitude.

  • Rising Sun Pictures uses Grid Engine to orchestrate its video rendering process to create digital effects for blockbuster films. Each step in the rendering process is a job with a task for every frame. Sun Grid Engine's workflow management abilities make sure that the rendering steps are performed in order for every frame as efficiently as possible.

  • A leading mobile phone manufacturer runs a Sun Grid Engine cluster to manage their product simulations. For example, they run drop test simulations with new phone designs using the Sun Grid Engine cluster to improve the reliability of their phones. They also run simulations of new electronics designs through the Sun Grid Engine cluster.

  • D.E. Shaw is using Sun Grid Engine to manage their financial calculations, including risk determination and market prediction. This company's core business runs through their Sun Grid Engine cluster, so it has to just work. The IT team managing the cluster offers their users a 99% availability SLA.

    Also, this company uses many custom-developed financial applications. The configurability of the Sun Grid Engine software has allowed them to integrate their applications into the cluster with little or no modifications.

  • Another Wall Street financial firm is using a Sun Grid Engine cluster to replace their home-grown workload manager. Their workload manager is written in Perl and was sufficient for a time. They have, however, now outgrown it and need a more scalable and robust solution. Unfortunately, all of their in-house applications are written to use their home-grown workload manager. Fortunately, Sun Grid Engine offers a standardized API called DRMAA that is available in Perl (as well as C, Python, Ruby, and the Java™ platform). Through the Perl binding of DRMAA, this customer was able to slide the Sun Grid Engine software underneath their home-grown workload manager. The net result is that the applications did not need to be modified to let the Sun Grid Engine cluster take over managing their jobs.

  • The Texas Advanced Computing Center at the University of Texas is #9 on the November 2009 Top500 list and uses Sun Grid Engine to manage their 63,000-core cluster. With a single master managing roughly 4000 machines and over 3000 users working on over 1000 projects spread around throughout 48 of the 50 US states, the TACC cluster weighs in as the largest (known) Sun Grid Engine cluster in production. Even though the cluster offers a tremendous amount of compute power to the users of the Teragrid research network (579 GigaFLOPS to be exact), the users and Sun Grid Engine master manage to keep the machines in the cluster at 99% utilization.

    The TACC cluster is used by researchers around the country to run simulations and calculations for a variety of fields of study. One noteworthy group of users has run a 60,000-core parallel job on the Sun Grid Engine cluster to do real-time face recognition in streaming video feeds.

Atypical Use Cases

One of the best ways to show Sun Grid Engine's flexibility is to take a look a some unusual use cases. These are by no means exhaustive, but they should serve to give you an idea of what can be done with the Sun Grid Engine software.

  • A large automotive manufacturer uses their Sun Grid Engine cluster in an interesting way. In addition to using it to process traditional batch jobs, they also use it to manage services. Service instances are submitted to the cluster as jobs. When additional service instances are needed, more jobs are submitted. When too many are running for the current workload, some of the service instances are stopped. The Sun Grid Engine cluster makes sure that the service instances are assigned to the most appropriate machines at the time.

  • One of the more interesting configuration techniques for Sun Grid Engine is called a transfer queue. A transfer queue is a queue that, instead of processing jobs itself, actually forwards the jobs on to another service, such as another Sun Grid Engine cluster or some other service. Because the Sun Grid Engine software allows you to configure how every aspect of a job's life cycle is managed, the behavior around starting, stopping, suspending, and resuming a job can be altered arbitrarily, such as by sending jobs off to another service to process. More information about transfer queues can be found on the open source web site.

  • A Sun Grid Engine cluster is great for traditional batch and parallel applications, but how can one use it with an application server cluster? There are actually two answers, and both have been prototyped as proofs of concept.

    The first approach is to submit the application server instances as jobs to the Sun Grid Engine cluster. The Sun Grid Engine cluster can be configured to handle updating the load balancer automatically as part of the process of starting the application server instance. The Sun Grid Engine cluster can also be configured to monitor the application server cluster for key performance indicators (KPIs), and it can even respond to changes in the KPIs by starting additional or stopping extra application server instances.

    The second approach is to use the Sun Grid Engine cluster to do work on behalf of the application server cluster. If the applications being hosted by the application servers need to execute longer-running calculations, those calculations can be sent to the Sun Grid Engine cluster, reducing the load on the application servers. Because of the overhead associated with submitting, scheduling, and launching a job, this technique is best applied to workloads that take at least several seconds to run. This technique is also applicable beyond just application servers, such as with SunRay Virtual Desktop Infrastructure.

  • A research group at a Canadian university uses Sun Grid Engine in conjunction with Cobbler to do automated machine profile management. Cobbler allows a machine to be rapidly reprovisioned to a pre-configured profile. By integrating Cobbler into their Sun Grid Engine cluster, they are able to have Sun Grid Engine reprovision machines on demand to meet the needs of pending jobs. If a pending job needs a machine profile that isn't currently available, Sun Grid Engine will pick one of the available machines and use Cobbler to reprovision it into the desired profile.

    A similar effect can be achieved through virtual machines. Because Sun Grid Engine allows jobs' life cycles to be flexibly managed, a queue could be configured that starts all jobs in virtual machines. Aside from always having the right OS profile available, jobs started in virtual machines are easy to checkpoint and migrate.

  • With the 6.2 update 5 release of the Sun Grid Engine software, Sun Grid Engine can manage Apache Hadoop workloads. In order to do that effectively, the qmaster must be aware of data locality in the Hadoop HDFS. The same principle can the applied to other data repository types such that the Sun Grid Engine cluster can direct jobs (or even data disguised as a job) to the machine that is closest (in network terms) to the appropriate repository.

  • One of the strong points of the Sun Grid Engine software is the flexible resource model. In a typical cluster, jobs are scheduled against things like CPU availability, memory availability, system load, license availability, etc. Because the Sun Grid Engine resource model is so flexible, however, any number of custom scheduling and resource management schemes are possible. For example, network bandwidth could be modeled as a resource. When a job requests a given bandwidth, it would only be scheduled on machines that can provide that bandwidth. The cluster could even be configured such that if a job lands on a resource that provides higher bandwidth than the job requires, the bandwidth could be limited to the requested value (such as through the Solaris Resource Manager).

Further Reading

For more information about Sun Grid Engine, here are some useful links:

Beta Testing the Sun Grid Engine Hadoop Integration

In case you haven't heard yet, the upcoming release of Sun Grid Engine will include an integration with Apache Hadoop that will allow Map/Reduce jobs to be submitted to a Sun Grid Engine cluster while minding HDFS data locality. The 6.2u5 release will be out by the end of the year, but it's currently in the beta testing phase. And that's where you come in.

I'm looking for some volunteers to test the integration. To that end, this blog post will provide instructions for how to get the beta code checked out and built. The Hadoop integration is actually only loosely dependent on the Sun Grid Engine software itself. While it's planned to be part of u5, the integration should be usable with a cluster as old to 6.2u2, although I would really recommend at least 6.2u4.

In a nutshell, the integration consists of two components. The first is the hadoop parallel environment that allows Map/Reduce jobs to be started as parallel jobs in a Sun Grid Engine cluster. The second is the integration with HDFS, called Herd, that makes the Sun Grid Engine scheduler aware of the locations of the HDFS data blocks. Herd has two parts. One part is a load sensor that runs on every execution machine and reports the HDFS blocks on that machine. The other part is a JSV that translates HDFS data paths included in the job submission into a list of HDFS blocks needed by the job.

How to check out the source code

  1. Make sure you have a functional CVS client.
  2. cvs -d :pserver:guest@cvs.sunsource.net:/cvs login
  3. cvs -d :pserver:guest@cvs.sunsource.net:/cvs checkout gridengine/source

Technically, the above will only check out the source directory, but for the Hadoop integration, that's all you need. The Hadoop integration lives in three places. First, the scripts live in source/dist/hadoop. Second, the Herd code lives at source/libs/herd. Third, the JSV Java language binding upon which the Herd code depends lives at source/libs/jjsv.

How to build the source code

  1. Make sure you're using at least Ant 1.6.3 and the Java Standard Edition 6 platform.
  2. Copy the source/build.properties file to build_private.properties.
  3. Edit the build_private.properties file to include the corrects paths for the Java Standard Edition 6 platform and junit 3.8.
  4. Change to the gridengine/source directory.
  5. ant jjsv
  6. ant herd

After the above steps, you will find herd.jar at source/CLASSES/herd/herd.jar and JSV.jar at source/CLASSES/jjsv/JSV.jar.

How to install the integration

  1. Copy herd.jar and JSV.jar to the $SGE_ROOT/lib directory.
  2. Copy the source/dist/hadoop directory to somewhere accessible by all the execution nodes.

How to configure the integration

  1. Get HDFS up and running on your cluster. The most useful configuration will be to have every execution host be a data node, and to only have execution hosts as data nodes. Also, because of the way Hadoop does authentication and authorization, you'll need to make sure that either HDFS has security disabled or that root and the SGE admin user are in the HDFS super user group.
  2. Copy your Hadoop configuration directory to <hadoop>/conf, where <hadoop> is the directory that you copied in step 2 of How to install the integration.
  3. Delete the <hadoop>/conf/mapred.xml, <hadoop>/conf/masters, and <hadoop>/conf/slaves files.
  4. Edit the <hadoop>/env.sh file to contain the paths to the Java platform, the Hadoop install directory, and the Hadoop configuration directory you just created (<hadoop>/conf).
  5. Change into the <hadoop> directory.
  6. ./setup.pl -i
  7. Add the hadoop parallel environment to one or more of your queues

The setup.pl script will install the hadoop parallel environment and the complexes needed by Herd. It will also start the Herd load sensor on all the execution hosts. At this point, you should be ready to go. Wait for a couple of minutes to give all of the execution hosts a chance to start running the load sensor and reporting values. You can run qhost -F hdfs_primary_rack to check that the load sensor is functioning correctly. Every execution host should report an hdfs_primary_rack value. If one or more machines have not reported a value within about five minutes, see the troubleshooting section below.

Using the integration

To submit a job that uses the hadoop parallel environment, use -pe hadoop <n>, where <n> is the number of nodes. The hadoop parallel environment uses an allocation rule that guarantees that no more than one task tracker per job will run on a single host. To tell the scheduler what data the job needs, request the hfds_input resource with a value of the HDFS path to the job's data. The data path must be an absolute path.

Here's an example. Say I want to use the grep example to find occurrences of the word 'Sun' in a series of documents. First, I'd copy those document into HDFS to /user/dant/sungrep (e.g. bin/hadoop fs -copyFromLocal ~/Documents/\* /user/dant/sungrep). I would then submit the job with echo `pwd`/bin/hadoop --config \\$TMPDIR/conf jar `pwd`/hadoop-0.20.1-examples.jar grep sungrep output Sun | qsub -pe hadoop 16 -l hdfs_input=/user/dant/sungrep -jsv <hadoop>/jsv.sh.

Let's look at that in a little more detail. First, we're echoing the Hadoop command and piping it to qsub. Why? Well, when the integration runs, it creates a conf directory in the job's temp directory that is properly set up for the assigned hosts. Until the job runs, though, we don't know where the temp directory is. We get it's path from the $TMPDIR variable once the job starts. We therefore need to wrap the Hadoop command in a script. We could either write a script that contains the command, or we could let qsub write one for us by piping the command to qsub's stdin. Note that we used --config \\$TMPDIR/conf in the command. The backslash is important because it prevents the shell on the submission host from interpreting the $TMPDIR variable.

Next, the qsub command uses -pe hadoop 16 to request 16 nodes. When this job is run, a job tracker will be started on the "master" host, and a task tracker will be started on each of the 16 assigned nodes. The master host is the host where the parallel job's master task is started. After the job tracker and task trackers are running, the grep job itself will be started, launched from the master host. The hadoop PE is a tight integration with an allocation rule of "1". In order to run a Hadoop job on top of SGE, you must use the PE, even if it's only a single-node job.

The qsub command also uses -l hdfs_input=/user/dant/sungrep -jsv <hadoop>/jsv.sh. The -l resource request tells SGE what data will be used by the job. It must be specified as an absolute path. The -jsv switch actually translates the resource request for hdfs_input into requests for specific racks and blocks. Without the -jsv switch, the job would never run because no node offers the hdfs_input resource. (No node offers it because it doesn't really exist. It's just a placeholder for the JSV to replace with rack and block requests. In programming terms, it's a reference injection point.) The resource request and JSV can be left out of the qsub command. If they're left out, the scheduler will not take the HDFS data locality into consideration when scheduling the job.

You can also use the Hadoop integration to set up the job tracker and task trackers and then submit jobs to them directly. Instead of echoing the Hadoop command to qsub, echo sleep 300000 instead. That will cause the job tracker and task trackers to be set up, but instead of running a job, it will just sleep for a long time. You can then run qstat -j <jobid> | grep context to show the job's context. One of the context variables will be the URL for the job tracker. Using that URL, you can set up a Hadoop configuration to talk to the job tracker so that you can submit jobs to it from the command line.

It is also highly recommended that the use of the Hadoop integration be coupled with exclusive host access. The Hadoop task trackers all assume that they have exclusive access to their nodes. If you don't use exclusive host access with the Hadoop integration, you'll end up oversubscribing the nodes.

Troubleshooting

Hopefully everything will work perfectly the first time. If for some reason it doesn't, here are some tips to help diagnose the problem:

The execds aren't reporting any hdfs resources, i.e. qhost -F | grep hdfs shows nothing.
Sometimes it takes several minutes for the nodes to start reporting the hdfs resources. If after several minutes there's still nothing, pick an execution host and check if the load sensor is running: jps -l. Look for com.sun.grid.herd.HerdJsv. Note that it might be running as root or as the SGE admin user. Also note that jps may only show you your own processes. If the load sensor isn't running, look for log files in /tmp. They will be called sge_hadoop_loadsensor.out and sge_hadoop_<n>.log. The .out file is the output from starting the load sensor. The .log files are the logging output from the load sensor. One will be the log file from the load sensor framework, and the other will be the log file from the Herd load sensor. (You can control the logging verbosity from the logging.properties file in the <hadoop> directory.) The most common problem is that the load sensor is started as the user root on most platforms (for a reason I don't yet understand), but that HDFS usually is not. With HDFS, the user who started it is the super user, and only the super user can query the kind of information that the load sensor needs. As stated in the configuration section, you must either disable HDFS security or set a super user group that contains root (and probably the SGE admin user). The next most common problems are that the path to Hadoop or the Java platform is not correct in env.sh or that the conf directory contains bad configuration information. You can test the load sensor manually by changing into the <hadoop> directory and running loadsensor.sh. If it works, it will "hang". Press enter, and it should spit out the hdfs resource values for that host. Type QUIT and press enter to exit the load sensor.
The job tracker and/or task trackers aren't starting.
The first place to look is the PE output and error files. The output from starting the job tracker should be found there. The next place to look is the log files. The log files are written where the Hadoop configuration says to put them. Make sure that wherever that is, all the users have access to it from all the nodes. Inability to write the log file is a common reason why the job tracker and/or task trackers won't start. In addition to the usual Hadoop log files, the integration also write a hadoop-<adminuser>-sge-<hostname>.log file. That file contains the output from starting the task trackers from the master host. Another common reason for the job tracker and/or task trackers not to start is that the path to the Java platform isn't correctly configured in the hadoop-env.sh file.

Thursday Jul 30, 2009

Sun HPC Software Workshop '09 -- Early Bird's Almost Over!

Just wanted to remind everyone that the early bird registration for the Sun HPC Software Workshop '09, Sept 7-10 in Regensburg, Germany, ends tomorrow (31 July 2009). It's your last chance to sign up at the discounted rate. After tomorrow, you will still be able to register, but the cost of registration will be higher.

In a nutshell, the Sun HPC Software Workshop '09 is a combination of our annual Grid Engine Workshop, a European edition of the popular Lustre Users Group meeting, and a conference on developing applications and services for HPC and cloud environments. The Workshop lasts three days, with a presentation track representing each of these topics. One the day before the main Workshop starts, we're also holding deeper technology seminars: a Lustre Deep Dive, a Grid Engine admin training, and a class on parallel application development taught by Ruud van der Pas. The Workshop and the preceding seminars are an excellent opportunity to learn more about these technologies and connect with the product engineers, partners, and other community members.

There is an open Call for Presentations for the Workshop, but it also closes tomorrow. If you're interested in proposing a talk for the Workshop (and getting a discounted registration fee if it's accepted), send a title, duration, and brief summary to the email address listed on the Agenda page. But, hurry. We'll be making our final decisions and notifying the speakers soon.

I look forward to seeing you there!

Tuesday Jul 21, 2009

Lies, Damned Lies, & DRMs

Some of our competitors seem to be very fond of spreading the rumor that the Sun Grid Engine product team has been laid off and/or that the product has been discontinued. It would appear that since they can't claim to have a better, more scalable, or more cost-effective product, they're willing to go with lying through their teeth to make the sale. Since I keep getting asked this question, I figured it would be worthwhile to post an official response.

To plagiarize Mark Twain, the rumors of our death have been greatly exaggerated. We're still here and going strong. The team is now roughly four times the size it was when I joined six years ago. It spans six offices in five countries on three continents. The product has a road map that reaches out past 2012 (which is as far as we're willing to speculate). We have a massive (if not leading) share in both the open source and licensed DRM system markets, and we're not planning to go away any time soon.

Of course, with the deal with Larry pending, nothing is certain. The only comment I can make there is "no comment." That said, for now at least, it's business as usual. We're still writing code, preparing releases, doing trainings, holding our annual Workshop, etc. Look for the next update this quarter. Look for the next release next year. And look for a whole lot more good stuff coming from our team over the next several updates and releases. With the features that have been added in the 6.2, 6.2u2 and 6.2u3 releases, Sun Grid Engine is in a great position. With what's coming up, I'd resort to lying too, if I worked for one of our competitors.

Monday Jul 20, 2009

European Students: Want a Free Laptop?

Are you a student in Europe\*? Do you want a new Toshiba laptop? Willing to write some code to get it? Good. Read on.

The OpenSolaris HPC team is currently running a programming contest for European students that was launched at ISC in Hamburg last month. The contest is to write the most performant and scalable implementation of a distributed hash table. Submission can be from teams of up to three people. The top prize is a new Toshiba laptop for each member of the winning team.

For more information, check out the contest site. Better hurry, though, because the contest deadline is coming up quick!

\* Contest participation is limited to legal residents of a specific list of European countries. See the contest site for details.


OFFICIAL RULES
NO PURCHASE NECESSARY

1. DESCRIPTION OF THE CONTEST: The Sun HPC Software Student Programming Challenge ISC 2009 ("Contest") is designed to promote the use of the Sun HPC Software, Developer Edition 1.0 for OpenSolaris among students by having them compete to design and implement the most scalable and best-performing implementation of a common parallel algorithm. Prizes will be awarded to those who submit the best entries as determined by the judges in accordance with these Official Rules.

2. ELIGIBILITY: This contest is open only to teams of 1 to 3 currently-enrolled, full- or part-time, undergraduate or graduate, university or college students, who are the legal age of majority in their country, province or state of legal residence and residents of Denmark, France, Germany, Italy, Poland, Russia, Spain, Sweden, Switzerland, and the United Kingdom. Void in Puerto Rico, Quebec and where prohibited by law. Persons in any of the following categories are not eligible to participate or win the prize(s) offered: (a) Employees or agents of Sun Microsystems, their parent companies, affiliates and subsidiaries, participating advertising and promotion agencies, application development partner companies, and prize suppliers; (b) immediate family members (defined as parents, children, siblings and spouse, regardless of where they reside) and/or those living in the same household as any person in (a) above; and (c) employees of any government entity. You must also have access to the Internet and a valid email address in order to enter or win.

3. HOW TO ENTER: This contest begins at 12:01 P.M. Pacific Time (PT) Zone in the United States (e.g. San Francisco time) which is 5:01 A.M. Greenwich Mean Time (GMT) on the 29th of June 2009 and ends at 11:59 P.M. (PT) which is 4:59 A.M. (GMT) on 10th of August 2009 ("Contest Period"). IMPORTANT NOTICE TO ENTRANTS: ENTRANTS ARE RESPONSIBLE FOR DETERMINING THE CORRESPONDING TIME ZONE IN THEIR RESPECTIVE JURISDICTIONS.

4. THE SUBMISSION: Create an implementation of a fault-tolerant distributed hash table as described at http://wikis.sun.com/display/HPCContest/Sun+HPC+Software+Student+Programming+Challenge+ISC+2009 The implementation must be written in C for the OpenSolaris 2009.06 operating environment using the Sun HPC ClusterTools 8.1 OpenMPI implementation and must be submitted as a Sun Studio 12 project. All Entries must include a valid and complete Sun Studio 12 project that builds without errors on an unmodified instance of the Sun HPC Software, Developer Edition 1.0 for OpenSolaris. Entries may be submitted either electronically or via mail. All Entries must be comprised of original work of the submitter(s). No participant may submit an Entry as a member of more than one team.

Electronic Entries must include a 1-3 page written summary of the implementation approach and the name(s) of the submitter(s). The electronic file must be a gzipped tar file that includes the Sun Studio 12 project directory, including all required files, and must be no larger than 5MB in size. If the electronic file is larger than 5MB in size, it must be submitted by mail in accordance with the instructions below. The electronic entry must be sent via email to hpccontest@sun.com and received no later than 11:59 PM (PDT) on August 10th, 2009 in the United States.

Mailed Entries must include a 1-3 page written summary of the implementation approach and the name(s) of the submitter(s), and a CD or DVD containing the project code as described above. All mailed Entries must be sent to Sun HPC Software Programming Challenge, c/o Sun Microsystems, Inc., 17 Network Circle, Menlo Park, CA 94025, MS-MPK17-207, and must be received no later than 11:59 PM (PDT) on August 10th, 2009 in the United States.

All Entries must be in English. Registration or Entries that are in any other language will not be considered. Entries that are lewd, obscene, pornographic, disparaging of the Sponsor or otherwise contain objectionable material may be disqualified in the Sponsor's sole and unfettered discretion.

5. JUDGING: All Entries will be judged by a panel of experts based on the following equally weighted judging criteria: data retrieval throughput for requests coming from a single node, data retrieval throughput for parallel requests coming from multiple nodes, ability to withstand processing node failure, and scalability with respect to number of processing nodes and number of data items. In the event of a tie, the person or team among the tied Entries with the highest score in scalability with respect to number of processing nodes and number of data items will be declared the winner. In the event that no entries are received, no prize will be awarded. Decisions of judges are final and binding. Winner will be notified by email.

6. PRIZES AND APPROXIMATE RETAIL VALUE: First prize: Toshiba OpenSolaris laptop valued at approximately $2,000. Second and third prizes: Apple iPod valued at approximately $150. Up to three Toshiba laptops and six Apple iPods may be awarded. Prize includes round-trip coach air transportation for one person from major airport nearest winner's residence and hotel accommodations for one person for four nights. Hotel accommodations at Sponsor's discretion. Certain black out dates apply. In the event the Sun HPC Software Workshop is cancelled or postponed for any reason, Sponsor reserves the right to award the remainder of the prize with no further obligation to the winner. All other expenses not specified herein are the responsibility of the winner. ALL TAXES AND ANY APPLICABLE WITHOLDING AND REPORTING REQUIREMENTS ARE THE SOLE RESPONSIBILITY OF THE WINNER. Cash prizes will be awarded in US Dollars. All costs associated with currency exchange are the sole responsibility of the winner.

7. CONDITIONS OF PARTICIPATION. Sponsor reserves the right to substitute a prize for an item of equal or greater value in the event all or part of a prize becomes unavailable. Prizes are awarded without warranty of any kind from Sponsor, express or implied, without limitation, except where this would be contrary to federal, state, provincial, or local laws or regulations. All federal, state, provincial and local laws and regulations apply. Submission of entry into this Contest deems that entrants agree to be bound by the terms of these Official Rules and by the decisions of Sponsor, which are final and binding on all matters pertaining to this Contest. Return of any prize/prize notification may result in disqualification and selection of an alternate winner. Any potential winner who cannot be contacted within 15 days of attempted first notification will forfeit his/her prize. Potential prize winner(s) may be required to sign and return an Affidavit or Declaration of Eligibility/Liability & Publicity Release within 30 days following the date of first attempted notification. Failure to comply within this time period may result in disqualification and selection of an alternate winner. Travel companion of winner must also execute an Affidavit of Eligibility/Liability & Publicity Release prior to ticketing and must possess required travel documents (e.g. valid photo I.D.) prior to departure. Once the travel schedule has been arranged, it cannot be altered and failure of winner to follow such schedule shall not obligate Sponsor in any way to provide the winner with alternate arrangements. The intellectual and industrial property rights to the contest submission, if any, will remain with the participants, except that these terms do not supersede any other assignment or grant of rights according to any other separate agreements between participants and other parties. As a condition of entry, participants agree that Sun shall have the right to use, copy, modify and make available the application or code in connection with the operation, conduct, administration, and advertising and promotion of the Contest via communication to the public, including, but not limited to the right to make screenshots, animations and video clips available to the public for promotional and publicity purposes. Notwithstanding the foregoing, ownership of and all intellectual and industrial property rights in and to the application and code shall remain with the participant. Acceptance of the prize constitutes permission for, and winners consent to, Sponsor and its agencies to use a winner's name and/or likeness and entry for advertising and promotional purposes without additional compensation, unless prohibited by law. To the extent permitted by law, entrants, agree to hold Sponsor, its parent, subsidiaries, agents, directors, officers, employees, representatives and assigns harmless from any injury or damage caused or claimed to be caused by participation in the Contest and/or use or acceptance of any prize won, except to the extent that any death or personal injury is caused by the negligence of the Sponsor. Sponsor is not responsible for any typographical or other error in the printing of the offer, administration of the Contest or in the announcement of the prize. A participant may be prohibited from participating in this Contest if, in the Sponsor's sole discretion, it reasonably believes that the participant has attempted to undermine the legitimate operation of this Contest by cheating, deception, or other unfair playing practices or annoys, abuses, threatens or harasses any other participants, the Sponsor or associated agencies. In the event a winner/potential winner's employer has a policy, which prohibits the awarding of a prize to an employee, the prize will be forfeited and an alternate winner will be selected.

8. NO RECOURSE TO JUDICIAL OR OTHER PROCEDURES: To the extent permitted by law, the rights to litigate, to seek injunctive relief or to make any other recourse to judicial or any other procedure in case of disputes or claims resulting from or in connection with this contest are hereby excluded, and any participant expressly waives any and all such rights.

Participants agree that these Official Rules are governed by the laws of California, USA.

9. DATA PRIVACY: Participants agree that personal data, especially name and address, may be processed, stored and otherwise used for the purposes and within the context of the contest and any other purposes outlined in these Official Rules. The data may also be used by the Sponsor in order to check participants' identity, their postal address and telephone number, or to otherwise verify their eligibility to participate in the Contest and to receive any prize. Participants have a right to access, review, rectify or cancel any personal data held by the Sponsor by writing to Sponsor (Attention: Daniel Templeton) at the address listed below. If participant's data is not provided or is canceled participants' Entries will be ineligible.

10. WARRANTY AND INDEMNITY: Entrants certify that their entry is original and that they are the sole and exclusive owner and right holder of the submitted entry and that they have the right to submit the Entry in the Contest. Each participant agrees not to submit any Entry that (1) infringes any 3rd party proprietary, intellectual property, industrial property, personal rights or other rights, including without limitation, copyright, trademark, patent, trade secret or confidentiality obligation; or (2) otherwise violates applicable law in any countries in the world. To the maximum extent permitted by law, each participant indemnifies and agrees to keep indemnified the Sponsor its parent, subsidiaries, agents, directors, officers, employees, representatives and assigns harmless at all times from and against any liability, claims, demands, losses, damages, costs and expenses resulting from any act, default or omission of the participant and/or a breach of any warranty set forth herein. To the maximum extent permitted by law, each participant indemnifies and agrees to keep indemnified the Sponsor, its parent, subsidiaries, agents, directors, officers, employees, representatives and assigns harmless at all times from and against any liability, actions, claims, demands, losses, damages, costs and expenses for or in respect of which the Sponsor will or may become liable by reason of or related or incidental to any act, default or omission by a participant under these Official Rules including without limitation resulting from or in relation to any breach, non-observance, act or omission whether negligent or otherwise, pursuant to these official rules by a participant.

11. ELIMINATION: Any false information provided within the context of the Contest by any participant concerning identity, postal address, telephone number, ownership of right or non-compliance with these rules or the like may result in the immediate elimination of the participant from the Contest. Sponsor further reserves the right to disqualify any Entry that it believes in its sole and unfettered discretion infringes upon or violates the rights of any third party or otherwise does not comply with these official rules.

12. INTERNET: Sponsor is not responsible for electronic transmission errors resulting in omission, interruption, deletion, defect, delay in operations or transmission. Sponsor is not responsible for theft or destruction or unauthorized access to or alterations of entry materials, or for technical, network, telephone equipment, electronic, computer, hardware or software malfunctions or limitations of any kind. Sponsor is not responsible for inaccurate transmissions of or failure to receive entry information by Sponsor on account of technical problems or traffic congestion on the Internet or at any Web site or any combination thereof, except to the extent that any death or personal injury is caused by the negligence of the Sponsor. If for any reason the Internet portion of the program is not capable of running as planned, including infection by computer virus, bugs, tampering, unauthorized intervention, fraud, technical failures, or any other causes which corrupt or affect the administration, security, fairness, integrity, or proper conduct of this Contest, Sponsor reserves the right at its sole discretion to cancel, terminate, modify or suspend the Contest. Sponsor reserves the right to select winners from eligible entries received as of the termination date. Sponsor further reserves the right to disqualify any individual who tampers with the entry process. Caution: Any attempt by a contestant to deliberately damage any Web site or undermine the legitimate operation of the game is a violation of criminal and civil laws and should such an attempt be made, Sponsor reserves the right to seek damages from any such contestant to the fullest extent of the law.

13. If any provision(s) of these Official Rules are held to be invalid or unenforceable, all remaining provisions hereof will remain in full force and effect.

14. WINNER'S LIST: For winner's name, log onto http://wikis.sun.com/display/HPCContest on or about August 14th, available for a period of up to 60 days.

15. SPONSOR: The Sponsor of this Contest is Sun Microsystems, Inc., 4220 Network Circle, Santa Clara, CA 95054.

About

templedf

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today