Monday Nov 30, 2009

Beta Testing the Sun Grid Engine Hadoop Integration

In case you haven't heard yet, the upcoming release of Sun Grid Engine will include an integration with Apache Hadoop that will allow Map/Reduce jobs to be submitted to a Sun Grid Engine cluster while minding HDFS data locality. The 6.2u5 release will be out by the end of the year, but it's currently in the beta testing phase. And that's where you come in.

I'm looking for some volunteers to test the integration. To that end, this blog post will provide instructions for how to get the beta code checked out and built. The Hadoop integration is actually only loosely dependent on the Sun Grid Engine software itself. While it's planned to be part of u5, the integration should be usable with a cluster as old to 6.2u2, although I would really recommend at least 6.2u4.

In a nutshell, the integration consists of two components. The first is the hadoop parallel environment that allows Map/Reduce jobs to be started as parallel jobs in a Sun Grid Engine cluster. The second is the integration with HDFS, called Herd, that makes the Sun Grid Engine scheduler aware of the locations of the HDFS data blocks. Herd has two parts. One part is a load sensor that runs on every execution machine and reports the HDFS blocks on that machine. The other part is a JSV that translates HDFS data paths included in the job submission into a list of HDFS blocks needed by the job.

How to check out the source code

  1. Make sure you have a functional CVS client.
  2. cvs -d :pserver:guest@cvs.sunsource.net:/cvs login
  3. cvs -d :pserver:guest@cvs.sunsource.net:/cvs checkout gridengine/source

Technically, the above will only check out the source directory, but for the Hadoop integration, that's all you need. The Hadoop integration lives in three places. First, the scripts live in source/dist/hadoop. Second, the Herd code lives at source/libs/herd. Third, the JSV Java language binding upon which the Herd code depends lives at source/libs/jjsv.

How to build the source code

  1. Make sure you're using at least Ant 1.6.3 and the Java Standard Edition 6 platform.
  2. Copy the source/build.properties file to build_private.properties.
  3. Edit the build_private.properties file to include the corrects paths for the Java Standard Edition 6 platform and junit 3.8.
  4. Change to the gridengine/source directory.
  5. ant jjsv
  6. ant herd

After the above steps, you will find herd.jar at source/CLASSES/herd/herd.jar and JSV.jar at source/CLASSES/jjsv/JSV.jar.

How to install the integration

  1. Copy herd.jar and JSV.jar to the $SGE_ROOT/lib directory.
  2. Copy the source/dist/hadoop directory to somewhere accessible by all the execution nodes.

How to configure the integration

  1. Get HDFS up and running on your cluster. The most useful configuration will be to have every execution host be a data node, and to only have execution hosts as data nodes. Also, because of the way Hadoop does authentication and authorization, you'll need to make sure that either HDFS has security disabled or that root and the SGE admin user are in the HDFS super user group.
  2. Copy your Hadoop configuration directory to <hadoop>/conf, where <hadoop> is the directory that you copied in step 2 of How to install the integration.
  3. Delete the <hadoop>/conf/mapred.xml, <hadoop>/conf/masters, and <hadoop>/conf/slaves files.
  4. Edit the <hadoop>/env.sh file to contain the paths to the Java platform, the Hadoop install directory, and the Hadoop configuration directory you just created (<hadoop>/conf).
  5. Change into the <hadoop> directory.
  6. ./setup.pl -i
  7. Add the hadoop parallel environment to one or more of your queues

The setup.pl script will install the hadoop parallel environment and the complexes needed by Herd. It will also start the Herd load sensor on all the execution hosts. At this point, you should be ready to go. Wait for a couple of minutes to give all of the execution hosts a chance to start running the load sensor and reporting values. You can run qhost -F hdfs_primary_rack to check that the load sensor is functioning correctly. Every execution host should report an hdfs_primary_rack value. If one or more machines have not reported a value within about five minutes, see the troubleshooting section below.

Using the integration

To submit a job that uses the hadoop parallel environment, use -pe hadoop <n>, where <n> is the number of nodes. The hadoop parallel environment uses an allocation rule that guarantees that no more than one task tracker per job will run on a single host. To tell the scheduler what data the job needs, request the hfds_input resource with a value of the HDFS path to the job's data. The data path must be an absolute path.

Here's an example. Say I want to use the grep example to find occurrences of the word 'Sun' in a series of documents. First, I'd copy those document into HDFS to /user/dant/sungrep (e.g. bin/hadoop fs -copyFromLocal ~/Documents/\* /user/dant/sungrep). I would then submit the job with echo `pwd`/bin/hadoop --config \\$TMPDIR/conf jar `pwd`/hadoop-0.20.1-examples.jar grep sungrep output Sun | qsub -pe hadoop 16 -l hdfs_input=/user/dant/sungrep -jsv <hadoop>/jsv.sh.

Let's look at that in a little more detail. First, we're echoing the Hadoop command and piping it to qsub. Why? Well, when the integration runs, it creates a conf directory in the job's temp directory that is properly set up for the assigned hosts. Until the job runs, though, we don't know where the temp directory is. We get it's path from the $TMPDIR variable once the job starts. We therefore need to wrap the Hadoop command in a script. We could either write a script that contains the command, or we could let qsub write one for us by piping the command to qsub's stdin. Note that we used --config \\$TMPDIR/conf in the command. The backslash is important because it prevents the shell on the submission host from interpreting the $TMPDIR variable.

Next, the qsub command uses -pe hadoop 16 to request 16 nodes. When this job is run, a job tracker will be started on the "master" host, and a task tracker will be started on each of the 16 assigned nodes. The master host is the host where the parallel job's master task is started. After the job tracker and task trackers are running, the grep job itself will be started, launched from the master host. The hadoop PE is a tight integration with an allocation rule of "1". In order to run a Hadoop job on top of SGE, you must use the PE, even if it's only a single-node job.

The qsub command also uses -l hdfs_input=/user/dant/sungrep -jsv <hadoop>/jsv.sh. The -l resource request tells SGE what data will be used by the job. It must be specified as an absolute path. The -jsv switch actually translates the resource request for hdfs_input into requests for specific racks and blocks. Without the -jsv switch, the job would never run because no node offers the hdfs_input resource. (No node offers it because it doesn't really exist. It's just a placeholder for the JSV to replace with rack and block requests. In programming terms, it's a reference injection point.) The resource request and JSV can be left out of the qsub command. If they're left out, the scheduler will not take the HDFS data locality into consideration when scheduling the job.

You can also use the Hadoop integration to set up the job tracker and task trackers and then submit jobs to them directly. Instead of echoing the Hadoop command to qsub, echo sleep 300000 instead. That will cause the job tracker and task trackers to be set up, but instead of running a job, it will just sleep for a long time. You can then run qstat -j <jobid> | grep context to show the job's context. One of the context variables will be the URL for the job tracker. Using that URL, you can set up a Hadoop configuration to talk to the job tracker so that you can submit jobs to it from the command line.

It is also highly recommended that the use of the Hadoop integration be coupled with exclusive host access. The Hadoop task trackers all assume that they have exclusive access to their nodes. If you don't use exclusive host access with the Hadoop integration, you'll end up oversubscribing the nodes.

Troubleshooting

Hopefully everything will work perfectly the first time. If for some reason it doesn't, here are some tips to help diagnose the problem:

The execds aren't reporting any hdfs resources, i.e. qhost -F | grep hdfs shows nothing.
Sometimes it takes several minutes for the nodes to start reporting the hdfs resources. If after several minutes there's still nothing, pick an execution host and check if the load sensor is running: jps -l. Look for com.sun.grid.herd.HerdJsv. Note that it might be running as root or as the SGE admin user. Also note that jps may only show you your own processes. If the load sensor isn't running, look for log files in /tmp. They will be called sge_hadoop_loadsensor.out and sge_hadoop_<n>.log. The .out file is the output from starting the load sensor. The .log files are the logging output from the load sensor. One will be the log file from the load sensor framework, and the other will be the log file from the Herd load sensor. (You can control the logging verbosity from the logging.properties file in the <hadoop> directory.) The most common problem is that the load sensor is started as the user root on most platforms (for a reason I don't yet understand), but that HDFS usually is not. With HDFS, the user who started it is the super user, and only the super user can query the kind of information that the load sensor needs. As stated in the configuration section, you must either disable HDFS security or set a super user group that contains root (and probably the SGE admin user). The next most common problems are that the path to Hadoop or the Java platform is not correct in env.sh or that the conf directory contains bad configuration information. You can test the load sensor manually by changing into the <hadoop> directory and running loadsensor.sh. If it works, it will "hang". Press enter, and it should spit out the hdfs resource values for that host. Type QUIT and press enter to exit the load sensor.
The job tracker and/or task trackers aren't starting.
The first place to look is the PE output and error files. The output from starting the job tracker should be found there. The next place to look is the log files. The log files are written where the Hadoop configuration says to put them. Make sure that wherever that is, all the users have access to it from all the nodes. Inability to write the log file is a common reason why the job tracker and/or task trackers won't start. In addition to the usual Hadoop log files, the integration also write a hadoop-<adminuser>-sge-<hostname>.log file. That file contains the output from starting the task trackers from the master host. Another common reason for the job tracker and/or task trackers not to start is that the path to the Java platform isn't correctly configured in the hadoop-env.sh file.

Monday Aug 10, 2009

Another Undocumented Feature

In reading the comments for Issue 409, I came across another undocumented feature I hadn't seen before. Apparently, if you pass a variable to your job through qsub or qrsh with the -v switch, and if that variable starts with SGE_COMPLEX_, the SGE_COMPLEX_ part will be stripped off, and the remainder will be treated as a resource request whose value will be placed in the job's environment.

An example will make it easier to explain. If your job is able to run on multiple architectures, but you always select on which architecture you're running it when you submit, you could add "-v SGE_COMPLEX_arch" to the qsub submission parameters, and the job's environment would then contain the value of arch that was requested as the -l arch=... resource request. In action, it would look like:

% qrsh -l arch=sol-amd64 -v SGE_COMPLEX_arch echo \\$SGE_COMPLEX_arch
sol-amd64

Nice, but why is it useful? Well, maybe your script is capable of operating in multiple environments, but it needs to know about how it was submitted. For example, maybe the script changes your application's startup parameters based on the memory limits. The script could use this feature to get the memory limits from the submission parameters and act accordingly. Of course, it could also get the memory limits from ulimit(1), so maybe not the best example. Licenses may be a better example. The OS is blissfully unaware of license assignments. The only way for your script to find out about how many licenses were requested for it would be to use this feature (or do some clever digging with qstat).

You might have noticed by now that you could get the same effect by just passing in the requested complex value as an environment variable, e.g. "qsub -l arch=sol-amd64 -v arch=sol-amd64 ..." The difference between using the SGE_COMPLEX_ feature and using an environment variable explicitly is that with the SGE_COMPLEX_ feature you don't have to know what the requested value was, i.e. you can add it to an sge_request(5) file or write it into your script. And now we come to the real value. If you have a job that needs to know about its submission parameters, you can embed submission directives to add the needed complexes' values to the environment. Pretty handy. Whenever you can, it's a good idea to make your jobs and scripts as self-contained as possible.

Update: It would appear that this feature no longer works in 6.2. The last version I was able to verify it in was 6.1u2. Not such a big deal, though, because with 6.2 we introduced JSVs, which let you do the same thing and a tremendous amount more.

Thursday Jul 30, 2009

Sun HPC Software Workshop '09 -- Early Bird's Almost Over!

Just wanted to remind everyone that the early bird registration for the Sun HPC Software Workshop '09, Sept 7-10 in Regensburg, Germany, ends tomorrow (31 July 2009). It's your last chance to sign up at the discounted rate. After tomorrow, you will still be able to register, but the cost of registration will be higher.

In a nutshell, the Sun HPC Software Workshop '09 is a combination of our annual Grid Engine Workshop, a European edition of the popular Lustre Users Group meeting, and a conference on developing applications and services for HPC and cloud environments. The Workshop lasts three days, with a presentation track representing each of these topics. One the day before the main Workshop starts, we're also holding deeper technology seminars: a Lustre Deep Dive, a Grid Engine admin training, and a class on parallel application development taught by Ruud van der Pas. The Workshop and the preceding seminars are an excellent opportunity to learn more about these technologies and connect with the product engineers, partners, and other community members.

There is an open Call for Presentations for the Workshop, but it also closes tomorrow. If you're interested in proposing a talk for the Workshop (and getting a discounted registration fee if it's accepted), send a title, duration, and brief summary to the email address listed on the Agenda page. But, hurry. We'll be making our final decisions and notifying the speakers soon.

I look forward to seeing you there!

Tuesday Jul 21, 2009

Lies, Damned Lies, & DRMs

Some of our competitors seem to be very fond of spreading the rumor that the Sun Grid Engine product team has been laid off and/or that the product has been discontinued. It would appear that since they can't claim to have a better, more scalable, or more cost-effective product, they're willing to go with lying through their teeth to make the sale. Since I keep getting asked this question, I figured it would be worthwhile to post an official response.

To plagiarize Mark Twain, the rumors of our death have been greatly exaggerated. We're still here and going strong. The team is now roughly four times the size it was when I joined six years ago. It spans six offices in five countries on three continents. The product has a road map that reaches out past 2012 (which is as far as we're willing to speculate). We have a massive (if not leading) share in both the open source and licensed DRM system markets, and we're not planning to go away any time soon.

Of course, with the deal with Larry pending, nothing is certain. The only comment I can make there is "no comment." That said, for now at least, it's business as usual. We're still writing code, preparing releases, doing trainings, holding our annual Workshop, etc. Look for the next update this quarter. Look for the next release next year. And look for a whole lot more good stuff coming from our team over the next several updates and releases. With the features that have been added in the 6.2, 6.2u2 and 6.2u3 releases, Sun Grid Engine is in a great position. With what's coming up, I'd resort to lying too, if I worked for one of our competitors.

Monday Jul 20, 2009

European Students: Want a Free Laptop?

Are you a student in Europe\*? Do you want a new Toshiba laptop? Willing to write some code to get it? Good. Read on.

The OpenSolaris HPC team is currently running a programming contest for European students that was launched at ISC in Hamburg last month. The contest is to write the most performant and scalable implementation of a distributed hash table. Submission can be from teams of up to three people. The top prize is a new Toshiba laptop for each member of the winning team.

For more information, check out the contest site. Better hurry, though, because the contest deadline is coming up quick!

\* Contest participation is limited to legal residents of a specific list of European countries. See the contest site for details.


OFFICIAL RULES
NO PURCHASE NECESSARY

1. DESCRIPTION OF THE CONTEST: The Sun HPC Software Student Programming Challenge ISC 2009 ("Contest") is designed to promote the use of the Sun HPC Software, Developer Edition 1.0 for OpenSolaris among students by having them compete to design and implement the most scalable and best-performing implementation of a common parallel algorithm. Prizes will be awarded to those who submit the best entries as determined by the judges in accordance with these Official Rules.

2. ELIGIBILITY: This contest is open only to teams of 1 to 3 currently-enrolled, full- or part-time, undergraduate or graduate, university or college students, who are the legal age of majority in their country, province or state of legal residence and residents of Denmark, France, Germany, Italy, Poland, Russia, Spain, Sweden, Switzerland, and the United Kingdom. Void in Puerto Rico, Quebec and where prohibited by law. Persons in any of the following categories are not eligible to participate or win the prize(s) offered: (a) Employees or agents of Sun Microsystems, their parent companies, affiliates and subsidiaries, participating advertising and promotion agencies, application development partner companies, and prize suppliers; (b) immediate family members (defined as parents, children, siblings and spouse, regardless of where they reside) and/or those living in the same household as any person in (a) above; and (c) employees of any government entity. You must also have access to the Internet and a valid email address in order to enter or win.

3. HOW TO ENTER: This contest begins at 12:01 P.M. Pacific Time (PT) Zone in the United States (e.g. San Francisco time) which is 5:01 A.M. Greenwich Mean Time (GMT) on the 29th of June 2009 and ends at 11:59 P.M. (PT) which is 4:59 A.M. (GMT) on 10th of August 2009 ("Contest Period"). IMPORTANT NOTICE TO ENTRANTS: ENTRANTS ARE RESPONSIBLE FOR DETERMINING THE CORRESPONDING TIME ZONE IN THEIR RESPECTIVE JURISDICTIONS.

4. THE SUBMISSION: Create an implementation of a fault-tolerant distributed hash table as described at http://wikis.sun.com/display/HPCContest/Sun+HPC+Software+Student+Programming+Challenge+ISC+2009 The implementation must be written in C for the OpenSolaris 2009.06 operating environment using the Sun HPC ClusterTools 8.1 OpenMPI implementation and must be submitted as a Sun Studio 12 project. All Entries must include a valid and complete Sun Studio 12 project that builds without errors on an unmodified instance of the Sun HPC Software, Developer Edition 1.0 for OpenSolaris. Entries may be submitted either electronically or via mail. All Entries must be comprised of original work of the submitter(s). No participant may submit an Entry as a member of more than one team.

Electronic Entries must include a 1-3 page written summary of the implementation approach and the name(s) of the submitter(s). The electronic file must be a gzipped tar file that includes the Sun Studio 12 project directory, including all required files, and must be no larger than 5MB in size. If the electronic file is larger than 5MB in size, it must be submitted by mail in accordance with the instructions below. The electronic entry must be sent via email to hpccontest@sun.com and received no later than 11:59 PM (PDT) on August 10th, 2009 in the United States.

Mailed Entries must include a 1-3 page written summary of the implementation approach and the name(s) of the submitter(s), and a CD or DVD containing the project code as described above. All mailed Entries must be sent to Sun HPC Software Programming Challenge, c/o Sun Microsystems, Inc., 17 Network Circle, Menlo Park, CA 94025, MS-MPK17-207, and must be received no later than 11:59 PM (PDT) on August 10th, 2009 in the United States.

All Entries must be in English. Registration or Entries that are in any other language will not be considered. Entries that are lewd, obscene, pornographic, disparaging of the Sponsor or otherwise contain objectionable material may be disqualified in the Sponsor's sole and unfettered discretion.

5. JUDGING: All Entries will be judged by a panel of experts based on the following equally weighted judging criteria: data retrieval throughput for requests coming from a single node, data retrieval throughput for parallel requests coming from multiple nodes, ability to withstand processing node failure, and scalability with respect to number of processing nodes and number of data items. In the event of a tie, the person or team among the tied Entries with the highest score in scalability with respect to number of processing nodes and number of data items will be declared the winner. In the event that no entries are received, no prize will be awarded. Decisions of judges are final and binding. Winner will be notified by email.

6. PRIZES AND APPROXIMATE RETAIL VALUE: First prize: Toshiba OpenSolaris laptop valued at approximately $2,000. Second and third prizes: Apple iPod valued at approximately $150. Up to three Toshiba laptops and six Apple iPods may be awarded. Prize includes round-trip coach air transportation for one person from major airport nearest winner's residence and hotel accommodations for one person for four nights. Hotel accommodations at Sponsor's discretion. Certain black out dates apply. In the event the Sun HPC Software Workshop is cancelled or postponed for any reason, Sponsor reserves the right to award the remainder of the prize with no further obligation to the winner. All other expenses not specified herein are the responsibility of the winner. ALL TAXES AND ANY APPLICABLE WITHOLDING AND REPORTING REQUIREMENTS ARE THE SOLE RESPONSIBILITY OF THE WINNER. Cash prizes will be awarded in US Dollars. All costs associated with currency exchange are the sole responsibility of the winner.

7. CONDITIONS OF PARTICIPATION. Sponsor reserves the right to substitute a prize for an item of equal or greater value in the event all or part of a prize becomes unavailable. Prizes are awarded without warranty of any kind from Sponsor, express or implied, without limitation, except where this would be contrary to federal, state, provincial, or local laws or regulations. All federal, state, provincial and local laws and regulations apply. Submission of entry into this Contest deems that entrants agree to be bound by the terms of these Official Rules and by the decisions of Sponsor, which are final and binding on all matters pertaining to this Contest. Return of any prize/prize notification may result in disqualification and selection of an alternate winner. Any potential winner who cannot be contacted within 15 days of attempted first notification will forfeit his/her prize. Potential prize winner(s) may be required to sign and return an Affidavit or Declaration of Eligibility/Liability & Publicity Release within 30 days following the date of first attempted notification. Failure to comply within this time period may result in disqualification and selection of an alternate winner. Travel companion of winner must also execute an Affidavit of Eligibility/Liability & Publicity Release prior to ticketing and must possess required travel documents (e.g. valid photo I.D.) prior to departure. Once the travel schedule has been arranged, it cannot be altered and failure of winner to follow such schedule shall not obligate Sponsor in any way to provide the winner with alternate arrangements. The intellectual and industrial property rights to the contest submission, if any, will remain with the participants, except that these terms do not supersede any other assignment or grant of rights according to any other separate agreements between participants and other parties. As a condition of entry, participants agree that Sun shall have the right to use, copy, modify and make available the application or code in connection with the operation, conduct, administration, and advertising and promotion of the Contest via communication to the public, including, but not limited to the right to make screenshots, animations and video clips available to the public for promotional and publicity purposes. Notwithstanding the foregoing, ownership of and all intellectual and industrial property rights in and to the application and code shall remain with the participant. Acceptance of the prize constitutes permission for, and winners consent to, Sponsor and its agencies to use a winner's name and/or likeness and entry for advertising and promotional purposes without additional compensation, unless prohibited by law. To the extent permitted by law, entrants, agree to hold Sponsor, its parent, subsidiaries, agents, directors, officers, employees, representatives and assigns harmless from any injury or damage caused or claimed to be caused by participation in the Contest and/or use or acceptance of any prize won, except to the extent that any death or personal injury is caused by the negligence of the Sponsor. Sponsor is not responsible for any typographical or other error in the printing of the offer, administration of the Contest or in the announcement of the prize. A participant may be prohibited from participating in this Contest if, in the Sponsor's sole discretion, it reasonably believes that the participant has attempted to undermine the legitimate operation of this Contest by cheating, deception, or other unfair playing practices or annoys, abuses, threatens or harasses any other participants, the Sponsor or associated agencies. In the event a winner/potential winner's employer has a policy, which prohibits the awarding of a prize to an employee, the prize will be forfeited and an alternate winner will be selected.

8. NO RECOURSE TO JUDICIAL OR OTHER PROCEDURES: To the extent permitted by law, the rights to litigate, to seek injunctive relief or to make any other recourse to judicial or any other procedure in case of disputes or claims resulting from or in connection with this contest are hereby excluded, and any participant expressly waives any and all such rights.

Participants agree that these Official Rules are governed by the laws of California, USA.

9. DATA PRIVACY: Participants agree that personal data, especially name and address, may be processed, stored and otherwise used for the purposes and within the context of the contest and any other purposes outlined in these Official Rules. The data may also be used by the Sponsor in order to check participants' identity, their postal address and telephone number, or to otherwise verify their eligibility to participate in the Contest and to receive any prize. Participants have a right to access, review, rectify or cancel any personal data held by the Sponsor by writing to Sponsor (Attention: Daniel Templeton) at the address listed below. If participant's data is not provided or is canceled participants' Entries will be ineligible.

10. WARRANTY AND INDEMNITY: Entrants certify that their entry is original and that they are the sole and exclusive owner and right holder of the submitted entry and that they have the right to submit the Entry in the Contest. Each participant agrees not to submit any Entry that (1) infringes any 3rd party proprietary, intellectual property, industrial property, personal rights or other rights, including without limitation, copyright, trademark, patent, trade secret or confidentiality obligation; or (2) otherwise violates applicable law in any countries in the world. To the maximum extent permitted by law, each participant indemnifies and agrees to keep indemnified the Sponsor its parent, subsidiaries, agents, directors, officers, employees, representatives and assigns harmless at all times from and against any liability, claims, demands, losses, damages, costs and expenses resulting from any act, default or omission of the participant and/or a breach of any warranty set forth herein. To the maximum extent permitted by law, each participant indemnifies and agrees to keep indemnified the Sponsor, its parent, subsidiaries, agents, directors, officers, employees, representatives and assigns harmless at all times from and against any liability, actions, claims, demands, losses, damages, costs and expenses for or in respect of which the Sponsor will or may become liable by reason of or related or incidental to any act, default or omission by a participant under these Official Rules including without limitation resulting from or in relation to any breach, non-observance, act or omission whether negligent or otherwise, pursuant to these official rules by a participant.

11. ELIMINATION: Any false information provided within the context of the Contest by any participant concerning identity, postal address, telephone number, ownership of right or non-compliance with these rules or the like may result in the immediate elimination of the participant from the Contest. Sponsor further reserves the right to disqualify any Entry that it believes in its sole and unfettered discretion infringes upon or violates the rights of any third party or otherwise does not comply with these official rules.

12. INTERNET: Sponsor is not responsible for electronic transmission errors resulting in omission, interruption, deletion, defect, delay in operations or transmission. Sponsor is not responsible for theft or destruction or unauthorized access to or alterations of entry materials, or for technical, network, telephone equipment, electronic, computer, hardware or software malfunctions or limitations of any kind. Sponsor is not responsible for inaccurate transmissions of or failure to receive entry information by Sponsor on account of technical problems or traffic congestion on the Internet or at any Web site or any combination thereof, except to the extent that any death or personal injury is caused by the negligence of the Sponsor. If for any reason the Internet portion of the program is not capable of running as planned, including infection by computer virus, bugs, tampering, unauthorized intervention, fraud, technical failures, or any other causes which corrupt or affect the administration, security, fairness, integrity, or proper conduct of this Contest, Sponsor reserves the right at its sole discretion to cancel, terminate, modify or suspend the Contest. Sponsor reserves the right to select winners from eligible entries received as of the termination date. Sponsor further reserves the right to disqualify any individual who tampers with the entry process. Caution: Any attempt by a contestant to deliberately damage any Web site or undermine the legitimate operation of the game is a violation of criminal and civil laws and should such an attempt be made, Sponsor reserves the right to seek damages from any such contestant to the fullest extent of the law.

13. If any provision(s) of these Official Rules are held to be invalid or unenforceable, all remaining provisions hereof will remain in full force and effect.

14. WINNER'S LIST: For winner's name, log onto http://wikis.sun.com/display/HPCContest on or about August 14th, available for a period of up to 60 days.

15. SPONSOR: The Sponsor of this Contest is Sun Microsystems, Inc., 4220 Network Circle, Santa Clara, CA 95054.

Thursday Mar 19, 2009

Podcast: New Installer in Sun Grid Engine 6.2 Update 2

I just posted a new podcast on the new installer in Sun Grid Engine 6.2u2. Check it out.

Monday Mar 16, 2009

New Installer in Sun Grid Engine 6.2 Update 2

In my previous post, I talked about the new installer that is included with Sun Grid Engine 6.2u2. Lubos, one of our core team (as opposed to Service Domain Manager or QA) engineers in Prague, has just posted a couple of videos of the new installer. The first one shows how to make sure the new installer can be used with the machines you're planning to use for your cluster. Because the new installer can install an entire cluster at once, it has to be able to contact all the machines destined for the cluster, and that's where the setup comes in. The second one actually shows off the new installer. Lubos also has some screenshots of the new installer posted.

Thursday Mar 05, 2009

Sun Grid Engine 6.2 Update 2 Is Out!

Sun Grid Engine 6.2u2 is now available. If you're not excited, you should be. First off, don't let the name fool you. 6.2u2 is not just bug fixes. It's a full feature release, and contains some great features. What features? Glad you asked.

First and foremost, job submission verifiers (JSVs). It's a feature we added specifically for TACC, but it's one that will be useful for almost everyone. In fact, I suspect that we'll discover it's the answer to some of the classic Sun Grid Engine problems. What is it? Before 6.2u2, there was no way to prevent a job from being submitted. It was (and still is) possible to choose not to schedule a job after it's been submitted, but before 6.2u2, that's all you could do. With 6.2u2 and JSV, you now have the option to insert a step between submission and acceptance. With that step, you can choose to accept or reject the job submission, but you can also choose to modify the job before accepting it, and that's where the magic comes in.

The verification step is handled through scripts or binaries. There's a new submission option, -jsv, that adds a JSV to the submission. That means you can pick up JSVs from anywhere that you can stash a submission option: most notably the global sge_request file, your user sge_request file, and the directory's sge_request file, but also DRMAA native specification, DRMAA job category, the enigmatic -@ switch, and, of course, the command line itself. The -jsv switch is cumulative, so if you have one in several of those places, several JSVs will be run for your submission. It's worth noting that all of the above listed JSV sources are controlled by the user, except the global sge_request file, and even that can be overridden with the -clear switch.

So far, we've only talked about the client side. JSVs can also come in on the server side. In the global host configuration an administrator can configure a single JSV. Unlike on the client side where every JSV is started from scratch with every job submission, on the server side the JSV is started once and queried repeatedly. The reason is that on the client side, performance isn't a big issue, but on the server side, the cost of forking and execing the JSV for every job submission can have a huge impact. By keeping the JSV running, we save that cost. The big advantage of the server-side JSV is that users can't circumvent it. If you really need to enforce a policy with a JSV, the server side is that place to do it.

Now, if you're thinking fast, you might question the point of the server-side JSV when users can change everything about the job using qalter after it's submitted. Well, so did we. When you configure a server-side JSV, users are no longer allowed to modify jobs after submission unless you specifically grant the ability to do so, and even then it's limited to the job attributes that you allow them to modify.

JSV is a huge topic, and I could probably go on for days about it. Instead I'll save it for a white paper and move on.

The next big feature in 6.2u2 is the new installer. You now have the option of using the old interactive text-based installer or a new graphical installer. The graphical installer has several important advantages. First, it lets you install an entire cluster at once. It actually sits on top of the auto-installer and reuses that same functionality to install remote nodes. The graphical installer, however, will first verify that all the nodes are reachable before the installation starts, so the installation won't quietly hang on an unreachable node. It also accepts wildcarded host name and IP address ranges, which makes installing a huge cluster much simpler.

The third major feature is that we've added support for Microsoft Windows Vista (Ultimate and Enterprise) and Server 2003R2 and 2008. Both 32-bit and 64-bit version are available. Harald (who you should encourage to start blogging!) worked really hard on ironing out the issues with the changes in the OS. We still rely on SFU for the Windows execution daemons, except that it's now called SUA.

The fourth big feature is job-level parallel job resource requests. Before 6.2u2, whenever a parallel job requested a resource, SGE would implicitly multiply that resource request by the number of assigned slaves (because each slave requests the resource on the host where it runs). That makes sense with, say, memory, where requesting 4GB really means that every slave should have 4GB. It doesn't make any sense for other things, like some software licenses. Now with 6.2u2, the administrator can flag a resource as job level, meaning that it is not multiplied by the number of assigned slaves when requested by a parallel job. In most cases, a resource that shouldn't be multiplied in for one job, shouldn't be multiplied for any job. There may be exceptions to the rule, but I doubt there will be many. I'd love to hear your feedback, though.

The last two new features aren't so much features as improvements. Starting with 6.2u2, the 64-bit Linux binaries use the jemalloc library instead of the default Linux malloc. The performance and memory footprint impact is significant, in some cases as much as 20% improvement. Also, starting with 6.2u2, the Linux binaries use poll() instead of select() in the commlib. For some flavors of Linux, the use of select() made it difficult to scale past a couple thousand hosts. With the commlib now using poll(), I've seen SGE scale well over 6000 Linux nodes.

And on top of all that, there is the usual pile of bug fixes. A handful of qmaster and scheduler issues cropped up recently in 6.2 and 6.2u1, but with 6.2u2 those should all now be resolved.

I highly recommend giving 6.2u2 a try, if for no reason other than JSV. Let me know what you think!

Thursday Feb 05, 2009

Performance Considerations For JSV Scripts

One of the new features coming in Grid Engine 6.2u2 is job submission verification (JSV). The basic idea is that on both the client side and the server side, you have the ability to add scripts that can read through all the job submission options and accept, reject, or modify the job. JSV will open up a whole new world of possibilities that didn't exist before, and it will largely end the need for qsub wrapper scripts.

Because the server-side JSV scripts are executed by the qmaster for every job, there are performance considerations that must be taken into account. In order to limit the performance impact, the qmaster will manage the JSV scripts the same way load sensor scripts are handled, i.e. they are started once and kept alive as a separate process instead of starting them once per job. Nonetheless, what happens inside the scripts can still have a big impact on qmaster performance.

In a test, Roland (who still isn't blogging!) set up some DRMAA submission clients to hammer the master with job submissions. With no server-side JSV scripts, the clients were able to do 900 job submissions per second\*. With a simple server-side Perl JSV script to change the job name, the clients were only able to submit 700 jobs per second. A similar JSV script written in Tcl yielded the same results. With a similar JSV script written in Borne shell, however, the clients were only able to submit 3 jobs per second. No, that isn't a typo. While languages like Perl and Tcl are able to process numbers and strings natively, Borne shell has to rely on forking off other commands. Those forks are expensive and, even in a simple JSV script, yield major performance penalties. For these reasons, I actually recommend the Java™ language for writing server-side JSV scripts. Not only do you get access to the all the great built-in and external libraries, but you also get access to JGDI, letting you talk to the qmaster without forking an SGE command-line tool. (Thanks to Jython, JRuby, Rhino, et al, you can get the same benefits from languages other than just the Java language.)

Let me repeat that point to make sure it comes through loud and clear. If you use a shell script as a server-side JSV script, you will trash your cluster's job submission rate. That's not just for DRMAA jobs or for certain users. That's for the entire cluster.

On the client side, the story is a bit simpler but still similar. For every job submission, each client-side JSV will be started. (An array job counts as a single job in this regard.) That makes sense because qsub is started once for every job submission, and the JSV scripts can't outlive the qsub that launched them.

For DRMAA the implications are a little different. The JSV scripts are still started for every job submission, even though the DRMAA client remains running between submissions. (A DRMAA bulk job is an array job and hence still counts as a single job in this respect.) Roland used DRMAA clients in his test because they're very fast at job submissions. Using client-side JSV scripts affects that in much the same way as on the server side. And as with the server-side scripts, shell scripts have more of an effect than scripts written in a higher-level language. If you figure there's about 200ms overhead for every fork & exec, you could easily add several seconds to each job submission. A DRMAA client without client-side JSV scripts can easily submit over 100 jobs per second. With even a single client-side JSV script that runs no further commands, your submission rate drops to less than 5 jobs per second. Use with caution!

The JSV feature in 6.2u2 is extremely powerful, but as I've explained, you have to use it with care. When used with appropriate caution, however, JSV provides a fairly easy answer to some of the traditionally thorny issues for SGE administrators.

\* Roland achieved that submission rate with a Sun x4100M2 with two dual core AMD 2.8GHz processors running Solaris 10 as the master.

Wednesday Jan 07, 2009

Grid Infrastructure Infrastructure

Owen Taylor (formerly) of GigaSpaces has put together an excellent proof of concept using GigaSpaces XAP and Sun Grid Engine. Using Sun Grid Engine, the PoC is able to grow and shrink the size of the GigaSpaces cluster dynamically according to changing load conditions. The PoC monitors GigaSpaces via JMX and then uses DRMAA to submit new instances to SGE or stop existing ones. Read more about it.

Tuesday Jan 06, 2009

Connecting All the Dots

The last couple of weeks before the holidays I worked on an interesting project. It involved assembling pretty much everything Sun offers for HPC into a single coherent demo and throwing in Amazon EC2 to boot. This post will explain what I did and how I did it. Let's start at the beginning.

One of the new offerings from Sun is the Sun HPC Software. Beneath the excessively generic name is a complete, integrated stack of HPC software components. Currently there are two editions: the Sun HPC Software, Linux Edition (aka Project Giraffe) and the Sun HPC Software, Solaris Developer Edition. (A Sun HPC Software, Solaris Edition and Sun HPC Software, OpenSolaris Edition will be following shortly.) The Linux edition is exactly what the name implies. It's a full stack of open source HPC tools bundled into a Centos image, ready to push out to your cluster. The Solaris developer edition is a slightly different animal. It is targeted at developers interested in writing HPC applications for Solaris. The Solaris developer edition is a virtual machine image (available for VMware and Virtual Box) that includes Solaris 10 and a pre-installed suite of Sun's HPC products, including Sun Grid Engine, Sun HPC ClusterTools, Sun Studio, and Sun Visualization, all integrated together.

For this demo, I used the Solaris developer edition. The end goal was to produce a version of the virtual machine image that was capable of automatically borrowing resources from a local pool or from the cloud in order to test or deploy developed HPC applications. Inside the developer edition virtual machine, there are already two Zones that act as virtual execution nodes for testing applications. That's a nice start, but what about testing on real machines or a larger number of machines? That's where the resource borrowing comes in. In the end, I had a VM image that was capable of automatically borrowing and releasing resources first from a local pool and later from the cloud, on demand.

The first step was to get the developer edition running as-is. Sounded simple enough. The first wrinkle was that I was doing this demo on a Mac. The regular VMware Player is not available for Mac, so I had to download an eval copy of VMware Fusion. Once I had Fusion installed, I was able to bring up the developer edition VM without a hitch.

Step 2 was to get the VM networked. The network configuration for the developer edition beta 1 is such that the global and non-global Zones can see each other, but nobody can get into or out of the VM. Getting the networking working was probably the hardest part of the demo, and honestly, I can't tell you how I finally did it. Per the suggestion of the pop-up dialogs from VMware, I installed the VMware Tools in the VM's Solaris instance. That changed the name of the primary interface from pcn0 to vmxnet0, but didn't actually help. Solaris was still unable to plumb the interface. After twiddling the VM's network settings several times and doing several reconfiguration boots, I eventually ended up with a working vmxnet1 interface (and a dead pcn0 and vmxnet0). As usual in such adventures, I'd swear that the last thing I did before it started working should not have had any appreciable effect. Oh, well. It worked, and I wasn't interested in understanding why.

Now that I had a functional network interface, the next step was to reinstall the Sun Grid Engine product. The VM comes with a preinstalled instance, but this demo requires features not enabled in a default installation, like what the VM provides. I left the original cell (default) intact and installed a new cell (hpc) with the -jmx and -csp options. -jmx enables the Java thread in the qmaster that serves up the JGDI API over JMX. I needed JGDI so that the demo GUI that I was building could receive event updates from the qmaster about job and host changes. With Sun Grid Engine 6.2, I was unable to successfully connect to the JMX server unless I installed the qmaster with certificate-based security, hence the -csp option. After the installation was complete, I then had to do the usual CSP certificate juggling, plus a new wrinkle. In order to connect to the JMX server, I also had to create a keystore for the connecting user with $SGE_ROOT/util/sgeCA/sge_ca -ks <user>. There's a quirk to the sge_ca -ks command, though. By default, it fails, explaining that it can't find the certificates. The reason is that the path to the certificates is hard-coded in the sge_ca script to a ridiculous default value. To change it to the correct value, I had to use the -calocaltop switch. After the certificates were squared away, I installed execution daemons in both Zones. At least that part was easy.

The next thing I did was to create some more Zones. Yes, I know this demo was supposed to be using real machines from a local pool and the cloud. Because it's a demo on a laptop, the "local machines" had to be equally portable. Because of firewall issues, I also wanted to have a backup for the cloud. In an effort to be clever, I moved the file systems for the two existing Zones onto their own ZFS volumes. I wanted to create the new Zones as cloned snapshots of the old Zones. Unfortunately, it turns out that even though the man page for zfs(1M) says that it's possible, the version of Solaris installed in the VM is the last version on which it isn't possible. After chasing my tail a bit, I decided to just do it the old fashioned way instead of trying to force the new fangled way to work.

Now that I had six non-global Zones running, the next step was to get Service Domain Manager installed. It is neither installed nor included in the developer edition VM, so I had to scp it over from my desktop. Technically, I could probably have managed to download it directly from the VM, but I had already downloaded it to my desktop before I started. For the Service Domain Manager installation, I followed Chansup's blog rather than the documentation. Chansup's blog posts detail exactly what steps to follow without the distraction of all the other possibilities that the docs explain. Following the steps in the blog, I was able to get the Service Domain Manager master and agents installed with little difficulty. The hardest part is that the sdmadm command has extremely complicated syntax, and it took a while before I could execute a command without having the docs or blog in front of me as a reference. To prove that the installation worked, I manually forced Service Domain Manager to add one of the new Zones to the existing Sun Grid Engine cluster, and much to my shock and wonderment, it worked.

The last step of VM (re)configuration was to configure the Service Domain Manager with a local spare pool and a cloud spare pool and a set of policies to govern when resources should be moved around. This step proved about as tricky as I expected. As one of the original architects and developers of the product, I had a good idea of what I wanted to do and how to make it happen, but the syntax and the details were still problematic. The syntax was the first hurdle. The docs have issues with both understandability and accuracy, and Chansup's blog was too narrowly focused for my purposes. After I poked around a bit, I figured out how to do what I wanted, but actually doing it was the next challenge. What I wanted to do was create two MaxPendingJobsSLO's...

We interrupt your regularly scheduled blog post to bring you a public service announcement. Please, for your own well being and the well being of others who might use your software, test all of your code contributions thoroughly on all supported platforms, and have them reviewed by an experienced member of the development team before committing, especially if you're working on the Firefox source base. This point in the blog post is the last time I saved my text before completing the post. Before I could save it, Firefox segfaulted causing me to loose a significant amount of work. What follows is a downtrodden, half-hearted attempt to complete the post again. We now return you to your regularly scheduled blog post.

What I wanted to do was create two MaxPendingJobsSLO's for the Sun Grid Engine instance. The first would post a moderate need (50) when the pending job list was more than 6 jobs long. The second would post a high need (99) when the pending job list was more than 12 jobs long. I also wanted to have a local spare pool with a low (20) PermanentRequestSLO and a low FixedUsageSLO, and a cloud spare pool with a moderate (60) PermanentRequestSLO and a moderate FixedUsageSLO. The idea was that when the Sun Grid Engine cluster was idle, all the resources would stay where they were. When the pending job list was longer than 6 jobs, resources would be taken from the local spare pool. When the pending job list was longer than 12 jobs, additional resource would be taken from the cloud spare pool. When the pending job list grew shorter, the resources would be returned to their spare pools. In theory. (The philosophy of setting up Service Domain Manager SLOs is a full topic unto itself and will have to wait for another blog post.)

The first problem I ran into was that Service Domain Manager does not allow a spare pool to have a FixedUsageSLO. An issue has been filed for the problem, but that didn't help me set up the demo. The result was that I had no way to force Service Domain Manager to take the local spare pool resources before the cloud spare pool resources. The best I could do was set the averageSlotsPerHost value for the SLO for the MaxPendingJobsSLO's to a high number so that Service Domain Manager only would take hosts one at a time, rather than one from each spare pool simultaneously.

The nest problem was quite unexpected. With the SLOs in place, I submitted an array job with 100 tasks. I waited. Nothing happened. I waited some more. Still nothing happened. I turns out that the MaxPendingJobsSLO only counts whole jobs, not job tasks like DRMAA would. The work-around was easy. I just had to be sure the demo submitted enough individual jobs instead of relying on array tasks.

The last problem was one that I had been expecting. After a long pending job list had caused Service Domain Manager to assign all the available resources to the cluster, when the pending job list went to zero, the borrowed resources didn't always end up where they started. Service Domain Manager does not track the origin of resources. Fortunately, the issue is resolved by an easy idiom. I created a source property for every resource, and I set the value of the property to either "cloud", "spare", or "sge". I then set up the spare pools' PermanentRequestSLO's to only request resources with appropriate source settings. I also added a MinResourceSLO for the cluster that wants at least 2 resources that didn't come from a spare pool, just to be complete.

With the SLOs in place, the configuration actually did what it was supposed to. When the cluster had enough pending jobs, hosts were borrowed first from the local spare pool and then from the cloud. When the pending jobs were processed, the resources went back to the appropriate spare pools. To make the configuration more demo-friendly, I changed the sloUpdateInterval for the Sun Grid Engine instance to a few seconds (from the default of a few minutes). I also changed the quantity for the spare pools' PermanentRequestSLO's to 1 so that they would only reclaim their resources one at a time, rather than all at once. With those last changes made, I was ready to move on to the UI.

The idea of the demo was to present a clear graphical representation of what was going on with Sun Grid Engine and Service Domain Manager. From past experience building a similar demo for SuperComputing, I knew that JavaFX™ Script was the best tool for the job. (OK. It's not the best tool for the job in a general sense, but I'm a long-time Java™ geek, I don't know Flash, and I didn't have any budget to buy tools. Under those constraints, it was the best I could do.) Before I could get to building the UI, though, I first needed a JGDI shim to talk to the qmaster. Richard kindly provided me with some JGDI sample code, and from there it was pretty easy. The hardest part was figuring out what the events actually meant. In the end, my shim registered for job add events (to recognize job submissions), task modified events (to recognize job tasks being scheduled), and job deleted events (to recognize job completions). It also registered for host added and deleted events to recognize when Service Domain Manager reassigned a host.

With the shim working smoothly, I turned to the actual UI. Given the complexity of the animations that I wanted to do, it was shockingly simple to achieve with JavaFX Script, especially considering that there was not yet a graphical tool equivalent to Matisse for Swing. Every bit of it was hand-coded, but it still was fast, easy, and came out looking great. In the end, the whole UI, counting the shim, was about 1500 lines of code, and about 500 lines of that was the shim. (JGDI is rather verbose, especially when establishing a connection to the qmaster.)

And with that, I ran out of time. The next step would have been to actually populate the cloud spare pool with machines provisioned from the cloud. Torsten graciously provided me a Solaris AMI that included Sun Grid Engine and Service Domain Manager. The plan was to pre-provision two hosts to populate the pool and then create a script that would provision an additional host each time the cloud pool dropped below two hosts and release a host every time it grew larger than two hosts. Now that the demo has been presented, the pressure is off, and other things are higher priority. I do plan, however, to eventually come back and put the last piece of the puzzle in place.

Below is a video of the demo, showing how jobs can be submitted from the Sun Studio IDE, and how Sun Grid Engine and Service Domain Manager work together with the local spare pool and the cloud to handle the workload. The job that is being submitted is a short script that submits eight sleeper jobs. Because the MaxPendingJobsSLO ignores array tasks, I needed to submit a bunch of individual jobs, but I didn't want to have to click the submit button multiple times in the demo.

Filming the video turned out to be an interesting challenge unto itself. I did the screencap using Snapz Pro on the Mac. It has no problem with JavaFX Script or with VMware VMs, but it apparently can't film JavaFX Script running inside a VMware VM. I ended up having to twiddle the UI a bit so that I could run it directly on the Mac. That's why in the demo, when I switch from Sun Studio to the UI, I swap Mac desktops instead of Solaris workspaces. The voice over and zooming effects are courtesy of Final Cut, by the way.

Friday Dec 19, 2008

Announcing Grid Engine 6.2 Update 1

Grid Engine 6.2 Update 1 is now ready for download.

Thursday Dec 18, 2008

The Perfect Holiday Gift

Wondering what to get for that special someone who has everything? How about a sneak peek at soon-to-be-released Sun Grid Engine 6.2 update 2? That's right! Nothing says 'I love you,' like the SGE 6.2u2 Beta, and it's available just in time for the holidays. It makes a great stocking stuffer, and it's fun for the whole family. Download the SGE6.2u2 Beta today!

Monday Nov 03, 2008

SGE Blog Planet

There's a new Grid Engine blog aggregator on planets.sun.com. The idea is to capture all of the relevant Grid Engine blogs in a single place for easy access. It's similar to the aggregator on the OpenSolaris HPC Community site, except that the HPC one also contains general HPC blogs and blogs on other Sun HPC products as well. If you have suggestions for a blog that should be included in either, let me know.

Wednesday Oct 01, 2008

Bonsai!

I recently rediscovered a hidden qconf option. I remember talking with the engineer when he implemented the option years ago, but because it was never documented, I forgot that it existed. A recent customer eval reminded me that it's there, and I think it's one worth sharing.

The hidden option is qconf -bonsai. It is a human-readable equivalent of qconf -sstree, which if you've looked at you'll know isn't even remotely human-readable. It prints the current share tree configuration using spacing to represent hierarchy.

Let's look at an example. This is the output from qconf -sstree for my home test cluster:

# qconf -sstree
id=0
name=Root
type=0
shares=1
childnodes=1
id=1
name=sge
type=0
shares=1000
childnodes=2,3
id=2
name=root
type=0
shares=400
childnodes=NONE
id=3
name=default
type=0
shares=200
childnodes=NONE

This is the output from qconf -bonsai for the same cluster:

# qconf -bonsai
Root=1
   sge=1000
      root=400
      default=200

Now, as for why it's an undocumented feature, I suspect it's historical. It was originally added on a whim by one of the engineers and was just never fully embraced. I remember there being talk about changing the name of the switch and making it a documented feature, but I suspect that plan just got lost in the shuffle.

About

templedf

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today