Wednesday Nov 28, 2007

DRMAA For Platform's LSF

In case you didn't notice, the Distributed Resource Management Application API (DRMAA) has become one of the first two official recommendations from the Open Grid Forum. Along with this long awaited official recommendation status came an exciting surprise: Platform now has a DRMAA implementation for their LSF product! The implementation comes courtesy of the FedStage Developer Network. FedStage has only produced a C binding for LSF so far (and I'm told that's all that's planned), but since the Grid Engine Java™ language binding is built on top of the C binding, it's a small step to get the Java language binding working for LSF.

To make the good news even better, not only does Platform officially endorse the implementation (I'm told they funded it), but FedStage has released it under an Apache 2.0 open source license!

With the sudden rise in customer and ISV interest and the addition of LSF to the family, I have a feeling that DRMAA has finally reached critical mass. To find out more, check out the DRMAA 1.0 IDL binding specification and the Grid Engine C and Java language binding tutorials.

Monday Oct 01, 2007

And There Was Much Rejoicing!

Whoohoo! The Open Grid Forum has finally accepted the Distributed Resource Management Application API (otherwise known as DRMAA, pronounced like "drama") as a formal recommendation! Along with the GridRPC standard, we are the first recommendations to come from the OGF.

In case you weren't aware, Grid Engine comes with binding for DRMAA in both C and the Java language. In addition, you can download bindings for Perl, Python, and Ruby. If you want to know more about how DRMAA works, dig around in the Grid category of my blog. I have a bunch of DRMAA articles.

Friday Aug 24, 2007

Specifying a Username With DRMAA

DRMAA is a standard API for submitting, monitoring, and controlling jobs with a DRM (Distributed Resource Manager). Grid Engine includes DRMAA bindings for C and for the Java™ platform. One of the first ideas most people come up with when looking at DRMAA is to build a daemon that will extend the reach of DRMAA, such as a portal or a web services interface. Traditionally, creating such a daemon in DRMAA had two problems. The first was that jobs are bound to the DRMAA session during which they were submitted. If you lose the session, such as by crashing and/or restarting, you lose contact with your jobs. The second problem was that DRMAA submits jobs as the user running the application, which in the case of a daemon was usually root or some neutral 3rd party, like sgeadmin.

The first problem has been advanced Grid Engine admin class I teach, a solution to this second problem occurred to me, and it doesn't require modifications to Grid Engine. Here's how it works...

First, configure a queue where your daemon will submit all jobs. (It could be more than one queue, but I'm going to continue in the singular.) The important thing is that only jobs from the daemon are allowed to run in the queue. The easiest way to do that is to create a new forced boolean complex, and assign it to the queue. Then have your daemon add a request for that resource to all job submissions. (There are a variety of way to do this, including adding the resource request to the daemon user's sge_request file.)

The reason why you want to isolate the daemon's queue is that you're going to change its starter method. Create a script or program that reads a user name from an environment variable, such as SGE_DRMAA_USERNAME, changes users to that user, and then executes the job script (passed as the first argument to the starter script). An example script might look like:


if [ "$SGE_DRMAA_USERNAME" = "" ]; then
   exit 100

An example C program for Solaris might look like:

#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>

int main(int argc, char \*\*argv) {
   char \*username = getenv("SGE_DRMAA_USERNAME");

   if (username == NULL) {
      return 100;
   } else {
      struct passwd \*pw = getpwnam(username);

      if (pw == NULL) {
         return 100;
      else {

Next, configure the queue's starter method to be your script/program. See the queue_conf(5) man page for details about the queue configuration. Because Grid Engine will run the starter script as the user who submitted the job, if your daemon is running as root, the starter script will be run as root, giving it permission to change the user id. If running the daemon as root is a problem in your environment, you can get around it using Solaris role-based access control (RBAC) or similar mechanisms on other operating systems to assign setuid permission to the user as whom the daemon is running.

Now, whenever your daemon accepts a job submission, it should attach the environment variable, with the name of the user as the value, to the job's environment. See the job environment DRMAA attribute (drmaa_v_argv in the drmaa_attributes(3) man page for C and the jobEnvironment property of the JobTemplate class for the Java binding) for details.

This arrangement does represent a security hole. The starter script blindly believes whatever the environment variable says. A malicious user could set the environment variable to root for his job and then submit it to your queue. Bad news. To prevent this security hole, create a public/private key pair for your daemon. Instead of putting the cleartext username in the environment variable, encrypt it first with the private key. The starter script must then use the daemon's public key to decrypt the username.

But there's still a security hole. A malicious user could snoop the job submission, lift the encrypted username and reuse that encrypted username for his own jobs. Eliminating this security hole is a little trickier. One solution might be to also include an encrypted sequence number that gets incremented with every job, forcing the starter method to globally track which sequence numbers have already been used (because jobs may ultimately be scheduled in any order). To close the hole completely, you'd have to verify that the sequence number belongs to the job being run. With that in mind, the best approach might be to have the starter method contact the daemon and report the decrypted sequence number. The daemon would then respond with the associated job number. If the job number's don't match, the job is a fake. To be completely secure, that communication should happen over SSL.

I haven't actually tried this approach yet, but it's on my list of things to do. If anyone out there gives it a go, I'd be very interested to hear how it went.

Wednesday Aug 22, 2007

Distributed Resource Management Application API Specification 1.0 in Open Public Review

Good news! The Distributed Resource Management Application API Specification 1.0 is now in the public review phase. It shouldn't be too much longer now before the standard is officially blessed by the Open Grid Forum.

Friday Jul 20, 2007

DRMAA and the shared.library.path

I noticed that multiple folks have found by blog in the last month because they were trying to figure out why the Grid Engine implementation of the DRMAA Java language binding is complaining about libdrmaa not being in the shared library path. I've probably answered this question indirectly in a previous post, but for the benefit of all those searchers, here it is in all its glory.

The Grid Engine DRMAA Java language binding implementation is written as a wrapper around the DRMAA C binding implementation. When the classloader loads the com.sun.grid.drmaa.SessionImpl class (which will happen when you call org.ggf.drmaa.SessionFactory.getSession() for the first time), The SessionImpl class will attempt to load the DRMAA C binding's shared library, otherwise known as libdrmaa. In order to find libdrmaa, the Java virtual machine must know that it should look in the $SGE_ROOT/lib/$ARC directory. ($SGE_ROOT is where Grid Engine is installed. $ARC is the name of your host's architecture, which can be determined by running $SGE_ROOT/util/arch.)

There are two ways for the Java virtual machine to know to look in the Grid Engine lib directory. The first is by the lib directory being included in the parent shell's shared library path environment variable. On Solaris, that's $LD_LIBRARY_PATH. (Or $LD_LIBRARY_PATH_64.) On some other platforms, it's $LIBPATH or $SHLIB_PATH. With Grid Engine 6.0, when you source the settings file ($SGE_ROOT/$SGE_CELL/common/settings.[c]sh), your shared library path is automatically modified to include the Grid Engine lib directory. With 6.1 on platforms other than Solaris and Linux, that's also true. With 6.1 on Solaris and Linux, the settings file no longer sets the shared library path. Instead, the Grid Engine binaries are compiled in such a manner that they can determine from their own paths what the path to the lib directory is. If you're using 6.1 on Solaris or Linux, in addition to sourcing the settings file, you will also have to set the shared library path to include the Grid Engine lib directory or use the second method I talk about below. Note that this 6.1 shared library path change also affects DRMAA applications written in C. Unless a DRMAA application written in C expects to be installed in the Grid Engine root directory (and hence was compiled to know how to find the lib directory), it will require that the user explicitly set the shared library path, just like with DRMAA applications written for the Java platform. (The same thing also applies to the Perl, Python, and Ruby bindings.)

The other way to tell the Java virtual machine how to find the Grid Engine lib directory is to pass in the information via the shared.library.path system property. To use this method, add the following to the options you pass to the Java virtual machine: -Dshared.library.path=$SGE_ROOT/lib/$ARC, where $SGE_ROOT and $ARC are as defined above. This method is probably the simpler and less invasive, but it must be applied every time the Java virtual machine is launched. If you use the shared library path method, you set it once, and it applies to all Java virtual machines launched from that shell. The downside, of course, is that setting the shared library path for DRMAA may adversely affect other applications with their own expectations for what should be in the shared library path.

While we're talking about issues caused by libdrmaa in DRMAA applications written for the Java platform, we should also talk about 32-bit versus 64-bit. The Java virtual machine has a restriction that it can only load libraries that are compiled for the same architecture as it was. If you're using a 32-bit Java virtual machine, it can only load 32-bit libraries. A 64-bit virtual machine can only load 64-bit libraries. The problem is that by default, the Grid Engine binaries that folks download for Solaris are 64-bit, while the Java virtual machine that runs by default is 32-bit. Again, there are two solutions to this problem. The better solution is to download and install the 32-bit Solaris binaries for Grid Engine. Again, this works with Grid Engine 6.0. It also works for 6.1 on AMD64; just download the x86 binaries. If you're using 6.1 on SPARC, though, you're saved from the trouble because the 32-bit libdrmaa is included with the 64-bit binaries.

The other option is to download and install the 64-bit Java virtual machine and run your app with the -d64 switch. That works in all cases, but it means that your application will be running in 64-bit mode, which is slower and has a bigger memory footprint than running in 32-bit mode.

These native problems are rather annoying. The better option would be to have a DRMAA Java language binding written in pure Java. We're talking about it, but don't expect one any time soon.

Monday Apr 02, 2007

Comment Comments

We're trying hard to finally get the DRMAA 1.0 specification accepted as an Open Grid Forum recommendation. The acceptance process requires that we produce experience documents detailing implementations of the DRMAA 1.0 specification. We have written three such documents, and the steering group has finally posted them for public comment. If you're interested in DRMAA and/or the Grid Engine, Condor, and/or GridWay implementations, please read through the documents and provide comments. The public comment period will end on April 20th.

Tuesday Jan 23, 2007

Last Chance For Comments

I just posted what will hopefully be the last release candidate for the DRMAA Java Language Binding Specification 1.0. If you want to review it and make comments, time is running out. I've already finished and checked in the Grid Engine implementation of the new spec for the Grid Engine 6.1 release.




« February 2017