Performance Considerations For JSV Scripts
By templedf on Feb 05, 2009
One of the new features coming in Grid Engine 6.2u2 is job submission verification (JSV). The basic idea is that on both the client side and the server side, you have the ability to add scripts that can read through all the job submission options and accept, reject, or modify the job. JSV will open up a whole new world of possibilities that didn't exist before, and it will largely end the need for qsub wrapper scripts.
Because the server-side JSV scripts are executed by the qmaster for every job, there are performance considerations that must be taken into account. In order to limit the performance impact, the qmaster will manage the JSV scripts the same way load sensor scripts are handled, i.e. they are started once and kept alive as a separate process instead of starting them once per job. Nonetheless, what happens inside the scripts can still have a big impact on qmaster performance.
In a test, Roland (who still isn't blogging!) set up some DRMAA submission clients to hammer the master with job submissions. With no server-side JSV scripts, the clients were able to do 900 job submissions per second\*. With a simple server-side Perl JSV script to change the job name, the clients were only able to submit 700 jobs per second. A similar JSV script written in Tcl yielded the same results. With a similar JSV script written in Borne shell, however, the clients were only able to submit 3 jobs per second. No, that isn't a typo. While languages like Perl and Tcl are able to process numbers and strings natively, Borne shell has to rely on forking off other commands. Those forks are expensive and, even in a simple JSV script, yield major performance penalties. For these reasons, I actually recommend the Java™ language for writing server-side JSV scripts. Not only do you get access to the all the great built-in and external libraries, but you also get access to JGDI, letting you talk to the qmaster without forking an SGE command-line tool. (Thanks to Jython, JRuby, Rhino, et al, you can get the same benefits from languages other than just the Java language.)
Let me repeat that point to make sure it comes through loud and clear. If you use a shell script as a server-side JSV script, you will trash your cluster's job submission rate. That's not just for DRMAA jobs or for certain users. That's for the entire cluster.
On the client side, the story is a bit simpler but still similar. For every job submission, each client-side JSV will be started. (An array job counts as a single job in this regard.) That makes sense because qsub is started once for every job submission, and the JSV scripts can't outlive the qsub that launched them.
For DRMAA the implications are a little different. The JSV scripts are still started for every job submission, even though the DRMAA client remains running between submissions. (A DRMAA bulk job is an array job and hence still counts as a single job in this respect.) Roland used DRMAA clients in his test because they're very fast at job submissions. Using client-side JSV scripts affects that in much the same way as on the server side. And as with the server-side scripts, shell scripts have more of an effect than scripts written in a higher-level language. If you figure there's about 200ms overhead for every fork & exec, you could easily add several seconds to each job submission. A DRMAA client without client-side JSV scripts can easily submit over 100 jobs per second. With even a single client-side JSV script that runs no further commands, your submission rate drops to less than 5 jobs per second. Use with caution!
The JSV feature in 6.2u2 is extremely powerful, but as I've explained, you have to use it with care. When used with appropriate caution, however, JSV provides a fairly easy answer to some of the traditionally thorny issues for SGE administrators.
\* Roland achieved that submission rate with a Sun x4100M2 with two dual core AMD 2.8GHz processors running Solaris 10 as the master.