PE Tight Integration
By templedf on Apr 04, 2007
While the topic of integration of parallel environments with Grid Engine is still fresh, there's one other topic I'd like to cover. What is a tight integration, and how is it different from a loose integration?
Let's start with how a parallel job is started.
Step 1, the scheduler sends the qmaster a set of orders, saying where to put the master task and where to put the slave tasks. The master task is the one that runs the job script. (I say script because in the vast majority of cases, a parallel job will be a script. It is, however, theoretically possible for it to be a binary.)
Step 2, the qmaster sends the master task to its destination execution daemon, just like with a non-parallel job, but it also reserves the jobs slots on the destination execution daemons for the slave tasks. Notice that I said "reserves slots," not "starts." The qmaster does not actually start any of the slave tasks. See step 3.3.
Step 3, the execution daemon starts the parallel job on the master node.
Step 3.1, the execution daemon on the master node runs the parallel environment startup script. This script prepares the parallel environment for running the master task. Among other things, this script creates a file that lists the job slots to be used for the slave tasks.
Step 3.2, the execution daemon runs the job script as the master task.
Step 3.3, the master task starts the parallel environment for the job. This step is different from step 3.1. Step 3.1 prepares the parallel environment, but it doesn't necessarily start any processes. Step 3.3 is where the parallel environment is actually run, such as running mpirun for an MPI integration.
Step 3.4, the parallel environment connects to the slave nodes and starts the slave tasks.
Step 4, after the job finishes, the execution daemon on the master node runs the parallel environment shutdown script.
The above process applies to both loosely and tightly integrated parallel environments. The difference between loose and tight integration is how the slave tasks gets started. In a loose integration, the parallel environment uses some out-of-band method to connect to the slave nodes and start the slave tasks. This method gives the parallel environment a great deal of freedom in how it starts the slave tasks, but it means that the slave tasks are running outside of the scope of Grid Engine. Because the slave tasks are run outside of Grid Engine, the qmaster has no way to track the resource usage of slave tasks in loosely integrated parallel environments. Only the resource usage of the master task can be tracked.
In a tightly integrated parallel environment, the slave tasks are started through qrsh -inherit. The -inherit switch is a special qrsh switch that is used only with slave tasks in tightly integrated parallel environments. A job submitted this way actually bypasses the scheduler completely and is sent directly to the target execution daemon. As a security precaution, execution daemons deny such job submissions by default. In step 2, when the qmaster reserves the slave nodes for a parallel job in a tightly integrated parallel environment, it tells the execution daemons to expect the qrsh -inherit jobs and not to deny them. Because the slave tasks are run through Grid Engine, the qmaster is able to track the tasks' resource usage, the same as with any other kind of job. A common trick to make the implementation of the integration easier is to provide an rsh wrapper that translates rsh calls into qrsh calls. That way, as long as the parallel environment naturally uses rsh to contact the slave nodes, the tight integration will work automatically.