Running Jobs on Sun Grid that require “Service Containers”

Sun Grid's resource management semantics basically dictate that jobs be self-contained, and terminate all processes in order to exit. The problem with terminating processes in a grid context is that it's not quite as simple as doing a PID trap on a single host, instead, you need to use the qsub, qstat and qdel commands to better manage your distributed jobs.

The example pattern that I'd like to elaborate is one of a “server/framework” which needs to run in order to support a client. Whether a simple RMID, or a more complex instance of a web server, app server or JavaSpace, the pattern is very similar. The developer wants to:

  1. Start up one or more servers (in our case 2, the httpd and the GigaSpaces Enterprise Server)
  2. Make sure that the servers are running
  3. Submit the client and wait for the client to complete
  4. Shutdown the Servers so that the Sun Grid Job can terminate and stop the meter

First some basic syntax:

  • #$ = new directives for SGE which do things like populate environment variables (-V)
  • qsub = submit this task to the grid for scheduling.. we use a couple of opt
  • “-sync n” fire and forget... don't wait for the job to be scheduled
  • “-N <jobname>” not required but could be used for parsing qstat... unfortunately qdel requires a jobid instead of a job name (to keep you from shutting down similarly named jobs)
  • “-t 1” or -t 1-4:1“ submit a job to one or multiple nodes with a minimum
  • qstat = get the status of the SGE queue, which in the case of Sun Grid will only return the jobs that you own for privacy purposes
  • ”-r“ only return the ”running“ jobs... jobs that are waiting (status=”qw“) are excluded
  • qdel = delete / stop the specified jobs

Now onto the listing:

#! /bin/bash
#$ -V

# if we are running against an older version of SGE, the ”$ -V“ direction
# will not exist, so be sure that we source the SGETOOLS (or at least try to)
if [[ ${SGETOOLS:-”unset“} = ”unset“ ]]
then
echo setting SGETOOLS
SGETOOLS=/home/sgeadmin/N1GE/bin/sol-amd64
export SGETOOLS
PATH=$SGETOOLS:$PATH
fi

echo ”Starting the GigaSpaces Servers“
GSEE_HOME=GigaSpacesEE5.0
GRID_HOME=$GSEE_HOME/ServiceGrid
GSC=`qsub -sync n -N gsee-gsc -v GSEE_HOME=$GSEE_HOME -v GRID_HOME=$GRID_HOME -t 1-4:1 $GRID_HOME/bin/gsc`
GSM=`qsub -sync n -N gsee-gsm -v GSEE_HOME=$GSEE_HOME -v GRID_HOME=$GRID_HOME -t 1$GRID_HOME/bin/gsm $GRID_HOME/config/overrides/gsm-override.xml`
echo ${GSC}
echo ${GSM}

#SGE Job return syntax is XXXX:X-X:X where $JobID:$rested_min-$max:$Actual_min
# so trim out just the first XXXX which is a regex matched from the 3rd field
MATCH=”\\(.\*\\) \\(.\*\\) \\([0-9]\*\\)\\.\\([0-9]\*\\)-\\([0-9]\*\\):\\([0-9]\*\\)“ #simple match for multi-node job
MATCH2=”\\(.\*\\) \\(.\*\\) \\([0-9]\*\\) \\(.\*\\)“ #simple match for simple 1 node job

GSCparsed=( `echo $GSC | sed -n -e ”s/${MATCH}/\\3/p“` )
if [[ ${GSCparsed:-”unset“} = ”unset“ ]] then
GSCparsed=( `echo $GSC | sed -n -e ”s/${MATCH2}/\\3/p“`)
fi

GSMparsed=( `echo $GSM | sed -n -e ”s/${MATCH}/\\3/p“` )
if [[ ${GSMparsed:-”unset“} = ”unset“ ]] then
GSMparsed=( `echo $GSM | sed -n -e ”s/${MATCH2}/\\3/p“`)
fi
echo ”Jobs $GSCparsed and $GSMparsed submitted“

# wait for these jobs to showup in qstat
GSMstatus=0
GSCstatus=0
until [[(”$GSMstatus“ > 0) && (”$GSCstatus“ > 0)]]
do
#evaluate the qstat -s r response (running jobs) to make sure that the
#requisite jobs are running
GSCstatus=$(qstat -s r | nawk '/'${GSCparsed}'/{var1+=1} END {print var1}')
GSMstatus=$(qstat -s r | nawk '/'${GSMparsed}'/{var1+=1} END {print var1}')
echo ”GSCstatus = $GSCstatus“
echo ”GSMstatus = $GSMstatus“
echo Server status is $(qstat -s r)
sleep 10
done

#run our application - in this case, use multiple nodes to help us calculate prime factor
echo ”crunching“
~/prime-crunch.sh $1
echo ”done“
#clean up
#parse jobid's out of GSM and GSC
echo $(qdel $GSMparsed $GSCparsed)
#go ahead and print out the queue status on the way out to verify cleanup (optional)
sleep 10
echo ”Leaving...“ echo $(qstat)

Hopefully, this example sheds some light on some of the mechanisms that a developer might enlist in order to launch more complex, server dependent applications against the Sun Grid. Please let me know if I need to elaborate further. I want to take this opportunity to recognize GigaSpaces, and specifically Dennis Reedy for his help in putting together a grid job which could flex a couple of nodes against their GigaSpaces Enterprise Server 5.0 environment. I'd also like to thank Bill Meine and Fay Salwen for their scripting assistance.

Keywords: ,
Comments:

Post a Comment:
Comments are closed for this entry.
About

dhushon

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today