Grid Engine@FOSS.in >>

After I started on Grid Engine, it was really tempting to show someone what I knew. Then came FOSS > Free and Open Source Software. We tried getting some real good apps that have been GRIDIZED. Though we had tough time getting things together for the event, the outcome was great.
The crowd was great, ranging from novice to implementors of internal grid. And we had the chance to explain grid engine in detail.

Grid Engine at Sun Stall
We had setup 3 v20z's and a metropolis to have a four node mini grid, totalling 7 cpus (2+1+2+2). But at the stall one of the v20z(2cpu) doesn't run. We also had to share a 2cpu v20z with another demo. We setup a zone on one of the v20z and managed to get 4 nodes ( 5cpu's).
We shared the main cell directory and the binaries with NFS on the machines. We had also set MPI (mpi-1.2 LINK) PE on the setup. It took us sometime to settings up, and soon we got it up and running.
The apps were
  • a biotech app, fastDNA, looking for 6 slots to run
  • a movie rendering app (using povray) also requiring 6 slots
  • some batch and array jobs from the examples.
We had created a web-based interface with 4 graphs showing the utilizations (using the rrdtool) in a frame. With 2 more frames showing the jobs in pending state and running state. One frame with cluster queue configurations and a frame with scheduler messages. Most of the stats were from qstat, except the cluster queue threshold, usage stats from qconf and qhost. The best part of it was to have it dumped with -xml switch, so that we could use xsl to get the html pages. The qstat o/p was feeded to the perl script writing to rrd database, and graphs created by rrdtool from the database. The graph was plotted for load_short, load_long and np_load_avg values showing grid engine used the values to get the average.

Demo
We started running fastDNA. As the non-global zone and global zone are part of single host, it was visible how the utilizations were same for the nodes. As the jobs run from fastDNA, one could see how the jobs got distributed to the queue instances. The load levels started shooting up, and one could see the scheduler messages
  • for dropping the queue as it was full and,
  • when the load_avg exceeded the threshold
(Using the scheduler report variable for tracking the job status would have been more helpful).
Finally the job finished after 1 1/2hrs on the setup. We then scheduled Helloworld (the movie rendering app) which finished in less than 40 mins. The take aways from the demo was:
  • Range of scheduling policies of the grid engine
  • Granular resource control ( we showed subordinate queues, load sensor, host complexes)
  • We also demonstrated the job suspension using the suspension threshold ( job migration was something someone was asking, unfortunately the apps didn't have checkpoint support )
  • Range of OS to run on ( Windows host surprised them, we also ran a few example jobs on the brandZ!)
  • The meta-scheduler\* feature for co-ordinating 2 grid engines, and ability interact with globus
Ravee
\*Correct me if this does not qualify for a meta-scheduler feature
Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

ravee

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today