Sunday Aug 07, 2005

sge6u4 and mpich integration for solaris 10 x64

sge6u4 and mpich integration for solaris 10 x86

There is renew intresting in HPTC-GRID user in using solaris10 x64 on opteron based system. Beside the HW and Solaris 10/x64 other tools that can help for setup a GRID env
  • Sun Studio 10 http://www.sun.com/software/products/studio/index.xml
  • N1GE6u4 or opensource sge6u4 http://gridengine.sunsource.net/
  • Sun JET(Jumpstart Enterprise Toolkit) http://www.sun.com/bigadmin/content/jet/
  • MPICH or LAM at http://apstc.sun.com.sg/ solaris 10 migration
  • Ganglia http://ganglia.sourceforge.net/
I setup a POC system at customer site for a small cluster.
After the sge6u4 installation, I set up mpi and mpich PE using the templates under $SGE_ROOT/mpi directory.
Final key step: modify the all.q and add mpi and mpich to "pe_list".

lustre and N1GE6 experience

experience of Lustre and N1GE6

We will summarize our experience on recent experience of Lustre FS and N1GE6u4, on sles9sp1
The main feature of lustre FS is that it is a multi-reader and multi-writer FS, it use metatdata server and Object Store Target to serve the FS
All clients system of lustre have a common view of lustre in /mnt/lustre.
We set the SGE_ROOT as /mnt/lustre/sge and it is own by sgeadmin.
The is default. Since lustre FS just an extension of linux FS so we just use the standard
  • at master inst_sge -m
  • at execd inst_sge -x
  • at shadow master inst_sge -sm
Lustre FS behavie like NFS, so we have build in HA of master and shadow master based on lustre FS

Monday Jun 13, 2005

rocks and v20z

rocks cluster SW verision 4 and v20z I just install the frontned of the rocks cluster SW version 4 on a v20z and it seems to work

since I only have one node so I could not try out the compute node

this version still contain sge5p3

it will support any version of the rhel4, e.g. CentOS etc

since this initial posting, Rocks 4 now also has sge6u4 roll and new improved roll from scable systems

rocks4 will also work with rhel4u1 that has dual core opteron support

In order to access the SP of v40z, one maight need IPMItool roll so one can have inband or outband ipmi control

Saturday Jun 04, 2005

acro for sles9sp1 and n1ge6u6

ARCo for SLES9SP1 and N1Ge4U6

We put done some observation in installing ARCo with SLES9SP1 and N1GE6U6

There is patch for U6 for AMD and common

The main point is that in a full installation of SLES9SP1 postgresql-7.4.6-02 is already installed

  • All the pgsql commands are in /usr/bin
  • The lib is in /usr/lib/postgresql
  • The conf template is in /usr/share/postgresql
  • The username postgres is already created and the home directory is in /var/lib/pgsql
  • The rest steps one can just follow the installation guide and my blogs for ARCo for sparc but with one modification:
    1. I install SWC first, this will also an update version of jdk
    2. I then install the dbwrite
    3. I install the reporting the last
    4. install java 1.5 64-bit, otherwise /etc/init.d/sgedbwrite start will fail

Saturday Apr 09, 2005

integration of schrodinger with SGE

We present a POC of integration of schrodinger with SGE

I have good advice from the help@schrodinger.com.

Schrodinger can use the MPI environmnet to run on multiple CPUs on multiple system.

As a shared grid environmnet one would like to submit jobs through a queue system.

Schrodinger support queue system: NQS , PBS and LSF, we need to create a similary environment so user can use the SGE env.

under the schrodinger root directory there is queues directory and it contains NQS, PBS and LSF.

we use cp -a NQS SGE to copy the NQS evvironment

there are five files

cancel, config, status.pl, submit and templates.sh

first one need to update the variables

QPATH=/opt/gridengine/bin/lx24-amd64 <--change

QDEL=qdel <--same as NQS

QSUB=qsub <-- same as NQS

QSTAT=qstat <-- same as NQS

templates.sh got the most update

#!/bin/sh

#$ -N %NAME%

#$ -o %LOGDIR%/%JOBID%.qlog

#$ -j y

#$ -pe mpich %NPROC%

QPATH=/opt/gridengine/bin/lx24-amd64

curdir=`echo $0 |sed -e 's#/[\^/]\*$##'`

if [ -f "$curdir/config" ]; then

. $curdir/config

fi

PATH=$QPATH:$PATH

export SCHRODINGER_BATCHID

SCHRODINGER_BATCHID=$JOB_ID <-- from SGE

SCHORDINGER_NODEFILE=$TMPDIR/machines <-- frm SGE

export SCHRODINGER_NODEFILE

%ENVIRONMENTS%

%COMMAND%

the schrodinger_hosts file need an update entry

name: localhost

schrodinger: /opt/schrodinger35

env: SCHRODINGER_RSH=ssh

env: SCHRODINGER_RCP=scp

name: testcluster

host: testcluster.local

hostname: testcluster.local

processors: 16

tmpdir; /state/partition1

name: sge

host: testcluster.local

hotname: testcluster.local

Queue: SGE

Qargs: ""

processors: 16

tmpdir: /state/partition1

Thursday Apr 07, 2005

Integration of gaussian 03 with SGE

Recently in a POC at customer site, we use the ROCKS cluster SW.

One of the requirement for POC is the running gaussian 03 under SGE.

In linux environment g03 can use linda parallel environmnet to use multiple processors on multiple nodes.

For simplicity we just use the SGE's PE mpi.

The hardest part for this exercise is understanding how to run g03 and how g03 work with queue system

After working with customer to create a input file and learn how to run the g03l and read through some documentation and example on the web on PBS and g03 and g03 and SGE and getting some example g03 script without SGE from barbara perz

I created two scripts

one is for customer to change the inputfile and number of nodes for the g03's jobs

one is the driver for g03 input file.

our s.csh is very simple

it will take two agrument

$1=number of %Nprocl +1

$2=input file

qsub -pe mpi sge-g01l.csh

Under g03l one can specifi Number of Process used in Linda,

one can add the %Nprocl in the input file and since linda will also use one

processor, so we allocate %Nprocl +1 from SGE's PE mpi environment.

We will use the user's input to change the inputfile then submit the job to g03l

#!/bin/csh -f

setenv g03root

source $g03root/g03/bsd/g03.login

set scratch=$TMPDIR <-- SGE's tmpdir

setenv GAUSS_SCRDIR $scratch

#$ -cwd

#$ -j y

set nodefile=$TMPDIR/machines <-- SGE's nodefile assign by -pe mpi

set ncpus=`expr $NSLOTS -1 ` <-- NSLOTS from -pe mpi

set input=$1 <-- inputfile

set tempinp=tmp$$ <-- new input file

echo "%nprocl=$nprocs " > $tmpinp

cat $input >> $tmpinp

cp $tmpinp $tmpinp.com <--rename

setenv GAUS_LFLAGS "-vv -mp 2 -nodefile $nodefile" <-- assign 2 processor per node

g03l $tmpinp.com $input.$ncpus.log

Wednesday Jan 26, 2005

certificate renew in SGE 5.3

a question on certificate renew in SGE 5.3
After some investigation, the followings are some observations

  • one can use inst_sge -m -csp to install the master server that will use the openssl certificate
  • By default the certificate is valid for 365 days
  • The sge installation did not document how you renew the certificate after one year
  • after examine the inst_sge script, one can use the following methods
    1. make sure SGE_ROOT and SGE_CELL is set
    2. cd $SGE_ROOT/$SGE_CELL/common/sgeCA/certs
    3. $SGE_ROOT/util/sge_ca -extend cert.pem -days 730 -outdir ./
    4. it will generate extended_cert.pem in the current directory
    5. shutdown the master $SGE_ROOT/SGE_CELL/common/rcsge stop
    6. cp extended_cert.pem cert.pem
    7. restart master SGE_ROOT/$SGE_CELL/common/rcsge start

Monday Dec 06, 2004

n1ge6 installation in Fedore core 2

I need to do a worksop at customer site that is using FC2, I need to install N1GE6 in FC2.

The following is my experience.

I install FC2 in desktop env

Since FC2 is using kernel 2.6, inst_sge -m fails and also complains about "strings" not found

I install binutils-2.15.90.0.3-5.i386.rpm to get strings

I create a link under bin, utilbin, lib of lx26-x86 to lx24-x86.

Now the installation is fine and acturally it is running lx24-x86?.

For qmon to work I need to install opebmotif21-2.1.30-9.i386.rpm

I also install ypserv-2.12.1-2.i386.rpm in the master host that will be NIS master

for the ARCo installation, I basically following my sparc installation blogs but with some difference

  • I use the j2sdk1_4._1_03-linux-i586.rpm that come with the arco/swc packsges, it become j2sdk1.4.1_03-fcs and is installed under /usr/java/j2sdk1.4.1_03
  • postresql is part of the FC2 and it is installed under /usr/bin with the home under /var/lib/psql and postges user created
  • one may want to create group other for SWC
  • one may want to create user: e.g. test1 for ARCo

Saturday Sep 18, 2004

Install N1Grid Engine 6 and compile postgresql on Sparc

To use the ARCO feature of N1ge6, one will need some database to store the data. At this point in time n1ge6 support postgresql and oracle.

To use postgresql on SPARC system , one will need to download the source code and compile.

Since most GUN license based SW like to use the gcc, so one will need to install the Solaris 9 companion CD first

N1ge6 has an update u1 in the form of patches for sparc:

  • 118082-01 provide 64-bit tar.gz
  • 118092-01 provide comm tar.gz
  • 118093-01 provide ARCO tar.gz
You need to install the same version, e.g 6.0 or 6.0u1, mixing is not supported

The latest patch required postgresql-7.4.2, since I donot know too much about the postgresql so we download the version postgresql-7.4.2.tar.gz from the http://www.postgresql.org/

compile postgresql

After I run gunzip -c postgresql-7.4.2.tar.gz |tar xvf -

cd postgresql-7.4.2

./configure

it fails.

examine config.log

it complains about libreadline.so.4 and about version of bison

Even through I setup the LD_LIBRARY_PATH to include /opt/sfw/lib and under /opt/sfw/lib there is libreadline.so.4

ld still cannot find the library,(this mean I donot know too much about how gcc work) so i copy the libreadline.so.4 to /usr/lib
(patrick@zill.net sugest me to run crle -u -l /opt/sfw/lib
acturally after I reboot the machies all is well, I donot need to copy the library to /usr/lib:-) )
need to run ldconfig /opt/sfw/lib so ld can include /opt/sfw/lib
It works. I finish ./configure and run make.

To be safe, I also download the bison-1.876d.tar.gz from the http://sunfreeware.com/ compile and install the new bison under /usr/local/bin

I re-run the ./configure , it does not complian about the version of bison.

I run make clean, make and make install

I have postgresql-7.4.2 install under /usr/local/pgsql.

The following are my experience in setup the ARCO.

It follows very closely with the chapter 8 of installation guide with minor modification.

setup the postgresql SW (page 89)

  1. create home directory for postgres user
    • groupadd postgres
    • mkdir -p /export/postgres/data
    • useradd -d /export/postgres -g postgres postgres
    • chown postgres:postgres /export/postgres/
    • su - postgres
  2. setup a database
    • /usr/local/pgsgl/bin/initdb -D ./data
  3. modify data/pg_hba.conf
    • host arco arco_write ip 255.255.255.0 md5
    • host arco arco_read ip 255.255.255.0 md5

  4. I change password to md5
  5. modify data/postgresql.conf
    • tcpip_socket =true
  6. start database
    • /usr/local/pgsql/bin/postmaster -i -D ./data
    • in a separate term for the console output

setup a postgresql database (page 92)

  1. su - postgres
  2. Create the owner of the database
    /usr/local/pgsql/bin/createuser -P arco_write
    • Enter password for new user:
    • Enter it again
    • Shall the new user be allowed to create databsse? y
    • Shall the new user be allowed to create more new users n
  3. Creating the database for ARCO
    • /usr/local/pgsql/bin/createdb -O arco_write arco
  4. (use new way documented in README of the patch 118093-01, point out by Richard Hierimeier)
  5. Create a database user for reading the database
    createuser -P arco_read
    • Enter password for new user:
    • Enter it again
    • Shall the new user be allowed to create databsse? n
    • Shall the new user be allowed to create more new users n
The console terminal of the postmaster will show the database will create many tables and views

Install the ARCO SW (page 95)

I following the page 95 of the installation guide, except on the step 9
I used postgresql-7.4.2.jar instead of pg73jdbc2.jar(one need to delete this file)

on step 10
The ARCo web application connects to the database
with a user which has restricted access.
The name of this database user is needed to grant
him access to the sge tables.

Please enter the name of this database user [arco_read] >>
Upgrade to database model version 1 ... Install version 6.0 (id=0) -------
Create table sge_job
Create index sge_job_idx0
Create index sge_job_idx1
create table sge_job_usage
Create table sge_job_log
Create table sge_job_request
Create table sge_queue
Create index sge_queue_idx0\^M Create table sge_queue_values\^M Create index sge_queue_values_idx0\^M Create table sge_host
Create index sge_host_idx0
Create table sge_host_values
Create index sge_host_values_idx0
Create table sge_department
Create index sge_department_idx0
Create table sge_department_values
Create index sge_department_values_idx0
Create table sge_project
Create index sge_project_idx0
Create table sge_project_values
Create index sge_project_values_idx0
Create table sge_user
Create table sge_user_values
Create index sge_user_values_idx0
Create table sge_group
Create index sge_group_idx0
Creat table sge_group_values
Create index sge_group_values_idx0
Create table sge_share_log
Create view view_accounting\^M Create view view_job_times\^M Create view view_jobs_completed\^M Create view view_job_log
Create view view_department_values
Create view view_group_values
Create view_host_values
Create view view_project_values
Create view view_queue_values
Create view view_user_values
revoke privileges from sge_department
revoke privileges from sge_department_values
revoke privileges from sge_group
revoke privileges from sge_group_values
revoke privileges from sge_host
revoke privileges from sge_host_values
revoke privileges from sge_job
revoke privileges from sge_job_log
revoke privileges from sge_job_request
revoke privileges from sge_job_usage
revoke privileges from sge_project
revoke privileges from sge_project_values
revoke privileges from sge_queue
revoke privileges from sge_queue_values\^M revoke privileges from sge_share_log\^M revoke privileges from sge_user\^M revoke privileges from sge_user_values
grant privileges to view_accounting
grant privileges to view_department_values
grant privileges on sge_department to arco_read
grant privileges on sge_department_values to arco_read
grant privileges on sge_group to arco_read
grant privileges on sge_group_values to arco_read
grant privileges on sge_host to arco_read
grant privileges on sge_host_values to arco_read
grant privileges on sge_job to arco_read
grant privileges on sge_job_log to arco_read
grant privileges on sge_job_request to arco_read
grant privileges on sge_job_usage to arco_read
grant privileges on sge_project to arco_read
grant privileges on sge_project_values to arco_read
grant privileges on sge_queue to arco_read
grant privileges on sge_queue_values to arco_read
grant privileges on sge_share_log to arco_read
grant privileges on sge_user to arco_read
grant privileges on sge_user_values to arco_read
grant privileges on view_job_log to arco_read
grant privileges on view_job_times to arco_read
grant privileges on view_jobs_completed to arco_read
grant privileges on view_project_values to arco_read
grant privileges on view_queue_values to arco_read
grant privileges on view_user_values to arco_read
commiting changes
version 6.0 (id=0) successfully installed
Install version 6.0u1 (id=1) -------
Create table sge_version
Update view view_job_times
Update version table
commiting changes
version 6.0u1 (id=1) successfully installed
OK

Install Sun Web Console (page 99)

I used swc_sparc_2.0.3.tar.gz file. After the extraction, you will have
  • setup
  • .pkgrc
  • SUNWbzip.pkg: bzip compression utility
  • SUNWj3dev.pkg: J2SDK 1.4 dev tools
  • SUNWj3dmo.pkg: J2SDK 1.4 demo programs
  • SUNWj3man.pkg: J2SDK 1.4 man pages
  • SUNWj3rt.pkg: J2SDK 1.4 runtime environment
  • SUNWjato.pkg: Sun One Appplication Framework runtime
  • SUNWmcon: Sun web Console 2.0.3 (Core)
  • SUNWctag.pkg: Sun web Console 2.0.3 (tags &Components
  • SUNWcatu.pkg: Tomcat servlet/JSP Container
This include the java sdk version 1.4.1
Run the Sun Web Console setup script,

at this time console will not start because nothings have been registred

Install the ARCO Console (page 100)

  1. cd /opt/n1ge/reporting
  2. ./inst_reporting
  3. Please enter the path to your java 1.4 installation[]>> /usr/java
  4. Please enter the path to the apool directory [/var/spool/arco] >>
  5. specify the paramters for the database connection
    • Enter your database type ( o=oracle, p=Postgresql }[p]>>
    • Please enter the name of your postgres db host []>> my-host
    • Please enter the port of your postgres db [5432] >>
    • Please enter the name of your postgres database [arco] >>
  6. Specify an accounting and reporting dabase user
    • Please enter the name of the database user [arco_read] >>
    • Please enter the password of the database user >>
    • please retype the password >>
    • Please enter the name of the database schema [public]>>
      Search for the jdba driver org.postgresql.Driver
      in directory /opt/nege/reporting/WEB-INF/lib ...
      found in /opt/n1ge/reporting/WEB-INF/lib/postgresql-7.4.2.jar
      Should the connect to the databse be tested (y/n) [y]>> y test db connection to 'jdbc:postgresql://my-host.my-domain:5432/arco' ... OK
  7. Enter the login names of users who are allowed to store thw=e queries and results
    • Enter a login name of a user (Press enter to finish) >> test
      Users:test
    • Enter a login name of a user (Press enter to finish) >>
    Verify the information
  8. Create the query directory
    Starting Sun web console version 2.0.3 .....
  9. connect to the SWC
    https://hostname:6789/
  10. Login with UNIX account

Tuesday Jul 13, 2004

HA-GRID: N1Grid Engine 6 Edition with Berkely DB spooling

N1GE6 introduce spooling with Berkely DB. There will be new issues to consider when we want to provide a HA N1GE6 services with Java ES CLuster service.

  • Tightley couple case: HA-sge_qmaster agent In this case we install Berkely DB on the same host with sge_qmaster.

    if we assume that SGE_ROOT=/opt/n1ge and cell=default

    the start script: $SGE_ROOT/$cell/common/sgemaster start

    the stop script:$SGE_ROOT/$cell/common/sgemaster stop

    Using the SunPlex agent builder one can easily create a sge_qmaster agent. and SUNWmsge package.

    We also need another agent to modify the host_aliases file,

    1. on node1 it should be sge_master node1
    2. on node2 it should be sge_mster node2
    One can also run HA-NFS agent to serve the / directory to all exec_host
  • Loosely couple case: HA-BDB agent and sge_qmaster and sge_shadow_master node.

    In this case we install Berkely DB on different nodes pairs. This node pairs will run HA-NFS and HA-BDB agents. HA-NFS come with SunCluster and it will serve the / directory to sge_shadow_master and may be to all exec_hosts.

    HA-DBD will be serve Berkely DB FO . Once one install BDB with the inst_sge -db command. We use the SunPlex agent builder to build a HA-Agent that control the start and stop of the BDB.

Let's assume that the SGE_ROOT=/opt/n1ge and cell=default

The start script: $SGE_ROOT/$cell/common/sgebdb start

The stop script: $SGE_ROOT/$cell/common/sgebdb stop

The BDB spool directory: $SGE_ROOT/$cell/spooldb

From a previous blog posting one can create a SUNWbdb agent using those infomation.

Tuesday Jun 08, 2004

HA-GRID HowTo

Here we describe many different way to make SGE5.3px or N1GE6 highly available or we can talk about how to integrate the N1GE6/SGE5.3px with the suncluster SW.

  • loosely couple:
  • just use HA-NFA as file system to mount the / directory one can configure one qmaster and one or multiple shadow master host
  • tightly couple:
  • in this case one will use the GFS(Global FS) feature of the SunCluster one of the cluster node could be qmaster and one node could be the shadow master we also use HA-NFS to share the / directory
  • tighter couple:
  • acturally one need to create two customer agents
    1. qmaster agent to failover the qmaster agent
    2. host-alias agent, to setup a dynamic host_aliaes file that associate the physical-node name with the logical hostname of the qmaster

In the future post we will describe how to use the suncluster agentbuilder to create customer agent

About

hstsao

Search

Top Tags
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today