An Example to Manage a Grid Engine cluster with SDM

Once you have added the Grid Engine (GE) service to the SDM (Service Domain Manager), you can manage a Grid Engine cluster (adding or removing execution hosts from the Grid Engine cluster) using the SDM as the Grid Engine work loads are changing.

Currently the following SLOs are available with the two available services: spare_pool and GE adapter services:

 o MinResourceSLO and FixedUsageSLO (can be used for any services)
 o PermanentRequestSLO  (only for the spare_pool service)
 o MaxPendingJobsSLO (only for the GE service)

You can find more details about SLO in this wiki page.

In order to provide more execution hosts as the number of pending jobs increases, you need to add the MaxPendingJobsSLO to the GE service configuration as shown below. Please note that the urgency is set to "99", which is the highest value so that this SLO can have the highest priority.  Also note that the execd configuration has been extended to be automatically provisioned by the SDM.

node1# sdmadm mc -c gesvc  [modify GE adapter configuration]
...

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<common:componentConfig xsi:type="ge_adapter:GEServiceConfig"
                        mapping="default"
                        xmlns:executor="http://hedeby.sunsource.net/hedeby-executor"
                        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                        xmlns:reporter="http://hedeby.sunsource.net/hedeby-reporter"
                        xmlns:security="http://hedeby.sunsource.net/hedeby-security"
                  xmlns:resource_provider="http://hedeby.sunsource.net/hedeby-resource-provider"
                        xmlns:common="http://hedeby.sunsource.net/hedeby-common"
                        xmlns:ge_adapter="http://hedeby.sunsource.net/hedeby-gridengine-adapter">
    <common:slos>
        <common:slo xsi:type="common:FixedUsageSLOConfig"
                    urgency="50"
                    name="fixed_usage"/>
        <common:slo name="maxPendingJobs"
             xsi:type="ge_adapter:MaxPendingJobsSLOConfig"
             urgency="99"
             max="10">
        </common:slo>

    </common:slos>
    <ge_adapter:connection keystore="/var/sgeCA/port6236/default/userkeys/sdmadmin/keystore"
                           password=""
                           username="sdmadmin"
                           jmxPort="6238"
                           execdPort="6237"
                           masterPort="6236"
                           cell="default"
                           root="/var/opt/sge/6.2beta"
                           clusterName="p6236"/>
    <ge_adapter:sloUpdateInterval unit="minutes"
                                  value="5"/>
    <ge_adapter:execd adminUsername="root"
                      defaultDomain=""
                      ignoreFQDN="true"
                      rcScript="false"
                      adminHost="true"
                      submitHost="false"
                      cleanupDefault="true">         <ge_adapter:localSpoolDir>/var/spool/sge/execd</ge_adapter:localSpoolDir>
        <ge_adapter:installTemplate executeOn="exec_host">             <ge_adapter:script>/opt/sdm/6.2beta/util/templates/ge-adapter/install_execd.sh</ge_adapter:script>
            <ge_adapter:conf>/opt/sdm/6.2beta/util/templates/ge-adapter/install_execd.conf</ge_adapter:conf>
        </ge_adapter:installTemplate>
        <ge_adapter:uninstallTemplate executeOn="exec_host">             <ge_adapter:script>/opt/sdm/6.2beta/util/templates/ge-adapter/uninstall_execd.sh</ge_adapter:script>
            <ge_adapter:conf>/opt/sdm/6.2beta/util/templates/ge-adapter/uninstall_execd.conf</ge_adapter:conf>
        </ge_adapter:uninstallTemplate>
    </ge_adapter:execd>
</common:componentConfig>

After saving the configuration changes, in order to make the changes are effective, update the configuration:

node1# sdmadm uc [-c gesvc]  
comp              host  message
----------------------------------------
ca                node0 reload triggered
executor          node0 reload triggered
                  node1 reload triggered
                  node2 reload triggered
                  node3 reload triggered
gesvc             node1 reload triggered
reporter          node0 reload triggered
resource_provider node0 reload triggered

In the following example, the SDM is going to add more execution hosts to the GE cluster by moving resources from the spare_pool service when the number of pending jobs exceeds the limit defined in the MaxPendingJobsSLO.  For this purpose, we are going to provide some available resources (node1, node2 and node3) to the spare_pool service as shown below:

node2# sdmadm add_resource -r node1 -t host -s spare_pool
resource message
-----------------------------------------------------------
node1    Resource was added to the system.

node2# sdmadm add_resource -r node2 -t host -s spare_pool
resource message
-----------------------------------------------------------
node2    Resource was added to the system.

node2# sdmadm add_resource -r node3 -t host -s spare_pool
resource message
-----------------------------------------------------------
node3    Resource was added to the system.

node2# sdmadm show_resource
service    id    state    type flags usage annotation
-----------------------------------------------------
spare_pool node1 ASSIGNED host       1
           node2 ASSIGNED host       1
           node3 ASSIGNED host       1

node2# sdmadm show_resource -s gesvc
No resources has been found.

One way to add an execution host is to use the SDM command manually as shown below. This command moves a particular resource to the designated service.

node3# sdmadm mvr -r node1 -s gesvc

Then, the SDM will automaticaly invoke the GE execution host installation procedure. One thing to note here is that, since the node1 is the qmaster host, it will be flagged as the static resource. So the node1 host  will not be removed from the GE service in the future.

As we submit more and more jobs to the GE cluster, the SDM will add more execution hosts to the existing GE cluster as shown below:

node1# qstat -f | more
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@node1                    BIP   0/4/4          0.56     sol-sparc64
     40 0.55500 sleep      root         r     06/24/2008 09:31:05     1
     41 0.55500 sleep      root         t     06/24/2008 09:31:05     1
     42 0.55500 sleep      root         t     06/24/2008 09:31:05     1
     43 0.55500 sleep      root         t     06/24/2008 09:31:05     1
---------------------------------------------------------------------------------
all.q@node2                    BIP   0/4/4          0.49     sol-sparc64
     36 0.55500 sleep      root         r     06/24/2008 09:31:05     1
     37 0.55500 sleep      root         t     06/24/2008 09:31:05     1
     38 0.55500 sleep      root         t     06/24/2008 09:31:05     1
     39 0.55500 sleep      root         t     06/24/2008 09:31:05     1
---------------------------------------------------------------------------------
all.q@node3                    BIP   0/4/4          0.57     sol-sparc64
     32 0.55500 sleep      root         r     06/24/2008 09:25:26     1
     33 0.55500 sleep      root         r     06/24/2008 09:25:29     1
     34 0.55500 sleep      root         r     06/24/2008 09:25:29     1
     35 0.55500 sleep      root         r     06/24/2008 09:25:29     1

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
     44 0.00000 sleep      root         qw    06/24/2008 09:22:47     1
     45 0.00000 sleep      root         qw    06/24/2008 09:22:51     1
     46 0.00000 sleep      root         qw    06/24/2008 09:22:51     1
     47 0.00000 sleep      root         qw    06/24/2008 09:22:51     1
     48 0.00000 sleep      root         qw    06/24/2008 09:22:51     1
     49 0.00000 sleep      root         qw    06/24/2008 09:22:52     1
     50 0.00000 sleep      root         qw    06/24/2008 09:22:52     1
     51 0.00000 sleep      root         qw    06/24/2008 09:22:52     1
     52 0.00000 sleep      root         qw    06/24/2008 09:22:52     1
     53 0.00000 sleep      root         qw    06/24/2008 09:22:53     1
     54 0.00000 sleep      root         qw    06/24/2008 09:22:53     1
     55 0.00000 sleep      root         qw    06/24/2008 09:22:53     1
     56 0.00000 sleep      root         qw    06/24/2008 09:22:53     1
     57 0.00000 sleep      root         qw    06/24/2008 09:22:53     1
     58 0.00000 sleep      root         qw    06/24/2008 09:22:54     1
     59 0.00000 sleep      root         qw    06/24/2008 09:22:54     1
     60 0.00000 sleep      root         qw    06/24/2008 09:22:54     1
     61 0.00000 sleep      root         qw    06/24/2008 09:22:55     1
     62 0.00000 sleep      root         qw    06/24/2008 09:22:55     1
     63 0.00000 sleep      root         qw    06/24/2008 09:22:55     1
     64 0.00000 sleep      root         qw    06/24/2008 09:22:55     1
     65 0.00000 sleep      root         qw    06/24/2008 09:22:56     1
     66 0.00000 sleep      root         qw    06/24/2008 09:22:56     1
     67 0.00000 sleep      root         qw    06/24/2008 09:22:56     1
     68 0.00000 sleep      root         qw    06/24/2008 09:22:56     1
     69 0.00000 sleep      root         qw    06/24/2008 09:22:56     1


node1# sdmadm sr
service id    state    type flags usage annotation
--------------------------------------------------------------
gesvc   node1 ASSIGNED host S     inf   Got execd update event
        node2 ASSIGNED host       inf   Got execd update event
        node3 ASSIGNED host       99    Got execd update event

node3# sdmadm sr
service id    state    type flags usage annotation
--------------------------------------------------------------
gesvc   node1 ASSIGNED host S     50    Got execd update event
        node2 ASSIGNED host       50    Got execd update event
        node3 ASSIGNED host       50    Got execd update event

However, in the current configuration, even if all the jobs were completed, the execution hosts will be remained in the GE service. It will not be put back into the spare_pool service where they were drafted originally.

In order to put these resources back to the spare_pool service, you need to modify the urgency value (=1) defined in the PermanentRequestSLO used by the spare_pool service so that its urgency value is greater than the urgency value (=50) defined in the FixedUsageSLO for the gesvc service.  In this example, the urgency value has changed to be 51.

node1# sdmadm mc -c spare_pool
...

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<common:componentConfig xsi:type="spare_pool:SparePoolServiceConfig"
                        xmlns:executor="http://hedeby.sunsource.net/hedeby-executor"
                        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                        xmlns:spare_pool="http://hedeby.sunsource.net/hedeby-sparepool"
                        xmlns:reporter="http://hedeby.sunsource.net/hedeby-reporter"
                        xmlns:security="http://hedeby.sunsource.net/hedeby-security"
                        xmlns:resource_provider="http://hedeby.sunsource.net/hedeby-resource-provider"
                        xmlns:common="http://hedeby.sunsource.net/hedeby-common"
                        xmlns:ge_adapter="http://hedeby.sunsource.net/hedeby-gridengine-adapter">
    <common:slos>
        <common:slo xsi:type="common:PermanentRequestSLOConfig"
                    quantity="10"
                    urgency="51"
                    name="PermanentRequestSLO">
            <common:request>type = "host"</common:request>
        </common:slo>
    </common:slos>
</common:componentConfig>

Once this changes are effective, the execution hosts that were added to the GE cluster will be removed from the GE cluster and put back to the spare_pool service.  The SLO update interval is defined in the configuration. Currently it is defined as 5 minutes in the GE service configuration.

    <ge_adapter:sloUpdateInterval unit="minutes"
                                  value="5"/>

As I mentioned before, the node1 will remain becuase it is flagged as a static resource.

node1# qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@node1                    BIP   0/0/4          0.26     sol-sparc64   

node1# sdmadm sr
service    id    state    type flags usage annotation           
-----------------------------------------------------------------
gesvc      node1 ASSIGNED host S     50    Got execd update event
spare_pool node2 ASSIGNED host       51                         
           node3 ASSIGNED host       51        


Another burst of jobs makes the SDM to add more execution hosts to the GE cluster:

node1# qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@node1                    BIP   0/4/4          0.22     sol-sparc64   
    213 0.55500 sleep      root         r     07/09/2008 14:34:57     1        
    218 0.55500 sleep      root         r     07/09/2008 14:35:00     1        
    219 0.55500 sleep      root         r     07/09/2008 14:35:00     1        
    220 0.55500 sleep      root         r     07/09/2008 14:35:00     1        
---------------------------------------------------------------------------------
all.q@node2                    BIP   0/4/4          0.21     sol-sparc64   
    214 0.55500 sleep      root         r     07/09/2008 14:34:57     1        
    215 0.55500 sleep      root         r     07/09/2008 14:34:57     1        
    216 0.55500 sleep      root         r     07/09/2008 14:34:57     1        
    217 0.55500 sleep      root         r     07/09/2008 14:34:57     1        
---------------------------------------------------------------------------------
all.q@node3                    BIP   0/4/4          0.21     sol-sparc64   
    221 0.55500 sleep      root         r     07/09/2008 14:39:47     1        
    222 0.55500 sleep      root         r     07/09/2008 14:39:47     1        
    223 0.55500 sleep      root         r     07/09/2008 14:39:47     1        
    224 0.55500 sleep      root         r     07/09/2008 14:39:47     1        

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    225 0.00000 sleep      root         qw    07/09/2008 14:25:08     1        
    226 0.00000 sleep      root         qw    07/09/2008 14:29:41     1        
    227 0.00000 sleep      root         qw    07/09/2008 14:29:52     1        
    228 0.00000 sleep      root         qw    07/09/2008 14:29:52     1        
    229 0.00000 sleep      root         qw    07/09/2008 14:29:53     1      

node1# sdmadm sr
service id    state    type flags usage annotation           
--------------------------------------------------------------
gesvc   node1 ASSIGNED host S     99    Got execd update event
        node2 ASSIGNED host       99    Got execd update event
        node3 ASSIGNED host       99    Got execd update event


When all the jobs are cleared, the execution hosts without the static flag will be put back to the spare_pool service.

node1# sdmadm sr
service    id    state    type flags usage annotation            
-----------------------------------------------------------------
gesvc      node1 ASSIGNED host S     50    Got execd update event
spare_pool node2 ASSIGNED host       51                          
           node3 ASSIGNED host       51                       


Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

Chansup Byun

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today