Tips & Tricks for SDM Cloud Adapter with Zones

 
  

The zones scripts for the SDM Cloud Adapter not really feature complete. A reboot of the master host can lead into the following problems:

  1. SDM system does not startup automatically. You have to start it with
    sdmadm startup_jvm -force
    The -force option is needed by the SDM system does not cleanup the pid file a shutdown.
  2. sge service might go after startup  into ERROR state.
    # sdmadm ss
    host       service    cstate  sstate 
    -------------------------------------
    lappy      sge        STOPPED  ERROR  
               spare_pool STARTED RUNNING
               zones      STARTED RUNNING
    The reason can be start the qmaster has not been started automatically. Start qmaster and startup the sge service again:
    # /gridware/sge/default/common/sgemaster
    # sdmadm suc -c sge -h localhost
    comp host       message          
    ---------------------------------
    sge  lappy      startup triggered
    # sdmadm ss
    host       service    cstate  sstate 
    -------------------------------------
    lappy      sge        STARTED RUNNING
               spare_pool STARTED RUNNING
               zones      STARTED RUNNING

  3. zones service goes into error recovery mode. You will find the following error messages in the log file:
    # tail /var/sdm/sdm1/log/cs_vm-0.log
    07/01/2009 11:27:18|21|...|I|Service zones: Started up 1 cloud hosts: [[hostname: z1, instanceId: i-z1, launchTime: 2009-07-01T11:27:18.000Z] ].
    07/01/2009 11:27:20|21|...|W|Service zones:The registered set of cloud host does not match the reported set! Registered mismatches [[hostname: z1, instanceId: i-z1, launchTime: 2009-07-01T11:27:18.000Z] ]. Reported mismatches []
    07/01/2009 11:27:20|21|...|W|Service zones:Problem: VPN server z1 is corrupted! The cloud host is not reported anymore!
    The problem is that the zones (and the SDM components on the zones) are not started automatically. Normally the zones script should detect this problem and recover it. However the recovery is not implemented.
    To solve the problem shutdown the SDM system:
    # sdmadm sdj -all -h localhost
    And cleanup the spool directory on the master host:
    # rm `find /var/spool/sdm/sdm1/spool/\*.srf`
    # rm /var/spool/sdm/sdm1/spool/cloud_hosts.spool
    Finally you can restart the system:
    # sdmadm suj
    To avoid this problem you have remove all zone resources from the SDM system. This can be easily done by setting the min and max attribute of the resource amount optimizer to 0. The zones service will shutdown all zones:
    # sdmadm mc -c zones
    <common:componentConfig xsi:type="cloud_adapter:CloudAdapterConfig"
        ...
        <cloud_adapter:optimizer xsi:type="cloud_adapter:MinMaxResourceAmountOptimizerConfig"
                                 max="0"
                                 min="0">
        ...
    </common:componentConfig>
    # sdmadm uc -c zones

    After a short time all resource from the zones service will disappear. You can finally shutdown the host.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

rhierlmeier

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today