Support of Sun Grid Engine 6.1 with Sun Cluster 3.1 and 3.2
By hhnguyen on Sep 13, 2007
You might be aware that Sun Grid Engine already comes with the possibility to configure a Shadow Master Host - which itself implements the ability to detect a failure of the master daemon and take over its role as master host.
So here are some reasons as of why you want to deploy Sun Grid Engine running on Sun Cluster instead of using the Shadow Master Hosts,if the availability of the master host and job scheduling engine with the complete stack is important:
- No additional HA-NFS filer needed:
Usually the $SGE_ROOT directory needs to be shared over NFS to all execution hosts in the Grid. At least the part relevant for the jobs. Without Sun Cluster it would be necessary to make the NFS server highly available by e.g. using an additional HA NAS filer.
With Sun Cluster the Sun Cluster Data Service for NFS can be used in the same Resource Group to provide HA NFS service and care about data integrity without requiring additional systems.
- Sometimes even the Grid Engine binaries for the execution hosts are shared via NFS. Same possibilities as with the point before apply.
- Reliable and robust failover mechanisms:
The shadow daemon determines the availability of the master host by querying the sge_qmaster daemon over the network (tcp connection and heartbeat file, shared over NFS). If just the network between the host running the shadow daemon and the host running the sge_qmaster is not available, manual intervention is necessary as described within Sun N1 Grid Engine 6.1 Administration Guide , which says: "In order to start a shadow sge_qmaster, the system must be sure either that the old sge_qmaster has terminated, or that it will terminate without performing actions that interfere with the newly-started shadow sge_qmaster.". The shadow daemon starts up automatically only if a lock file for sge_qmaster does not exist.
Sun Cluster implements protection against split brain situations and solid failure fencing. The Sun Cluster framework guarantees that the Grind Engine service is only available at one node at a time without any need for manual intervention. In addition it probes the Grid Engine daemons locally through the data service.
- Unique and reliable service IP address for execution hosts:
The execution hosts need to determine the current active host running the sge_qmaster daemon by looking up the name in the $SGE_ROOT/cell/common/act_qmaster file.
With Sun Cluster they can rely on a highly available network resource implemented through the Resource Type SUNW.LogicalHostname. This enables you to use a topology where no NFS service is mandated to be shared with the execution hosts. Execution hosts can then use local storage and always use the same highly available IP address to contact the sge_qmaster daemon.
- Mechanism to reflect necessary service dependencies:
The shadow daemon must have read/write root access to the same file systems like the host running the sge_qmaster daemon in order to provide the same service upon failure of the sge_qmaster host - again mandating that the NFS file system comes from a different system to achieve HA.
With Sun Cluster this can be achieved by using shared storage. Proper resource dependencies can be placed between HA NFS, the sge_qmaster and the sge_schedd resource to guarantee a clean startup upon failover.
- Faster and finer grained reaction upon individual service failures:
The shadow daemon has a default interval to check the sge_qmaster every 60 seconds (every 30 seconds on the heartbeat file).
With Sun Cluster the Process Monitor Facility (PMF) is making sure that upon loss of the Grid Engine processes they get restarted immediately. In addition the local probe is making sure a failure is only detected if the daemons don't react as expected. This probe is not dependent on any network in between to determine that. The Cluster agents monitors sge_qmaster and sge_schedd with a dedicated resource each with proper resource dependency in place. A failure on sge_schedd does only trigger a restart of sge_schedd, and not of sge_qmaster.
Solaris Cluster Engineering