Notes on Sailfin Cluster Failure Management and GMS
By 143562 on Feb 17, 2009
Here are some short notes on how a SailFin Cluster deals with instance failures. These are good for troubleshooting and helpful when debugging issues with failure scenarios. But first, if you are unfamiliar with sailfin clustering please read Quick Start with Sailfin Clustering.
Sailfin relies on Group Management Service for its failure management. This includes detection of an instance's failure and appropriate notification. Below is a list of some types of instance failure that GMS helps detect:
a. Node Agent and instance process dying
b. Instance process alone dying [Transient Failure]
a. Network Failure [cable snap at the machine's end or at the router's end]
b. Power Failure [of the machine hosting a sailfin instance]
Notes on how GMS works:
Each instance in a sailfin cluster has an instance of GMS service running in it which starts as the instance is started. A logical GMS group is formed by the GMS services running on all the instances of a cluster.
Using the GMS service each member of the group is able to send and receive signals. Using a heartbeat mechanism the GMS services are able to detect states such as addition, failure or recovery of a group member.
These states are registered as events and are logged in the instance's server.log file under <sailfin-installation>/nodeagents/<agent-name>/<instance-name>/logs/server.log. For example, if an instance is shutdown using the "asadmin stop-instance <instance-name>" command, all other instances that are part of the group would detect this shutdown. You will note a PEER_STOP_EVENT registered in the server log files of these instances.
- PEER_STOP_EVENT: Indicates a planned shutdown of an instance (using the asadmin stop-instance command).
- ADD_EVENT: Indicates that an instance has been started (using the asadmin start-instance command) and its GMS service joining the logical GMS group.
- JOINED_AND_READY_EVENT: Indicates that startup of an instance is complete.
- IN_DOUBT_EVENT: Indicates that GMS suspects that an instance has failed. (Try this by killing an instance and its associated node-agent's process and notice the messages in the logs of other instances)
- FAILURE_EVENT: Indicates confirmation of failure of an instance by GMS.
Node-Agent as a WatchDog:
One other failure detection mechanism is a non-GMS one. The node-agent acts as a watch-dog for the instance. It detects instance process failure and attempts a restart of the instance. This is the transient failure mentioned above as item 1 (b) above.
CLB as a Listener of GMS events:
CLB is a listener of GMS events and it adjust its functonality as per the event's significance. For example, the CLB considers an instance to be available to serve requests until it receives a FAILURE_EVENT for that instance from GMS. On receiving a FAILURE_EVENT the CLB stops forwarding requests to the failed instance. The instance is added back to the CLB's list of available instances only after the CLB receives a JOINED_AND_READY_EVENT for that instance.
GMS failure detection and notification times can vary depending on the hardware used and certain GMS and sailfin configurable settings. For information on this and other functionality provided by GMS, please read documention available at shoal.dev.java.net and swik.net/GlassFish+Shoal.