Tuesday Feb 17, 2009

Notes on Sailfin Cluster Failure Management and GMS


Here are some short notes on how a SailFin Cluster deals with instance failures. These are good for troubleshooting and helpful when debugging issues with failure scenarios. But first, if you are unfamiliar with sailfin clustering please read Quick Start with Sailfin Clustering.

Sailfin relies on Group Management Service for its failure management. This includes detection of an instance's failure and appropriate notification. Below is a list of some types of instance failure that GMS helps detect:
  1. Software Failure:
    a. Node Agent and instance process dying
    b. Instance process alone dying [Transient Failure]

  2. Hardware Failure:
    a. Network Failure [cable snap at the machine's end or at the router's end]
    b. Power Failure [of the machine hosting a sailfin instance]


Notes on how GMS works:
  1. Each instance in a sailfin cluster has an instance of GMS service running in it which starts as the instance is started. A logical GMS group is formed by the GMS services running on all the instances of a cluster.

  2. Using the GMS service each member of the group is able to send and receive signals. Using a heartbeat mechanism the GMS services are able to detect states such as addition, failure or recovery of a group member.

  3. These states are registered as events and are logged in the instance's server.log file under <sailfin-installation>/nodeagents/<agent-name>/<instance-name>/logs/server.log. For example, if an instance is shutdown using the "asadmin stop-instance <instance-name>" command, all other instances that are part of the group would detect this shutdown. You will note a PEER_STOP_EVENT registered in the server log files of these instances.

Below is a list of some important GMS events along with their significance:
  1. PEER_STOP_EVENT: Indicates a planned shutdown of an instance (using the asadmin stop-instance command).
  2. ADD_EVENT: Indicates that an instance has been started (using the asadmin start-instance command) and its GMS service joining the logical GMS group.
  3. JOINED_AND_READY_EVENT: Indicates that startup of an instance is complete.
  4. IN_DOUBT_EVENT: Indicates that GMS suspects that an instance has failed. (Try this by killing an instance and its associated node-agent's process and notice the messages in the logs of other instances)
  5. FAILURE_EVENT: Indicates confirmation of failure of an instance by GMS.
These log messages also indicate the instance associated with the event. This information is quite handy when debugging failure based scenarios.

Node-Agent as a WatchDog:
One other failure detection mechanism is a non-GMS one. The node-agent acts as a watch-dog for the instance. It detects instance process failure and attempts a restart of the instance. This is the transient failure mentioned above as item 1 (b) above.

CLB as a Listener of GMS events:
CLB is a listener of GMS events and it adjust its functonality as per the event's significance. For example, the CLB considers an instance to be available to serve requests until it receives a FAILURE_EVENT for that instance from GMS. On receiving a FAILURE_EVENT the CLB stops forwarding requests to the failed instance. The instance is added back to the CLB's list of available instances only after the CLB receives a JOINED_AND_READY_EVENT for that instance.

GMS failure detection and notification times can vary depending on the hardware used and certain GMS and sailfin configurable settings. For information on this and other functionality provided by GMS, please read documention available at shoal.dev.java.net and swik.net/GlassFish+Shoal.



About

Sailfin, Glassfish and more....

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today