More information about running MySQL on Open HA Cluster / Solaris Cluster
By Lenz Grimmer on Feb 03, 2009
We received a number of followup questions from our readers, requesting more technical background information. For example, Mark Callaghan was wondering about the following:
- How is failure detection done?
- How is promotion of a slave to the master done after failure detection?
- How are other slaves failed to the new master?
I asked Detlef to elaborate some more on the technical details of this solution. Here's his very exhaustive reply, thank you very much, Detlef!
I would also like to point out that he'll be speaking about Solutions for High Availability and Disaster Recovery with MySQL at this year's MySQL Conference & Expo in Santa Clara, which will take place on April 20-23, 2009.
But now without further ado, here are Detlef's answers:
At first we have to differentiate the various offerings in our portfolio:
- Solaris Cluster 3.2 is the supported product running as a binary distribution on Solaris 9 and Solaris 10. It is meant for production environments.
- Open HA Cluster is the open source variant of the Solaris Cluster development source tree.
- Solaris Cluster Express is a binary distribution, build out of the open and closed source of the Solaris Cluster development tree, running on a specific Solaris Express Community Edition build. It is meant for developers and early adopters of the newest features.
- Colorado is going to be the binary distribution build out of the Open HA Cluster source tree, running on a specific OpenSolaris binary distribution. It is also meant for developers and early adopters of the newest features.
answer the question for failure detection and automation, we have to
differentiate between two scenarios. The first one is a normal cluster
on one site, running MySQL databases amongst other applications. The
second scenario consist of two clusters described in scenario one
linked together with Sun Cluster Geographic Edition or its open
two clusters are dispersed between two sites. The main purpose for this scenario is disaster recovery, and the content of the MySQL database is replicated between the clusters using the classical MySQL replication.
OpenHA Cluster / Sun Cluster configurationThe easiest case is a two node configuration with various MySQL resources for simplicity of the description one master and one slave. The failure detection is done with various mechanisms on top of the Solaris mechanisms like IPMP, SMF. The various levels are framework process monitoring, heartbeat and failure fencing, quorum, application process monitoring, in depth application probing. I ignore other applications, assuming the in depth probing and the process monitoring is done in a comparable way.
Failure detection and automation.
We have to differentiate between hardware failure detection, framework failure detection, node failure detection and application failure detection. For hardware failure detection we leverage the features from Solaris like fmd, smf, and ipmp. On top of the hardware failure detection we have the framework failure detection (process monitoring), the node failure detection and the application failure detection.
The framework failure detection relies on kernel integration and features like SMF.
Framework process monitoring.
are two levels, the lighter one is SMF – if a SMF monitored process dies,
it will get restarted. If the main cluster process dies, the node panics and starts again (which is highly unlikely).
Node failure detection
Heartbeat and failure fencing
cluster forms a membership of all the nodes, and the nodes detect the presence of other nodes
via a private interconnect (between 1 and 6 network connections). Offline nodes will be fenced off from the cluster
by SCSI reservations. Knowing that not every storage supports this feature, we made the use of SCSI reservations configurable.
To protect the cluster from amnesia and split brain situations, we implemented a quorum mechanism. Every node has 1 vote in the quorum algorithm, to guarantee the a valid partition we use quorum devices. A quorum device can be a SCSI LUN or a Solaris server. This Solaris server can not be a part of the cluster. If the cluster partitions for whatever reason, there is a race for quorum. The partition which gets the quorum survives and the other partition dies immediately, and can not form a cluster until it sees the surviving partition and gets the permission to join. This is a very brief explanation, the implementation is robust enough to prevent split brains and amnesia effectively. More information can be found in the Sun Cluster concepts guide.
For application failure detection we have two levels again, first the process monitoring, second application in depth probing.
Application process monitoring
MySQL server is only one process, this process is registered at the
process monitoring facility (PMF). If the data server process fails, PMF
detects it in realtime and triggers a restart. The restart trips the
retry_limit. A restart sequence which is happening too often in a
certain amount of time is interrupted, because it is considered
unsuccessful, Intervals and retry limits are configurable. If this
interruption occurs a failover is executed unless prevented by
Application in depth probing of MySQL
minute, there is an mysql admin ping, every five minutes we will have a
table manipulation in a special monitoring database which is not replicated.
In case of a failure we first try a configurable number of local
restarts, and if these restarts fail within a configurable interval, we
execute a failover. If the data server is running as a slave, we query
the slave status command for consistent thread situation, IO_State and
error numbers. If something is wrong with the replication we issue a
warning. Currently we do not have an automated failover in this
scenario, but it is planned for future versions. Unfortunately I
am unable to share a time line. So the question "How are other slaves
failed to the new master?" as far a as automation is concerned will be
answered then. Currently it is a manual process.
How is promotion of a slave to the master done after failure detection?
this is not implemented in scenario 1 but in scenario 2. An automated
failover between master and slave is on our radar, and the
implementation will be performed in the same way as described in
scenario 2, except that we will have a full automation here.
Scenario 2 (here we have an automated failover)
OpenHA Cluster / Sun Cluster configuration geographic edition (SC GEO)SC GEO is a layer on top of Sun Cluster and provides disaster recovery. Here we have two clusters on on each site. think about one in New York and one in San Francisco. Between the two clusters we need an IP communication link for monitoring. We need as well a replication protocol configured between the two sites. Currently, we offer StorageTek Availability Suite, Truecopy and SRDF. In the near future we will have MySQL master slave replication as well at the same time will offer a framework to integrate new replication protocols. In fact, the MySQL replication is the reference replication protocol for the new framework. This new framework is called script based plugin.
Failure detectionIf the remote cluster is down (detected via ping) we raise an alert and ask the administrator to initiate the takeover. The administrator has to push a button, and SC GEO will perform the takeover. For maintenance reasons or to implement a follow the sun policy we support the planned failover as well this planned failover initiates role conversions. He do not have to care about the detailed steps to take. We perform the takeover in a semi-automated fashion, because resynchronization can be a costly process and we want an informed decision here. An automation would be possible but is not advised to implement. If the replication framework fails we raise an error status which should be caught by system monitoring.
How is promotion of a slave to the master done after failure detection?
Here we have two cases, failover and takeover. The takeover is simple, because the old master is dead. So we just promote the slave to the master.
The failover is more high sophisticated. If the slave is in sync with the master the roles are swapped and the old slave will be reset, and reprepared, to be swapped again. If the slave is is behind the master, we stop the IP address where clients connect to the slave to prevent a further backlog and wait until the slave is in sync. If the slave is in sync, we swap the roles as described above. After that, we start the ip address again.
In the first release of the MySQL SBP implementation, there is a limitation, that we allow only one master/slave pair in the DR case, but we will remove this limitation in the future. So the clusters on each site do not contain an internal master/slave replication.
Additional information about Solaris cluster can be found at: http://docs.sun.com/app/docs/prod/sun.cluster32?l=en&a=view and http://docs.sun.com/app/docs/prod/sun.cluster.geo32?l=en&a=view
The descriptions above are just a brief summary of our protection levels. For further information I would encourage you to study http://docs.sun.com/app/docs/doc/820-2554?l=en and http://docs.sun.com/app/docs/doc/819-3059?l=en
For geographic edition you find information under: http://docs.sun.com/app/docs/coll/1191.4?l=en
Additional information for Open HA Cluster is located here: http://opensolaris.org/os/community/ha-clusters/