MySQL Replication: Self-Healing Recovery with GTIDs and MySQL Utilities
By Mat Keep on Apr 10, 2013
MySQL 5.6 includes a host of enhancements to replication, enabling DevOps teams to reliably scale-out their MySQL infrastructure across commodity hardware, on-premise or in the cloud.
One of the most significant enhancements is the introduction of Global Transaction Identifiers (GTIDs) where the primary development motivation was:
- enabling seamless failover or switchover from a replication master to slave
- promoting that slave to the new master
- without manual intervention and with minimal service disruption.
You can download the new MySQL Replication High Availability Guide to learn more. The following sections provide an overview of how GTIDs and new MySQL utilities work together to enable self-healing replication clusters.
To understand the implementation and capabilities of GTIDs, we will use an example illustrated below:
- Server “A” crashes
- We need to failover to one of the slaves, promoting it to master
- The remaining server becomes a slave of that new master
Figure 1: MySQL Master Failover and Slave Promotion
As MySQL replication is asynchronous by default, servers B and C may not have both replicated and executed the same number of transactions, i.e. one may be ahead of the other. Consider the following scenarios:
- Server B is ahead of C and B is chosen as the new master;
- [Then] Server C needs to start to replicate from the first transaction in server B that it is yet to receive;
- Server C has executed transactions that have so far not been received by Server B, but the administrator designates Server B as the new master (for example, it is configured with higher performing hardware than Server C).
- Server B therefore needs to execute later transactions from Server C, before assuming the role of the master, otherwise lost transactions and conflicts can ensue.
GTIDs apply a unique identifier to each transaction, so it is becomes easy to track when it is executed on each slave. When the master commits a transaction, a GTID is generated which is written to the binary log prior to the transaction. The GTID and the transaction are replicated to the slave.
If the slave is configured to write changes to its own binary log, the slave ensures that the GTID and transaction are preserved and written after the transaction has been committed.
The set of GTIDs executed on each slave is exposed to the user in a new, read-only, global server variable, gtid_executed. The variable can be used in conjunction with GTID_SUBTRACT() to determine if a slave is up to date with a master, and if not, which transactions are missing.
A new replication protocol was created to make this process automatic. When a slave connects to the master, the new protocol ensures it sends the range of GTIDs that the slave has executed and committed and requests any missing transactions. The master then sends all other transactions, i.e. those that the slave has not yet executed. This is illustrated in the following example (note that the GTID numbering and binlog format is simplified for clarity):
Figure 2: Automatically Synchronizing a New Master with it's Slaves
In this example, Server B has executed all transactions before Server A crashed. Using the MySQL replication protocol, Server C will send “id1” to Server B, and then B will send “id 2” and “id3” to Server C, before then replicating new transactions as they are committed. If the roles were reversed and Server C is ahead of Server B, this same process also ensures that Server B receives any transactions it has so far not executed, before being promoted to the role of the new master.
We therefore have a foundation for reliable slave promotion, ensuring that any transactions executed on a slave are not lost in the event of an outage to the master.
MySQL Utilities for GTIDs
Implemented as Python-based command-line scripts, the Replication utilities are available as a standalone download as well as packaged with MySQL Workbench. The source code is available from the MySQL Utilities Launchpad page.
The utilities supporting failover and recovery are components of a broader suite of MySQL utilities that simplify the maintenance and administration of MySQL servers, including the provisioning and verification of replication, comparing and cloning databases, diagnostics, etc. The utilities are available under the GPLv2 license, and are extendable using a supplied library. They are designed to work with Python 2.6 and above.
Replication Utility: mysqlfailover
While providing continuous monitoring of the replication topology, mysqlfailover enables automatic or manual failover to a slave in the event of an outage to the master. Its default behavior is to promote the most viable slave, as defined by the following slave election criteria.
• The slave is running and reachable;
• GTIDs are enabled;
• The slaves replication filters do not conflict;
• The replication user exists;
• Binary logging is enabled.
Once a viable slave is elected (called the candidate), the process to retrieve all transactions active in the replication cluster is initiated. This is done by connecting the candidate slave to all of the remaining slaves thereby gathering and executing all transactions in the cluster on the candidate slave. This ensures no replicated transactions are lost, even if the candidate is not the most current slave when failover is initiated.
The election process is configurable. The administrator can use a list to nominate a specific set of candidate slaves to become the new master (e.g. because they have better performing hardware).
At pre-configured intervals, the utility will check to see if the server is alive via a ping operation, followed by a check of the connector to detect if the server is still reachable. If the master is found to be offline or unreachable, the utility will execute one of the following actions based on the value of the failover mode option, which enables the user to define failover policies:
• The auto mode tells the utility to failover to the list of specified candidates first and if none are viable, search the list of remaining slaves for a candidate.
• The elect mode limits election to the candidate slave list and if none are viable, automatic failover is aborted.
• The fail mode tells the utility to not perform failover and instead stop execution, awaiting further manual recovery actions from the DevOps team.
Via a series of Extension Points users can also bind in their own scripts to trigger failover actions beyond the database (such as Virtual IP failover). Review the mysqlfailover documentation for more detail on configuration and options of this utility.
Replication Utility: mysqlrpladmin
If a user needs to take a master offline for scheduled maintenance, mysqlrpladmin can perform a switchover to a specific slave (called the new master). When performing a switchover, the original master is locked and all slaves are allowed to catch up. Once the slaves have read all events from the original master, the original master is shutdown and control is switched to the new master.
There are many Operations teams that prefer to take failover decisions themselves, and so mysqlrpladmin provides a mechanism for manual failover after an outage to the master has been identified, either by using the health reporting provided by the utilities, or by alerting provided by a tool such as the MySQL Enterprise Monitor.
It is also possible to easily re-introduce recovered masters back to the replication topology by using the demote-master command and initiating a switchover to the recovered master.
Review the mysqlrpladmin documentation for more detail on configuration and options of the utility.
Resources to Get Started
In addition to the documentation, two additional resources are useful to enable you to learn more:
The combination of options to control failover or switchover and the ability to inform applications of the failover event are powerful features that enable self-healing for critical applications.