By LinuxJedi on Jan 28, 2010
One of the most common errors we come across whilst supporting MySQL Cluster is an error commonly referred to as 'GCP stop'. These errors will occur most frequently in cluster setups which have high activity and more often than not use disk data. So lets look into what these are, why they happen and how to prevent them.
What is a GCP Stop?
All data that needs to be written to MySQL cluster is first written to the REDO log, this is so that when a node starts the log can be played back from the position of the last good LCP (Local CheckPoint, a point at which all the cluster data memory is written to disk). The REDO data needs to be consistent between all data nodes and that is where the GCP (or Global CheckPoint) comes in. It synchronously flushes the REDO data across all data nodes to disk every 2 seconds (by default). A GCP stop happens when a new GCP is trying to commit the REDO to disk and the previous one has not finished. MySQL Cluster is a real-time database so this is a critical problem and the node in question is shut down to preserve data integrity.
Why does a GCP Stop happen?
GCP stop usually happens for one of two related reasons. Firstly there is too much data to commit between GCPs for it to all be written to disk at once and secondly the disks are too slow.
You should now be able to get an idea of why this is more prominent on clusters using disk data, both the disk data and GCP are written to disk at the same time (as well as things like the LCP), lowering the disk bandwith available for the GCP.
This is also more common on multi-threaded data nodes (ndbmtd) in MySQL Cluster 7.0 because these can handle more data simultaneously and therefore can be in a situation where they need to write more to the REDO log.
How to prevent a GCP Stop
There are several effective ways to prevent a GCP stop:
1. Buy faster disks - may not be an option but if the data is written faster this can prevent a GCP Stop
2. Spread the different parts of the data node onto different disks - the REDO, LCP and disk data can all be separated onto different disks, giving a much better disk I/O bandwidth to each
3. Commit more often - if you have a really long transaction with lots of data this could create a commit which is too large for one GCP
4. Configuration - there are some configuration settings you can tweak to improve things, but these will only give small improvements over the above three points. Settings like TimeBetweenGlobalCheckpoints which if decreased causes the data node to GCP more often which means there is less to write to disk per checkpoint (but checkpointing more often means less time to checkpoint, so not always a good option). There are also settings affecting disk factors outside of GCP such as DiskPageBufferMemory, increasing this will buffer more disk data (much like innodb_buffer_pool_size for InnoDB) decreasing disk bandwidth disk data uses so that the GCP can use more disk bandwidth.
There are other settings that can be tweaked as a last resort depending on what kind of GCP Stop occurs (yes, there are a couple of different types) but the first three points should be a primary concern before thinking about doing this.