Sunday Aug 22, 2010

Moving on from Oracle

This week will be my final week at Oracle.  I am very much going to miss the great team I work with but due to many other reasons I have decided to move on.

Where am I going?

I'm sure several people already know, if you don't, this will be revealed soon.  It will still be related to MySQL, but not so much MySQL Cluster.  I do fully intend to still write patches for MySQL and MySQL Cluster in my spare time and I'll still probably be lurking in the MySQL Cluster mailing list and forums.

What does this mean for this blog?

In all likelihood it will become stale very quickly (worst case it will be removed).  I will no longer be able to edit it.

Will there be a new blog?

Yes, at some point soon.  Watch out on Planet MySQL.

Monday Jul 12, 2010

MySQL Cluster and NUMA

One problem with MySQL Cluster we are starting to see quite often is to do with the current generation of Xeon processors.  This post outlines the problem and how to avoid it.

With the Nehalem based Intel Xeons (and also in some older AMD CPUs) they add a technology called NUMA (Non-Uniform Memory Access).  This basically gives each CPU its own bank of memory instead of all CPUs accessing all the memory.  For many applications this means much faster memory access. You should be able to see if NUMA is on by looking for it in dmesg.

So why is this a bad thing?

MySQL Cluster data nodes typically require a large portion of the memory, this means very often that one CPU will need to access the memory from another other CPU.  This in general is quite slow, on a busy cluster we have seen this access take 100ms - 500ms!  MySQL Cluster is real-time and is not a happy bunny when there are things stopping it becoming real-time.  Therefore typically watchdog timeouts are very regular in NUMA based systems.

So, how can this be improved?

For starters NUMA is easy to turn off, simply add the kernel boot option numa=off.  We have also observed that later Linux kernels (around 2.6.30) have improved the scheduler for NUMA and appear to be friendlier to MySQL Cluster.  But I would personally recommend turning it off even with newer kernels.

What else can cause problems?

We do not yet have as much data on this, but it is also believed that dynamic CPU clocking can also cause similar issues.  If the data node is not busy the CPU is clocked down which then causes timing issues for cluster.  I would recommend setting the CPU to full performance settings where possible.

Edit: Mat Keep has confirmed that dynamic CPU clocking certainly causes performance issues in the comments.

Hyper-threading can also be a killer.  If you have a 4 core CPU with hyper-threading it shows as 8 cores, but since these are not full cores setting MaxNoOfExectionThreads=8 can cause a lot of contention.  In most cases you do not need to turn hyper-threading off but do not try to give the CPUs more workload than they can handle. 

Tuesday Feb 23, 2010

MySQL Conference Cluster Tutorial

If you have enjoyed my blog postings about MySQL Cluster please come to the MySQL Cluster tutorial at the 2010 O'Reilly MySQL Conference & Expo.  I shall be giving this tutorial along with my colleagues Geert Vanderkelen and Andrew Morgan.

The outline of this tutorial is as follows:

"This tutorial aims to guide normal MySQL users and DBAs into the world of MySQL Cluster. From installing and configuring to creating your first clustered table and finally node failure handling. At the end of the session you’ll will be the proud owner of a full blown, tiny MySQL Cluster which you can show off at work. The data nodes will gently keep you lap warm when the air conditioning is set too low, and the cluster log will help you through sleepless nights."

So it is going to be fun day and hopefully you will learn a few things along the way :)

You can register using the discount code mys10fsp which will give you 25% off on top of any early bird registration discounts.

If you are coming to the conference but cannot make the tutorial I will be happy to talk to anyone about MySQL Cluster (or anything else I might know).  I look forward to seeing you all there!

Sunday Feb 21, 2010

MySQL Cluster Disk Data Tables

MySQL Cluster is well known as an in-memory database which can be restrictive (a server typically has a RAM limit) so it may surprise some to learn that you can also store MySQL Cluster data on disk.  Disk Data Tables have been in MySQL Cluster for quite a while, the first GA release with them included was 6.2.  They do have caveats which I will describe here, many of which already have plans to be improved in future versions.

How Disk Data Tables work

A set of UNDO logs and data files can be created which are on every data node in the cluster.  The UNDO logs bring the disk data back to a consistent point so that during node restart the REDO logs can be replayed when a node is started.

Not all data in a disk table is actually stored on disk, the main two things that are not are indexed columns (the entire column is in main memory, not just the index) and the first 256 bytes of a BLOB/TEXT column.  Also columns on disk are fixed length, so your varchar(255) will be the equivalent of a char(255).  There are plans to change this in future releases.

How to create Disk Data Tables

To create a Disk Data Table you first need two things, a Logfile Group containing UNDO files and a Tablespace containing data files.  Each cluster can only have one Logfile Group but can have multiple Tablespaces.  There are in fact two ways to creates Logfile Groups and Tablespaces.  The first is using config.ini, if you use this method the files are created upon an initial start of a data node automatically.  To do this you need to add the following to the [ndbd default] section of this file:

InitialLogFileGroup = name=lg_1; undo_buffer_size=64M; undo1.log=150M; undo2.log=200M
InitialTablespace = name=ts_1; extent_size=1M; data1.dat=1G; data2.dat=2G

These examples show two files in each grouping but you can have one or more in each group.  The extent_size is optional and defaults to 1M if omitted.

The second way of creating these is to use the MySQL client, the following example will create the same as the config.ini settings above:

mysql> CREATE LOGFILE GROUP LG1
    ADD UNDOFILE 'undo1.log'
    INITIAL_SIZE 150M
    UNDO_BUFFER_SIZE 164M
    ENGINE NDBCLUSTER;

mysql> ALTER LOGFILE GROUP LG1
    ADD UNDOFILE 'undo2.log'
    INITIAL_SIZE 200M
    ENGINE NDBCLUSTER;

mysql> CREATE TABLESPACE ts_1
    ADD DATAFILE 'data1.dat'
    USE LOGFILE GROUP LG1 
    EXTENT_SIZE 1M
    INITIAL_SIZE 1G
    ENGINE NDBCLUSTER;

mysql> ALTER TABLESPACE ts_1
    ADD DATAFILE 'data2.dat'
    INITIAL_SIZE 2G
    ENGINE NDBCLUSTER;

The ENGINE NDBCLUSTER is required, this is because it is possible in the future to have Logfile Groups and Tablespaces for other engines.  It is worth noting here that the undo_buffer_size cannot be increased with an alter table.  The only way of doing this is dropping all the Tablespaces and the Logfile Group (and therefore all your disk data) first, this is something else we are working on improving.

Finally to create a disk table you need to do:

CREATE TABLE table_1 (
    id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
    a VARCHAR(50) NOT NULL,
    b DATE NOT NULL,
    )
    TABLESPACE ts_1 STORAGE DISK
    ENGINE NDBCLUSTER;

Or if the table already exists it can be moved into a tablespace using:

ALTER TABLE table_1 TABLESPACE ts_1 STORAGE DISK ENGINE NDBCLUSTER;

Usage statistics

There is a very easy way to see usage statistics for the data and UNDO files, this information is all in the INFORMATION_SCHEMA.FILES table.  This lists the details for each group and each file for each node so you many want to use an SQL query to limit this information.  Some great examples can be seen in the manual page.

Things to watch for

The data node has the config.ini option DiskPageBufferMemory, this acts very much like InnoDB's innodb_buffer_pool_size setting.  So it is best to set this as large as possible for optimal performance.  If it is too low and the disk data is on the same file system as the REDO logs this can lead to GCP Stop.

So, to prevent GCP Stop in this case you can put the disk data on a different file system to REDO logs, this can be done with the FileSystemPathDD, FileSystemPathDataFiles and FileSystemPathUndoFiles configuration options.  Alternatively a path can be provided when you add the files.

The undo_buffer_size is taken out of SharedGlobalMemory, so this setting will typically need to be increased to accommodate the UNDO buffer.  If this happens when creating Logfile Group in the MySQL client you will see the following error:

ERROR 1528 (HY000): Failed to create LOGFILE GROUP

With SHOW WARNINGS you will then see:

Got error 1504 'Out of logbuffer memory' from NDB

This is also a good point with many errors that occur in MySQL Cluster, the SHOW WARNINGS command will often give more details on the error which can help figure out the cause.

Friday Feb 05, 2010

MySQL Cluster Uses

MySQL Cluster can be used as a general purpose transactional storage engine, but if you convert all your InnoDB tables to it and connect your application straight to it you may not see the performance you were hoping for.  This is because MySQL Cluster was originally designed for real-time telecommunications applications (such as RADIUS servers).  It has slowly been modified to become more general purpose and improvements are being made every day but there are still some performance considerations which go with this.  In some cases tweaking your schema and/or queries can help performance dramatically, so I shall try and outline some of things to watch for here.

Indexes

The fastest type of lookup you can do in Cluster is a primary key equality lookup (ie. SELECT \* FROM table WHERE pkey = 2).  This is because the primary key is stored as a hash index as well as an optional ordered index.  This hash index is used to partition the data between the data nodes.  MySQL is smart enough to process the hash and go directly to the data node with the data.

When running a query which uses an ordered index (or unique hash index) the query is sent to the Transaction Coordinator in one node which then asks the Local Query Handlers in one or more nodes to return rows that match this query.  The Transaction Coordinator is aware of the indexes so knows which nodes to ask to process the query.  This in general is slightly slower and can actually perform worse for ordered indexes as more data nodes are added because more Local Query Handlers need to be contacted.

Finally if the query is a table scan the Transaction Coordinator must ask all the Local Query Handlers to search for the data.  This is much slower.

Joins

Joins in MySQL Cluster do not currently perform well.  Internally the second table must be queried to match every row returned by the first table, this can mean a lot of network traffic which can slow things down.  There is work in progress to improve this by pushing the join condition down into the data nodes, this will give a massive performance increase when using joins.  More information about this (called SPJ) can be seen in Jonas' blog.

BLOBs and TEXT

As I have already mentioned in this blog BLOBs (and TEXT) columns require a separate table to hold most of the BLOB data.  This can cause performance problems and locking issues so if possible VARCHAR or VARBINARY should be used instead.

Transactions

MySQL Cluster currently only supports READ COMMITTED transaction isolation level so if you are used to REPEATABLE READ that you typically get from engines such as InnoDB you may want to check that this will not cause problems with your application.

In general we recommend transactions should be short, limited to just a few queries.  MySQL Cluster can give temporary errors for a number of reasons, these do not happen often but will mean that the entire transaction will need to be started again.  For many applications it will be very hard to replay a transaction which has been running for a whole hour.  Also committing very large transactions can lead to GCP Stop.

Networking

Data nodes must be on the same private network, period.  Preferably on the same switch.  The protocol all the nodes use is insecure so must not be on a public network (with the exception of the mysqld servers).  Also latency can be a problem with data nodes, if there is a short period of large latency or latency in general is increased due to the data nodes being in physically different locations you will have performance problems and worse regular node failures are to be expected.  Finally if the network has other traffic running on it this can also cause performance problems.

Sunday Jan 31, 2010

MySQL Cluster Start Phases

When MySQL Cluster data nodes start they need to go through a process of determining roles in the cluster, copying the data back into RAM and synchronising everything up.  This can take longer than expected and the process is not always very verbose.  So in this post I will outline a simplified version of what MySQL Cluster is doing in each start phase so that you can see why it can take time.

You can see the last completed start phase by following the management logs or you can see the current start phase by doing 'x status' in ndb_mgm where 'x' is the node ID of the data node you want the information for.

Start Phase 0

This is part of the basic initialisation, and the configuration is received.  If this is an initial start then the node's file system is created.

Start Phase 1

All the blocks that are not already running are initialised in this phase.  Connections to other data nodes are established in this phase and the nodes determine which one is to be the president node (coordinating things during the start).  If nodes stalls in this phase it usually means that the nodes can talk the the management node but not to each other.

Start Phase 2

The state of all other nodes is checked in this phase and at this point a new master may be elected, this is typically the president or the node with the newest complete Global Checkpoint.  The master node co-ordinates things such as Global Checkpoints.

Start Phase 3

A lot of decisions are made in this phase as to what kind of start is required.  Some block communications are also setup such as the DBLQH and DBTC.

Start Phase 4

The real workload start here!  The schema and LCP (Local CheckPoint) data is loaded back into memory, this can take a little time if you have a lot of in-memory data. In an initial start the redo log files are created.

Start Phase 5

Until very recently this used to typically be the slowest phase, in some cases it could even take hours!  The REDO log was processed in this phase and if there were a lot of long transactions in this log the backwards searching required took a long time.  This is now heavily optimised in 6.3.28 and 7.0.9, in my own limited testing I have seen a 60x start speed improvement thanks to these changes but real mileage may vary.

Other things are also started in this phase such as LCPs as well as copying data from other nodes if our node has some (or all) missing.

Start Phase 6

This phase is mostly just to setup the node groups.

Start Phase 7

Primarily the president selects a node as an Arbitrator.  Minor things also happen such as the normal DiskCheckPointSpeed is configured instead of DiskCheckPointSpeedInRestart.

Start Phase 8

Indexes are rebuilt in this stage.  This is now multi-threaded in 6.3.30 and 7.0.10 with the new BuildIndexThreads option which can give much better performance on very large tables.

Start Phase 9

Some finial initialisation of variables.

Start Phase 101

The node is now responsible for the primary replica of its data.

Start Complete!

This node can now accept transactions!

As you can see there have been some changes recently which can greatly improve the performance of starting a node (in some cases it is now down to mere seconds).  This should also help you figure out roughly where the problem is happening if a node is taking excessively long to start or fails during start.

Friday Jan 29, 2010

Blobs in MySQL Cluster

If there is one thing that confuses people about tables in MySQL Cluster (including me at times) it is BLOB/TEXT columns.  When NDB was originally created it was not designed to handle BLOB data, so the handling of BLOB data was difficult to implement and is sometimes not exactly what users expect.

How MySQL Cluster BLOBs work

When you create a table in MySQL Cluster which has a BLOB column the first 256 bytes of the BLOB is stored in the main table (and in memory when using disk data tables), subsequent data is then stored in a hidden table (typically split into 2KB rows).  This means there is an extra table for every BLOB or TEXT column in your main table (and extra resource usage).

BLOB locking in MySQL Cluster

These extra tables can cause some problems, firstly with performance (retrieving BLOB data is not very fast) and more importantly with locking.  MySQL Cluster works in READ-COMMITTED transaction isolation level, but this makes things difficult when handling BLOB data to keep the tables in-sync and consistent.  When selecting a row which has a BLOB MySQL Cluster needs to gain a shared lock on that row, and when updating that row an exclusive lock is needed.

This can be a problem if, for example, you update a row in one transaction and select it at another at the same time.  The select will wait for the update to complete because it cannot obtain a shared lock until the exclusive lock is cleared.  This then can lead to the temporary error 'Time-out in NDB, probably caused by deadlock'.

Finally there are certain settings that may need to be increased to handle the large amount of data in a big BLOB, most notably SendBufferMemory and in the case of ndbmtd LongMessageBuffer.

The moral of this story?

1. If you can use VARCHAR/VARBINARY instead, this will avoid these problems
2. Be very careful about writing your applications with BLOBs, if they are large it may well be better to keep them stored separately on a SAN and have your application retrieve them
3. Keep your transactions short so locking time is kept to a minimum

Thursday Jan 28, 2010

MySQL Cluster GCP Stop

One of the most common errors we come across whilst supporting MySQL Cluster is an error commonly referred to as 'GCP stop'.  These errors will occur most frequently in cluster setups which have high activity and more often than not use disk data.  So lets look into what these are, why they happen and how to prevent them.

What is a GCP Stop?

All data that needs to be written to MySQL cluster is first written to the REDO log, this is so that when a node starts the log can be played back from the position of the last good LCP (Local CheckPoint, a point at which all the cluster data memory is written to disk).  The REDO data needs to be consistent between all data nodes and that is where the GCP (or Global CheckPoint) comes in.  It synchronously flushes the REDO data across all data nodes to disk every 2 seconds (by default).  A GCP stop happens when a new GCP is trying to commit the REDO to disk and the previous one has not finished.  MySQL Cluster is a real-time database so this is a critical problem and the node in question is shut down to preserve data integrity.

Why does a GCP Stop happen?

GCP stop usually happens for one of two related reasons.  Firstly there is too much data to commit between GCPs for it to all be written to disk at once and secondly the disks are too slow.

You should now be able to get an idea of why this is more prominent on clusters using disk data, both the disk data and GCP are written to disk at the same time (as well as things like the LCP), lowering the disk bandwith available for the GCP.

This is also more common on multi-threaded data nodes (ndbmtd) in MySQL Cluster 7.0 because these can handle more data simultaneously and therefore can be in a situation where they need to write more to the REDO log.

How to prevent a GCP Stop

There are several effective ways to prevent a GCP stop:

1. Buy faster disks - may not be an option but if the data is written faster this can prevent a GCP Stop
2. Spread the different parts of the data node onto different disks - the REDO, LCP and disk data can all be separated onto different disks, giving a much better disk I/O bandwidth to each
3. Commit more often - if you have a really long transaction with lots of data this could create a commit which is too large for one GCP
4. Configuration - there are some configuration settings you can tweak to improve things, but these will only give small improvements over the above three points.  Settings like TimeBetweenGlobalCheckpoints which if decreased causes the data node to GCP more often which means there is less to write to disk per checkpoint (but checkpointing more often means less time to checkpoint, so not always a good option).  There are also settings affecting disk factors outside of GCP such as DiskPageBufferMemory, increasing this will buffer more disk data (much like innodb_buffer_pool_size for InnoDB) decreasing disk bandwidth disk data uses so that the GCP can use more disk bandwidth.

There are other settings that can be tweaked as a last resort depending on what kind of GCP Stop occurs (yes, there are a couple of different types) but the first three points should be a primary concern before thinking about doing this.

If you have any problems with GCP Stop I highly recommend asking on the MySQL Cluster forum or the MySQL Cluster mailing list.

About

LinuxJedi is an ex-MySQL Senior Technical Support Engineer who previously worked at Oracle and specialised in MySQL Cluster as well C/C++ APIs.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today