By Henrik Ingo on Aug 05, 2008
While we were mostly excited to become part of Sun, there was at least one drawback too. MySQL had a worldwide policy of "Scandinavian style" 5 week vacations per year (4 Summer, 1 Winter week). Sun doesn't have such a worldwide policy, so most of us lost some vacation days there. But there is one group of employees that still enjoy this benefit... those of us that actually live in Scandinavia!
Those 4 weeks are however now gone for my part :-( So I thought I'll make a soft start to working by a) yesterday doing almost nothing and b) today blogging. A long long time ago before I left for my vacation Mark Callaghan requested more blog posts about MySQL Cluster, so I thought why not start with one right now. As previously noted there are many good MySQL Cluster bloggers out there, but admittedly some of their blog postings can be very deeply technical. With this one however, I just want to highlight some basic aspect of how MySQL Cluster works and in fact how it's performance differs from InnoDB and MyISAM based databases. (But it must be added, if you want "getting started" level material, then Johan Andersson's blog with posts like A great config.ini and The RTFM is a must read.)
Some time in June - this was before my vacation, a long time ago I almost cannot remember - I was getting a MySQL Cluster setup up to speed with a customer. Before getting to the real data, the customer had decided to "benchmark" his setup with simply inserting 100 000 integers in a batch:
CREATE TABLE t (i INT) engine=ndbcluster;
INSERT INTO t (1);
... (100000 times)
This test came out with MySQL Cluster looking very bad, more than 2 minutes. Same on InnoDB took 3.7s. (The InnoDB result also looked unrealistic first, on a standard PC you wouldn't get anywhere near that fast, but since those RAID controllers have a DDR RAM chip as write cache, it basically made this an in-memory operation of size 400kB or so.)
So what to do?
MySQL Cluster lesson 1: Batching, batching!
The major cause for bad performance in MySQL Cluster, is often the fact that cluster nodes are connected to each other over a network. This introduces latency and overhead in every single transaction you do - typically at least 3 nodes need to communicate with each other, send the data, agree that it can be committed, etc.... The way to minimize this problem is batching: If you send more data in one batch, the effect of tcp/ip overhead is smaller.
In this particular example of course the effect of tcp/ip overhead was at worst: We were sending only one INT at a time! So we turned on transaction_allow_batching and turned off ndb_force_send. Also remember to turn off AUTOCOMMIT, so there actually is a transaction to be batched!
MySQL Cluster lesson 2: Parallelize, parallelize!
The great thing with MySQL Cluster is it's write scalability, which (for "suitable applications") is practically infinite. With normal MySQL you can easily get read scalability with things like replication, but write scalability is a tougher nut to crack. In MySQL Cluster, you add more data nodes and you get more write throughput. The partitioning is internal to the cluster, so you don't have to mess with shards or anything in your application, one mysql client connection is all you need. Live installations of MySQL Cluster (such as HLR, the heart of a mobile phone network) will do over 100k writes per second.
But to actually get there, you need to also parallelize your operations. Inserting 100k INTEGERs in one batch should be fast, but separating it into 100 batches should be lightning fast, since they can all happen at the same time and MySQL Cluster can absorb that load.
So with these changes... We got from 2 minutes to 50 seconds. Still not 1 second like our marketing material claims!
MySQL Cluster lesson 3: Why PRIMARY KEY is important
With all of my own tricks used, I actually had to turn to my more senior collagues for advice what to do next. One reply was the jackpot:
The thing is, this simple example didn't have a primary key defined. In real life situations you of course always have the primary key, and that is also the reason I had forgotten this important piece of advice: With MySQL Cluster you must always always always remember to use it even for silly examples like this. The reason is that MySQL Cluster (by default) uses the primary key for its internal partitioning between data nodes. So if there is no primary key, it will automatically add a hidden AUTOINCREMENT column to act as one. This leads to 2 problems:
1) Since you are not using an explicit primary key, the mysqld node will end up randomly connecting to whatever ndbd node it happens to choose, from which the transaction is re-routed to the right data nodes. This unnecessary adds (doubles!) network latency.
2) When inserting, fetching the unique autoincrement value acts like a global lock for all inserts. You can parallelize all you want, but it doesn't get any faster. (There is ndb_autoincrement_prefetch_sz to prefetch a number of keys in a batch, but still, always always use primary key.)
In this example, parallellizing actually made performance much worse from the 50 seconds we were at, since the multiple threads were now all competing for the autoincrement lock.
Adding an explicit primary key to the table and the inserts, solved the problem and got the time down to 10 seconds and with some further tuning to 5 seconds. This is still not lightning fast, but at this point the customer was happy and chose to focus on their real application instead. (After the tuning leading to 5 seconds, next significant improvements would probably have been gained from omitting SQL and using the native C++ or Java NDB API and switching from Ethernet to the proprietary Dolphin SCI interconnects. Both of these further remove latency.)
So to summarize: Batch, parallelize and always always define the primary key.
While this has been a trivial example of something not relevant for the real world, I hope it has helped highlight some behavioural charasteristics of the MySQL NDB Cluster.
PS. In June there was still a lot of confusion around the new version numbering with MySQL NDB Cluster. Are you all clear on that now, or do we need to go through it once more in another blog post?