Saturday Mar 14, 2009

Script to start lots of MySQL Cluster nodes on Sparc CMT

I recently had the chance to play with the new T5140 servers. Using the Sparc CMT architecture, these servers present an amazing 128 cpu's to you to use (as a combination of cores and compute threads, there are 2 sockets only).

We are doing some trials with eager Sun customers who want to utilize these babies. The good news is that MySQL Cluster 7.0 (aka 6.4) will support a multi-threaded data node option. The bad news is, one ndbd process still only uses about 8 CPU cores, so to utilize 128, there is some way to go! So the bad news is we still have to launch many ndbd processes to get out the full power of these boxes. But the good news is that with 7.0 there is at least a point in trying at all.

I developed a simple script which lets me easily start a varying amount of ndbd and mysqld processes on one host (and then copy the script to also start same amount of processes on another host). If you have been using Johan's excellent severalnines scripts, I should explain that here I'm trying to do exactly the opposite than those do. The benefit of the severalnines scripts is that you can comfortably start and manage the whole cluster from one since command line, it will ssh into your other servers for you, and execute needed commands. This script does not do that, indeed the point is to just make it simple to start 6 ndbd and 6 mysqld processes on the same server I'm logged in.


# This is a simple script to start many ndbd and mysqld processes on one host.
# It is useful for the newest Sparc CMT architectures where you may want to
# start many processes per physical server.
# Copyright 2000-2008 MySQL AB, 2008 Sun Microsystems, Inc.
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; version 2 of the License.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program; see the file COPYING. If not, write to the
# Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston
# MA 02110-1301 USA.


# The node-id's to start, odd on one host, even on the other
NUM_NDBD='3 5 7 9 11 13'
#NUM_NDBD='4 6 8 10 12 14'

# I also had some code to circulate ndbd's to bind to 3 different
# NIC's, but this is omitted here for simplicity.

# How many mysqld's to start on each host. In this case we don't need
# to use the node-id for cluster, instead these numbers are used in
# the pathname of mysqldir, in the port and socket to listen to.
NUM_MYSQL='01 02 03 04 05 06'

# Whether to start with --initial or not

# Large memory pages Solaris optimization
#export LD_PRELOAD
#export MPSSHEAP

# MySQL Cluster 6.4 management daemon tries to store configuration state,
# which is annoying when you fiddle with config.ini a lot, so delete the
# stored configs.
rm mysql-cluster/ndb_1_config.bin.1
libexec/ndb_mgmd --ndb-nodeid=1 --config-file=/etc/mysql/config-kludge.ini

sleep 4

for i in $NUM_NDBD
libexec/ndbmtd $INITIAL --ndb-nodeid=$i --ndb-connectstring=$HOSTNAME:1186 &

sleep 10

for i in $NUM_MYSQL
mkdir var/var$i
bin/mysql_install_db --datadir=var/var$i >/dev/null
chown -R mysql var/var$i
bin/mysqld_safe --datadir=${INSTALLDIR}var/var$i/ --port=33$i --socket=/tmp/mysql.$i.sock &
# Not needed, running with skip-grant-tables instead
#bin/mysql --socket=/tmp/mysql.$i.sock -u root -e "GRANT ALL ON \*.\* TO 'root'@'%'"

Oh, you want to hear results from my tests? Sorry, maybe later, gotta go now...

Monday Dec 15, 2008

How much DataMemory+IndexMemory do you need for disk data?

One thing we were guessing at with Massimo yesterday is, if you store
large blobs as disk data, how much will they consume DataMemory and
IndexMemory (primary key, each "chunk" has a hidden primary key, first 25x bytes
of blob stored in memory...)?

My empirical test showed that about 2% of the total size of blobs is needed for RAM
(25% of that is IndexMemory).

IMHO this is close to negligible, but in many situations not negligible
at all (may have close to TB of disk data -> 20GB of RAM needed for
disk data).

Also note that this is a minimum figure. If you actually have something
else than the blob (like other indexes) you of course use much more RAM.

The test was:
CREATE TABLE `jpgtest` (
`id` int(11) NOT NULL,
`jpg` blob,

and inserting 100k blobs into that table (7+ GB in total).

Details below.

PS: Note that Johan just posted several excellent posts on using MySQL Cluster disk based data:

\*\*\*\*\*\* Loading 100k files as blobs into an NDB disk data table. \*\*\*\*\*
(Simple test, one datafile, one insert thread, etc...)

GRANT ALL ON \*.\* TO 'root'@'';
-- tablespace and undo log
ADD UNDOFILE 'undo_1.dat'

ADD DATAFILE 'data_1.dat'

use test;

CREATE TABLE `jpgtest` (
`id` int(11) NOT NULL,
`jpg` blob,

-bash-3.2$ cat

use DBI;

$hostname = "ndb05";
$database = "test";
$tablename = "jpgtest";
$user = "root";
$pw = "";

$dsn = "DBI:mysql:database=$database;host=$hostname;port=3306";

$dbh = DBI->connect($dsn, $user, $pw);
$drh = DBI->install_driver("mysql");

$n = 100000;
open FH, 'fakepic.jpg';
$jpg = ;


$sth = $dbh->prepare("INSERT INTO jpgtest VALUES (?, ?);");
$sth->bind_param(1, $i, {TYPE => SQL_INTEGER});
$sth->bind_param(2, $jpg, {TYPE => SQL_BLOB});
print "$i\\n";

-bash-3.2$ ls -lFh
total 108K
-rw-r--r-- 1 hingo hingo 100K 2008-12-09 16:26 fakepic.jpg
-rw-r--r-- 1 hingo hingo 634 2008-12-09 20:48

Load speed:
real 24m36.396s
user 1m24.002s
sys 0m13.382s

mysql> select count(\*) from jpgtest;

77831 records
1476,4 seconds
52,72 rows/sec <<<<<<<<<<<
102404 bytes/row
5398420,02 bytes/sec
5,15 MB/sec <<<<<<<<<<

Data usage:
7970205724 bytes total
7,42 GB total (data inserted) <<<<<<<<<<<

[root@ndb05 mysql]# ls -laiFh /data1/mysqlcluster/ndb_2_fs
total 11G
21200937 -rw-r--r-- 1 root sroot 11G 2008-12-10 12:09 data_1.dat
21200936 -rw-r--r-- 1 root sroot 128M 2008-12-10 12:09 undo_1.dat
(The above means nothing, the sizes are as specified when created. However, it is interesting to note that 10GB in MySQL becomes 11GB in the filesystem...)

(This query from Johan's blog)
mysql> select free_extents, total_extents from information_schema.files where file_type='datafile';
| free_extents | total_extents |
| 5284 | 10240 |
| 5284 | 10240 |
(Interesting... Why are more than half of my extents still free,
even if I inserted 7 GB into a 10 GB file? Something in this is not right...)

-bash-3.2$ tail ndb_1_cluster.log
2008-12-10 12:11:40 [MgmSrvr] INFO -- Node 3: Data usage is 11%(3704 32K pages of total 32768)
2008-12-10 12:11:40 [MgmSrvr] INFO -- Node 3: Index usage is 28%(4725 8K pages of total 16416)

3704 pages
32 KB/page
118528 KB
115,75 MB

4725 pages
8 KB/page
37800 KB
36,91 MB

RAM vs datafile ratio:
7,42 GB of largish blobs will use about
152,66 MB of RAM (indexes, hidden "cunk" indexes, beginning of each blob is in RAM...)
Conclusion: Allocate
2,01 %
of the size of your disk data blobs for RAM!

24,18 % of that is IndexMemory

Friday Nov 28, 2008

How to use JDBC (Connector/J) with MySQL Cluster

Last week I helped a customer setup a JBoss application against MySQL Cluster. It turns out it is not immediately obvious how you should setup our JDBC connector to do loadbalancing and failover. For instance, setting the connector up for an Master-Slave setup (with MySQL Enterprise) is well documented, but not doing the same with MySQL Cluster.

It's not really properly documented in the manual part, but I found in the changelogs, and confirmed on IRC that to do load-balancing across the SQL nodes in MySQL Cluster, you would use a different JDBC connection string with the "loadbalance" keyword added...


That does indeed loadbalance, however it didn't properly address failover. When an SQL node is killed, it still tries to round-robin queries to all specified hosts, resulting in exceptions 50% of the time (with 2 nodes, that is).

After further digging (in fact, my collague got this from Mark Matthews himself) I finally found out that the correct string to use is:


The "loadBalanceBlacklistTimeout" adds the needed feature that failed connections in a connection pool are put aside for the specified time, and only working connections are utilized.

That's all that is needed. It is simple and beautiful once you get it to work!

Update: I should add that transactions that are running while a node crashes will still rollback and return an exception. This is by design and it is then up to the application to decide whether to give up or retry. If you retry the transaction, the JDBC driver will pick a working SQL node for the second try.

Tuesday Aug 05, 2008

MySQL Cluster basics: Why using a primary key is even more important

While we were mostly excited to become part of Sun, there was at least one drawback too. MySQL had a worldwide policy of "Scandinavian style" 5 week vacations per year (4 Summer, 1 Winter week). Sun doesn't have such a worldwide policy, so most of us lost some vacation days there. But there is one group of employees that still enjoy this benefit... those of us that actually live in Scandinavia!

Those 4 weeks are however now gone for my part :-( So I thought I'll make a soft start to working by a) yesterday doing almost nothing and b) today blogging. A long long time ago before I left for my vacation Mark Callaghan requested more blog posts about MySQL Cluster, so I thought why not start with one right now. As previously noted there are many good MySQL Cluster bloggers out there, but admittedly some of their blog postings can be very deeply technical. With this one however, I just want to highlight some basic aspect of how MySQL Cluster works and in fact how it's performance differs from InnoDB and MyISAM based databases. (But it must be added, if you want "getting started" level material, then Johan Andersson's blog with posts like A great config.ini and The RTFM is a must read.)

Some time in June - this was before my vacation, a long time ago I almost cannot remember - I was getting a MySQL Cluster setup up to speed with a customer. Before getting to the real data, the customer had decided to "benchmark" his setup with simply inserting 100 000 integers in a batch:

CREATE TABLE t (i INT) engine=ndbcluster;
... (100000 times)

This test came out with MySQL Cluster looking very bad, more than 2 minutes. Same on InnoDB took 3.7s. (The InnoDB result also looked unrealistic first, on a standard PC you wouldn't get anywhere near that fast, but since those RAID controllers have a DDR RAM chip as write cache, it basically made this an in-memory operation of size 400kB or so.)

So what to do?

MySQL Cluster lesson 1: Batching, batching!

The major cause for bad performance in MySQL Cluster, is often the fact that cluster nodes are connected to each other over a network. This introduces latency and overhead in every single transaction you do - typically at least 3 nodes need to communicate with each other, send the data, agree that it can be committed, etc.... The way to minimize this problem is batching: If you send more data in one batch, the effect of tcp/ip overhead is smaller.

In this particular example of course the effect of tcp/ip overhead was at worst: We were sending only one INT at a time! So we turned on transaction_allow_batching and turned off ndb_force_send. Also remember to turn off AUTOCOMMIT, so there actually is a transaction to be batched!

MySQL Cluster lesson 2: Parallelize, parallelize!

The great thing with MySQL Cluster is it's write scalability, which (for "suitable applications") is practically infinite. With normal MySQL you can easily get read scalability with things like replication, but write scalability is a tougher nut to crack. In MySQL Cluster, you add more data nodes and you get more write throughput. The partitioning is internal to the cluster, so you don't have to mess with shards or anything in your application, one mysql client connection is all you need. Live installations of MySQL Cluster (such as HLR, the heart of a mobile phone network) will do over 100k writes per second.

But to actually get there, you need to also parallelize your operations. Inserting 100k INTEGERs in one batch should be fast, but separating it into 100 batches should be lightning fast, since they can all happen at the same time and MySQL Cluster can absorb that load.

So with these changes... We got from 2 minutes to 50 seconds. Still not 1 second like our marketing material claims!

MySQL Cluster lesson 3: Why PRIMARY KEY is important

With all of my own tricks used, I actually had to turn to my more senior collagues for advice what to do next. One reply was the jackpot:

The thing is, this simple example didn't have a primary key defined. In real life situations you of course always have the primary key, and that is also the reason I had forgotten this important piece of advice: With MySQL Cluster you must always always always remember to use it even for silly examples like this. The reason is that MySQL Cluster (by default) uses the primary key for its internal partitioning between data nodes. So if there is no primary key, it will automatically add a hidden AUTOINCREMENT column to act as one. This leads to 2 problems:

1) Since you are not using an explicit primary key, the mysqld node will end up randomly connecting to whatever ndbd node it happens to choose, from which the transaction is re-routed to the right data nodes. This unnecessary adds (doubles!) network latency.

2) When inserting, fetching the unique autoincrement value acts like a global lock for all inserts. You can parallelize all you want, but it doesn't get any faster. (There is ndb_autoincrement_prefetch_sz to prefetch a number of keys in a batch, but still, always always use primary key.)

In this example, parallellizing actually made performance much worse from the 50 seconds we were at, since the multiple threads were now all competing for the autoincrement lock.

Adding an explicit primary key to the table and the inserts, solved the problem and got the time down to 10 seconds and with some further tuning to 5 seconds. This is still not lightning fast, but at this point the customer was happy and chose to focus on their real application instead. (After the tuning leading to 5 seconds, next significant improvements would probably have been gained from omitting SQL and using the native C++ or Java NDB API and switching from Ethernet to the proprietary Dolphin SCI interconnects. Both of these further remove latency.)

So to summarize: Batch, parallelize and always always define the primary key.

While this has been a trivial example of something not relevant for the real world, I hope it has helped highlight some behavioural charasteristics of the MySQL NDB Cluster.

PS. In June there was still a lot of confusion around the new version numbering with MySQL NDB Cluster. Are you all clear on that now, or do we need to go through it once more in another blog post?


The people of the MySQL Telecom team writes about developments around MySQL and MySQL Cluster and how these products are used by our Communitcations industry customers. (Image jasmic@Flickr)


« June 2016