MySQL Cluster and NUMA

One problem with MySQL Cluster we are starting to see quite often is to do with the current generation of Xeon processors.  This post outlines the problem and how to avoid it.

With the Nehalem based Intel Xeons (and also in some older AMD CPUs) they add a technology called NUMA (Non-Uniform Memory Access).  This basically gives each CPU its own bank of memory instead of all CPUs accessing all the memory.  For many applications this means much faster memory access. You should be able to see if NUMA is on by looking for it in dmesg.

So why is this a bad thing?

MySQL Cluster data nodes typically require a large portion of the memory, this means very often that one CPU will need to access the memory from another other CPU.  This in general is quite slow, on a busy cluster we have seen this access take 100ms - 500ms!  MySQL Cluster is real-time and is not a happy bunny when there are things stopping it becoming real-time.  Therefore typically watchdog timeouts are very regular in NUMA based systems.

So, how can this be improved?

For starters NUMA is easy to turn off, simply add the kernel boot option numa=off.  We have also observed that later Linux kernels (around 2.6.30) have improved the scheduler for NUMA and appear to be friendlier to MySQL Cluster.  But I would personally recommend turning it off even with newer kernels.

What else can cause problems?

We do not yet have as much data on this, but it is also believed that dynamic CPU clocking can also cause similar issues.  If the data node is not busy the CPU is clocked down which then causes timing issues for cluster.  I would recommend setting the CPU to full performance settings where possible.

Edit: Mat Keep has confirmed that dynamic CPU clocking certainly causes performance issues in the comments.

Hyper-threading can also be a killer.  If you have a 4 core CPU with hyper-threading it shows as 8 cores, but since these are not full cores setting MaxNoOfExectionThreads=8 can cause a lot of contention.  In most cases you do not need to turn hyper-threading off but do not try to give the CPUs more workload than they can handle. 


Great post Andrew

Re the dynamic CPU clocking, Mikael Ronstrom observed that power-save modes heavily impacted performance on MySQL Cluster while running recent Sysbench READ/WRITE tests against MEMORY engine - 2.7GHz CPUs downclocked to 800MHz...once he turned power saving off, and made a couple of other minor mods to the config.ini file, he observed 30x higher TPS at 1/3rd latency for Cluster vs MEMORY - both on a single node

Posted by Mat Keep on July 12, 2010 at 04:16 AM BST #


Thanks for your comment, I thought this would be the case but couldn't find any evidence at the time.

Posted by LinuxJedi on July 12, 2010 at 04:34 AM BST #

When you say "on a busy cluster we have seen this access take 100ms - 500ms" are you saying that the memory latency can be up to 1/2 of a second? Or should that really be us (microseconds) instead of ms (milliseconds)? Or are you measuring much more than a single memory access here?

How does the performance of Solaris in a similar test situation compare?

Posted by Mike Gerdts on July 12, 2010 at 05:45 AM BST #

Hi Mike,

Yes, memory latency of up to 1/2 a second in worst case situations when there is a lot of activity (although these are probably most likely multiple accesses to copy data to/from the send buffer, not just a single block).

I don't know about Solaris, I have not been able to play with Solaris on Nehalem CPUs.

Posted by LinuxJedi on July 12, 2010 at 06:20 AM BST #

Interesting post.

On the Hyper-threading point I remember an article several years ago on a reputable site (PC Magazine maybe?) regarding use of HT on servers, and encouraging server makers to disable it by default.
I'm a little rusty on the particulars, but essentially both the real core and the HT core share the same on-die cache. When the HT thread found the data it wanted wasn't in the cache that could cause a purge and a refill of the CPU cache, wasting precious clock cycles. Through repeated tests the article demonstrated that it rather depended on what you were doing with the hardware as to whether or not you'd see a benefit from having HT on or not, but generally in a server environment it was advised to leave it off.

There is every chance that Intel has fixed this in HT, I honestly don't track the tech closely enough to say.

Posted by Twirrim on July 12, 2010 at 04:33 PM BST #

Note that you cannot really turn NUMA off, since it is a property of the hardware. If you turn off the NUMA Support in Kernel, what you get is a striped distribution of memory banks over all CPUs (Called SUMA). This splits the workload more evenly and therefor the average wait time becomes more unique. It is still slow.

Ideally your application is eighter NUMA aware or you install it in a way a process only runs inside a single NUMA Zone. Turning Interleaving on will only be a quick workaround. Luckyly the NUMA misses in Nehalem are not so severe. On an Itanium based machine for example I would not go for a cross-domain APP with no special NUMA intelligence. I dont know much about the MySQL Cluster, but I guess you could try to run multiple instances on a NUMA Box and lock them to different NUMA Zones.

Lots of information on that topic can be found in Kevin Glossons Blog - but talking about Oracle Database.


Posted by Bernd Eckenfels on August 09, 2010 at 10:05 AM BST #

Hi Bernd,

Thanks for the additional information. Since MySQL Cluster is typically used to allocate almost the entire system RAM in a single process it does make for a technical challenge.

Having MySQL Cluster NUMA aware would probably be the best solution to this problem, I'm not sure how trivial it is in practice.

Posted by LinuxJedi on August 09, 2010 at 10:17 AM BST #

Post a Comment:
  • HTML Syntax: NOT allowed

LinuxJedi is an ex-MySQL Senior Technical Support Engineer who previously worked at Oracle and specialised in MySQL Cluster as well C/C++ APIs.


« April 2014