Thursday Aug 21, 2008

Unlocking MySQL : Whats hot and what's not

One of the approaches we are using to look at MySQL scalability is to identify and if possible eliminate hot locks inside MySQL. For locks that cannot be eliminated, we are looking at ways to reduce the mutex hold times, or replace the critical sections with lock-free algorithms. It won't happen overnight, but it needs to be done if we are to make MySQL more scalable.

Along the way we're planning to blog about how we identified and eliminated performance bottlenecks and provide you with all the gory details about the process and technique. This will hopefully entice you all to contribute to fixing performance issues :-) We can reach our goal of a highly scalable MySQL much faster!. If you any other ideas about how we can scale the MySQL scalability effort (pun intended), or you would like to bust some locks, please feel free to let us know.

Fixing locks is sometimes very tricky as sometimes eliminating a really hot lock can only give a modest improvement in performance, or sometimes, no improvements at all. For example, sometimes fixing a hot lock might only move the bottleneck to the next lock. We have to address all the scalability issues before we actually see a dramatic improvement. A recent example illustrates this point.

LOCK_plugin: Not so hot after all

If you have been following MySQL scalability issues, you must be well aware of the LOCK_plugin lock. LOCK_plugin is used to protect the list of plugins used by MySQL. These plugins can be storage engine plugins, udf plugins, full text parser plugins, etc. LOCK_plugin got extremely hot for several benchmarks. Analysis of the contention showed that LOCK_plugin was being acquired and released at the end of every query execute as part of sending results to the clients. A closer look at the usage reveals that this lock is held while providing a hint to Innodb to release any shared locks and release concurrency tickets. The technique used by the MySQL server was to iterate over the plugin list and call ha_release_temporary_latches() for all storage engine plugins. The method used to do this (plugin_for_each_with_mask() grabs LOCK_plugin while iterating over the plugins list. For more details, checkout Bug#34409 LOCK_plugin contention via ha_release_temporary_latches/plugin_foreach

Sergei Golubchik recently fixed it in the most elegant way possible -- do not acquire a lock over the plugins list while releasing temporary latches. Looking at Sergei's fix and searching through the source, we see ha_release_temporary_latches() is called by MySQL server at multiple places without holding the LOCK_plugin lock! We are unnecessarily holding LOCK_plugin! Since holding LOCK_plugin is the side effect of iterating the plugins list, do not use the plugins list to iterate over storage engines; just iterate over the storage engines that are used in this transaction. What I really liked about the fix is that Sergei identified that there is no need for a lock, and got rid of it, rather than trying to break up the lock, or other things!. Kudos to Sergei for fixing this in such an efficient manner. If you are using 5.1.28 or 6.0.7-alpha, you already have this fix.

With this bugfix, we see around a 5% improvement in sysbench read-only tests. The reason we see only a 5% improvement and not more is that eliminating this lock made other locks more contented, and they are now the bottleneck. We are one step closer to a more scalable MySQL. Back to lock hunting for now.

Wednesday Jul 23, 2008

Peeling the MySQL Scalability Onion

In this blog I will talk about how we (the Sun/MySQL Performance Team) eliminated the need for a lock to get better scalability with MySQL 5.1.24+.

While comparing sysbench runs using two different versions of MySQL 5.1, I noticed a big difference in the system utilization. One version had much more idle time than the other. This difference was much more apparent with high thread counts on machines with lots of CPU. A look at the system calls showed a large number of lwp_park system calls. That is, threads were being put to sleep.

bash # $ dtrace -qn 'syscall:::entry{@c[probefunc]=count()}' -n tick-5s'{trunc(@c, 10);exit(0)}'

  p_online                                                       1535
  ioctl                                                          2255
  fcntl                                                          6134
  priocntlsys                                                    6462
  write                                                          6492
  read                                                          12775
  yield                                                         19065
  gtime                                                        313927
  pollsys                                                      321074
  lwp_park                                                     951749
Investigating why threads are being parked showed a very interesting stacktrace
              mysqld`int ha_innobase::info(unsigned)+0x164
              mysqld`bool make_join_statistics(JOIN\*,TABLE.. +0x180
              mysqld`int JOIN::optimize()+0x670
              mysqld`bool mysql_select(THD\*,Item\*\*\*,TABLE_.. +0x224
              mysqld`bool handle_select(THD\*,st_lex\*,selec.. +0xc4
              mysqld`bool execute_sqlcom_select(THD\*,TABLE.. +0x1c8
              mysqld`int mysql_execute_command(THD\*)+0x380
              mysqld`bool Prepared_statement::execute(Stri.. +0x29c
              mysqld`bool Prepared_statement::execute_loop.. +0xbc
Looking at make_join_statistics() we see that it calls ha_innobase::info() with HA_STATUS_VARIABLE | HA_STATUS_NO_LOCK. A quick search for the definition of HA_STATUS_NO_LOCK shows
  assuming the table keeps shared actual copy of the 'info' and
  local, possibly outdated copy, the following flag means that
  it should not try to get the actual data (locking the shared structure)
  slightly outdated version will suffice
#define HA_STATUS_NO_LOCK        2

So the mysql server is requesting ha_innobase::info() to not hold a lock, and it is being ignored by ha_innobase::info()!.

I compared against MySQL 5.0 and saw that this particular lock was not held when ha_innobase::info() was called. Searching through the commit logs I found that this was introduced by Bug#32440.

Quickly hacking the code to revert to the old behaviour gave a big boost in performance. Hence I filed Bug #38185 ha_innobase::info can hold locks even when called with HA_STATUS_NO_LOCK. Luckily its a very small change and a fix is already in progress.

Moral of the story? A simple contended lock can bring down your performance by quite a lot.

Monday Jul 07, 2008

Storage engine or MySQL server? Where has the time gone?

I want to answer a simple question - If a query takes X milliseconds, how much of it is spent in the storage engine, and how much in the MySQL server? Why do I think is important? Well, since I am working on MySQL performance, I want to be able to place my bets on where to spend my time optimizing MySQL. Lets take an example. If a query takes 50ms, and I am able to figure out that 40ms is in the mysql server and the remaining 10ms is in the storage engine, the first place I would want to take a look at optimizing the MySQL server. This also tells me that it does not matter what the storage engine is, I am being limited by the server. Similarly if I find the storage engine taking up most of the time, I can explore alternate storage engines. I know all you mysql experts will tell me that much of the attribution (of time) has got to do with the capability of the storage engine, what features it supports, etc. and you cannot really do the extrapolation. However, assuming falcon would have all the features of innodb, this becomes a valid question to ask. Where is my query spending its time.

I can think of 2 easy ways to answer this question. If you know of any additional methods please feel free to let me know.

Using Dtrace

I could write a Dtrace script to intercept all storage engine API calls (both the handler and handlerton) and calculate total time. I can then subtract this with the total time for the query and get the time attribution. There are a few difficulties with this method. I am able to write such a script, however, the cost of measurement is quite high. Many of the storage engine api calls take very little time, and hence the cost of intercepting them and gathering statistics distorts things quite a bit. The other problem is that there is no way for the script to work for all cases. Since the storage engine API's are mangled, the script will not work for different C++ compilers. The solution is to write a script to generate the dtrace script (which is what I have done), however, this does not really solve the question as I cannot trust the values reported because of the probe effect.

Using Sun Studio Performance Analyzer

Sun Studio Performance Analyzer has an option where you can look at the time spent inside a specific shared object. Hence, I could build the storage engine as a plugin ( for ex) and then do my experiment and then use the analyzer to figure out the amount of time spent in and answer my question. I need to try this out. (Thanks to Richard Smith for pointing this out)

If you know of any other methods please let me know!

Wednesday Jun 25, 2008

Improving filesort performance in MySQL

I recently filed Bug #37359 filesort can be more efficient based on some performance work with an industry standard benchmark. Read on if the internals of how MySQL implements filesort interests you.

Filesort, as the name implies, is used to sort records when there is an ORDER BY clause in the query. The reason it has the prefix "file" is because temporary files may be used to store intermediate results. filesort() uses a per thread sort buffer of size sort_buffer_size to sort the table. filesort() is implemented in sql/

filesort may not know how many records need to be sorted. It asks the storage engine for an estimate of the number of records in the table via estimate_rows_upper_bound(). If the number of records that fit in the sort buffer is less than the estimate, filesort will use temporary files to store intermediate results. The flow is as follows (or atleast the important steps to understand the bug mentioned above)

  1. records = Min(estimate records in table, records that can fit in sort buffer)
  2. Initialize the sort buffer via make_char_array
  3. Do the actual sort.
A pictorial way of looking at this is shown below. Width of the box is proportional to time taken to execute the function, and height is the stack depth. Move the mouse over the image to see more details. Source:

Initializing the sort buffer can be very expensive when the estimate for the number of rows is off. For example, in my benchmark, the table has 210 Million rows. Innodb returns 210 Million for estimate_rows_upper_bound(). The actual number of rows that fulfill the WHERE clause of my query is 3!. For a default sort buffer size of 2MB, the number of rows that fit in the sort buffer is 70,000. So filesort() unnecessarily initializes space for 70,000 rows, and does the sort. You might think initializing space for 70,000 rows is not a big deal, but when it constitutes 10% of the query execution time, it IS a big deal.

So how can we improve this?

  • Do not use filesort! Rewrite your query to make use of an index, or add an index which will help the optimizer avoid the filesort.
  • For cases where the estimate for number of queries is wrong, reduce the sort buffer size to a low value (mysql> set global sort_buffer_size=32\*1024;)
  • Enhance estimate_rows_upper_bound() to passdown the WHERE clause so that the storage engine can give a better estimate.

Tuesday May 06, 2008

uperf - A network benchmark tool

Heard of filebench? Want something similar for networking? Look no further! Today we opensourced uperf, a tool to benchmark networking performance. uperf, just like its cousin filebench,1 is a framework that takes a description of a workload/application (called a profile), and generates load to match the profile. uperf is quite heavily used by the performance groups at Sun to study networking performance.

How does uperf help customers?

Often we hear customers complain about not getting good results with their application even though the results with micro-benchmarks were quite good. This is mainly because well known micro-benchmarks do very specific things like measuring bulk throughput or latency and do not really model the customer's application. Using micro-benchmarks does help standardize performance results, but the customer cannot really draw assumptions, based on these standard benchmarks, about how their application will perform. The safest alternative is to run their application to determine the performance. This maybe very time consuming and may require a lot of setup. uperf allows the customer to specify a description(profile) of what their application does and stress the system using this description. Using a profile has some additional benefits like you could modify the profile slightly to determine the effects of using a different protocol, etc..

How does uperf help developers?

Customers rarely want to share their application or it is just too time consuming to get their application and measure their performance in our labs, We are often left with a description of what the application is doing. uperf can take [a modified version of] this description and run whatever the real application does. In addition to running the customer's application, uperf can provide you with additional features like

  • Use threads instead of processes or vice versa
  • Use CPU performance counters to measure certain operations
  • Ability to change the protocol used and/or operation mix to predict performance metrics for a modified application
  • and many other features ( has the whole list)

So please feel free to check out uperf and contribute code, ideas, suggestions etc..

Wednesday Apr 02, 2008

Finding optimization opportunities in MySQL by looking at callstacks

Using callstacks to look at code is very useful. If you are not familiar with callstacks, I suggest you read my earlier blog about it. I was trying to understand the mysql code path using sysbench as the test and found something interesting. An image of the callstack is shown below. The SVG version, with much more information, is also available . The width of the block is proportional to the time it took for the function, and the height is the level (or depth) of the stack.

Mysql uses mysql_execute_command() to execute queries. Looking at the callstack you can see very clearly that mysql_execute_command() calls open_and_lock_tables which then tries to open tables via open_table(). The code path gets interesting here. As you can see in the image above, (and in the svg callstack), there is tight for loop. Looking closely (moving your mouse over the area in the svg file), you will see that the code is basically traversing a hash list! Not the best thing to do with a hashtable. Looking at the code, it is originating at line number 2781 in sql/ Can it be improved? Most probably. Will it help if we improve it? Definitely for some cases, maybe not for others.

I found a few more interesting places in the code, but I will save them for my future blogs :-)

Monday Feb 25, 2008

Maximizing Sysbench OLTP performance for MySQL

Maximizing Sysbench OLTP performance for MySQL

Sysbench is a popular open source benchmark used to measure performance of various parts of the operating systems and (one) application (database). Since we are talking about MySQL, I will concentrate on the database part of the test.

The oltp test in the sysbench benchmark creates a single table and runs a set of queries against it. Each row in the table is around 250 bytes and by default it creates 1000 rows. For our experiment we used 10 million rows. Allan has blogged about the details regarding the experiments; I will present an alternate view to those experiments.

The trick to getting good numbers with Sysbench and MySQL is very simple

  1. Maximize CPU utilization
  2. Reduce delays due to IO

Maximize CPU Utilization

It is very important to utilize all the resources in the system to get maximum performance. If you have a multi-core machine (many processors already have multiple cores), this means that you will have to execute several threads in parallel to to fully utilize all the cores in the system. For Sysbench, this is achieved by using the --num-threads argument. We got the best performance when the number of threads is usually close or equal to the number of cores in the system. You also need to increase the innodb threads concurrently executing inside innodb. The meaning of innodb_thread_concurrency has changed since MySQL 5.0.19, so be sure to set it to the right value. We found the best number to be 0 (unlimited threads).

Reduce delays due to IO

One way to eliminate IO delays is by not doing any IO i.e cache everything. Each row in the Sysbench test table is around 250 bytes. For a 10 million row table, it is about 2.5GB of data. Most systems come with more memory than that, and hence you should be able to cache the table in memory.

MySQL uses the innodb_buffer_size variable to determines the size of the cache it uses. If you have sufficient memory (more than 2.5GB in our case), you could cache the whole table. An alternative strategy (especially if you are using 32bit MySQL) is to choose a smaller value for the innodb_buffer_cache and let the table be cached in the filesystem buffer cache; but I suspect it is more efficient to cache the data at the innodb level. This should eliminate or considerably reduce all the reads.

To eliminate writes, you can either choose to do a read-only test, or use a cache-enabled disks for the writes. The cache can either be NVRAM, or the write cache on the disk. Note: Using cache that is not battery backed is very risky. Do it at your own risk.

If you are still seeing delays due to IO(which in turn translates to idle time on the system), you can try adding more threads to soak up the available CPU. You have to be very careful to find the right balance.

To recap, We got the best numbers using

  1. Use multiple sysbench threads via the --num-threads parameter
  2. Set innodb_thread_concurrency = 0 in my.cnf
  3. Set innodb_buffer_size equal to or greater than the table size

Thats it! Here is the my.cnf that gives me the best numbers.


innodb_data_home_dir = /mysqldata
innodb_data_file_path = ibdata1:2000M;ibdata2:10M:autoextend
innodb_log_group_home_dir = /mysqldata
innodb_buffer_pool_size = 4096M
innodb_additional_mem_pool_size = 20M
innodb_log_file_size = 400M
innodb_log_buffer_size = 64M
innodb_thread_concurrency = 0

Friday Jan 18, 2008

Averaging performance data

When you are optimizing benchmarks, the typical process involves running the same benchmark N times, and picking an arbitrary run of the benchmark (called a run) from these N runs to get the representative run. Another option is to average these N runs (creating a new run N') and pick that one as the representative run. In fenxi, we have discussed automatically averaging a bunch of runs. Performance data can be of two types
  • Numerical Data (Throughput, Response time, etc)
  • Textual Data (OS Patch level, syslog messages, etc.)
Averaging numerical data is very easy. Averaging textual data is not possible, or desired. However, since we are creating a new run N', we need to select textual data to be part of this new run. Which run do we pick it from? We are trying to solve this via the Fenxi project. If you have any thoughts or suggestions regarding this, please feel free to contact us.

Monday Jan 07, 2008

Fenxi - Performance analysis made easy

We just opensourced a nice performance analysis tool called Fenxi. Fenxi is a pluggable Java-based post-processing, performance analysis tool that parses and loads the data from a variety of tools into a database, and then allows you to query and compare different sets of performance data. Fenxi can also be used to graph data from performance tools. Fenxi (mandarin for analyze) is the successor to the Sun-internal tool called Xanadu. It is integrated with the Faban Benchmark harness.

If you have ever worked with performance data, you will pretty soon realize that
Performance Data can get huge.
Consider a benchmark running on a 64 core system with 100's of disks attached, with multiple network interfaces for 30 minutes. If you collect mpstat at 10 second intervals for the whole run, you end with more than 11,000 lines of data! (That is 400 CNTRL-F's if you are using VI in a regular sized termial). If you collect data from more tools like vmstat, iostat, trapstat, busstat, cpustat, etc you will end up with much more! Going through each of them line by line is not a scalable approach.
Performance Data is interrelated.
The tool outputs are just different views of the system behavior. We want to look at the system as a whole, rather than at its individual views. If your incoming network packets peaks, your interrupts in your mpstat most likely peaks. We may want to see if throughput was impacted as a result of a burst of writes to our disks, etc.
Some performance data makes sense visually.
For large data, a visual view gives a quick summary of the data. As Tim Cook states it, "the human brain is a powerful pattern-recognition machine - graphs allow you to spot things you would never see in numbers (like waves of CPU migrations moving across different cores)". Look at the bottom of the blog for more details
Performance Data should be queryable
We want to be able to query or ask questions to the performance data.  For ex, you might want to know "What are my hot disks?". Traditionally, people have answered such questions  by writing custom scripts using sed/awk/perl. This can get tedious very fast. We need a better way of asking questions. In Fenxi, we store the data in the database, and questions are formulated in SQL.
Performance Data should be comparable, averageable, etc.
Since I work in the performance group at Sun, we run a lot of benchmarks. Since the goal of [most] benchmarks is to maximize the performance of a system, we are always constantly trying out new changes to the system. Typically, we change a parameter and repeat the benchmark and see if it has improved performance.
Performance Data should be sharable.
We rarely work in isolation. We should be able to share data with our peers and collaborate on finding performance fixes.

Fenxi tries to solve all of the above problems.

Sample Graph

Sample Text

Fenxi text view

You can see a sample database run processed by Fenxi. I urge you to check it out!

Thursday Nov 01, 2007

Visualizing callgraphs via dtrace and ruby

The allfrom.d dscript can be used to display all function entry/exits caused by a function call. When call graphs are deep, or long, a visual representation is very helpful to understand the flow as well as how much time each function consumed. The output displayed is inspired by Roch's CallStackAnalyzer which was in turn inspired by the work on vftrace by Jan Boerhout.

I wrote a simple ruby script to post process the allfrom.d output and generate a SVG image. The advantage of using SVG is that you can use javascript to provide added functionality. For example, you can hover your mouse over any block to see the name of the function and its elapsed time. Similarly, you could add support for Zoom and Pan

Unfortunately, I am having problems with serving svg files with the right mime type. So I have included a png image below. You can save the svg files somewhere on your computer and view them using Firefox.

A sample for the connect(3socket) is shown below. The input file used to generate it is available here. The width of each box indicates how long the function took. Y axis indicates call depth.

If you are interested, you can also check out

To use the scripts to generate your own call stacks, download the following two [ 1, 2] ruby source files. For example, to generate the above callgraph, you can use

dtrace -s allfromkernel.d acccept > accept.log
ruby function_call_graph.rb accept.log > accept.svg

Note that the script does not work reliably for asynchronous function calls, or cases where the thread changes CPU. If there is sufficient interest, I might be tempted to add those in :-)

Thursday Oct 18, 2007

Marshalling trouble with JRuby/Rails?

If you are getting exceptions like the one below with JRuby and Rails, read on for the solution
Rendering /tmp/neel/jruby-1.0.1/lib/ruby/gems/1.8/gems/actionpack-1.13.5/lib/
   action_controller/templates/rescues/layout.rhtml (500 Internal Error)
no marshal_dump is defined for class Java::JavaObject
/tmp/neel/jruby-1.0.1/lib/ruby/1.8/pstore.rb:349:in `dump'
/tmp/neel/jruby-1.0.1/lib/ruby/1.8/pstore.rb:327:in `transaction'
/tmp/neel/jruby-1.0.1/lib/ruby/1.8/cgi/session/pstore.rb:81:in `update'
/tmp/neel/jruby-1.0.1/lib/ruby/1.8/cgi/session/pstore.rb:88:in `close'
/tmp/neel/jruby-1.0.1/lib/ruby/1.8/cgi/session.rb:324:in `close'
The problem is that Rails is trying to store the Java object in the session store. The default session store is the filesystem, and it fails since it cannot marshal a Java Object. Setting the session store to use memory instead of the filesystem solves the problem.

In config/environment.rb set

config.action_controller.session_store = :memory_store

Sunday Oct 14, 2007

Ruby lovers rejoice!

My friend Prashant is working on delivering Ruby into Solaris! I am sure all Ruby lovers really appreciate your efforts Prashant! I no longer have to download and compile Ruby (or alternatively, install CoolStack) for every machine I use. If you are interested, checkout and contribute to the ARC case.

Tuesday Sep 18, 2007

ZFS ARC Statistics

ARC Statistics are exported via kstat(1M) (CR 6510807) in Solaris 10 U4 and Solaris nevada (build 57+). A simple way to see them is to use kstat -m zfs.

I wrote a simple perl script to print out the ARC statistics. Sample output is shown below. You can either print the set of default fields, or specify what fields you want printed (see arcstat -v for details)


In later blogs, I will describe how to make sense of these numbers. Sample output below

Cached random read

thumper1[bash]$ ~/
    Time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
18:15:21  985M  100M     10   47M   14   53M    8    1M    3    11G   11G
18:15:22  137K     0      0     0    0     0    0     0    0    11G   11G
18:15:23  138K     0      0     0    0     0    0     0    0    11G   11G
18:15:24  138K     0      0     0    0     0    0     0    0    11G   11G
18:15:25  138K     0      0     0    0     0    0     0    0    11G   11G
18:15:26  138K     0      0     0    0     0    0     0    0    11G   11G
18:15:27  139K     0      0     0    0     0    0     0    0    11G   11G
18:15:28  140K     0      0     0    0     0    0     0    0    11G   11G
18:15:29  139K     0      0     0    0     0    0     0    0    11G   11G
18:15:30  140K     0      0     0    0     0    0     0    0    11G   11G
18:15:31  139K     0      0     0    0     0    0     0    0    11G   11G
18:15:32   33K     0      0     0    0     0    0     0    0    11G   11G

Uncached sequential write

    Time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
18:17:48     0     0      0     0    0     0    0     0    0    11G   11G
18:17:49     0     0      0     0    0     0    0     0    0    11G   11G
18:17:50   664   372     56     7    2   365  100     5    2    11G   11G
18:17:51    1K   534     36     5    0   529  100     5    0    11G   11G
18:17:52    2K   774     33     6    0   768  100     6    0    10G   10G
18:17:53    2K   645     31     5    0   640  100     5    0    10G   10G
18:17:54    1K   645     35     5    0   640  100     5    0    10G   10G
18:17:55    2K   645     31     5    0   640  100     5    0    10G   10G
18:17:56    1K   646     35     6    0   640  100     6    0    10G   10G
18:17:57    2K   645     31     5    0   640  100     5    0    10G   10G
18:17:58    2K   774     33     6    0   768  100     6    0    10G   10G

Monday Jun 25, 2007

ZFS Intent Log (ZIL) on a seperate device

Neil Perrin, the lumberjack, has quietly added support for using separate devices for the ZFS intent log!!. This allows ZFS to use NVRAM when it is available, enabling it to perform even better for certain kinds of applications. His putback fixes CR 6339640 Make ZIL use NVRAM when available.

Thursday Mar 22, 2007

JRuby and Derby - The easy way

Are you interested in learning to use Derby from JRuby? While there are more comprehensive tutorials/classes on the web, here is a simple way to access Derby from JRuby (tested on 0.9.8). I hope you find it useful. Make sure derby.jar is in your CLASSPATH

require "java"

conn = java.sql.DriverManager.getConnection("jdbc:derby:test;create=true")

stmt = conn.createStatement
rs = stmt.executeQuery("select TABLEID,TABLENAME from sys.systables")

while ( do
  printf("%20s %20s\\n", rs.getString(1), rs.getString(2))



« June 2016