Monday Jul 07, 2008

Storage engine or MySQL server? Where has the time gone?

I want to answer a simple question - If a query takes X milliseconds, how much of it is spent in the storage engine, and how much in the MySQL server? Why do I think is important? Well, since I am working on MySQL performance, I want to be able to place my bets on where to spend my time optimizing MySQL. Lets take an example. If a query takes 50ms, and I am able to figure out that 40ms is in the mysql server and the remaining 10ms is in the storage engine, the first place I would want to take a look at optimizing the MySQL server. This also tells me that it does not matter what the storage engine is, I am being limited by the server. Similarly if I find the storage engine taking up most of the time, I can explore alternate storage engines. I know all you mysql experts will tell me that much of the attribution (of time) has got to do with the capability of the storage engine, what features it supports, etc. and you cannot really do the extrapolation. However, assuming falcon would have all the features of innodb, this becomes a valid question to ask. Where is my query spending its time.

I can think of 2 easy ways to answer this question. If you know of any additional methods please feel free to let me know.

Using Dtrace

I could write a Dtrace script to intercept all storage engine API calls (both the handler and handlerton) and calculate total time. I can then subtract this with the total time for the query and get the time attribution. There are a few difficulties with this method. I am able to write such a script, however, the cost of measurement is quite high. Many of the storage engine api calls take very little time, and hence the cost of intercepting them and gathering statistics distorts things quite a bit. The other problem is that there is no way for the script to work for all cases. Since the storage engine API's are mangled, the script will not work for different C++ compilers. The solution is to write a script to generate the dtrace script (which is what I have done), however, this does not really solve the question as I cannot trust the values reported because of the probe effect.

Using Sun Studio Performance Analyzer

Sun Studio Performance Analyzer has an option where you can look at the time spent inside a specific shared object. Hence, I could build the storage engine as a plugin ( for ex) and then do my experiment and then use the analyzer to figure out the amount of time spent in and answer my question. I need to try this out. (Thanks to Richard Smith for pointing this out)

If you know of any other methods please let me know!

Wednesday Jun 25, 2008

Improving filesort performance in MySQL

I recently filed Bug #37359 filesort can be more efficient based on some performance work with an industry standard benchmark. Read on if the internals of how MySQL implements filesort interests you.

Filesort, as the name implies, is used to sort records when there is an ORDER BY clause in the query. The reason it has the prefix "file" is because temporary files may be used to store intermediate results. filesort() uses a per thread sort buffer of size sort_buffer_size to sort the table. filesort() is implemented in sql/

filesort may not know how many records need to be sorted. It asks the storage engine for an estimate of the number of records in the table via estimate_rows_upper_bound(). If the number of records that fit in the sort buffer is less than the estimate, filesort will use temporary files to store intermediate results. The flow is as follows (or atleast the important steps to understand the bug mentioned above)

  1. records = Min(estimate records in table, records that can fit in sort buffer)
  2. Initialize the sort buffer via make_char_array
  3. Do the actual sort.
A pictorial way of looking at this is shown below. Width of the box is proportional to time taken to execute the function, and height is the stack depth. Move the mouse over the image to see more details. Source:

Initializing the sort buffer can be very expensive when the estimate for the number of rows is off. For example, in my benchmark, the table has 210 Million rows. Innodb returns 210 Million for estimate_rows_upper_bound(). The actual number of rows that fulfill the WHERE clause of my query is 3!. For a default sort buffer size of 2MB, the number of rows that fit in the sort buffer is 70,000. So filesort() unnecessarily initializes space for 70,000 rows, and does the sort. You might think initializing space for 70,000 rows is not a big deal, but when it constitutes 10% of the query execution time, it IS a big deal.

So how can we improve this?

  • Do not use filesort! Rewrite your query to make use of an index, or add an index which will help the optimizer avoid the filesort.
  • For cases where the estimate for number of queries is wrong, reduce the sort buffer size to a low value (mysql> set global sort_buffer_size=32\*1024;)
  • Enhance estimate_rows_upper_bound() to passdown the WHERE clause so that the storage engine can give a better estimate.

Wednesday Apr 02, 2008

Finding optimization opportunities in MySQL by looking at callstacks

Using callstacks to look at code is very useful. If you are not familiar with callstacks, I suggest you read my earlier blog about it. I was trying to understand the mysql code path using sysbench as the test and found something interesting. An image of the callstack is shown below. The SVG version, with much more information, is also available . The width of the block is proportional to the time it took for the function, and the height is the level (or depth) of the stack.

Mysql uses mysql_execute_command() to execute queries. Looking at the callstack you can see very clearly that mysql_execute_command() calls open_and_lock_tables which then tries to open tables via open_table(). The code path gets interesting here. As you can see in the image above, (and in the svg callstack), there is tight for loop. Looking closely (moving your mouse over the area in the svg file), you will see that the code is basically traversing a hash list! Not the best thing to do with a hashtable. Looking at the code, it is originating at line number 2781 in sql/ Can it be improved? Most probably. Will it help if we improve it? Definitely for some cases, maybe not for others.

I found a few more interesting places in the code, but I will save them for my future blogs :-)

Monday Feb 25, 2008

Maximizing Sysbench OLTP performance for MySQL

Maximizing Sysbench OLTP performance for MySQL

Sysbench is a popular open source benchmark used to measure performance of various parts of the operating systems and (one) application (database). Since we are talking about MySQL, I will concentrate on the database part of the test.

The oltp test in the sysbench benchmark creates a single table and runs a set of queries against it. Each row in the table is around 250 bytes and by default it creates 1000 rows. For our experiment we used 10 million rows. Allan has blogged about the details regarding the experiments; I will present an alternate view to those experiments.

The trick to getting good numbers with Sysbench and MySQL is very simple

  1. Maximize CPU utilization
  2. Reduce delays due to IO

Maximize CPU Utilization

It is very important to utilize all the resources in the system to get maximum performance. If you have a multi-core machine (many processors already have multiple cores), this means that you will have to execute several threads in parallel to to fully utilize all the cores in the system. For Sysbench, this is achieved by using the --num-threads argument. We got the best performance when the number of threads is usually close or equal to the number of cores in the system. You also need to increase the innodb threads concurrently executing inside innodb. The meaning of innodb_thread_concurrency has changed since MySQL 5.0.19, so be sure to set it to the right value. We found the best number to be 0 (unlimited threads).

Reduce delays due to IO

One way to eliminate IO delays is by not doing any IO i.e cache everything. Each row in the Sysbench test table is around 250 bytes. For a 10 million row table, it is about 2.5GB of data. Most systems come with more memory than that, and hence you should be able to cache the table in memory.

MySQL uses the innodb_buffer_size variable to determines the size of the cache it uses. If you have sufficient memory (more than 2.5GB in our case), you could cache the whole table. An alternative strategy (especially if you are using 32bit MySQL) is to choose a smaller value for the innodb_buffer_cache and let the table be cached in the filesystem buffer cache; but I suspect it is more efficient to cache the data at the innodb level. This should eliminate or considerably reduce all the reads.

To eliminate writes, you can either choose to do a read-only test, or use a cache-enabled disks for the writes. The cache can either be NVRAM, or the write cache on the disk. Note: Using cache that is not battery backed is very risky. Do it at your own risk.

If you are still seeing delays due to IO(which in turn translates to idle time on the system), you can try adding more threads to soak up the available CPU. You have to be very careful to find the right balance.

To recap, We got the best numbers using

  1. Use multiple sysbench threads via the --num-threads parameter
  2. Set innodb_thread_concurrency = 0 in my.cnf
  3. Set innodb_buffer_size equal to or greater than the table size

Thats it! Here is the my.cnf that gives me the best numbers.


innodb_data_home_dir = /mysqldata
innodb_data_file_path = ibdata1:2000M;ibdata2:10M:autoextend
innodb_log_group_home_dir = /mysqldata
innodb_buffer_pool_size = 4096M
innodb_additional_mem_pool_size = 20M
innodb_log_file_size = 400M
innodb_log_buffer_size = 64M
innodb_thread_concurrency = 0



« July 2016