Wednesday Apr 14, 2010

InnoDB now supports native AIO on Linux

Note: this article was originally published on http://blogs.innodb.com on April 14, 2010 by Inaam Rana.

With the exception of Windows InnoDB has used ‘simulated AIO’ on all other platforms to perform certain IO operations. The IO requests that have been performed in a ‘simulated AIO’ way are the write requests and the readahead requests for the datafile pages. Let us first look at what does ‘simulated AIO’ mean in this context.

We call it ‘simulated AIO’ because it appears asynchronous from the context of a query thread but from the OS perspective the IO calls are still synchronous. The query thread simply queues the request in an array and then returns to the normal working. One of the IO helper thread, which is a background thread, then takes the request from the queue and issues a synchronous IO call (pread/pwrite) meaning it blocks on the IO call. Once it returns from the pread/pwrite call, this helper thread then calls the IO completion routine on the block in question which includes doing a merge of buffered operations, if any, in case of a read. In case of a write, the block is marked as ‘clean’ and is removed from the flush_list. Some other book keeping stuff also happens in IO completion routine.

What we have changed in the InnoDB Plugin 1.1 is to use the native AIO interface on Linux. Note that this feature requires that your system has libaio installed on it. libaio is a thin wrapper around the kernelized AIO on Linux. It is different from Posix AIO which requires user level threads to service AIO requests. There is a new boolean switch, innodb_use_native_aio, to choose between simulated or native AIO, the default being to use native AIO.

How does this change the design of the InnoDB IO subsystem? Now the query thread instead of enqueueing the IO request actually dispatches the request to the kernel and returns to the normal working. The IO helper thread, instead of picking up enqueued requests, waits on the IO wait events for any completed IO requests. As soon as it is notified by the kernel that a certain request has been completed it calls the IO completion routine on that request and then returns back to wait on the IO wait events. In this new design the IO requesting thread becomes kind of a dispatcher while the background IO thread takes on the role of a collector.

What will this buy us? The answer is simple – scalability. For example, consider a system which is heavily IO bound. In InnoDB one IO helper thread works on a maximum of 256 IO requests at one time. Assume that the heavy workload results in the queue being filled up. In simulated AIO the IO helper thread will go through these requests one by one making a synchronous call for each request. This means serialisation forcing the request that is serviced last to wait for the other 255 requests before it gets a chance. What this implies is that with simulated AIO there can be at most ‘n’ IO requests in parallel inside the kernel where ‘n’ is the total number of IO helper threads (this is not entirely true because query threads are also allowed to issue synchronous requests as well, but I’ll gloss over that detail for now). In case of native AIO all 256 requests are dispatched to the kernel and if the underlying OS can service more requests in parallel then we’ll take advantage of that.

The idea of coalescing contiguous requests is now off loaded to the kernel/IO scheduler. What this means is that which IO scheduler you are using or the properties of your RAID/disk controller may now have more affect on the overall IO performance. This is also true because now many more IO requests will be inside the kernel than before. Though we have not run tests to specifically certify any particular IO scheduler the conventional wisdom has been that for database engine workloads perhaps no-op or deadline scheduler would give optimal performance. I have heard that lately a lots of improvements have gone in cfq as well. It is for you to try and as always YMMV. And we look forward to hear your story.

NOTE:InnoDB h as always used native AIO on Windows and it continues to do so in Plugin 1.1. innodb_use_native_aio will have no affect on Windows.

Tuesday Apr 13, 2010

InnoDB recovery is now faster…much faster!

Note: this article was originally published on http://blogs.innodb.com on April 13, 2010 by Inaam Rana.

One of the well known and much written about complaint regarding InnoDB recovery is that it does not scale well on high-end systems. Well, not any more. In InnoDB plugin 1.0.7 (which is GA) and plugin 1.1 (which is part of MySQL 5.5.4) this issue has been addressed. Two major improvements, apart from some other minor tweaks, have been made to the recovery code. In this post I’ll explain these issues and the our solution for these.

First issue reported here is about available memory check eating up too much CPU. During recovery, the first phase, called redo scan phase, is where we read the redo logs from the disk and store them in a hash table. In the second phase, the redo application phase, these redo log entries are applied to the data pages. The hash table that stores the redo log entries grows in the buffer pool i.e.: memory for the entries is allocated in 16K blocks from the buffer pool. We have to ensure that the hash table does not end up allocating all the memory in the buffer pool leaving us with no room to read in pages during the redo log application phase. For this we have to keep checking the size of the heap that we are using for allocating the memory for the hash table entries. So why would it kill the performance? Because we do not have the total size of the heap available to us. We calculate it by traversing the list of blocks so far allocated. Imagine if we have gigabytes or redo log to apply (it can be up to 4G). That would mean hundreds of thousands of blocks in the heap! And we have to make a check roughly whenever we are reading in a new redo page during scan. An O(n * m) algorithm where ‘n’ is number of blocks in the heap and ‘m’ is number of redo pages that have to be scanned.

What is the solution we came up with? Store the total size of a heap in its header. Simple and effective. Our algorithm now becomes O(m).

Lets talk about the second issue reported here. During the redo log application phase, data pages to which redo log entries are applied are to be inserted in a list called flush_list which is ordered by the LSN of the earliest modification to a page i.e.: oldest_modification. During the normal working the LSN increases monotonically therefore the insertion to the flush_list always happens at the head. But during recovery we have to linearly search the flush_list to find the appropriate spot for insertion. The length to which the flush_list can grow is the number of modified pages (called dirty pages in db parlance) we had at the time of crash. On high-end system with multi-gigabyte buffer pools this number can be very high i.e.: million or more dirty pages. A linear search for insertion won’t scale.

There has been a talk in the community about how to fix this and various solutions have been suggested and some were implemented by the community as well. What we, at InnoDB, have finally implemented is to have an auxiliary data structure (a red-black tree in this case) which is active only during the recovery phase and which is used to speed up sorted insertions in the the flush_list. The flush list remains a list and after the recovery is over the red black is tree is discarded as during the normal operations we only ever append to the flush_list.

So much for the theory. Now let us see if we can walk the talk. To evaluate the effectiveness of this fix we’d need a crash when there are a lot of redo logs to apply and there are a lot of dirty pages. I requested Michael Izioumtchenko (InnoDB’s QA supremo) to come up with something. And he did the following:
The dataset was obtained by running a 60m sysbench readwrite uniform distribution in memory workload with prewarmed cache using the following configuration parameters:
–innodb-buffer-pool-size=18g
–innodb-log-file-size=2047m
–innodb-adaptive-flushing=0
–innodb-io-capacity=100

The latter two are used to throttle flushing in order to maximize the number of dirty pages.

It took only about 20 min of running a workload to arrive to the test dataset, including cache prewarming.
So at time of crash we had:
Modified db pages  1007907
Redo bytes: 3050455773
And the recovery times were:

Plugin 1.0.7 (also Plugin 1.1): 1m52s scan, 12m04s apply, total 13m56s
Plugin 1.0.6: 31m39s scan, 7h06m21s apply, total 7h38m
1.0.7 (and Plugin 1.1) is better 16.95x on scan, 35.33x on apply, 32.87x overall

Note that all this comes to you transparently. You don’t have to set any parameter to take advantage of this feature. My only suggestion would be to use as large log files (there is a limit of 4G on total log file size) as you can. I know users have been using smaller log files to avoid recovery running into hours. They have taken a hit on throughput during normal running to avoid longer recovery time. You don’t have to do that any more. InnoDB recovery will never run into hours. It’s a guarantee!

InnoDB Performance Schema

Note: this article was originally published on http://blogs.innodb.com on April 13, 2010 by Jimmy Yang.

Performance Schema Support in InnoDB

With the plugin 1.1 release, InnoDB will have full support of Performance Schema, a new feature of MySQL 5.5 release. This allows a user to peak into some critical server synchronization events and obtain their usage statistics. On the other hand, in order to make a lot of sense of the instrumented result, you might need some understanding of InnoDB internals, especially in the area of synchronization with mutexes and rwlocks.

With this effort, the following four modules have been performance schema instrumented.

1. Mutex
2. RWLOCKs
3. File I/O
4. Thread

Almost all mutexes (42), rwlocks (10) and 6 types of threads are instrumented. Most mutex/rwlock instrumentations are turned on by default, a few of them are under special define. For File I/O, their statistics are categorized into Data, Log and Temp file I/O.

This blog is to give you a quick overview on this new machinery.

Start the MySQL Server with Performance Schema

To start with, you probably want to take a quick look at MySQL’s Performance Schema Manual (http://dev.mysql.com/doc/refman/5.5/en/performance-schema.html ), this gives you a quick overview on the general performance schema features.

The performance schema is by default built in with MySQL 5.5 release. However, you do need to add “-performance_schema” to your server boot command line or have performance_schema system variable enabled in your server configure file in order to enable the performance schema. Otherwise, it will be disabled.  Please note, you can specify “performance_schema” variable with no value or a value of 1 to enable it, or with a value of 0 to disable it.

When server starts, please pay attention to following lines in server error log:

“100407 16:13:02 [Note] Buffered information: Performance schema enabled.”

This means the server starts with performance schema running fine.

It could also display message such as:

“100407 16:13:02 [Note] Buffered information:  Performance schema disabled (reason: start parameters)”

This shows the performance schema is disabled due to lack of either “performance_schema” boot option or appropriate variable set in the configuration file.

The third type message would be “Performance schema disabled (reason: init failed)”, it is due to performance schema initialization failure (could due to reasons such as memory allocation failure etc.). This message is relatively rare. I have not encountered it. If you do hit it, please check  other performance schema related system variables, to see if they are out of reasonable range.

Performance Schema Database and its Tables

Assuming server starts fine with Performance Schema enabled, first stop you want to visit is probably the new database called “performance_schema”. All performance schema related tables are in this database:

mysql> use performance_schema

mysql> show tables;

Tables_in_performance_schema


Tables_in_performance_schema
1 COND_INSTANCES
2 FILE_INSTANCES
3 MUTEX_INSTANCES
4 RWLOCK_INSTANCES
5 EVENTS_WAITS_CURRENT
6 | EVENTS_WAITS_HISTORY
7 EVENTS_WAITS_HISTORY_LONG
8 EVENTS_WAITS_SUMMARY_BY_EVENT_NAME
9 EVENTS_WAITS_SUMMARY_BY_INSTANCE
10 EVENTS_WAITS_SUMMARY_BY_THREAD_BY_EVENT_NAME
11 PROCESSLIST
12 SETUP_CONSUMERS
13 SETUP_INSTRUMENTS
14 SETUP_OBJECTS
15 SETUP_TIMERS
16 PERFORMANCE_TIMERS
17 FILE_SUMMARY_BY_EVENT_NAME
18 FILE_SUMMARY_BY_INSTANCE

These 18 tables can be categorized into a few big groups, such as “Instance” tables, “Wait” table with “History”, or “Wait” table with “Summary” and “Setup” table.

In the next few section, I will go through a few tables in this list that I think are important.

Find Instrumented Events with INSTANCE TABLES

To view what InnoDB events are active and being instrumented, please check following four “Instance” tables for corresponding modules:

MUTEX_INSTANCES
RWLOCK_INSTANCES
PROCESSLIST
FILE_INSTANCES

mysql> SELECT DISTINCT(NAME)
->  FROM MUTEX_INSTANCES
-> WHERE NAME LIKE “%innodb%”;

name
wait/synch/mutex/innodb/analyze_mutex
wait/synch/mutex/innodb/mutex_list_mutex
wait/synch/mutex/innodb/ibuf_mutex
wait/synch/mutex/innodb/rseg_mutex
wait/synch/mutex/innodb/autoinc_mutex
wait/synch/mutex/innodb/flush_list_mutex
…..
wait/synch/mutex/innodb/thr_local_mutex
wait/synch/mutex/innodb/srv_monitor_file_mutex
wait/synch/mutex/innodb/buf_pool_mutex
wait/synch/mutex/innodb/recv_sys_mutex
wait/synch/mutex/innodb/fil_system_mutex

wait/synch/mutex/innodb/fil_system_mutex

wait/synch/mutex/innodb/trx_doublewrite_mutex
wait/synch/mutex/innodb/flush_order_mutex

35 rows in set (0.00 sec)

Please notice there could be multiple instances of a mutex in the server,

mysql> SELECT COUNT(*)
-> FROM MUTEX_INSTANCES
-> WHERE NAME LIKE “%rseg_mutex%”;

COUNT(*)
128

1 row in set (0.92 sec)

This is why we need to use “SELECT DISTINCT (NAME)” clause in the initial query to get only the distinct mutex names from the MUTEX_INSTANCES table. Without the DISTINCT, there could be hundreds of instances of mutex being displayed. This also applies to other instance tables.

Also please note, if the mutex is not yet created, it will not be listed in the instance table, so you might see fewer events/instances than you might expected.

One last point for this section, buffer block mutex and rwlock are instrumented but disabled by default from performance schema instrumentation. The reason is that there comes one mutex/rwlock pair per 16k buffer block. Server with large buffer pool configuration could easily create thousands of instances of this type of mutexes/rwlocks.  This easily exceed the default value of max mutex/rwlock instances (1000) allowed. And user would require to extend the limit by setting system variable performance_schema_max_mutex_instances and/or performance_schema_max_rwlock_instances.
However, as we mentioned, the block mutex/rwlock are instrumented,  to enable them, you might need to change the code and un-define “PFS_SKIP_BUFFER_MUTEX_RWLOCK”.

Find out what is going on with EVENTS_WAITS_CURRENT table

The next table you might be interested in is the EVENTS_WAITS_CURRENT table,

mysql>  SELECT THREAD_ID, EVENT_NAME, SOURCE
->   FROM EVENTS_WAITS_CURRENT
->  WHERE EVENT_NAME LIKE  “%innodb%”;

THREAD_ID EVENT_NAME SOURCE
2 wait/synch/mutex/innodb/ios_mutex srv0start.c:495
8 wait/synch/mutex/innodb/log_sys_mutex log0log.ic:405
9 wait/synch/mutex/innodb/kernel_mutex srv0srv.c:2182
10 wait/synch/mutex/innodb/thr_local_mutex thr0loc.c:127

4 rows in set (0.00 sec)

This table shows the latest instrumented activity for a particular thread. And the nice part of it is that it has the exact file name and line number of each event. So in case there is a hang/blocking situation (due to mutex/rwlock), you could know which mutex or rwlock is actually involved.

Check into “Limited history” with HISTORY tables

There are a couple of “HISTORY” tables that record each instrumented events. The EVENTS_WAITS_HISTORY table contains the most recent 10 events per thread. And EVENTS_WAITS_HISTORY_LONG contains the most recent 10,000 events by default.  They also come with the “SOURCE” field with file name and line number, and you might be able to do some aggregation on them to find some interesting behavior.

For example, following query gives you exact mutex instances that has been on the top list as shown in the history table:

mysql> SELECT EVENT_NAME, SUM(TIMER_WAIT), COUNT(*), SOURCE
-> FROM EVENTS_WAITS_HISTORY_LONG
-> WHERE EVENT_NAME LIKE “%innodb%”
-> GROUP BY SOURCE
-> ORDER BY SUM(TIMER_WAIT) DESC;

Or you can obtain the instance with the most average time wait:

mysql> SELECT EVENT_NAME, SUM(TIMER_WAIT)/count(*), source
-> FROM EVENTS_WAITS_HISTORY_LONG
-> WHERE EVENT_NAME LIKE “%innodb%”
-> GROUP BY source
-> ORDER BY SUM(TIMER_WAIT) / COUNT(*) DESC;

As mentioned, the history table has limited size, with 10 events per thread for  EVENTS_WAITS_and 10,000 for  EVENTS_WAITS_HISTORY_LONG.  However, you could extend the history length of these two tables by changing
“performance_schema_events_waits_history_size” and “performance_schema_events_waits_history_long_size” system variables. The performance_schema_events_waits_history_long_size can be extended to a million rows in maximum. However, please do not expect this would be enough. Even with 1 million events configured, in a busy system, it probably only contains a few seconds operation of the server.


Find out aggregated information from SUMMARY Tables

To get the overall aggregated value for these instances, you would need the “SUMMARY” table. There are 5 Summary tables,

EVENTS_WAITS_SUMMARY_BY_EVENT_NAME
EVENTS_WAITS_SUMMARY_BY_INSTANCE
EVENTS_WAITS_SUMMARY_BY_THREAD_BY_EVENT_NAME
FILE_SUMMARY_BY_EVENT_NAME
FILE_SUMMARY_BY_INSTANCE

As their name suggested, they are just events statistics aggregated with different criteria. Digging into these table gives you some idea where the contention could be.

For example, following query shows what is the hottest mutex (these values have unit as pico-second):

mysql> SELECT EVENT_NAME, COUNT_STAR,SUM_TIMER_WAIT,   AVG_TIMER_WAIT
-> FROM EVENTS_WAITS_SUMMARY_BY_EVENT_NAME
-> WHERE EVENT_NAME LIKE “%innodb%”
-> ORDER BY COUNT_STAR DESC;

EVENT_NAME COUNT_STAR SUM_TIMER_WAIT AVG_TIMER_WAIT
buf_pool_mutex 1925253 264662026992 137468
buffer_block_mutex 720640 80696897622 111979
kernel_mutex 243870 44872951662 184003
purge_sys_mutex 162085 12238011720 75503
…. ..

This experiment shows the buf_pool_mutex has been the hottest mutex. However, aggregate on AVG_TIMER_WAIT shows that ibuf_mutex is the one we waited the longest, even though it is much less frequently accessed.

Also please note these tables can be truncated, so you can essentially reset these wait values, and start the counting and aggregation afresh.

Again, to really understand and interpret information from these tables require some understanding of internals that these mutexes/rwlocks reside. It could target to advanced users and developers who want to analyze performance bottlenecks. However, common users might still be able to infer certain information out of it, and have some creative usage of these statistics. For example, for some I/O bound servers, you might find double write buffer mutex is on the top listed mutexes in terms of total time waited. Then you might want to consider to turn off the double write buffer option etc.

Performance Impact:
The last item we discuss is that this performance schema comes with a cost. It does have visible performance impact. A simple dbt2 test with 50 warehouse, 32 connections on a server with 2G buffer pool size show about 8% performance impact with all performance schema events turned on. This is also confirmed by some sysbench’ tests.

In fact, to minimize performance impact, performance schema allows you turn off counting on individual event with SETUP tables,  you can use SETUP_CONSUMERS to turn on/off logging into history table etc., and SETUP_INSTRUMENTS to turn on/off counting on a particular mutex/rwlock etc.  However, turning off events counting cannot completely eliminate the performance impact from the performance schema. This is something the performance schema to be improved upon.

Summary:
In summary, we are providing a rich set of mutex, rwlock, I/O and thread usage information through Performance Schema instrumentations. It can be used to diagnose server performance bottlenecks, find possible hot spots in the server as well as gain a better understanding on system behavior/access pattern on modules where these mutexes/rwlocks reside in. However, it does come with a cost to server performance itself. So this is more suitable for development server tuning and studying. You might want to leave this out for any production server.

Better Scalability with Multiple Rollback Segments

Note: this article was originally published on http://blogs.innodb.com on April 13, 2010 by Sunny Bains.

Background

The original motivation behind this patch was the infamous Bug#26590MySQL does not allow more than 1023 open transactions. Actually the 1024 limit has to do with the number of concurrent update transactions that can run within InnoDB. Where does this magic number come from ? 1024 is the total number of UNDO log list slots on one rollback segment header page. And in the past InnoDB created just one rollback segment header page during database creation. This rollback segment header page is anchored in the system header page, there is space there for 128 rollback segments but only one was being created and used resulting in the 1024 limit. Each slot in the rollback segment header array comprises of {space_id, page_no}, where both space_id and page_no are of type uint32_t . Currently the space id is “unused” and always points to the system table space, which is tablespace 0. Now, onto the rollback segment header page. This page contains a rollback segment header (details of which are outside the scope of this blog entry :-) ), followed by an array of 1024 UNDO slots. Each slot is the base node of a file based linked list of UNDO logs. Each node in this file based list contains UNDO log records, containing the data updated by a transaction. A single UNDO log node can contain UNDO entries from several different transactions.

Performance ramifications

When a transaction is started it is allocated a rollback segment to write its modifications. Multiple transactions can write to the same rollback segment but only one transaction is allowed to write to any one UNDO slot during its lifetime. This should make clear where the 1024 limit comes from. Each rollback segment is protected by its own mutex and when we have a single rollback segment this rollback segment mutex can become a high contention mutex.

Requirements

Backward compatibility in file formats is something we take very seriously at InnoDB.  InnoDB has always had the ability to use up to 128 pages but before this fix it created only one rollback segment. We had to figure out a way to make the multiple rollback segments change backward compatible, without breaking any assumptions in the code of older versions of InnoDB about absolute locations of system pages and  changes to system data. The 128 limit is a result of the latter. While there is space for 256 rollback segments, InnoDB uses only 7 bits from that field. Once we fix that we could in the future enable 256 rollback segments, however 128 seems to be sufficient for now. There are other scalability issues that need to be addressed first before 128K concurrent transactions will become an issue :-) .

The solution

To keep backward compatibility the additional rollback segments are created after the double write buffer when creating the rollback segments in a new instance. We keep the total number of rollback segments set to 128 rather than 256 and set the remaining slots to NULL because the older versions will try and scan up to 256. This means that older versions of InnoDB could also benefit from these extra rollback segments.  If you decide to create the extra rollback segments with a newer version but decide to revert back to an older version the older versions should be able to use the additional rollback segments. Newer versions of InnoDB that contain this fix, will create the additional segments in existing instances only if the innodb_force_recovery flag is not set and the database instance was shutdown cleanly. The additional segments will be created by default when creating new instances.

Thursday Apr 01, 2010

InnoDB Plugin Doc now on dev.mysql.com

Note: this article was originally published on http://blogs.innodb.com on April 1, 2010 by John Russell.

The InnoDB Plugin manual is now available on the MySQL web site.

Tuesday Aug 11, 2009

New InnoDB Plugin with MORE Performance: Thanks, Community!

Note: this article was originally published on http://blogs.innodb.com on August 11, 2009 by Calvin Sun.

Today, the InnoDB team announced the latest release of the InnoDB Plugin, release 1.0.4. Some of the performance gains in this release are quite remarkable!

As noted in the announcement, this release contains contributions from Sun Microsystems, Google and Percona, Inc., for which we are very appreciative. The purpose of this post is to describe the general approach the InnoDB team takes toward third party contributions.

In principle, we appreciate third party contributions. However, we simply don’t have the resources to seriously evaluate every change that someone proposes, but when we do undertake to evaluate a patch, we have some clear criteria in mind:

  • The patch has to be technically sound, reliable, and effective
  • The change should fit with the architecture, and our overall plans and philosophy for InnoDB
  • The contribution must be available to us under a suitable license

Let’s consider, in general terms, what these criteria mean in practice.

We have to expend a fair bit of effort to carefully evaluate and possibly modify a patch before we can include it in the release. Some of the third party contributions we’ve seen have not been portable, or have been developed just for Linux. It can take time to find an approach that enables a platform to take advantage of a new feature, even if the platform has the required capabililties. Some of the patches we’ve evaluated have contained actual bugs that would impact reliability, cause deadlocks or have other negative implications. InnoDB is a clean and elegant piece of code, yet some of its internal algorithms and behaviors are subtle and complex. Therefore, changes in the “guts” of InnoDB (or any storage engine) must be done carefully and thoroughly tested. Some patches that have been offered make a difference, but only when compared to an inappropriate “baseline”. At any given point, we would look to include a patch only if it makes a significant improvement over the “best” version or configuration of InnoDB available at the time. We like to test each patch in isolation, to assess its individual value. This requires some rigorous performance testing, with multiple workloads.

From time to time, third parties have made suggestions for changes that may seem attractive at first, but don’t make sense longer term. In general, we may have a more comprehensive approach to a problem or requirement that we would like to implement, rather than incorporate a patch that would introduce a feature that would ultimately be made obsolete. We prefer to have fewer “knobs” and tuning complexity, so we’re more inclined to implement heuristic, self-tuning capabilities than we are to add new configuration parameters. Lastly, we take care to protect the ability to upgrade and downgrade user databases with the file format management features in the InnoDB Plugin. If a patch requires an on-disk change, we will defer its incorporation until the time comes to implement a new file format.

For us to be able to make continued investment in InnoDB, we must be able to license the software commercially. OEMs and ISVs who incorporate MySQL with InnoDB in their products may not wish to release their products in open source form. Therefore, for each contribution we are to accept, we must have clear legal rights to the change.

Beyond all that, of course, we take care to carefully document each new feature, both in terms of form and function. We try hard to explain the implications of a feature, providing information about what it does, and when and where to use a feature, as well as how to do so. And, we generally speaking are committed to upward compatibility and support of a feature once it is introduced.

It’s pretty clear that the integrity of InnoDB, with its broad adoption and importance everywhere it is used, is paramount to you and to us. You can trust the InnoDB team to protect InnoDB now and in the future, while being open to suggestions and contributions. Let us know if you think we’re doing a good job!

Wednesday May 13, 2009

InnoDB Conference Presentations Now Online

Note: this article was originally published on http://blogs.innodb.com on May 13, 2009 by Calvin Sun.

Well, it took us a little while (we’ve been busy ;-) !), but we’ve now posted our presentations on InnoDB from the MySQL Conference and Expo 2009. You can download these presentations by Heikki Tuuri, Ken Jacobs and Calvin Sun from the InnoDB website, as follows:

Friday Apr 17, 2009

Only God can make random selections

Note: this article was originally published on http://blogs.innodb.com on April 17, 2009 by Vasil Dimov.

Recently, it was reported (see MySQL bug #43660) that “SHOW INDEXES/ANALYZE does NOT update cardinality for indexes of InnoDB table”. The problem appeared to happen only on 64-bit systems, but not 32-bit systems. The bug turns out to be a case of mistaken identity. The real criminal here wasn’t the SHOW INDEXES or the ANALYZE command, but something else entirely. It wasn’t specific to 64-bit platforms, either. Read on for the interesting story about this mystery and its solution …

InnoDB estimates statistics for the query optimizer by picking random pages from an index. Upon detailed analysis, we found that the algorithm that picks random pages for estimation always picked the same page, thus producing the same result every time. This made it appear that the index cardinality was not updated by ANALYZE TABLE. Going deeper, the reason the algorithm always selected the same page was that the random number generator always generated numbers that, when divided by 3, always gave the same remainder (2).

The sampling algorithm selects a random leaf page by starting from the root page and then selecting a random record from it, descending into its child page and so on until it reaches a leaf page. In the particular case that was reported in the bug report, the root page contained only 3 records and the tree height was only 2 (i.e., the leaf pages were all just below the root page).

You can already guess what happened. The “random” numbers generated, not being so random, caused the algorithm to always pick the same record from the root page (the second one) and then descend to the leaf page below it. Every time. So, the 8 random pages that were sampled in order to get an estimate of the whole picture were in fact the same page, even in isolated ANALYZE TABLE runs.

So, clearly there was a problem with the random number generator. But why didn’t this problem seem to appear on 64-bit platforms? It would have, had we only enough time to wait. The random number generator, always generating numbers like 3k+2 of type unsigned long, at some point wrapped around 4 billion on 32-bit machines and started generating numbers like 3k+1. On 64-bit machines, where unsigned long is much bigger, this wrap did not occur. But it would have occurred if we ran the test for 1000 years!.

So, on 32-bit machines, at some point the first record from the root page was picked instead of the second one, and this caused some changes in the results produced by ANALYZE TABLE. Yet, on 64-bit machines, for all practical purposes, this “never” happens. By only looking at the symptoms, one would get the impression that the flaw existed only on 64-bit machines and that 32-bit systems were ok.

Well, what about the fix? A possible fix would be to change InnoDB in 64-bit environments to behave the same way it does in 32-bit environments. People who are used to the behavior of InnoDB on 32-bit machines and upgrade to a 64-bit machine might be satisfied, because the problem on 64-bit systems was “solved”. But in reality, this approach in no way would fix the underlying problem. The real solution is to replace the random number generator with a better one (fully realizing that algorithmic random number generators are only pseudo-random number generators).

Yet even that is not so simple. Making any change would have caused changes to index cardinality estimations, thereby causing changes in decisions made by the optimizer, resulting in different execution plans … and different, possibly worse, performance for queries. Because MySQL 5.1 and 5.0 are frozen for such drastic changes, we fixed this bug in the upcoming 1.0.4 release of the InnoDB Plugin.

In order to not break existing applications, and since many people wanted a fix for MySQL 5.0 and 5.1, we implemented this fix for MySQL under the control of a new configuration parameter (innodb_use_legacy_cardinality_algorithm), which is turned on by default, preserving past behavior. Because the “right fix” is to permanently change the random number generator, this new configuration parameter is not present in the InnoDB Plugin, and the “more random” random number generator will always be used.

And that is the end of the case of mistaken identity. It turns out that it is really hard to generate truly random numbers, hence the title of this blog post.

Birds? Feathers? Plug In to a BoF on InnoDB!

Note: this article was originally published on http://blogs.innodb.com on April 17, 2009 by Calvin Sun.

There is going to be a lot of InnoDB news at the MySQL Conference and Expo … including some surprises! Wear your bird costume (or not), but please join us for a Birds of a Feather session about InnoDB products and technology at 7:30pm Wednesday, in Ballroom C.

You’ll have a chance to dig deeper into the topics you’ve learned about for the first time at the conference, and can chat with Heikki Tuuri, Calvin Sun and Ken Jacobs and others about all things InnoDB.

Largely unstructured and open, we’ll discuss the latest version of the InnoDB Plugin and answer your technology questions about the newest InnoDB products. It’s also a great opportunity to network with other users and experts in InnoDB technology and learn about how to get the best out of MySQL for your application. Come ask questions and get answers about this technology that is so critical to YOUR success!

See you there!

Tuesday Mar 31, 2009

Software is Hard Sometimes …

Note: this article was originally published on http://blogs.innodb.com on March 31, 2009 by Vasil Dimov.

Some months ago, Google released a patch for InnoDB that boosts performance on multi-core servers. We decided to incorporate the change into the InnoDB Plugin to make everybody happy: users of InnoDB don’t have to apply the patch, and Google no longer has to maintain the patch for new versions of InnoDB. And it makes us at Innobase happy because it improves our product (as you can in this post about InnoDB Plugin release 1.0.3).

However, there are always technical and business issues to address. Given the low-level changes in the patch, was it technically sound? Was the patch stable and as rock solid as is the rest of InnoDB? Although it was written for the built-in InnoDB in MySQL 5.0.37, we needed to adapt it to the InnoDB Plugin. Could we make the patch portable to many platforms? Could we incorporate the patch without legal risk (so it could be licensed under both the GPL and commercial terms)?

Fortunately Google generously donated the patch to us under the BSD license, so there was no concern about intellectual property (so long as we properly acknowledged the contribution, which we do in an Appendix to the manual and in the source code). So, while the folks at Google are known for writing excellent code, we had to thoroughly review and test the patch before it could be incorporated in a release of InnoDB.

The patch improves performance by replacing InnoDB mutex and rw-mutex with atomic memory instructions. The first issue that arose was that the patch assigned the integer value -1 to a pthread_t variable, to refer to a neutral/non-existent thread identifier. This approach worked for Google because they use InnoDB solely on Linux. As it happens, pthread_t is defined as an integer on Linux.

But we had problems when the patch was tested on FreeBSD. We still needed to reference a non-existent thread, but in some environments (e.g. some versions of HPUX, Mac OS X) pthread_t is defined as a
structure, not an integer. As we looked at it, the problem became more complex. For this scheme to work, it must be possible to change thread identifiers atomically, using a Compare-And-Swap instruction. Otherwise, there will be subtle, mysterious and nasty failures.

We thought about enabling this patch only on Linux. But that approach was not optimal, because the patch could work perfectly well on other platforms where pthread_t is the same size as the machine word, like FreeBSD and Solaris, for example. We could have simply enumerated the operating system names where the patch would be supported, but this too was far from perfect. The ability to use the atomic instructions depends not only on the operating system, but the GCC version (has to be 4.1 or newer) and the capabilities of the CPU (some CPUs do not support Compare-And-Swap!).

So, we developed a dynamic compile-time check for checking whether the target environment (pthread_t type and size, GCC version, CPU capabilities) supports the use of atomic instructions as intended. This seemed to be the best approach, but then another problem arose!

Dynamic compile-time checks (i.e. autoconf) are part of the MySQL’s ./configure script. Innobase does not control this script, and so we must live with what is distributed by MySQL. To make things simple for
users, we wanted to avoid asking the users to re-create ./configure with the autotools. So we simply injected the required checks in the Makefile. This seemed to work fine. Only after release of the InnoDB Plugin 1.0.3 did a small crack in this approach arise. Fortunately this problem turned out to be easy to fix and only occurs if one compiles with “make -j” (in order to perform the make in parallel). See Bug#43740 for details and for a fix.

The bottom line is that sometimes more goes on “behind the scenes” than you might expect when it comes to incorporating a third-party contribution into a product. We are grateful for the Google patch, and are glad that we have been able to include it in the InnoDB Plugin in a way that maximizes its portability, while keeping things as simple as possible for users.

Friday Mar 20, 2009

… and Who Could Forget Mark Callaghan?

Note: this article was originally published on http://blogs.innodb.com on March 20, 2009 by Calvin Sun.

Ooops! Mark Callaghan of Google is one of world’s experts in InnoDB, and a frequent blogger on its performance characteristics. The InnoDB Plugin 1.0.3 is much more scalable on multi-core systems because of the contributions he has made (along with Ben Handy).

Mark will deliver a keynote the on Google’s use of MySQL and InnoDB on Tuesday morning at the MySQL Conference, and another talk on Wednesday. As Mark says, “Although Innodb is not in the title, it is prominent in both of the talks I will do”:

Sorry the other post didn’t mention you, Mark. We won’t forget to be at BOTH your sessions, and we look forward to hearing what you have to say!

Thursday Mar 19, 2009

Talk,Talk, Talk: Innobase Speaks

Note: this article was originally published on http://blogs.innodb.com on March 19, 2009 by Calvin Sun.

That should read “Talks, Talks, Talks” … There will be several presentations by InnoDB experts at the upcoming 2009 MySQL Conference and Expo. Whether you’re a newbie or an experienced DBA deeply familiar with InnoDB, you won’t want to miss these important talks about InnoDB: Innovative Technologies for Performance and Data Protection by Ken and Heikki, Tues, 11:55am Crash Recovery and Media Recovery in InnoDB by Heikki, Wed, 2pm InnoDB Internals: InnoDB File Formats & Source Code Structure by Heikki and Calvin, Wed, 5:15pm Concurrency Control: How it Really Works by Heikki, Thurs, 2:50pm Note the new times for the last two talks above. Be sure to check the conference schedule! Not much more to say about this topic, at least not here. Hear it all there!

People are Talkin’ … about InnoDB

Note: this article was originally published on http://blogs.innodb.com on March 19, 2009 by Calvin Sun.

InnoDB executes a lot of transactions for MySQL users. And users and technical experts in the InnoDB community exchange lots of information about InnoDB … transacting in ideas … all the time. A great opportunity to participate in some of these transactions is the upcoming MySQL Conference and Expo.

Does your InnoDB-based application need more transactions to execute more quickly and reliably? Come hear these noted speakers present information that can help you get the most out of InnoDB:

Wednesday Mar 11, 2009

Plug In for Performance and Scalability

Note: this article was originally published on http://blogs.innodb.com  on March 10, 2009 by Calvin Sun.

Why should you care about the latest “early adopter” release of the InnoDB Plugin, version 1.0.3?   One word: performance! The release introduces these features:

  • Enhanced concurrency & scalability: the “Google SMP patch” using atomic instructions for mutexing
  • More efficient memory allocation: ability to use more scalable platform memory allocator
  • Improved out-of-the-box scalability: unlimited concurrent thread execution by default
  • Dynamic tuning: at run-time, enable or disable insert buffering and adaptive hash indexing

These new performance features can yield up to twice the throughput or more, depending on your workload, platform and other tuning considerations. In another post, we explore some details about these changes, but first, what do these enhancements mean for performance and scalability?

In brief, we’ve tested three different workloads (joins, DBT2 OLTP and a modified sysbench) using a memory-resident database. In all cases, the InnoDB Plugin scales significantly better than the built-in InnoDB in MySQL 5.1. And in some cases, the absolute level of performance is dramatically higher too! The charts below illustrate the kinds of performance gains we’ve measured with release 1.0.3 of the InnoDB Plugin. Your mileage may vary, of course. 

This release of the InnoDB Plugin incorporates a patch made by Ben Handy and Mark Callaghan at Google to improve multi-core scalability by using more efficient synchronization methods (mutexing and rw-locks) to reduce cpu utilization and contention. We’re grateful for this contribution, and you will be too!

Now to our test results …

Joins: The following chart shows the performance gains in performing joins, comparing the built-in InnoDB in MySQL (in ">blue) with the InnoDB Plugin 1.0.3 (in red).

As you can see from the blue bars in the above chart, with MySQL 5.1 using the built-in InnoDB, the total number of joins the system can execute declines as the number of concurrent users increases. In contrast, the InnoDB Plugin slightly improves performance even with one user, and maintains performance as the number of users rises. This performance improvement is due in large part to the use of atomics for mutexing in the InnoDB Plugin.

Transaction Processing (DBT2): The following chart illustrates a scalability improvement using the OLTP read/write DBT2 benchmark, again comparing the performance of the built-in InnoDB in MySQL with the performance of InnoDB Plugin 1.0.3.

Here, the InnoDB Plugin scales better than the built-in InnoDB from 16 to 32 users and produces about 12% more throughput with 64 concurrent users, as other bottlenecks are encountered or system capacity is reached. This improvement is likewise due primarily to the changes in mutexing.

Modified Sysbench: This test uses a version of the well-known sysbench workload, modified to include queries based on a secondary index, as suggested by Mark Callaghan of Google.

This time, the InnoDB Plugin shows significantly better scalability from 8 to 64 users than the built-in InnoDB in MySQL, yielding as much as 60% more throughput at 64 users. Like the previous examples, this improvement is largely due to the use of atomics for mutexing.

Modified Sysbench with tcmalloc: This test uses the same modified sysbench workload, but shows the difference between the built-in InnoDB (which uses the internal InnoDB memory allocator) and the InnoDB Plugin when using a more scalable memory allocator, in this case tcmalloc.

When the new configuration parameter innodb_use_sys_malloc is set to enable use of the memory allocator tcmalloc, the InnoDB Plugin really shines! Transaction throughput continues to scale, and the actual throughput with 64 users has nearly doubled!

About

This is the InnoDB team blog.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today