Monday Apr 16, 2012

Optimizing neighbor flush behavior

Note: this article was originally published on on April 16, 2012 by Yasufumi Kinoshita.

The performance of flush_list flushing of InnoDB decides the basic performance for modifying workloads. So, it is important to optimize the flush behavior. In this post we’ll consider how to optimize the neighbor-flushing behavior.

Factor 1: Characteristics of storage

Depending on the characteristics of your storage’s throughput for write IO, you can term your storage as either “write amount bound” or “write times bound”. The minimum unit of the InnoDB datafile is page size (16KB or less). And InnoDB attempts to combines them in a single IO up to 1 extent (1MB) maximum, if they are contiguous.

<one HDD>:  Almost “write times bound”. Because head-seek time is the most effective factor for access time of HDD. And around 1MB size can be treated by the 1 head-seek.

<RAID-HDD>: It depends on the striping size of the RAID. In many cases, the striping size is set to 256KB ~ 1MB (much larger than the page size of datafile), with the intention that 1IO – 1HDD (both for keeping sequential access advantage of HDD and for keeping parallel ability for IO requests using several HDD in RAID). For the such general striping size, RAID-HDD is “write times bound”. (For the small striping size around same size as the page size, it should be “write amount bound”. But I don’t recommend such small striping size from the viewpoint of this post, because it just loses the sequential access advantage.)

<SSD>: It depends on internal write unit of SSD. For newer high-end SSD, the size is 4KB or more. It is not larger than InnoDB page size. Such high-end SSD is “write amount bound”. However, the unit size is very different according to each SSD’s internal implementations. Low-end or older SSD might have unit size over 1MB (and the throughput might be slow) and might be “write times bound”. You can estimate the write unit size of your SSD by random write benchmark with several block sizes 4,8,16KB,…,1MB, and the largest block size of the “write times bound” expected as the unit size.

Factor 2: Oldest modified age

The redo log in InnoDB is used in a circular fashion. The reusable redo space is limited by the oldest modification in the oldest modified block i.e.: the max oldest modified age which is equal to current_LSN (Log Sequenc Number) – the oldest modification LSN cannot be higher than the log capacity of the redo log files. When there is no reusable redo space available other modification operations cannot be done until the oldest modified age is decreased by flushing the oldest dirty pages.

The flushing throughput of the oldest dirty pages decides the workload throughput. It is important, how to effectively use the limited write IO bound for flushing “oldest” dirty pages.

Tuning flush_list flushing effective

The first priority of flushing is to reduce the oldest modified age assuming there is no shortage of  free blocks. So, this “flushing the oldest blocks only” is the basic strategy.

For “write amount bound” storage (e.g. high-end SSD), this is already the best strategy. It equals to “innodb_flush_neighbors = false“.

On the other hand, for “write times bound” storage (e.g. HDD base), the contiguous dirty neighbors of the oldest dirty pages can be flushed without wasting the write IO bound, because of the sequential advantage. So, flushing also the contiguous pages is really worth to do. But non-contiguous and non-oldest blocks should not be flushed at the same time, because non-contiguous flushing will become another IO request and has high probability to be treated as another raw block writing in the storage (waste the write IO bound).

The flush_neighbors of InnoDB traditional implementation flushes non-contiguous dirty blocks also. And it is not the best behavior for both type of storage “write times bound” and “write amount bound”. In MySQL labs release 2012 we have fixed this behavior to flush contiguous pages only, for “write times bound” storage.


In the end, the conclusion is followings

  • For HDD or HDD-RAID (stripe size about 256KB ~ 1MB): use the new flush_neighbors (flushing contiguous dirty blocks only)
  • For SSD (internal write unit size =< InnoDB data page size): disable flush_neighbors

This is the InnoDB team blog.


« April 2012 »