Improving MySQL/Innodb/ZFS read-only performance by 5x using SSDs for L2ARC
By Charles Suresh on Aug 18, 2009
Unfortunately, almost all of these sites focus on the use-case where the entire DB is stored on the SSD. This is in many ways the wrong way to be using an SSD given their high price-per-byte compared to disk. Ideally, you want the database blocks with high cache locality to be in the SSD and everything else can remain on the voluminous, high capacity, low price-per-byte disks.
Part of the reason that all of these bloggers store the entire database in the SSD rather than just cache the parts that are hot in the SSD probably stems from the fact that there aren't many filesystems out there that support the separation of hot-vs-not data. My knowledge of file system support for SSDs on other operating systems is pretty limited but as far as I'm concerned, we do have that support in ZFS for Solaris (using the L2ARC and/or the ZIL). So as far as I know/care, ZFS is the only filesystem that can leverage the SSD as a cache - thus making use of the best of both worlds: the low latency of SSDs and the high capacity of disks.
I'll start off with a description of the test rig: we chose to use a relatively low end configuration which was assumed to resemble what a typical MySQL configuration looks like out in the real world. The MySQL server is a Sun x4150 with a relatively small number of disks(4). The purpose of the study was to look at what benefits (if any) would accrue from using one or more SSDs in the ZFS L2ARC for a read-only test using sysbench. The actual ZFS commands used for the setup is listed in the footnotes. The SSD used was a pair of 32 Gbyte Intel X25-E. I used a database with 800M records (using about 190 GBytes of disk space) and the throughput was determined using a 5 minute long sysbench read-only run with 10 threads scanning over all 800M rows. I'm hoping this represents a fairly typical MySQL config where the disk/IO subsystem is the bottleneck. If the disks aren't the bottleneck, obviously, there isn't any point in using an SSD.
Testing MySQL/innodb using sysbench in the above config showed that with 1 SSD configured as the ZFS L2ARC, the throughput doubled compared to a baseline using only disks (config #1 listed below) and with the 2 SSD configuration, throughput went up ~5x. Now 5x is great but like the Rolling Stones said: "I can't get no satisfaction ... ". And like them, "I tried, and I tried" to look at why I wasn't getting the ideal 100x throughput I should expect if I'm to believe all the rah-rah-rah I hear about SSDs.
Now, the SSD used in the ZFS L2ARC is meant to be used as a cache and a cache is good only if the workload using it has good locality. So I decided to take a look at all the locations in the database that are being read as part of the test to check for temporal (and maybe even spatial) locality.
It turns out that this workload (for the environment it is running in) had very little locality of reference. Of all the database blocks read, only ~5% of the blocks are looked up more than once within a run. And only ~0.2% were referenced more than 2 times. The relationship between the hit ratio in a cache and the reduction in latency should be available in most CS architecture textbooks but I didn't have one handy and Wikipedia and a cursory search on Google didn't turn up any sites that talk about it so, for reference, here are some equations that express these relationships:
- h=hit ratio
- L=latency of the back-end
- l=latency of the cache
- r=latency ratio between back-end and cache = L/l
- x=latency reduction (ie overall latency with cache should be L/x)
- h=r\*(x-1)/(x\*(r-1)) # mnemonic!
- x=r/(r\*(1-h)+h) # same thing, except x is expressed in terms of r and h
- r=x\*h/(1+x\*(h-1)) # same thing, except r is expressed in terms of x and h
If we let r=100 and h=0.05 then x=100/95.05, ie a 5% reduction in latency
This should lead us to the logical/theoretical conclusion that the L2ARC should give almost no benefit at all (well, literally it should give ~5% improvement for a ~5% hit ratio). But, as the data showed, we were getting much more than a 5% improvement. So the new line of questioning was to explain the 2x-5x improvement in the face of a low locality of reference.
Since the only way to improve a cold cache workload is to somehow prefetch the blocks, it was clear that some kind of prefetching was going on. The ideal prefetch is when you have an Oracle backing MySQL (heh, ;-) this is still a techie blog - we aren't talking M&A here!) that can predict the future and serve up the blocks right before they are needed. Lacking such an oracle, we settle for the default ZFS block size to do the prefetch for us. The way this works is that when innodb asks for block x, ZFS caches blocks x .. x+7 in the ARC and these blocks eventually make their way to the L2ARC. Since this is a workload with poor locality, there is a high probability that block x will never be referenced again for the rest of the test, but there is a higher likelihood that one of the blocks x+1..x+7 will eventually be accessed and hit either in the ARC or the bigger L2ARC.
So despite having no/very little cache locality, prefetching database blocks into the cache turns out to be what causes a win in this test. With more SSDs, the hit rate obviously improves and this is why we see the huge improvements even when the trace analysis shows that it shouldn't improve by much (atleast theoretically).
Tuning down the block size of the ZFS file containing the MySQL/innodb database from the default of 128K to match the innodb block size of 16K is usually a win (as I discussed in my previous blog entry). But this tuning can be sub-optimal when using an SSD as a ZFS L2ARC for a read-mostly workload.
As you can see, the caching happened as a side-effect of leaving the ZFS block size at the default. If we followed the conventional wisdom of bumping down the block size, it is clear that none of the blocks that are needed at some later point in time can be prefetched and this will result in a slightly degraded performance. We could even go the other way and bump up the ZFS block size so that even more caching could be done. The actual benefit of doing this will depend on the size of your L2ARC and the "free" disk bandwidth that is available to feed the larger block size. I was using a relatively low end config with 4 disks which are already running at full throttle so I decided it wasn't worthwhile to do the experiment of increasing the ZFS block size.
So to summarize:
- Despite using a cache-busting, almost pure-random access workload, SSDs used in the ZFS L2ARC can give a pretty good performance boost. The corollary to that, of course, is that if your database workload has good cache locality, your performance could be even better.
- Matching the ZFS block size to the database block size is in general a good idea. This particular test, where I do a read-only test with L2ARC enabled using SSDs, turns out to be one of those corner cases where that tuning tip hurts more than it helps.
For those who want exact details about the configuration I tested, here you go:
- Hardware: Sun x4150, 2xQuad core Intel Xeon 3 GHz CPUs, 8GB memory
- Software: Solaris (Nevada 115) + MySQL 5.4
- Configuration: The MySQL database is on a ZFS filesystem which is striped over 4 disks. I have 2 Intel X25-E SSDs and I tested 3 configurations: