ZFS and Databases

Databases and ZFS

Comparing UFS and ZFS out-of-the-box, we find that ZFS performs slightly better than UFS Buffered. We also demonstrate that it is possible to get performance improvements with ZFS by following a small set of recommendations. We have also identified a couple of tunings that help performance. These tunings will be on by default in future releases of ZFS

We (PAE - Performance Engineering) recently completed a study to understand database performance with ZFS. Read on more details and recommendations. You can also read Roch's blog on the same study

Databases stress the filesystem in unique ways. Depending on the workload and configuration, you can have thousands of IO operations per second. The size of these IO is usually small (database block size). All the writes are synchronized writes. Reads can be random or sequential. Some writes are also more critical than others. Depending on the configuration, Reads are cached by the database program or the filesystem (if supported/requested). In many cases where filesystems are used, the IO is spread over a few files. This causes the single writer lock to be very hot under certain configurations like Buffered UFS.

Since IO is so important for databases, not surprisingly, there are a lot heavy weight players in this arena. UFS, QFS, VxFS, are quite popular with customers as the underlying filesystem. So how does the new kid on the block (ZFS) do?

We used an internally developed benchmark called OLTP/Net to study database performance with ZFS. OLTP/Net (O-L-T-P slash Net) is a OLTP benchmark that simulates an online store. The major feature of the benchmark is that it has a bunch of tuning knobs that control the ratio of network IO to disk IO, and/or read/write nature of the transactions, and/or number of new connects/disconnects to the database etc.. This makes it quite easy to simulate customer situations in our labs. We use it quite extensively inside Sun to model real-world database performance, and have found/fixed quite a few performance issues using this workload.

For our ZFS study, we used the default settings for OLTP/Net. In this scenario, we have a read/write ratio of 2:1 and a network/disk IO ratio of 10:1. Since our goal is to run like most customers, we controlled the number of users (load generators) such that the box was 60% utilized.

The hardware configuration consisted of a T2000 with 32x1200Mhz CPUs, 32GB RAM connected to 140 Fibre channel JBODs. We used both Solaris 10 Update 2 as well as Solaris Nevada build 43 to do the analysis We created one big dynamically stripped pool with all the disks. We set the recordsize of this pool to 8k. Each disk was divided into 2 slices. These slices were allocated to UFS and ZFS in round robin fashion to ensure that each filesystem got equal number of inner and outer slices.

Normally for OLTP benchmark situations, we try to use the smallest database blocksize for best performance. When we started out with our study, we used a block size of 2048 as that gives us the best performance for other filesystems. But since we are trying to do what most customers might do, we switched over to a block size of 8192. We did two kinds of tests, a cached database as well as a large (not cached) database. Details follow in following sections.

Recommendations for ZFS and Databases

Most customers use UFS buffered filesystems and ZFS already performs better than UFS buffered!. Since want to test performance, and we want ZFS to be super fast, we decided to compare ZFS with UFS directio. We noticed that UFS Directio performs better than what we get with with ZFS out-of-the-box. With ZFS, not only was the throughput much lower, but we used more twice the amount of CPU per transaction, and we are doing 2x times the IO. The disks are also more heavily utilized.
We noticed that we were not only reading in more data, but we were also doing more IO operations that what is needed. A little bit of dtracing quickly revealed that these reads were originating from the write code path! More dtracing showed that these are level 0 blocks, and are being read-in for the read-modify-write cycle. This lead us to the FIRST recommendation
Match the database block size with ZFS record size.
A look at the DBMS statistics showed that "log file sync" was one of the biggest wait events. Since the log files were in the same filesystem as the data, we noticed higher latency for log file writes. We then created a different filesystem (in the same pool), but set the record size to 128K as log writes are typically large. We noticed a slight improvement in our numbers, but not the dramatic improvement we we wanted to achieve. We then created a separate pool and used that pool for the database log files. We got quite a big boost in performance. This performance boost could be attributed to the decrease in the write latency. Latency of database log writes is critical for OLTP performance. When we used one pool, the extra IOs to the disks increased the latency of the database log writes, and thus impacted performance. Moving the logs to a dedicated pool improved the latency of the writes, giving a performance boost. This leads us to our SECOND recommendation
If you have a write heavy workload, you are better off by separating the log files on a separate pool
Looking at the extra IO being generated by ZFS, we noticed that the reads from disk were 64K in size. This was puzzling as the ZFS recordsize is 8K. More dtracing, and we figured out that the vdev_cache (or software track buffer) reads in quite a bit more than what we request. The default size of the read is 64k (8x more than what we request). Not surprisingly, the ZFS team is aware of this, and there are quite a few change requests (CR) on this issue

4933977: vdev_cache could be smarter about prefetching
6437054: vdev_cache: wise up or die
6457709: vdev_knob values should be determined dynamically

Tuning the vdev_cache to read in only 8K at a time decreased the amount of extra IO by a big factor, and more importantly improved the latency of the reads too. This leads to our THIRD recommendation
Tune down the vdev_cache using ztune.sh1 until 6437054 is fixed
Ok, we have achieved quite a big boost from all the above tunings, but we are still seeing high latency for our IOs. We see that the disks are busier during the spa_sync time. Having read Eric Kustarz's blog about 'vq_max_pending' , we tried playing with that value. We found that setting it to 5 gives us the best performance (for our disks, and our workload). Finding the optimal value involves testing it for multiple values -- a time consuming affair. Luckily the fix is in the works

6457709: vdev_knob values should be determined dynamically

So, future releases of ZFS will have this auto-tuned. This leads us to our FOURTH recommendation
Tune vq_max_pending using ztune.sh1 until 6457709 is fixed
We tried various other things. For example, we tried changing the frequency of the spa_sync. The default is once every 5 seconds. We tried once every second, or once every 30 seconds, and even once every hour. While in some cases we saw marginal improvement, we noticed higher CPU utilization, or high spin on mutexes. Our belief is that this is something that is good out of the box, and we recommend you do not change it. We also tried changing the behaviour of the ZIL by modifying the zfs_immediate_write_sz value. Again, we did not see improvements. This leads to our FINAL recommendation

Let ZFS auto-tune. It knows best. In cases were tuning helps, expect ZFS to incorporate that fix in future releases of ZFS

In conclusion, you can improve out-of-the-box performance of databases with ZFS by doing simple things. We have demonstrated that it is possible to run high-throughput workloads with current release of ZFS. We have also shown that it is quite possible to get huge improvements in performance for databases in future versions of ZFS. Given the fact that ZFS is around a year old, this is amazing!!

1ztune.sh Roch's script


Good stuff, as we are about to embark on our own measurements in the coming weeks. One thing that would be handy would be a set of graphs that demonstrate the improvements along the way as well as what the end-state relative to all the other file systems was. Also, having a clear understanding of how ZFS fared vs the other "classic" file systems in use for databases would be great.

Posted by Alexei Rodriguez on September 25, 2006 at 11:07 AM PDT #

There was a fair bit of speculation on opensolaris http://www.opensolaris.org/jive/thread.jspa?messageID=56899? and not enough measurement, It good to see some one is doing the experiment :) I think you have answered some of the questioned asked in the thread but there are still a couple unanswered. Is there a plan to open source this OLTP/Net tool ?

Posted by Robin McDonald on September 25, 2006 at 03:31 PM PDT #

Regarding your second rule:

"We then created a separate pool and used that pool for the database log files. We got quite a big boost in performance. This performance boost could be attributed to the decrease in the write latency. Latency of database log writes is critical for OLTP performance. When we used one pool, the extra IOs to the disks increased the latency of the database log writes, and thus impacted performance. Moving the logs to a dedicated pool improved the latency of the writes, giving a performance boost."

Why does a second pool result in better latencies? Because this second pool is on dedicated disks? Then this should be added to the rule "second pool on dedicated disks" - how many disks did you use/do you recommend for this pool? - Or just to isolate the database file rw traffic from the logfile write traffic wrt. to read prioritization (then it would not be necessary to have this second pool on separate disks)?

You mention you used JBODs - presumably without a battery backed write cache found in RAID controllers. Using such a write cache for the DBMS logs to reduce latency especially on the commits of short running OLTP transactions is a standard best practice. What would be the impact of having the second zpool for the logs on such a device?

Posted by Franz Haberhauer on September 25, 2006 at 05:13 PM PDT #

I'd like to know why Sun continues doing ZFS testing on JBODs when most of their customers have storage on large scale devices with intelligent RAID controllers. ZFS's performance does not fare so well on such devices.

Posted by Jon Hamlin on September 25, 2006 at 11:09 PM PDT #

You said that all the disks were split into to slices and alternate inner and outers given to UFS and ZFS. IIRC this means that ZFS would never be using a full disk with an EFI label on it and more importantly given it isn't using the full disk it can't work with the disks write cache. If you use full disks for the zpools what difference does it make in this context ?

Posted by Darren Moffat on September 26, 2006 at 01:14 AM PDT #

Alexei, We are continuing to work on database performance with ZFS. We plan on analyzing performance of ZFS woth different kinds of database workloads (DSS, Batch, micro benchmarks, etc..). For the raw numbers, you can visit Roch's blogs at http://blogs.sun.com/roch

Posted by Neelakanth on September 26, 2006 at 01:34 AM PDT #

Robin, This is just the first round of blogs, stay tuned more data. I do not believe we plan to opensource the Oltp/Net benchmark soon. There are some complex ownership issues to be resolved before we do it

Posted by Neelakanth on September 26, 2006 at 01:36 AM PDT #

Darren, yes, you are correct. If we assign a slice, ZFS does not enable the write cache. To make a fair comparison, we turned on the write cache for ZFS and turned off the write cache for UFS. For what it is worth, the write cache does not make a big impact on the throughput (expect maybe for write caches on the disks that host the database log files). We can always counter latency for other writes by putting more load (users) on the system.

Posted by Neelakanth on September 26, 2006 at 01:46 AM PDT #

Write cache on/off for ZFS, only makes a difference when dealing with simple devices that have no other form of I/O concurrency (some SATA drive with older driver and no NCQ as I understand). The reason is that for synchronous ops, ZFS writes to caches then flushes the cache. This is not so different than bypassing the cache entirely.

Posted by Roch on September 26, 2006 at 05:54 PM PDT #

About intelligent raid controller not being fit for ZFS. I think this is a misconception. The raid controller just need to be configured to ignore the flush write cache commands and things will just scream.

Posted by Roch on September 26, 2006 at 06:03 PM PDT #

Post a Comment:
Comments are closed for this entry.



« July 2016