ZFS and Databases

Given that we started to have enough understanding on the internal dynamics of ZFS, I figured it was time to tackle the next hurdle : running a database management system (DBMS). Now I know very little myself about DBMS, so I teamed up with people that have tons of experience with it, my Colleagues from Performance Engineering (PAE), Neelakanth (Neel) Nagdir and Sriram Gummuluru getting occasional words of wisdom from Jim Mauro as well.

Note that UFS (with DIO) has been heavily tuned over the years to provide very good support for DBMS. We are just beginning to explore the tweaks and tunings necessary to achieve comparable performance from ZFS in this specialized domain.

We knew that running a DBMS would be a challenge since, a database tickles filesystems in ways that are quite different from other types of loads. We had 2 goals. Primarily, we needed to understand how ZFS performs in a DB environment and in what specific area it needs to improve. Secondly, we figured that whatever would come out of the work, could be used as blog-material, as well as best practice recommendations. You're reading the blog material now; also watch this space for Best Practise updates.

Note that it was not a goal of this exercise to generate data for a world record press-release. (There is always a metric where this can be achieved.)


The workload we use in PAE to characterize DBMSes is called OLTP/Net. This benchmark was developed inside Sun for the purpose of engineering performance into DBMS. Modeled on common transaction processing benchmarks, it is OLTP-like but with a higher network-to-disk ratio. This makes it more representative of real world application. Quoting from Neel's prose:

        "OLTP/Net, the New-Order transaction involves multi-hops as it
        performs Item validation, and inserts a single item per hop as
        opposed to block updates "

I hope that means something to you; Neel will be blogging on his own, if you need more info.

Reference Point

The reference performance point for this work would be UFS (with VxFS being also an interesting data point, but I'm not tasked with improving that metric). For DB loads we know that UFS directio (DIO) provides a significant performance boost and that would be our target as well.

Platform & Configuration

Our platform was a Niagara T2000 (8 cores @ 1.2Ghz, 4 HW threads or strands per core) with 130 @ 36GB disks attached in JBOD fashion. Each disk was partitioned in 2 equal slices, with half of the surface given to a Solaris Volume Manager (SVM) onto which UFS would be built and the other half was given to ZFS pool.

The benchmark was designed to not fully saturate either the CPU or the disks. While we know that performance varies between inner & outer disk surface, we don't expect the effect to be large enough to require attention here.

Write Cache Enabled (WCE)

ZFS is designed to work safely, whether or not a disk write-cache is enabled (WCE). This stays true if ZFS is operating on a disk slice. However, when given a full disk, ZFS will turn _ON_ the write cache as part of the import sequence. That is, it won't enable write cache when given only a slice. So, to be fair to ZFS capabilities we manually turned ON WCE when running our test over ZFS.

UFS is not designed to work with WCE and will put data at risk if WCE is set, so we needed to turn it off for the UFS runs. We needed to do this, to get around the fact that we did not have enough disk to provide each filesystem. Therefore the performance we measured is what would be expected when giving full disk to either filesystem. We note that, for the FC devices we used, WCE does not provide ZFS a significant performance boost on this setup.

No Redundancy

For this initial effort we also did not configure any form of redundancy for either filesystem. ZFS RAID-Z does not really have equivalent feature in UFS and so we settled on simple stripe. We could eventually configure software mirroring on both filesystems, but we don't expect that will change our conclusions. But still this will be interesting in follow-up work.

DBMS logging

Another thing we know already is that a DBMS's log writer latency is critical to OLTP performance. So in order to improve on that metric, it's good practice to set aside a number of disks for the DBMS' logs. So with this in hand, we manage to run our benchmark and get our target performance number (in relative terms, higher the better):

        UFS/DIO/SVM :           42.5
        Separate Data/log volumes


OK, so now we're ready. We load up Solaris 10 Update 2 (S10U2/ZFS), build a log pool and a data pool and get going. Note that log writers actually generate a pattern of sequential I/O of varying sizes. That should map quite well with ZFS out of the box. But for the DBMS' data pool, we expect a very random pattern of read and writes to DB records. A commonly known zfs best practice when servicing fixed record access is to match the ZFS' recordsize property to that of the application. We note that UFS, by chance or by design, also works (at least on sparc) using 8K records.

2nd run ZFS/S10U2

So for a fair comparison, we set the recordsize to 8K for the data pool and run our OLTP/Net and....gasp!:

        ZFS/S10U2       :       11.0
        Data pool (8K record on FS)
        Log pool (no tuning)

So that's no good and we have our work cut out for us.

The role of Prefetch in this result

To some extent we already knew of a subsystem that commonly misbehaves (which is being fix as we speak), the vdev level prefetch code (that I also refer to as the software track buffer). In this code, whenever ZFS issues a small read I/O to a device, it will, by default, go and fetch quite a sizable chunk of data (64K) located at the physical location being read. In itself, this should not increase the I/O latency which is dominated by the head-seek and since the data is stored in a small fixed sized buffer we don't expect this is eating up too much memory either. However in a heavy-duty environment like we have here, every extra byte that moves up or down the data channel occupies valuable space. Moreover, for a large DB, we really don't expect the speculatively read data to be used very much. So for our next attempt we'll tune down the prefetch buffer to 8K.

And the role of the vq_max_pending parameter

But we don't expect this to be quite sufficient here. My DBMS savvy friends would tell me that the I/O latency of reads was quite large in our runs. Now ZFS prioritizes reads over writes and so we thought we should be ok. However during a pool transaction group sync, ZFS will issue quite a number of concurrent writes to each device. This is the vq_max_pending parameter which default to 35. Clearly during this phase the read latency even if prioritized will take a somewhat longer time to complete.

3rd run, ZFS/S10U2 - tuned

So I wrote up a script to tune those 2 ZFS knobs. We could then run with a vdev preftech buffer of 8K and a vq_max_pending of 10. This boosted our performance almost 2X:

        ZFS/S10U2       :       22.0
        Data pool (8K record on FS)
        Log pool (no tuning)
        vq_max_pending : 10
        vdev prefetch : 8K

But not quite satisfying yet.

ZFS/S10U2 known bug

We know of something else about ZFS. In the last few builds before S10U2, a little bug made it's way into the code base. The effect of this bug was that for full record rewrite, ZFS would actually input the old block even though the data is actually not needed at all. Shouldn't be too bad, perfectly aligned block rewrites of uncached data is not that common....except for database, bummer.

So S10U2 is plagued with this issue affecting DB performance with no workaround. So our next step was to move on to ZFS latest bits.

4th run ZFS/Build 44

Build 44 of our next Solaris version has long had this particular issue fixed. There we topped our past performance with:

        ZFS/B44         :       33.0
        Data pool (8K record on FS)
        Log pool (no tuning)
        vq_max_pending : 10
        vdev prefetch : 8K

As we compare to umpty-years of super tuned UFS:

        UFS/DIO/SVM : 42.5
        Separate Data/log volumes


I think at this stage of ZFS, the results are neither great nor bad. We have achieved:

        UFS/DIO   : 100 %
        UFS       : xx   no directio (to be updated)
        ZFS Best  : 75%  best tuned config with latest bits. 
        ZFS S10U2 : 50%  best tuned config.
        ZFS S10U2 : 25%  simple tuning.

To achieve acceptable performance levels:

The latest ZFS code base. ZFS improves fast these days. We will need to keep tracking releases for a little while. The current OpenSolaris release as well as the upcoming Solaris 10 Update 3 (this fall), should perform for these tests, as well as the Build 44 results shown here.

1 data pool and 1 log pool: common practice to partition HW resource when we want proper isolation. Going forward I think that, we will eventually get to the point where this will not be necessary but it seems an acceptable constraint for now. Tuned vdev prefetch: the code is being worked on. I expect that in a near future this will not be necessary.

Tuned vq_max_pending: that may take a little longer. In a DB workload, latency is key and throughput secondary. There are a number of ideas that needs to be tested which will help ZFS improve on both average latency as well as latency fluctuations. This will help both the Intent log (O_DSYNC write) latency as well as reads.

Parting Words

As those improvement come out, they may well allow ZFS to catch or surpass our best UFS numbers. When you match that kind of performance with all the usability and data integrity features of ZFS, that's a proposition that becomes hard to pass up.

How are the disks partitioned? I think that could be of impact. Is UFS or ZFS first? A hard drive performs best at the beginning of the disk and worse further on. example: http://www.simplisoftware.com/Public/Content/Pages/Products/Benchmarks/HdTachUsage4.jpg

Posted by J. Resoort on septembre 22, 2006 at 05:08 AM MEST #

We don't think it matters enough to this kind of benchmark whose performance depends a lot more on random IOPS (head movements) than pure throughput. Q: do we see inner/outer cyl. affecting IOPS capability ? Our judgement on this is it should not matter.

Posted by Roch on septembre 22, 2006 at 07:22 AM MEST #

How do these numbers compare to raw partitions? Rumour has it that UFS/DIO/SVM is 90-95% of raw paritions. I understand that managing raw partitions is PITA but if I don't have too many spindles/disks maybe raw paritions are still an viable option. -- prasad

Posted by Prasad on septembre 22, 2006 at 11:06 AM MEST #

J.Resoort, We assigned each slice on a disk on a round-robin basis. So both UFS and ZFS got the same number of inner and outer slices.

Posted by Neel on septembre 22, 2006 at 11:46 AM MEST #

Explain again why it is safe to turn on write caching for a database running on ZFS? What happens if a transaction is committed (but not written to disk because of write caching), and then there is a power cut?

Posted by AM on septembre 22, 2006 at 07:02 PM MEST #

Synchroneous writes are handled by the ZIL. The ZIL may well issue a number of concurrent writes to some of the vdevs (which go to the write-cache) _but_ it will not return execution control to application before it has flushed the caches from all devices in question. When a synchroneous writes completes, data is always on stable storage.

Posted by Roch on septembre 25, 2006 at 10:24 AM MEST #

Post a Comment:
Comments are closed for this entry.



« avril 2014

No bookmarks in folder