ZFS and OLTP
By user13278091 on sept. 22, 2006
ZFS and Databases
Given that we started to have enough understanding on the internal dynamics of ZFS, I figured it was time to tackle the next hurdle : running a database management system (DBMS). Now I know very little myself about DBMS, so I teamed up with people that have tons of experience with it, my Colleagues from Performance Engineering (PAE), Neelakanth (Neel) Nagdir and Sriram Gummuluru getting occasional words of wisdom from Jim Mauro as well.
Note that UFS (with DIO) has been heavily tuned over the years to provide very good support for DBMS. We are just beginning to explore the tweaks and tunings necessary to achieve comparable performance from ZFS in this specialized domain.
We knew that running a DBMS would be a challenge since, a database tickles filesystems in ways that are quite different from other types of loads. We had 2 goals. Primarily, we needed to understand how ZFS performs in a DB environment and in what specific area it needs to improve. Secondly, we figured that whatever would come out of the work, could be used as blog-material, as well as best practice recommendations. You're reading the blog material now; also watch this space for Best Practise updates.
Note that it was not a goal of this exercise to generate data for a world record press-release. (There is always a metric where this can be achieved.)
The workload we use in PAE to characterize DBMSes is called OLTP/Net. This benchmark was developed inside Sun for the purpose of engineering performance into DBMS. Modeled on common transaction processing benchmarks, it is OLTP-like but with a higher network-to-disk ratio. This makes it more representative of real world application. Quoting from Neel's prose:
"OLTP/Net, the New-Order transaction involves multi-hops as it performs Item validation, and inserts a single item per hop as opposed to block updates "
I hope that means something to you; Neel will be blogging on his own, if you need more info.
The reference performance point for this work would be UFS (with VxFS being also an interesting data point, but I'm not tasked with improving that metric). For DB loads we know that UFS directio (DIO) provides a significant performance boost and that would be our target as well.
Platform & Configuration
Our platform was a Niagara T2000 (8 cores @ 1.2Ghz, 4 HW threads or strands per core) with 130 @ 36GB disks attached in JBOD fashion. Each disk was partitioned in 2 equal slices, with half of the surface given to a Solaris Volume Manager (SVM) onto which UFS would be built and the other half was given to ZFS pool.
The benchmark was designed to not fully saturate either the CPU or the disks. While we know that performance varies between inner & outer disk surface, we don't expect the effect to be large enough to require attention here.
Write Cache Enabled (WCE)
ZFS is designed to work safely, whether or not a disk write-cache is enabled (WCE). This stays true if ZFS is operating on a disk slice. However, when given a full disk, ZFS will turn _ON_ the write cache as part of the import sequence. That is, it won't enable write cache when given only a slice. So, to be fair to ZFS capabilities we manually turned ON WCE when running our test over ZFS.
UFS is not designed to work with WCE and will put data at risk if WCE is set, so we needed to turn it off for the UFS runs. We needed to do this, to get around the fact that we did not have enough disk to provide each filesystem. Therefore the performance we measured is what would be expected when giving full disk to either filesystem. We note that, for the FC devices we used, WCE does not provide ZFS a significant performance boost on this setup.
For this initial effort we also did not configure any form of redundancy for either filesystem. ZFS RAID-Z does not really have equivalent feature in UFS and so we settled on simple stripe. We could eventually configure software mirroring on both filesystems, but we don't expect that will change our conclusions. But still this will be interesting in follow-up work.
Another thing we know already is that a DBMS's log writer latency is critical to OLTP performance. So in order to improve on that metric, it's good practice to set aside a number of disks for the DBMS' logs. So with this in hand, we manage to run our benchmark and get our target performance number (in relative terms, higher the better):
UFS/DIO/SVM : 42.5 Separate Data/log volumes
OK, so now we're ready. We load up Solaris 10 Update 2 (S10U2/ZFS), build a log pool and a data pool and get going. Note that log writers actually generate a pattern of sequential I/O of varying sizes. That should map quite well with ZFS out of the box. But for the DBMS' data pool, we expect a very random pattern of read and writes to DB records. A commonly known zfs best practice when servicing fixed record access is to match the ZFS' recordsize property to that of the application. We note that UFS, by chance or by design, also works (at least on sparc) using 8K records.
2nd run ZFS/S10U2
So for a fair comparison, we set the recordsize to 8K for the data pool and run our OLTP/Net and....gasp!:
ZFS/S10U2 : 11.0 Data pool (8K record on FS) Log pool (no tuning)
So that's no good and we have our work cut out for us.
The role of Prefetch in this result
To some extent we already knew of a subsystem that commonly misbehaves (which is being fix as we speak), the vdev level prefetch code (that I also refer to as the software track buffer). In this code, whenever ZFS issues a small read I/O to a device, it will, by default, go and fetch quite a sizable chunk of data (64K) located at the physical location being read. In itself, this should not increase the I/O latency which is dominated by the head-seek and since the data is stored in a small fixed sized buffer we don't expect this is eating up too much memory either. However in a heavy-duty environment like we have here, every extra byte that moves up or down the data channel occupies valuable space. Moreover, for a large DB, we really don't expect the speculatively read data to be used very much. So for our next attempt we'll tune down the prefetch buffer to 8K.
And the role of the vq_max_pending parameter
But we don't expect this to be quite sufficient here. My DBMS savvy friends would tell me that the I/O latency of reads was quite large in our runs. Now ZFS prioritizes reads over writes and so we thought we should be ok. However during a pool transaction group sync, ZFS will issue quite a number of concurrent writes to each device. This is the vq_max_pending parameter which default to 35. Clearly during this phase the read latency even if prioritized will take a somewhat longer time to complete.
3rd run, ZFS/S10U2 - tuned
So I wrote up a script to tune those 2 ZFS knobs. We could then run with a vdev preftech buffer of 8K and a vq_max_pending of 10. This boosted our performance almost 2X:
ZFS/S10U2 : 22.0 Data pool (8K record on FS) Log pool (no tuning) vq_max_pending : 10 vdev prefetch : 8K
But not quite satisfying yet.
ZFS/S10U2 known bug
We know of something else about ZFS. In the last few builds before S10U2, a little bug made it's way into the code base. The effect of this bug was that for full record rewrite, ZFS would actually input the old block even though the data is actually not needed at all. Shouldn't be too bad, perfectly aligned block rewrites of uncached data is not that common....except for database, bummer.
So S10U2 is plagued with this issue affecting DB performance with no workaround. So our next step was to move on to ZFS latest bits.
4th run ZFS/Build 44
Build 44 of our next Solaris version has long had this particular issue fixed. There we topped our past performance with:
ZFS/B44 : 33.0 Data pool (8K record on FS) Log pool (no tuning) vq_max_pending : 10 vdev prefetch : 8K
As we compare to umpty-years of super tuned UFS:
UFS/DIO/SVM : 42.5 Separate Data/log volumes
I think at this stage of ZFS, the results are neither great nor bad. We have achieved:
UFS/DIO : 100 % UFS : xx no directio (to be updated) ZFS Best : 75% best tuned config with latest bits. ZFS S10U2 : 50% best tuned config. ZFS S10U2 : 25% simple tuning.
To achieve acceptable performance levels:
The latest ZFS code base. ZFS improves fast these days. We will need to keep tracking releases for a little while. The current OpenSolaris release as well as the upcoming Solaris 10 Update 3 (this fall), should perform for these tests, as well as the Build 44 results shown here.
1 data pool and 1 log pool: common practice to partition HW resource when we want proper isolation. Going forward I think that, we will eventually get to the point where this will not be necessary but it seems an acceptable constraint for now. Tuned vdev prefetch: the code is being worked on. I expect that in a near future this will not be necessary.
Tuned vq_max_pending: that may take a little longer. In a DB workload, latency is key and throughput secondary. There are a number of ideas that needs to be tested which will help ZFS improve on both average latency as well as latency fluctuations. This will help both the Intent log (O_DSYNC write) latency as well as reads.
As those improvement come out, they may well allow ZFS to catch or surpass our best UFS numbers. When you match that kind of performance with all the usability and data integrity features of ZFS, that's a proposition that becomes hard to pass up.