UFS versus ZFS with PostgreSQL, the saga continues
By paulvandenbogaard on Jun 17, 2008
The previous posts described some tests I did using a setup that was "tuned" to focus on CPU consumption. The idea was to minimize IO as much as possibly by caching all data needed as much as possible and using plenty of disk arrays with write caches to minimize write latency.
In my current tests I am interested in the IO components and how they influence throughput when using two different file systems. In order to create a significant IO component the amount of disks was limited to nine internal ones: no array; no write caches. In addition the load generator was changed to significantly increase the working set data. Indeed my 16GB of shared buffers are not enough to buffer this working set.
The initial configuration was "tuned" to once in 15 minutes perform a checkpoint. Here I will describe the results when and using a smaller checkpoint interval and influencing the time "available" to write out all data due to a checkpoint. The PostgreSQL parameters in effect here are:
The checkpoint_timeout was set to do a checkpoint every 300 seconds. The checkpoint_completion_target used to be 50%. This influences PostgreSQL to write out all the dirty buffers in a period of time that is 50% of the timeout value. Indeed the old graphs show this nicely: first halve of each checkpoint period a lot of IO can be seen that is not present during the second halve of this period.
The idea behind the tests described here was to spread the IO over time. This was done by reducing the timeout from 15 to 5 minutes and increasing the completion_target to 85%. This turned out not to be optimal. Probably too many dirty blocks were re-dirtied again so a continuous writing of the same blocks was done over and over.
A smaller checkpoint_timeout seems fine for the time being. Increasing this parameter ensures that rewriting of blocks is minimized but the amount of blocks to be written increases resulting in a much more extreme IO workload once the checkpoint starts. With that same reduced checkpoint_timeout setting the completion_target was set to 15% to see what happens. The results of the 85% and 15% tests can be seen in the graphs below. First the throughput graph.
The nice thing here is that ZFS shows the best throughput (on average about 50K TPM.) And indeed for a 15% completion_target. Same target for UFS causes huge fluctuations in the throughput graph. Now lets look at the IOPS graph.
Although the pattern is clear lets focus on a smaller time interval to see what happens during one checkpoint interval
The 15% setting seems too much for the UFS case: it needs to continue writing data for the whole period. ZFS on the other hand, puts out a burst of IO's in a 50 second period and from then on the IO settle to a constant amount of "background" IO for the rest of the checkpoint interval (most likely this has a high content of WAL related output.) 15% of 5 minutes is 45 seconds. Indeed ZFS seems to be able to cope with this setting.
The above shows the write IO's per second. How much data is actual written is shown in the following graph.
ZFS is clearly pushing out much more data. During those initial 50 seconds up to 60MB/sec, after that a linear reduction of this amount can be seen. Although the IOPS are rather constant the amount is not. ZFS seems to reduce its write size. UFS pushes significantly less data out to the disks. However it uses so many IO calls that this reduces the overall TPM throughput.
Since ZFS is a logging file system the data written contains both the application (PostgreSQL) data and the logging data. I'll do some extra tests to see how this effects the TPM results. Need to "sacrifice" a 1GB cache disk array for this to emulate a fast SSD-like disk. Will make sure that this one is dedicated for ZFS its so called intent log.