PostgreSQL on UFS versus ZFS

Introduction


Up till now I had the good fortune to do my PostgreSQL tests using a Sun Fire V890, 16 cores SPARC IV @1350MHz. Attached were seven disk arrays. Four of these with 256MB of write through cache; the other three with 1GB of write through cache. All spindels set up as a stripe.


Always used UFS, where WAL was located on a stripe of the four smaller arrays; all application data related tables on one array; all application indexes on one array; and all other PostgreSQL related data on the third array. Both file systems that held application specific data were mounted with directio option in place. As was the WAL holding file system.


Always used an internal toolkit to generate an OLTP-ish load with 128 concurrent users that were constantly sending their queries and DML (no think times), set up in such a way that there would be quite some IO (a rather large working set).


Next to these seven arrays I had nine internal disks free. This blog entry describes the findings when just using these internal disks in different ways. These internal disks were configure in different ways to hold the data mentioned above. Next to the (array based) baseline different configurations using both UFS and ZFS were tested. Below the findings are presented.


Results


The load generation tool reports its results through a TPM metric. The next graph shows this throughput for the baseline, a single stripe over the nine internal disks holding an UFS file system and a single pool (of these same internal nine disks) on which a ZFS file system was configure.


The /etc/system settings used for all these tests are:



set rlim_fd_max=260000
set rlim_fd_cur=260000
set maxphys=1048576
set ip:ip_squeue_fanout=1
set ip:ip_squeue_bind=0
set tune_t_fsflushr=1
set autoup=900
set ufs:ufs_WRITES=0
set segmap_percent=25

My system has 48GByte internal memory. PostgreSQL shared buffers take 16GByte. So I do not want the filesystems to use more than 25% of my memory for caching. Default UfS uses 12%. However ZFS is a little more greedy by default. This setting at least ensures the database processes would have an adequate amount of memory.


The ZFS file system was created with an 8KByte record size and compression turned off. No further settings were changed.





fs_tpm.png



Before each test a backup of the freshly created database was restored. The caching behavior of ZFS shows quite nicely. Both UFS tests need their time to get started (remember that directio mounted file systems do not cache, so in the beginning the data needs to be read from disks, instead from file system cache). The dips in throughput correspond nicely with checkpoint activity.

Checkpoints are definitely handled way better when using disk arrays with memory backed up cache. At about 2000 seconds the throughput seems become stable in the array case, while in both other cases the checkpoint load strongly effects the throughput.

Comparing the UFS and ZFS tests done using internal disks it seems that during throughput peek times they are quite comparable, although ZFS shows a little less throughput. The data above "just" presents the TPM. Lets look at the IO graphs. To make things "easier" I'll only show UFS vs ZFS using the internal disks. The above graphs come from a test where all database files are in the same file system. Therefore iostat cannot discriminate between WAL, data nor index based IOs.

writerate.png

Yes indeed, there is a difference. Read rate is not depicted here. ZFS shows no reads during the interval. Indeed the file system cache works perfectly. For UFS there is an initial rate of 13MByte/sec. This one reduces rapidly. At 1500 seconds there is 200KByte/sec left. After 2500 seconds (from the start) this has been further reduced to less than 40 KB/sec.

However the difference in write IO is significant. Using ZFS the periods of write IO is longer (from ~200 sec with UFS, to ~380 sec with ZFS) and the amount of write IO is up to five times as big (18MB/sec vs 80MB/sec). Now looking at the actual IOPS the following graph shows.



writeiorate.png



This one shows ZFS is handling the disk device much better. The average number of IO's is down by a factor of 3 when using ZFS.

Finally looking at CPU consumption it shows the ZFS test uses a little less of this resource.





Next step



Some extra tests were done where two pools are used. One for WAL and the other for the rest of the database files. This to check if the behavior could reside from the way the database uses its storage. The WAL pool has four disks, the other pool has five.

Since a picture tells us more than a thousand words, here are the results. First the WAL output.



walwrite.png



Although they seem to differ, it looks more or less equal when considering amounts. Still the UFS is "faster," meaning the frequency of those spikes (checkpoint) is higher. Once more, since everything is in the ZFS file system cache, the ZFS test starts at once, while the UFS one needs a warm up period. However the huge differences seen in the previous graphs do not show here. Now lets look at the filesystem holding the other PostgreSQL data.



allwrite.png

There it is! There is a striking difference between UFS and ZFS file systems when doing non wall related write IO with PostgreSQL. And by the way, the difference is bigger compared to the 'all on one file system' tests: approximately 7MB for UFS versus 85MB for ZFS. The ZFS numbers are obtained through zpool iostat command. I double checked the iostat numbers, and indeed the 85 MB number is correct. No hidden undocumented features that could "explain" this difference. Need to dig deeper.





Conclusion



Although the actual performance from a client perspective is just a little less when comparing ZFS with UFS, the amount of IO (KB/sec) is too different to be ignored. Although ZFS seems to optimize by large sequential writers the overall effect in an OLTP environment seems not to be optimal. I hope to be able to do some further tests that could help shed some light on this remarkable finding.

Comments:

Could you get your hands on some SSD disks and run some numbers on them?

Posted by Mikael Gueck on June 12, 2008 at 09:05 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

paulvandenbogaard

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today