PostgreSQL wal_sync_method and O_DIRECT on Solaris

Starting with PostgreSQL 8.2, I observed in the documentation that if wal_sync_method is set to open_datasync (O_DSYNC) or open_sync (O_SYNC) then PostgreSQL will set O_DIRECT flag (to bypass file system buffer) where O_DIRECT flag is supported. PostgreSQL does that by checking if O_DIRECT is defined and enables it accordingly.

Thats great, since both O_DSYNC and O_SYNC are  synchronous flags, there is no use for file system to cache it (maybe not true for reads). But certainly by default, if it is not directly writing to the disks, it generally is not getting the best response time on the completion of the write calls. Even on Solaris, it is generally recommended that if you use O_DSYNC flags on your files that are opened, then it is better to write directly to the underlying disks.

However there is a small problem. O_DIRECT flag is not supported on Solaris. Hence the assumption that PostgreSQL while using wal_sync_method as open_datasync or open_sync is writing directly to the disks on Solaris is not true. There is another api  directio(3C) which needs to be used on Solaris. Solaris  has no other way of knowing that the application is requesting to bypass the file system buffer cache. (Only other alternative is for Solaris administrator to use forcedirectio as mount options but that does it at the filesystem level impacting all files on that filesystem.)

directio(3C) API was introduced in Solaris 8 and hence applications using it should compile for Solaris 8,9,10 and all OpenSolaris based distributions.

I did a quick test by just modifying BasicFileOpen function in fd.c in PostgreSQL 8.3beta1 to advise the directio(fd,DIRECTIO_ON) while still using wal_sync_method=open_datasync and saw performance improvements with the recompiled application. Ofcourse in my small quick test it turned on DIRECTIO for all files and that is not something that we want (remember the CLOG Buffer thrashing from couple of days ago?). Looks like we need a hacker to modify the code to advise DIRECTIO_ON for XLOG, datafile and the index files when open_datasync or open_sync is used as wal_sync_method and fsync is enabled.

 

Comments:

wal_sync_method only controls WAL writes, which (barring recovery) are blocks that are never read back again. That means any time spent caching that write is wasted, so bypassing the buffer cache is just fine.

Direct I/O on the database filesystems is a bad idea. While it may work fine on your class of hardware, smaller systems in particular rely on the OS caching to a significant degree, and sync writes can be extremely slow. The whole reason parameters like effective_cache_size exist is to manage that the OS cache is buffering reads and writes in addition to what's inside shared_buffers.

The easiest way to work around this problem is to put the WAL xlog on another filesystem. Then you can mount that one with forcedirectio without any problems. There are some other possible performance improvements from separating out the WAL anyway, because there's never any seeks on that volume either (again, barring recovery).

I wasn't aware that O_DIRECT wasn't working properly on Solaris, don't use that OS much myself. It would be easy enough to fix the underlying code to properly use Solaris directio; there's already some cleanup of wal_sync_method like that on the PostgreSQL to-do list. But since splitting out the WAL volume and using forcedirectio essentially resolves this issue, that's why I suspect it hasn't come up as a priority to fix this yet.

Posted by Greg Smith on October 30, 2007 at 06:49 AM EDT #

I thought directio was only for ufs. Does that mean you are working on optimizing postgresql for ufs? That might be useful, but I thought zfs was considered the recommended choice now.

By the way, would it make sense in solaris to have O_DIRECT as an alias for DIRECTIO_ON? (or something similar, I have not checked the exact semantics)

Posted by Marc on October 30, 2007 at 06:52 AM EDT #

Hi Greg,

Yep I get your point. wal_sync_method is only for WAL Logs and maybe directio(3C) should be used only for WAL Log.

However I have seen improvements when used with Data/Index files also. So Maybe there should be an postgresql.conf tunable for it.

As for why it should be done using directio(3C) compared to "forcedirectio" mount option for ufs:

Using forcedirectio on mount options typically requires System Administration rights which a user might not have. Hence using directio(3C) its easier for developers to indicate and users to use transparently without begging the System Administrator to change the flag for the Entire File system which will have bigger impact. (This is quite useful for Enterprise environments where a System Administrator type change sometimes requires getting Vice President's approval.)

Hi Marc,
Yes this is only for UFS. However directio(3C) is just a hint and does not care about the filesystem. Currently it is only supported on UFS/NFS. For ZFS it might fail or might not do anything but it doesnt hurt. See man directio for usage recommendations

DIRECTIO_ON is not directly supported on open(2) call. Hence wont work.

Also I don't advice it turning globally on for all cases but allowing via a flag or through automatic detection of sync flags.

As a sidenote all Commercial databases on Solaris are already using directio(3C) to indicate the same hints for their log files so it is nothing different here. Also they give a flag/clause for doing the same on TableSpaces which eventually holds the data/index.

But I do appreciate the feedback on this.

Thanks.

Posted by Jignesh Shah on October 30, 2007 at 07:09 AM EDT #

Jignesh,

Why \*doesn't\* Solaris support O_DIRECT flag? Last I checked, that was a POSIX standard. Asking applications to support Solaris-specific directio calls isn't going to get a lot of people to port their applications from Linux.

Posted by Josh Berkus on December 11, 2007 at 10:17 AM EST #

directio() is advisory. It is not a substitute for O_DIRECT. O_SYNC or O_DSYNC should be used in place of O_DIRECT and then directio() could be used as a hint.

Posted by Mark Callaghan on November 06, 2008 at 03:57 AM EST #

Post a Comment:
Comments are closed for this entry.
About

Jignesh Shah is Principal Software Engineer in Application Integration Engineering, Oracle Corporation. AIE enables integration of ISV products including Oracle with Unified Storage Systems. You can also follow me on my blog http://jkshah.blogspot.com

Search

Archives
« July 2015
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
       
Today