Sizing PostgreSQL 8.3 Logs

Steve (via comments) asked about guidelines for  PostgreSQL logs for X Transactions. I thought for a while on how to answer that question. One approach was to start answering it from the disk level and work the way to PostgreSQL logs and second approach was to start from PostgreSQL logs and work my way to the disks. After some time, I took the later approach. Why? Well by doing so I also would help people size their systems properly.

Most of the reasons that I will use in this post specially related to the disk/storage side probably comes from my interpretations of what I understood from listening to experts within and outside Sun. So while it may not be accurate but to me sounds logical. Feel free to flame me if there is something illogical here.

Generally I have found that people approach their sizing in the following way: First they select their application, then their database, then  either the Operating System or the Server itself and probably think last about the Storage or mostly carve it from existing SAN infrastructure.

To make my post simple, lets assume that you have identified Sun Fire X4450 (its new and hence in fashion) for your OLTP appplication running on PostgreSQL 8.3.  I would say that server can support 3000-4000  OLTP transactions per second. For sizing purposes lets say we are targetting a max of 3000 tps on it. Now lets work our way through this.

If I understand PostgreSQL 8.3 right, then by default each transaction when it commits will result in a sync on the wal record which will write a page which typically is 8K to the log devices.  (By default in PostgreSQL 8.3: fsync = on, commit_delay=0, synchronous_commit=true, full_page_writes=on)

So lets size the pressure on the logs:

For 1 transaction =  atleast 1 IO operation of 8KB on log filesystem

For 3000 tps =    atleast 3000  IO operation per second (iops)  of total 24000 KB/sec of writes (roughly about 24 MB/sec for simplicity)

Now most modern drives do support 25-60MB/sec writes so it would seem that the overall throughput should not be a problem.

However on the same token most modern drives (few term them as spindles in this conversation) typically only do well around 120 iops average (considering that the write cache is turned off on the drive for safety).

Aha so a single disk cannot handle the logs. Well then the number of drives I would need just for the log file system is 3000/120 = 25

So we will require about 25 disks just to handle the log devices. Now generally each disk is about 73GB nowadays. Which means that I will have to dedicate about 1,825 GB just for the logs. How much actual space is required for log? Well even if you tune checkpoint_segments to a high value like 512 or 1024 (which is pretty high with each segment about 16MB) that will be roughly be still less than 36GB (or about half of 1 disk).

Its still not over. Generally you will need protection for the log devices. If you do RAID1+0 then you end up with 50 disks. If you do RAID-5, then first of all you will need to make sure that the hardware controller can do 3000 parities/second without really increasing your service times. At this time most people probably think I am loosing my mind since most people don't really allocate 50 disks to their logs. How do they then support the same transactions with lot lower spindles of disks? Various technologies come to the rescue:

Lets look at PostgreSQL 8.3 technologies that helps us to reduce the spindle count for logs (and also risks associated with it if any):


1. full_page_writes: This option - a tunable in postgresql.conf - by default is turned on. By turning this option off, PostgreSQL does not write the entire 8K page out to the log but just the amount that is actually changed. Say for example this reduces your average writes from 8K to 2K then your max throughput support required drops from 24MB/sec to say about 6MB/sec. However note that this by itself does not reduce the 3000 iops done to the log file system. Also this introduces a risk of partial write. That is if the page when written to the actual table/index storage is only written partially when power fails, then the information stored in the log will not be enough to recreate the page when after power restore postgresql server is trying to recover the database.

2. commit_delay: This option - a tunable in postgresql.conf - by default is set to 0. However by setting it to a positive number (>0) has  quite a bit of change on how Postgres will write to the logs. Say there are more than 6 (greater than default commit_siblings=5) transactions are in flight and one of them wants to commit. The database server detects that are chances of other transactions also completing and hence takes a quick nanosleep (equal to commit_delay) and allows other transactions that also wants to commit into the wal buffers. Once it wakes up (or some other transactions forces it to sync first), it will write all committed transactions in one go. The drawback is individual transactions are delayed to commit (and hence probably increasing their response time by the sleep in nanoseconds) but the real advantage is now all those transactions being committed will only result in 1 OP.  This is a huge win. For example say that on average with the heavy load at peak it is doing three transactions per sync then it reduces the iops requirement from 3000 iops to 1000 iops (which immediately reduces the need from 25 spindles to about 9 spindles).  Also there is no risk to the database, plus delays to the transactions are typically the minimum sleep resolution that the operating system supports. More transactions within a slight delay, lesser the load of iops on logs.


(1&2 both on: If only commit_delay is on then eventhough the sync is doing 1 iops since all the transactions involved could be operating on diffent pages (or should be) hence it will have to write 3 8KB pages or 24KB pages together. However with full_page_writes off, it is probably much less than 24KB.)

 3. synchronous_commit: PostgreSQL 8.3 has added yet another technology to reduce the load on log devices. This option when set to false will for most cases avoid synchronous commit (or syncing of the WAL buffers after commit). A separate process will sync based on time which will allow more and more transactions to be combined together (till the maximum value  the operating system allows) within a single IO operating. This infact allows still further reduction of 9 disks from earlier case to say maybe 6 or so. However there is a risk associated with synchronous_commit. Unlike commit_delay which delays the transactions, in this case the transactions gets committed even though wal buffers are not yet flushed/synced to the device. Hence there are chances of loosing transactions (within a period equal to 3x wal_writer_delay) even though it was reported committed.  However considering many people who revert to unsafe fsync=off , this allows the database to be consistent state in case of power failure.

We have seen how existing technologies can be leveraged in PostgreSQL 8.3 to reduce the spindles required for log device from 25 to  9 or even lower to say 6. Now lets look at hardware technologies that helps to reduce the number of spindles required.

1. Write cache on disk itself: Most modern disks now have write cache on them. By turning them on, it can significantly help increase the iops in case of writes on those drives. However even though the drives says it has written the data to the application to improve the service time, the data could still be in cache only and hence lost during power failure which could lead to unusable database.

2. Battery backed Write cache: This write cache is similar to the write cache on disk, except it is generally external and backed up by a  battery which can sustain power failures (depend on the length of failure). However it is more expensive setup but the cost is offset by the savings of the spindles itself, since it may result that you only need 1 spindle (and 1 more for mirror) (NOTE: we are just talking about the log devices here). However thing to note is the writes from the cache to the disk will still happen at "disk" speed which means if you have peak loads then the cache can help out a lot but if you have consistent inflow of writes then the cache might eventually fill up and will now have to wait for the disk to free up some cache. So it might still be worthwhile to have more spindles depending on the cache size and how long the peak writes will last.

After this long post, we have only sized the PostgreSQL 8.3 logs yet.



Excellent Post, can you size the rest?

Posted by TheCrook on February 10, 2008 at 07:23 PM EST #

The idea of SATA SSD's seems to be getting more and more tempting for things like this, to have your log's on a SSD thats capable of over 10k iops would negate a lot of issues, put 4 of them in a RAID10 on a good SATA raid card with battery backed cache and you have a system that even under the most extreme load will still not be an issue and won't fail under a power problem... SSD's are getting more and more tempting each year, even if just for a small portion of your datasets.

Posted by Trophaeum on February 11, 2008 at 08:24 AM EST #

Couple of technical clarifications here. First, only the first transaction that impacts a page after a checkpoint writes a full 8K to the write-ahead log (and there's the full_page_writes tunable for that behavior even, thought it isn't safe to turn off). Additional changes to a section of data that's already been touched before results in only writing what's been updated. So while there's an ugly period just after the checkpoint where things approach the worst-case you describe, most of the time the actual throughput to the WAL is lower. How much lower depends on how randomly distributed your data is, but most workloads will get enough re-use of the some popular pages to lower total throughput requirements.

Second, the WAL consists of sequential writes only, so no seeks involved, and they get grouped if there's multiple transactions waiting to go. So while 120 iops is a reasonable estimate for any single client, if you have multiple clients generating data the number of I/O commits you can get per second goes up considerably--if there's a backlog of 5 clients worth of data when the disk is ready to make a commit, they all go into that one write. That doesn't just happen when commit_delay is turned on, in every case the commit backlog is cleared when a new commit is done. As a ballpark estimate I suggest that by the time you have 10 clients generating traffic, you can get closer to 500 iops to the WAL even out of uncached drives rather than 120.

As a data point, if you have a real battery-backed write cache I've configured and seen others configure multiple systems that reach 3000-4000 TPS using only a single WAL spindle. Adding more spindles doesn't scale the WAL fsync problem very well anyway--when there's no cache in the way that's driven more by rotational speed of the drives for the commits, so you're limited by the iops of a single drive even if you stripe more of them together.

Posted by Greg Smith on February 11, 2008 at 09:04 AM EST #

Hi Greg,

Thanks for the feedback with the technical clarifications

1. Point taken with worse case scenarios. Its hard to come up a number for re-use. For sizing purposes, I guess I can introduce an expected percentage of reuse with normal being say 30% being able to be reused but still very workload dependent

2. This is where I see things a bit different. Its maybe since on Solaris we always turn on forcedirectio (similar to O_DIRECT on Linux). From that I always see 8KB writes on WAL but with commit_delay, it fluctuates from 64KB to 256KB. Considering that if full_page_writes are turned on, I concluded that only one transaction was being flushed to disk unless you turn on commit_delay. I guess I will have to do write a dtrace script to prove either case. But good feedback on it.

For para 3, I dont understand the concern. Unless you have concat happening instead of stripes, you should be able to support more iops.. Ofcourse individual IO will still be worse but the controller algorithm should be able to spread the writes quite well among the disks as per my experience.

Posted by Jignesh Shah on February 11, 2008 at 09:40 AM EST #

If you're using one of the sync write methods and turning on forcedirectio that may very well be changing how the writes are getting done. It might be worth running a test one day with wal_sync_method=fdatasync and with forcedirectio off to see if the behavior changes for you, such that writes even without commit_delay on are larger than 8K with many clients going.

My main point in paragraph 3 was that two drives don't process a single fsync any faster than a single one does; time to actually commit is still driven by rotation speed. There are situations where that can bottleneck operations independently of total write throughput. Just something to watch out for. I also thought it was worth pointing out just how much a good caching controller can help with this situation.

Posted by Greg Smith on February 12, 2008 at 06:49 PM EST #

Ah, found the message on this topic I wanted to pass along before (this is part of the potential 8.4 queue):

Posted by Greg Smith on February 12, 2008 at 07:17 PM EST #

Post a Comment:
Comments are closed for this entry.

Jignesh Shah is Principal Software Engineer in Application Integration Engineering, Oracle Corporation. AIE enables integration of ISV products including Oracle with Unified Storage Systems. You can also follow me on my blog


« June 2016