PostgreSQL performance study: introduction

While working on PostgreSQL performance and scalability tuning I found things worth mentioning. Guess that in order to talk to the community it would be nice if these findings would easily available. So find here the first post about this subject, with many more to come: an introduction of the environment I use for testing, some installation things and the IO patterns I observe.


While using an internal benchmark kit to generate load for a PostgreSQL 8.3beta setup some scalability issues are observed. The setup being used is a Sun Fire V890, 16 cores SPARC IV @1350MHz. This system has 64GByte internal memory, and seven disk arrays dedicated to the database. Load is generated on a Sun Fire T2000. Machines are connected through 1GBit ethernet.

System runs Nevada, build 70a. The database is a self compiled 64 bit binary optimized for this platform. The setup as a whole is created to ensure IO is not a problem. This to enable us to focus on the CPU related performance and scalability.

The first thing done was a suite of tests to discover anomalities over time. Wanted to make sure the load and database behaviour was "constant". Each test ran for 3600 seconds after a 100 second ramp up. During the whole test a couple of utilities like mpstat, vmstat, prstat, intrstat were running to gather system wide statistics. During most of the tests DTrace was used to measure areas of interest in the PostgreSQL code. This was done by "enhancing" the DTrace provider that comes with this version of PostgreSQL by adding a couple of probes.

Load settings were altered by changing the number of concurrent users from 64, 128, 256, ..., 1280. The generator has a notion of TPM to measure throughput. The TPM numbers of these tests will be compared. BTW these numbers are completely artificial and only usable for inter test comparisions.

The postgresql.conf file contains the following settings that differ from the default values:

max_connections = 2100
shared_buffers = 2097152 # 16GB
max_fsm_pages = 204800
wal_sync_method = fdatasync
wal_buffers = 256
checkpoint_timeout = 900
checkpoint_segments = 128
commit_delay = 10
commit_siblings = 10
bgwriter_delay = 10
bgwriter_lru_maxpages = 1000

IO Behaviour

Here are some graphs. First some of the IO behaviour. Please note that the database was created with an additional objective in mind: to learn the IO patterns for the different parts of this test suite. Load generator specific index files are placed in a dedicated tablespace. So are the specific tables. These tablespaces are each placed on a disk array dedicated to that tablespace. The WAL data is also placed on a dedicated disk array and finally all other IO is done to a fourth array. In the IO graphs these are called: index, table, wal, and pg. The graphs only show the write calls to the devices. The filesystem holding the non specific tablespaces is mounted on a UFS filesystem. All others are mouned on a UFS file system but with both the force directio option and the noatime option in place. For the direction mounted file systems this means that the underlying disks do 8KB block IOs.

The pictures depict the IO behaviour for 64, 128, 512 and 1280 concurrent user runs. In that order. BTW the WAL writes are scaled to fit the pictures. In reality they occur ten times more.





I have many question about this IO behaviour. However this facet does not have the focus of attention. The arrays can do much more work. There was only a burst of read activity (not in the graphs) during the first couple of 100 seconds. After that the 16GB were adequate to buffer al data. Realistic? No. However good to focus on CPU related issues.

The only write factor that really influences the transaction times is the time it takes for the WAL to be written after a sync. However the average service times are sub millisecond. Only during the checkpoints was this different.


It would be interesting to know what motivated you to make these configuration changes. In particular why set the unusual commit_siblings and commit_delay settings. And why start so many connections if you only have 16 processors.

The biggest unexplained bit here is the thing about forcedirectio. Postgres is designed around the assumption that the filesystem i/o (except possibly the wal) will be buffered. It expects writes to be basically instantaneous and reads to benefit from a large filesystem cache. And it expects readahead on sequential reads to happen automatically. Postgres makes no attempt to read blocks it might need soon before it needs them for example.

What would be useful to know to interpret these results is whether the database was resident in Postgres shared buffers or whether it was being read and written into the filesystem cache heavily even if it didn't have to be fetched from disk.

Posted by Gerg on March 10, 2008 at 02:45 AM PDT #

Greg, these setting were found to give us an optimal result in earlier tests. I think Jignesh Shah did blog about this earlier. Although there were up to 1280 clients, these clients are not active continuously. On average they should sleep (200 msecs) after each business activity: the so called think time.
Why so many? If there are no scalability issues this would show up nicely. Since we expect scalability issues we need a range to get an impression on where the issues start, and how this escalates. These tests are all done to set a landscape. It will enable us to focus. More data is currently being analyzed. I'll continue my blog with new findings as soon as possible. Perhaps some of your answers will be in there.
To answer your question about the 1000 number: this is not necesarily a big number. I have seen many installations (with a commercial database) that run with with many more users on midrange and higher end servers.

The directio mount flag is used because it improves performance. File system read ahead is unusable and countereffective since in a random environment it is highly unlikely that the next (sequential) block will be needed . Since file systems do these extra reads these become an extra load on the IO subsystem.
As you said directio mounting is especially beneficial for the WAL device. Until they are being archived... But that feature is disabled.
Since only dirty blocks are written chances are small that more blocks can be written in sequence. Therefor this is also not beneficial for improved IO performance. Finally the database is the one and only place where knowledge is (or should be) available with respect to what is needed and what not. I therefore believe (and found this true in a fast majority of cases) that moving your file system cache into the shared buffer section of the database is the best way to use the memory available. Of course I use a situation where the database is the one and only application running on the server.
If we find that PostgreSQL should be more knowledgeable about read ahead strategies than we will most likely look at an implementation.
The blog mentioned that the 16GB of shared buffers seems adequate to remove read IO. I feel therefore that the whole dataset that was being used was available in memory. This was a deliberate attempt in order to focus on CPU related matter.
Finally the two commit_xxx settings drastically reduce the amount of write IO for the WAL device. Although on my system this did not make a measurable difference to the transaction throughput as reported by the load generator.
I'll try to remember to post a coupe of graphs that show this. Please help me remember.

Posted by Paul van den Bogaard on March 10, 2008 at 04:46 AM PDT #

Great post and draw. Thank you for sharing.

Posted by links london jewelry on November 30, 2009 at 10:32 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed



« June 2016