Tuesday Jul 22, 2008

Waving Commit Delay

In the previous entries many graphs were shown. The once showing throughput by the amount of commands handled had this typical sinusoidal pattern. I now think I found the cause of it. Although I definitely do not understand it (yet).

Since, as usual, a picture tells us more than a thousand words here is one with that pattern. The vertical axes shows the cumulative time by all PostgreSQL processes in a certain state:


Ok, what is shown here? The red graph depicts the spent while sleeping as a direct result of the commit_delay setting. Currently the average number of processes with active transactions when a flush is called is around 85. Therefore that part of the equation is true, so indeed it is the commit_delay setting that determines if the sleep is done: all other prerequisites are true (read the source if you do not understand :-) .)

Now when setting commit_delay to zero, the result becomes:


The time for XLogFlush now fully determines the time spent in the routine that handle the RecordTransactionCommit. It is also approximately 60% of the time spend handling commands.

The downside of not disabling commit_delay is total throughput. A little bit of delay reduced the amount of (WAL related) IO significantly. This was already shown in an earlier blog entry. However the behavior is much more constant without the delay. would it be beneficial if the same constancy could be realized in the commit_delay case?

To answer that one, some extra investigations are called for.

Saturday Jul 19, 2008

Througput spikes explained

In previous posts a typical pattern in handling commands was seen. The handling versus reading commands cumulative time consumption showed a pattern as can be seen in the next graph:


I had a hunch that this could be related to the client (load generator) behaviour. The client program is written in Java. It runs 128 application threads that continuously communicate with the database. Lots of rows inserted and retrieved. The number of objects created should be considerable. Garbage collection to clean up the Java heap might be the driving force for the above mentioned pattern.

My standard configuration runs with a 2 Giga byte heap. I decided to increase this to 3.2GB, and also increase the Eden size to 1.5GB. For those a little familiar with Java, this is the space in which the objects are created. The test resulted in the following graph:


Indeed it is the client who is responsible for these spikes. Using a larger heap means more objects can be created in the Java heap before a garbage collection is forced. Also a larger heap holds many more objects translating in longer GC times. Indeed the time spent reading commands has increased. Please note, garbage collections in Eden are "stop the world", meaning the client does not respond to the database process: during a GC the readcommand does take significant more time.

Wednesday Jul 16, 2008

Finding the throttle

In my previous blog entry I showed a behavioral pattern in PostgreSQL that needs attention. In this entry I will try to explain what I did to find the culprit.

The only statistics discussed were the ones in tcop/postgres.c These gave an impression of the global activity. During a checkpoint lots of time is spend handling commands. Although this time seems to be zero, it cummulates to figures as high as 1600 seconds. By looking at the other statistics during those checkpointing intervals some of these too show this accumulation hopefully too. This would make it easy to find the root cause. At 1218 seconds this pattern shows nicely. The next table shows the top time-consumers during that interval:

SLRUPhysicalWritePage::Part6 16 8368 523 8368
SLRUPhysicalWritePage::call 16 8370 523 8368
SLRUselectLRUPage::WritePage 1 8370 8370 8370
SLRUwritePage::call 16 8372 523 8370
postgresqlMain::readchar::B 9799 19100 1 4297
LWLock::Acquire::08::semaphoreTime 1711 21303 12 411
xlogflush::go::acq 1724 21349 12 411
SLRUselectLRUPage::Return3 768 344920 449 10574
SLRUselectLRUPage::Return1 2162 879268 406 10568
LWLock::Acquire::59::semaphoreTime 676 1059722 1567 8368
SimpleLRUWaitIO::bufferlock::Acquire 768 1060120 1380 8368
SLRUselectLRUPage::WaitIO 622 1061593 1706 8381
SimpleLRUWaitIO::call 747 1061932 1421 8381
SLRUselectLRUPage::Part3 623 1069964 1717 8381
SLRUselectLRUPage::call 2931 1224187 417 10574
SLRUreadPageReadOnly::SimpleReadPageCalls 1022 1224899 1198 10575
SLRUreadPage::call 2799 1224903 437 10575
SLRUreadPageReadOnly::call 1525 1225559 803 10576
postgresqlMain::handleCommand::E 9824 1267569 129 10577

Ok, what are these? These are the statistics. I try to use a #8220convention:#8221 A name consists of one or more tags separated by a colon pair. First part is the name of the function in which the statistic is used. Next part should indicate the part in the function that it means to time, and if present the final part tells more about the details being measured. Most statistics measure wall clock time, however it can be that another metric is used (number of elements in a list for example). Rows in the table are sorted by total time. The postgresqlMain::handleCommand::E is the statistic used to count time needed to handle an Execute command. During the last second it was called 9824 times, and the total time it took to do this was a little more than 1267 seconds. On average it took 129 msecs. However there once was a maximum handling time of a little over 10 seconds. (if time is the metric, the resolution is msec.) The functions in my copy of backend/access/transam/slru.c are now heavily instrumented. Statistic names starting with SLRU belong there. It looks like most of the commandHandling time comes from SLRUreadPageReadOnly: SLRUreadPageReadOnly::call depicts the total time spent in this function. However from the count value it is obvious that not all handleCommand visit this function. 1022 of these #8230ReadOnly function calls (SLRUreadPageReadOnly::SimpleReadPageCalls) call the SimpleLruReadPage function. Almost all of those 1267 seconds are spend in there. This function has 6 statistics build in. None of the others popup besides that one for the time consumption by the SLRUselectLRUPage::call. Since it was not obvious to me at that time what was going on in that function. Therefore I split this routine in three parts and measured time spend in each of them. Only part3 is visible in this Top-N table. Looking at the code I see this part contains an if-else construct in which either SimpleLruWaitIO or SimpleLruWritePage are called. The first one takes 1069 seconds for 747 calls, the second one takes only 8 seconds. However that is for just one call! This is a single write that takes forever. Why? And is this related to those 1069 seconds spent in the WaitIO part. Names used do point in that direction.The SimpleLruWaitIO routing too has 6 statistics of his own. One of them is: SimpleLRUWaitIO::bufferlock::Acquire.
The bufferlocks are assigned dynamically. This then, could be an explanation for the accumulation of time in LWLock 59. As you might recall in this one all statistics for all dynamic locks in the system are stored. The numbers do match. And indeed the maximum waiting time seen for LWLock::59:semaphoreTime is very similar to that WriteIO. It looks like all my client related processes are waiting for this single write that takes so long, since 128 \* 8 seconds translates to 1024 seconds total. Since this maximum time is for one single occurrence this also means that single block (did not actually look to the block size, so it is an assumption really) is used by all processes. If this is the case than there is something out there behaving as a classic scalability issue.

Time consuming write

Now why is that single write so terribly slow? Can I get more information out of the statistics I have? Before the SlruPhyscialWritePage routine is called one of the buffer locks is acquired in EXCLUSIVE mode. After this routine returns and some extra housekeeping is done this lock is released again. The PhyscialWritePage routine, which of course also has a fair amount of statistics, is the one where time is consumed. Meaning that that lock being held in EXCLUSIVE mode is our scalability issue! The PhyscialWritePage was also #8220magic#8221 to me, so simple separation by adding #8220PartX#8221s here and there learns me that Part6 is the time consumer. Since I know by traversing the code, that the last parameter passed to this routine has a value of null it must be the pg_fsync(fd) in this routine that causes the long waiting. This raises the next question. What is being synced? It must be data since we are in a checkpoint.

DTrace to the rescue
The little script I used for this basically starts on a fdsync system call. Before this happens it knows the file descriptors and file names that belong to each other. On each fdsync that takes longer than 2 seconds the filename that belongs to the file descriptor is printed.The script also counts all writes against the file descriptors. Basically the assumption is that all these write are out there waiting to be sync-ed. During the test three different filenames were shown as follows:
fsync 24637: Took 8 secs to write 122588 entries to "pg_tblspc/16384/16386/16387.1"
fsync 24637: Took 8 secs to write 128169 entries to "pg_tblspc/16384/16386/16387"fsync 24698: Took 7 secs to write 1 entries to "pg_clog/000C"
Here 24637 is the process id of the bgwriter. It is flushing single gigabyte portions at each flush. To all files of interest. (the ones above are application specific tablespaces/files). However writing a single block to clog also takes a very long time. Mmm that one is easy to resolve. Just wondering if it would make a difference. Moved pg_clog (including its content) to yet another disk. Created a softlink, and redid the test.

The CLOG solution
The next two graphs tell the story. These are the same graph as shown in the previous blog entry. The first one present the overall picture, the second one zooms in for a little more detail.
handlingcommandsclog.png handlingcommandsclogzoom.png

This one looks far better. However that pattern of "increased readcommand time / decreased handle time" is interesting too. And of course the questions:
  • CLOG is flushed by the fdsync. However I do not think this originates from the code pathes inspected so far. Just happened to be a side effect of the DTrace script and relatively ease to remove. So where did it come from?
  • Is it just the huge load of IOs on a ZFS file system that causes these waits, or can this be seen on UFS too for example?
  • Still there is a huge burst of writes (1GB) to each tablespace. Would it be worthwhile to sync in between checkpoints to spread the IO load over time? A single disk can sustain only so many IOs/sec. Spreading should indeed be advantaguous.

This screams for further inspection.

Checkpoint throttled throughput

In previous entries the checkpoint effect on throughput was mentioned but not explained. While looking at the IO behaviour comparing UFS to ZFS when running a PostgreSQL database instance, I see huge bursts of IO that also seem to throttle throughput in a significant way.

The original tests used a load generation tool that reported its measured throughput each 100 seconds. Since we all know that averaging removes the interesting information I redid a test with a 5 second reporting interval. Indeed this showed the effect of throughput much better: 0 TPM while checkpointing. Complete stand still of the application. Wouw, this needs further attention.

Instrumentation effect

Initially I used DTrace probes to instrument the code. However I found that for high frequency probes (firing over 1000 times a second on many processes) the instrumentation effect was not neglectable. Magne Mahre wrote a little library we started using to instrument "a light weight way". In my situation LWLock 25 was used to handle some synchronization issues. Every process counted locally and during commit time the global pool of counters was synced. This works rather well. However it is so unfortunate that dbwriter and logwriter commit at a very low rate. I had to call a "Copy local to global" routine to ensure the information gathered in these processes did also show in the final reports.

Unfortunately the final reports did show a new number one contender: lock 25. This method too did not turn out to be the light weight instrumentation method looked for. Although I must admit that in my attempt to understand I enter quite an amount of counters in the machine (currently I have over 550 new counters). Lots of these turn out not to increment.

I wrote a quick library that uses the atomic_xxx routines provided in the libc librarty that comes with Solaris. The now complete mutex and lock free way of instrumenting seems to provide me with the information I was looking for. Although I am still grasping to understand it completely. This library creates statistic objects in a new segment of shared memory. The advantage seems to be that any other process can attach to it, to display a real time view of the counters in there. I currently sample with a frequency of 1 Hz. And this adequately shows me the information.

A single statistic holds a running total, a counter of "times updated", the minimum value so far, and the maximum value so far used to update that statistic. Maybe a future version will also provide a histogram facility so a better understanding of the dynamics of certain statistic could be possible. Right now it is not implemented.


I currently using one of the earlier PostgreSQL 8.4 sources. All is build 64bit in an optimized way with Sun Studio 12 compilers and tools. Most tests run for an hour, after which I start mining the data. Current focus is on activities, or lack there of, during checkpoints. Checkpoints that are parameterized to happen every 300 seconds. And should finish in 15% of the time. All PostgreSQL related files reside on two ZFS pools. A WAL pool holds 4 internal disks. The other pool consists of 5 internal disks and holds all the other files. Each pool has its own memory backed up cache array for the ZFS Intent Log (see earlier post on UFS and ZFS).

Since the checkpoints are "tuned" to happen as fast as possible the concentration of IO seems to be high. However the IO subsystem can cope with this.

The next graph shows the throughput. A clear pattern is visible. I had a hunch that it might be related to garbage collection in the client. Not seeing throughput can also mean less activity on the load generator. Since this is a Java program GC needed to investigated. But ok, using jstat -gc it was very obvious this application behaves as it should be. No prolonged periods of GC-takes-over-the-world here. Once more the database was "prime suspect."

All tests are done with 128 clients.m


Mapping this to the IO happening is clearly shows in the following graph. Please note these TPM numbers are for 100 second interval and are clearly shown after that interval. Also the actual point "Time=zero seconds" is different on the load generating machine and the system under test.


And the IO graphs do not include the traffic to the ZIL holding devices. This is database related IO only. Seems obvious that the sudden burst in IO is related to the next dip in throughput. A dip that is about 20-25%.

From this graph my idea resulted to do some finer grained measurements. This to see if during those high IO peaks all throughput suffered. Redid the test, but now with a 5 second interval for reporting TPM..

Same information as the previous graph, but now a 5 second interval for TPM. Same overall pattern as the one above but now focussing on a smaller, arbitrary selected, interval.


Indeed the checkpoint seems to be the driving force for this sudden, steep, not long lasting, surge in throughput. However mapping the results from two applications running on different machines is a little tricky. As the saying goes: 'Use the source, Luke'

Using the source

In order to be able to exclude the data from the client I implemented some counters in tcop/postgres.c In the routine PostgresMain there is a "for ( ; ; )" loop in which data is read from the client, and acted upon. Which these I am now able to tell how much time (WAL clock) has elapsed handling data in different contexts (execute, parse, bind, sync, define) and how much the database process is waiting for the client. And of course it tells me the number of times this has been done. The counters are cumulating over time. However perl is quite happy to translate this into deltas per second.

Besides these counters I enthusiastically overloaded many source modules with too many statistics. In storage/lmgr/lwlock.c I placed multiple per non dynamic lock, and one where I cumulate all the information for the dynamic locks. In my source this one is numbered 59. The next table is a snippet where the tcop/postgres.c placed stats are shown together with those from the lwlock.c source module (only the most time consuming lwlocks during the time a checkpoint is happening. BTW the tests are now reduced to 1800 or so seconds since these patterns seem not to be affected by warm up phases while mimicking real life..

Here is the graph. I put the ceiling on 400M to make things a little more visible. The actual peaks are around 1600M. micro seconds. The third graph depicts the total time spend reading and handling commands.


Zooming in a little show a bit more detail:


Those short interruptions directly followed by a peak of the same magnitude are artifacts of measurements: if a timer stops in the beginning of the next interval, there is an extra cumulation in that next interval instead of the current one. However the peak at 910 and 1210 seconds are way beyond these artifacts. Total time is quit nice: 128 seconds. Recall we have 128 processes all measured each single second! The extra peaks/dips in reading/handling every 22 seconds or so is not yet explained. This one might be due to short garbage collection interruptions. It is indeed more time waiting for a command to arrive. Need to look into that one too some day soon...

Looking at the blue 'Totals' graph show a couple of seconds of "nothingness" Not reading, but also not executing! Followed by an extreme peak of handling time. BTW if a statistic timer starts in interval A, and ends in interval A+5 seconds, this adds zero's to A, A+1, ..., A+4, and 5 to interval A+5. There is indeed a repeating pattern of total zero throughput ( I am sure, I read the numbers.)

Thursday Jun 19, 2008

Running PostgreSQL on ZFS file system. Some considerations

Up till now I compared ZFS to UFS when deploying the PostgreSQL database files. Only these database processes were running on the system, and only all the PostgreSQL related files were on the filesystem being compared.

The file system was created on either a stripe set of nine disks using Solaris Volume Manager or a zpool using these same nine disks. The last blog entry reported ZFS outperforming UFS.

I did some extra tests to see if performance (read database throughput) could be increased by fiddling with some elements that are known to influence the IO part when using ZFS.

The ZFS Intent Log is known to consume bandwith. The effect of this element can be seen by disabling it altogether. Of course this is definitely NOT advised for production environments. (See http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ZIL for some extra information.)

Did some tests with the ZIL disabled. Using 5, 15, 50 and 85% for the completion_target of PostgreSQL. The following graph shows the database transaction throughput:


Recalling the baseline I presented in an earlier blog using seven disk arrays with caches, that showed a throughput around 90K TPM this one comes very close. Only nine internal disks this time...

The previous throughput graph (... the saga continues) showed a rate of ~ 50K TPM: transactional integrity does not come for free.

Did some more tests where ZIL was enable, but new using a dedicated disk array (with 1GByte of write cache) for the intent log. The throughput result of this setup, the disabled-ZIL and the 'all on nine disks' tests are in the next graph (all have a completion_target setting of 15%):


The ZIL element obviously takes time. Time during which the throughput is cleary throttled while a database checkpoint is in progress. Using a dedicated fast device pays of to increase throughput and make it more constant over time. The IO (write part) was measured for both the ZIL device as for the data on the "nine disk" ZFS file system . Here is the result:


The checkpoint at 1800 seconds resulted in a 77 MByte write/sec for the ZIL only. As already suggested by Mikael Gueck I would surely like to do some extra tests with SSD disks. Guess I need to hold my breath a while, since right now I "just' do not have them.

However right now I feel ZFS is already a realistic alternative for a PostgreSQL setup.

Will do some more tests where I'll play with some of the other features of ZFS to get a feeling for their effect on my PostgreSQL environment.

Tuesday Jun 17, 2008

UFS versus ZFS with PostgreSQL, the saga continues

The previous posts described some tests I did using a setup that was "tuned" to focus on CPU consumption. The idea was to minimize IO as much as possibly by caching all data needed as much as possible and using plenty of disk arrays with write caches to minimize write latency.

In my current tests I am interested in the IO components and how they influence throughput when using two different file systems. In order to create a significant IO component the amount of disks was limited to nine internal ones: no array; no write caches. In addition the load generator was changed to significantly increase the working set data. Indeed my 16GB of shared buffers are not enough to buffer this working set.

The initial configuration was "tuned" to once in 15 minutes perform a checkpoint. Here I will describe the results when and using a smaller checkpoint interval and influencing the time "available" to write out all data due to a checkpoint. The PostgreSQL parameters in effect here are:


The checkpoint_timeout was set to do a checkpoint every 300 seconds. The checkpoint_completion_target used to be 50%. This influences PostgreSQL to write out all the dirty buffers in a period of time that is 50% of the timeout value. Indeed the old graphs show this nicely: first halve of each checkpoint period a lot of IO can be seen that is not present during the second halve of this period.

The idea behind the tests described here was to spread the IO over time. This was done by reducing the timeout from 15 to 5 minutes and increasing the completion_target to 85%. This turned out not to be optimal. Probably too many dirty blocks were re-dirtied again so a continuous writing of the same blocks was done over and over.

A smaller checkpoint_timeout seems fine for the time being. Increasing this parameter ensures that rewriting of blocks is minimized but the amount of blocks to be written increases resulting in a much more extreme IO workload once the checkpoint starts. With that same reduced checkpoint_timeout setting the completion_target was set to 15% to see what happens. The results of the 85% and 15% tests can be seen in the graphs below. First the throughput graph.


The nice thing here is that ZFS shows the best throughput (on average about 50K TPM.) And indeed for a 15% completion_target. Same target for UFS causes huge fluctuations in the throughput graph. Now lets look at the IOPS graph.


Although the pattern is clear lets focus on a smaller time interval to see what happens during one checkpoint interval


The 15% setting seems too much for the UFS case: it needs to continue writing data for the whole period. ZFS on the other hand, puts out a burst of IO's in a 50 second period and from then on the IO settle to a constant amount of "background" IO for the rest of the checkpoint interval (most likely this has a high content of WAL related output.) 15% of 5 minutes is 45 seconds. Indeed ZFS seems to be able to cope with this setting.

The above shows the write IO's per second. How much data is actual written is shown in the following graph.


ZFS is clearly pushing out much more data. During those initial 50 seconds up to 60MB/sec, after that a linear reduction of this amount can be seen. Although the IOPS are rather constant the amount is not. ZFS seems to reduce its write size. UFS pushes significantly less data out to the disks. However it uses so many IO calls that this reduces the overall TPM throughput.

Since ZFS is a logging file system the data written contains both the application (PostgreSQL) data and the logging data. I'll do some extra tests to see how this effects the TPM results. Need to "sacrifice" a 1GB cache disk array for this to emulate a fast SSD-like disk. Will make sure that this one is dedicated for ZFS its so called intent log.

Sunday Jun 15, 2008

UFS vs ZFS: commandline editing

My previous post on UFS vs ZFS showed too much of a difference. I looked at all my tests and they all showed this same difference. Still I wanted to make sure...

So I redid the test and .... gone were does 85MB ZFS writes. Doing these tests I rely on command line editing and command line history. Try to use the same command line as before. So I walked through my history and did not spot anything. On the first pass that is.

Than I "just" spotted it. Once more the devil is in the details. The command line used was

#zfs set recordsize=8k pg

And that # character was not the prompt but a comment marker... Well this explains a lot. The record size that in fact was used was the deault setting, 128K. Each 8KB write was in fact a 128KByte write.

So I redid the ZFS based tests and here are the results for the 'all on one volume of nine disks.'

First the Writes/sec:


And the corresponding IO load:


Finally here is the TPM throughput graph for the 'all on nine' tests.


Once more ZFS writes out more data. Looking at the TPM numbers the ZFS tests pushes a little less transactions through the system. Therefore the extra writing must be related to ZFS "overhead." The "good" thing is ZFS uses less but much larger IOs.

Although the maximum throughput is a little better for UFS, the ZFS behaviour is more constant. Overall the average looks very similar. No clear winner. But than it was not a contest, just a comparison. And surely I would like to understand those too noticable high-lows in throughput. Of course the writes are related to PostgreSQL its checkpoint configuration. This configuration was "tuned" for a certain set of tests were focus was on CPU consumption. In order to do this the checkpoints were moved to once per 900 seconds. Also PostgreSQL attempt to finish all checkpoint related IO in half of the checkpoint time. Another test is called for that causes a more constant and less extreme IO load.

And btw, the UFS tests were done without directio mounting: the ramp up is almost as fast as with ZFS.

Thursday Jun 12, 2008

PostgreSQL on UFS versus ZFS


Up till now I had the good fortune to do my PostgreSQL tests using a Sun Fire V890, 16 cores SPARC IV @1350MHz. Attached were seven disk arrays. Four of these with 256MB of write through cache; the other three with 1GB of write through cache. All spindels set up as a stripe.

Always used UFS, where WAL was located on a stripe of the four smaller arrays; all application data related tables on one array; all application indexes on one array; and all other PostgreSQL related data on the third array. Both file systems that held application specific data were mounted with directio option in place. As was the WAL holding file system.

Always used an internal toolkit to generate an OLTP-ish load with 128 concurrent users that were constantly sending their queries and DML (no think times), set up in such a way that there would be quite some IO (a rather large working set).

Next to these seven arrays I had nine internal disks free. This blog entry describes the findings when just using these internal disks in different ways. These internal disks were configure in different ways to hold the data mentioned above. Next to the (array based) baseline different configurations using both UFS and ZFS were tested. Below the findings are presented.


The load generation tool reports its results through a TPM metric. The next graph shows this throughput for the baseline, a single stripe over the nine internal disks holding an UFS file system and a single pool (of these same internal nine disks) on which a ZFS file system was configure.

The /etc/system settings used for all these tests are:

set rlim_fd_max=260000
set rlim_fd_cur=260000
set maxphys=1048576
set ip:ip_squeue_fanout=1
set ip:ip_squeue_bind=0
set tune_t_fsflushr=1
set autoup=900
set ufs:ufs_WRITES=0
set segmap_percent=25

My system has 48GByte internal memory. PostgreSQL shared buffers take 16GByte. So I do not want the filesystems to use more than 25% of my memory for caching. Default UfS uses 12%. However ZFS is a little more greedy by default. This setting at least ensures the database processes would have an adequate amount of memory.

The ZFS file system was created with an 8KByte record size and compression turned off. No further settings were changed.


Before each test a backup of the freshly created database was restored. The caching behavior of ZFS shows quite nicely. Both UFS tests need their time to get started (remember that directio mounted file systems do not cache, so in the beginning the data needs to be read from disks, instead from file system cache). The dips in throughput correspond nicely with checkpoint activity.

Checkpoints are definitely handled way better when using disk arrays with memory backed up cache. At about 2000 seconds the throughput seems become stable in the array case, while in both other cases the checkpoint load strongly effects the throughput.

Comparing the UFS and ZFS tests done using internal disks it seems that during throughput peek times they are quite comparable, although ZFS shows a little less throughput. The data above "just" presents the TPM. Lets look at the IO graphs. To make things "easier" I'll only show UFS vs ZFS using the internal disks. The above graphs come from a test where all database files are in the same file system. Therefore iostat cannot discriminate between WAL, data nor index based IOs.


Yes indeed, there is a difference. Read rate is not depicted here. ZFS shows no reads during the interval. Indeed the file system cache works perfectly. For UFS there is an initial rate of 13MByte/sec. This one reduces rapidly. At 1500 seconds there is 200KByte/sec left. After 2500 seconds (from the start) this has been further reduced to less than 40 KB/sec.

However the difference in write IO is significant. Using ZFS the periods of write IO is longer (from ~200 sec with UFS, to ~380 sec with ZFS) and the amount of write IO is up to five times as big (18MB/sec vs 80MB/sec). Now looking at the actual IOPS the following graph shows.


This one shows ZFS is handling the disk device much better. The average number of IO's is down by a factor of 3 when using ZFS.

Finally looking at CPU consumption it shows the ZFS test uses a little less of this resource.

Next step

Some extra tests were done where two pools are used. One for WAL and the other for the rest of the database files. This to check if the behavior could reside from the way the database uses its storage. The WAL pool has four disks, the other pool has five.

Since a picture tells us more than a thousand words, here are the results. First the WAL output.


Although they seem to differ, it looks more or less equal when considering amounts. Still the UFS is "faster," meaning the frequency of those spikes (checkpoint) is higher. Once more, since everything is in the ZFS file system cache, the ZFS test starts at once, while the UFS one needs a warm up period. However the huge differences seen in the previous graphs do not show here. Now lets look at the filesystem holding the other PostgreSQL data.


There it is! There is a striking difference between UFS and ZFS file systems when doing non wall related write IO with PostgreSQL. And by the way, the difference is bigger compared to the 'all on one file system' tests: approximately 7MB for UFS versus 85MB for ZFS. The ZFS numbers are obtained through zpool iostat command. I double checked the iostat numbers, and indeed the 85 MB number is correct. No hidden undocumented features that could "explain" this difference. Need to dig deeper.


Although the actual performance from a client perspective is just a little less when comparing ZFS with UFS, the amount of IO (KB/sec) is too different to be ignored. Although ZFS seems to optimize by large sequential writers the overall effect in an OLTP environment seems not to be optimal. I hope to be able to do some further tests that could help shed some light on this remarkable finding.

Friday Apr 18, 2008

Scalability issues due to waits

Previous entries showed lots of detail around different types of LWLocks, and the time spend waiting for them. In order to get a "helicopter" view all lock waiting times are combined. Also so timing is done in PostgresMain to understand the amount of time spend getting (waiting for) a new command, handling a command and pg_stat reporting.

In the previous post all tests were done from 64 to 1280 concurrent users. The findings in this post are all from 384 till 1024 users, in steps of 64. Each run took one hour. Measurement time was approx. 30 seconds less.

First a rather big table that holds all the numbers of interest :-) Lets explain the columns by position (all times not specified are in seconds):

  1. number of concurrent users;
  2. number of times the pg_stat updates were executed;
  3. total amount in time taken by these pg_stat updates;
  4. average time (msec) a single pg_stat update takes;
  5. number of commands read;
  6. total time for reading these commands;
  7. average time (msec) for reading a command;
  8. total time for handling the commands read;
  9. average time (msec) for handling a command;
  10. total number of times a LWLock was waited for;
  11. total amount of time waited for LWLocks;
  12. average time (msec) waiting for a LWLock;
  13. throughput per minute;
  14. average response time from client perspective;
  15. total amount of time (cols 3 + 6 + 8) explained.

   1        2       3     4         5         6       7        8       9       10          11     12     13     14    15              
384 30,774,906 1,498 0.05 134,478,799 1,257,700 9.35 111,570 0.83 30,583,924 47,298 1.55 41,771 0.017 3569
448 35,254,626 2,708 0.08 154,052,468 1,447,983 9.40 148,952 0.97 35,416,188 72,604 2.05 47,931 0.019 3570
512 39,472,668 1,966 0.05 172,484,080 1,628,241 9.44 197,524 1.15 41,839,911 109,362 2.61 54,144 0.022 3569
576 41,647,430 2,148 0.05 181,986,937 1,724,056 9.47 330,029 1.81 47,182,321 233,743 4.95 57,251 0.034 3569
640 41,903,717 2,260 0.05 183,106,564 1,740,012 9.50 543,114 2.97 48,511,160 444,870 9.17 57,943 0.057 3570
704 41,537,271 2,290 0.06 181,505,836 1,729,515 9.53 782,201 4.31 48,491,749 684,688 14.12 57,042 0.084 3571
768 41,011,402 2,279 0.06 179,208,259 1,710,304 9.54 1,030,160 5.75 47,853,511 933,159 19.50 56,190 0.113 3571
832 40,331,868 2,299 0.06 176,237,327 1,685,163 9.56 1,284,115 7.29 46,607,735 1,188,499 25.50 55,407 0.143 3571
896 39,857,972 2,268 0.06 174,167,499 1,668,546 9.58 1,529,312 8.78 46,385,722 1,434,691 30.93 55,103 0.172 3571
960 39,120,264 2,216 0.06 170,943,881 1,641,102 9.60 1,784,889 10.44 44,893,231 1,691,997 37.69 53,655 0.206 3571
1024 38,916,412 2,219 0.06 170,053,697 1,633,946 9.61 2,017,717 11.87 44,316,238 1,924,754 43.43 53,323 0.234 3568

The nice thing is to know that almost all time is explained (column 15). Tests took a little less than 3600 seconds.
Lets look at some of these numbers in a graphical way. First what is happening from a client side perspective when increasing the concurrency?

resptimeandthroughput.png Maximum throughput is reached with 640 concurrent users. The initial linear looking increase in response time is already clearly not linear at this concurrency level. Once the 704 mark is passed it looks linear once more although it rises much stronger. The transaction throughput numbers are in TPM (per minute.)
The next graph shows what is happening in the database back-end. Here the number of command read and the time spend handling these commands and the time spend waiting for LWLocks is shown. All these numbers are totals for all the back-ends in the test. Please note the concurrency numbers do not include the standard Postgres processes (dbwriter, wal log writer, etc).

Once more the number of commands read (and handled) tops at 640 users. This seems also to be the place were both "time needed to handle" and "time needed to acquire a lock" start rising in a linear way. The actual CPU consumption (difference between the red and the blue graph) looks rather constant in the third figure:

cpuUsage.png In this graph the average CPU usage to handle a single command is shown. There seems to be a marginal increase in CPU consumption with increase in concurrency. The CPU usage increase when going from 384 to 640 users is remarkable. It could be that -- but this is not proven by proper tests and instrumentation -- the processes starts spinning, in order to acquire a LWLock, more and more with increasing concurrency. At 640 than, most of this spinning is not adequate any more and the processes start to wait in order to acquire the lock. These lock waits take more and more time due to increased concurrency. This leads to a reduction in throughput (or in commands handled).
Knowing that all these tests indeed took an hour, it can be concluded that these 3600 seconds were basically used for waiting. For now I assume that the readCommand is -- as its name implies -- spending most of its time waiting for commands and data to arrive, in order to process them. The actual command handling is constant in its CPU consumption (around 0.5 msec.) The waiting time lost for acquiring LWLocks is however increasing when concurrency goes up. The amount of time spend in reading a command is rather constant. Total amount of CPU is 0.47 msec per command in the 384 user case. This translates to 64,272 sec (111570 - 47298). Total amount of time waited is 1,304,998 sec (1257700 + 47298.) Fortunately this is most of the time waiting for the clients to send a new command. However at 640 users (the maximum throughput) the wait time for locks has grown to 444,870 seconds. Something that escalated to 1,924,754 seconds in the 1024 user case.
The amount of CPU used is definitely an overestimate. The assumption here is "time not waited for LWLock is CPU time". The following calculation shows this cannot be true:-)
If, in the 384 user case, 64,272 seconds are used by CPUs in a period of 3569 seconds, it must be that the system has at least 18 CPUs. Also knowing the amount of idle time was approx. 37% this should mean these tests were run on a 28 CPU machine.For the 1024 user case there were 92963 seconds consumed: 26 CPUs. Idle time during this test was 7%, meaning 28 CPUs.The system used had just 16 CPUs! So there is wait time in other area's too. However compared to the wait time for the LWLocks this is barely worth any attention.However a significant amount of wait time is lost in reading a command. Approx. 9.5 msec are used per command. This could be due to the protocol used for communication with the client, or deficiencies in the benchmark tool. At least it motivates some extra attention. Currently the energy spend on SMP scalability is best used in the area's where wait time is observed. The above mentioned numbers show two area's worthy of more attention to obtain a more scalable and performing solution.

Wednesday Apr 09, 2008

PostgreSQL Lightweight Locks: Queue Length under high load

The previous blog entry (Using DTrace to measure Lightweight lock contention) discussed some findings with respect to contention on the lightweight locks in PostgreSQL. Using more extreme loads the ProcArrayLock turned out to be number one on the "waiting for locks" list.[Read More]

Thursday Apr 03, 2008

Using DTrace to measure Light Weight Lock contention

In an earlier post a table was shown with timing information for light weight locks. Here I will try to show a little more detail on the instrumentation effect. Please note these tests were done with PostgreSQL 8.4 devel.[Read More]

Monday Mar 10, 2008

Light Weight Lock Contention

After the IO effect had been removed the tests continued. Now able to focus on CPU related. As usual the quest is for finding where time flies by: either using the CPU or waiting for a resource.  When using higher loads the CPUs are not used to their fullest capacity: the processes must be waiting for resources.  Discarding network waits (it is here that commands come from) the next source are PostgreSQL its so called LWLocks.[Read More]

Sunday Mar 09, 2008

PostgreSQL performance study: introduction

While trying to improve the performance of PostgreSQL many tests are done. Currently the focus is on CPU related elements. In order to cleanup the picture some IO related optimizations were done. This post shows some graphs depicting the write side of the IO. This post is the first in a (possible) long series discussing findings, thought, etc.
[Read More]



« July 2016