Checkpoint throttled throughput
By paulvandenbogaard on Jul 16, 2008
In previous entries the checkpoint effect on throughput was mentioned but not explained. While looking at the IO behaviour comparing UFS to ZFS when running a PostgreSQL database instance, I see huge bursts of IO that also seem to throttle throughput in a significant way.
The original tests used a load generation tool that reported its measured throughput each 100 seconds. Since we all know that averaging removes the interesting information I redid a test with a 5 second reporting interval. Indeed this showed the effect of throughput much better: 0 TPM while checkpointing. Complete stand still of the application. Wouw, this needs further attention.
Initially I used DTrace probes to instrument the code. However I found that for high frequency probes (firing over 1000 times a second on many processes) the instrumentation effect was not neglectable. Magne Mahre wrote a little library we started using to instrument "a light weight way". In my situation LWLock 25 was used to handle some synchronization issues. Every process counted locally and during commit time the global pool of counters was synced. This works rather well. However it is so unfortunate that dbwriter and logwriter commit at a very low rate. I had to call a "Copy local to global" routine to ensure the information gathered in these processes did also show in the final reports.
Unfortunately the final reports did show a new number one contender: lock 25. This method too did not turn out to be the light weight instrumentation method looked for. Although I must admit that in my attempt to understand I enter quite an amount of counters in the machine (currently I have over 550 new counters). Lots of these turn out not to increment.
I wrote a quick library that uses the atomic_xxx routines provided in the libc librarty that comes with Solaris. The now complete mutex and lock free way of instrumenting seems to provide me with the information I was looking for. Although I am still grasping to understand it completely. This library creates statistic objects in a new segment of shared memory. The advantage seems to be that any other process can attach to it, to display a real time view of the counters in there. I currently sample with a frequency of 1 Hz. And this adequately shows me the information.
A single statistic holds a running total, a counter of "times updated", the minimum value so far, and the maximum value so far used to update that statistic. Maybe a future version will also provide a histogram facility so a better understanding of the dynamics of certain statistic could be possible. Right now it is not implemented.
I currently using one of the earlier PostgreSQL 8.4 sources. All is build 64bit in an optimized way with Sun Studio 12 compilers and tools. Most tests run for an hour, after which I start mining the data. Current focus is on activities, or lack there of, during checkpoints. Checkpoints that are parameterized to happen every 300 seconds. And should finish in 15% of the time. All PostgreSQL related files reside on two ZFS pools. A WAL pool holds 4 internal disks. The other pool consists of 5 internal disks and holds all the other files. Each pool has its own memory backed up cache array for the ZFS Intent Log (see earlier post on UFS and ZFS).
Since the checkpoints are "tuned" to happen as fast as possible the concentration of IO seems to be high. However the IO subsystem can cope with this.
The next graph shows the throughput. A clear pattern is visible. I had a hunch that it might be related to garbage collection in the client. Not seeing throughput can also mean less activity on the load generator. Since this is a Java program GC needed to investigated. But ok, using jstat -gc it was very obvious this application behaves as it should be. No prolonged periods of GC-takes-over-the-world here. Once more the database was "prime suspect."
All tests are done with 128 clients.m
Mapping this to the IO happening is clearly shows in the following graph. Please note these TPM numbers are for 100 second interval and are clearly shown after that interval. Also the actual point "Time=zero seconds" is different on the load generating machine and the system under test.
And the IO graphs do not include the traffic to the ZIL holding devices. This is database related IO only. Seems obvious that the sudden burst in IO is related to the next dip in throughput. A dip that is about 20-25%.
From this graph my idea resulted to do some finer grained measurements. This to see if during those high IO peaks all throughput suffered. Redid the test, but now with a 5 second interval for reporting TPM..
Same information as the previous graph, but now a 5 second interval for TPM. Same overall pattern as the one above but now focussing on a smaller, arbitrary selected, interval.
Indeed the checkpoint seems to be the driving force for this sudden, steep, not long lasting, surge in throughput. However mapping the results from two applications running on different machines is a little tricky. As the saying goes: 'Use the source, Luke'
Using the source
In order to be able to exclude the data from the client I implemented some counters in tcop/postgres.c In the routine PostgresMain there is a "for ( ; ; )" loop in which data is read from the client, and acted upon. Which these I am now able to tell how much time (WAL clock) has elapsed handling data in different contexts (execute, parse, bind, sync, define) and how much the database process is waiting for the client. And of course it tells me the number of times this has been done. The counters are cumulating over time. However perl is quite happy to translate this into deltas per second.
Besides these counters I enthusiastically overloaded many source modules with too many statistics. In storage/lmgr/lwlock.c I placed multiple per non dynamic lock, and one where I cumulate all the information for the dynamic locks. In my source this one is numbered 59. The next table is a snippet where the tcop/postgres.c placed stats are shown together with those from the lwlock.c source module (only the most time consuming lwlocks during the time a checkpoint is happening. BTW the tests are now reduced to 1800 or so seconds since these patterns seem not to be affected by warm up phases while mimicking real life..
Here is the graph. I put the ceiling on 400M to make things a little more visible. The actual peaks are around 1600M. micro seconds. The third graph depicts the total time spend reading and handling commands.
Zooming in a little show a bit more detail:
Those short interruptions directly followed by a peak of the same magnitude are artifacts of measurements: if a timer stops in the beginning of the next interval, there is an extra cumulation in that next interval instead of the current one. However the peak at 910 and 1210 seconds are way beyond these artifacts. Total time is quit nice: 128 seconds. Recall we have 128 processes all measured each single second! The extra peaks/dips in reading/handling every 22 seconds or so is not yet explained. This one might be due to short garbage collection interruptions. It is indeed more time waiting for a command to arrive. Need to look into that one too some day soon...
Looking at the blue 'Totals' graph show a couple of seconds of "nothingness" Not reading, but also not executing! Followed by an extreme peak of handling time. BTW if a statistic timer starts in interval A, and ends in interval A+5 seconds, this adds zero's to A, A+1, ..., A+4, and 5 to interval A+5. There is indeed a repeating pattern of total zero throughput ( I am sure, I read the numbers.)