Finding the throttle

In my previous blog entry I showed a behavioral pattern in PostgreSQL that needs attention. In this entry I will try to explain what I did to find the culprit.

The only statistics discussed were the ones in tcop/postgres.c These gave an impression of the global activity. During a checkpoint lots of time is spend handling commands. Although this time seems to be zero, it cummulates to figures as high as 1600 seconds. By looking at the other statistics during those checkpointing intervals some of these too show this accumulation hopefully too. This would make it easy to find the root cause. At 1218 seconds this pattern shows nicely. The next table shows the top time-consumers during that interval:

SLRUPhysicalWritePage::Part6 16 8368 523 8368
SLRUPhysicalWritePage::call 16 8370 523 8368
SLRUselectLRUPage::WritePage 1 8370 8370 8370
SLRUwritePage::call 16 8372 523 8370
postgresqlMain::readchar::B 9799 19100 1 4297
LWLock::Acquire::08::semaphoreTime 1711 21303 12 411
xlogflush::go::acq 1724 21349 12 411
SLRUselectLRUPage::Return3 768 344920 449 10574
SLRUselectLRUPage::Return1 2162 879268 406 10568
LWLock::Acquire::59::semaphoreTime 676 1059722 1567 8368
SimpleLRUWaitIO::bufferlock::Acquire 768 1060120 1380 8368
SLRUselectLRUPage::WaitIO 622 1061593 1706 8381
SimpleLRUWaitIO::call 747 1061932 1421 8381
SLRUselectLRUPage::Part3 623 1069964 1717 8381
SLRUselectLRUPage::call 2931 1224187 417 10574
SLRUreadPageReadOnly::SimpleReadPageCalls 1022 1224899 1198 10575
SLRUreadPage::call 2799 1224903 437 10575
SLRUreadPageReadOnly::call 1525 1225559 803 10576
postgresqlMain::handleCommand::E 9824 1267569 129 10577

Ok, what are these? These are the statistics. I try to use a #8220convention:#8221 A name consists of one or more tags separated by a colon pair. First part is the name of the function in which the statistic is used. Next part should indicate the part in the function that it means to time, and if present the final part tells more about the details being measured. Most statistics measure wall clock time, however it can be that another metric is used (number of elements in a list for example). Rows in the table are sorted by total time. The postgresqlMain::handleCommand::E is the statistic used to count time needed to handle an Execute command. During the last second it was called 9824 times, and the total time it took to do this was a little more than 1267 seconds. On average it took 129 msecs. However there once was a maximum handling time of a little over 10 seconds. (if time is the metric, the resolution is msec.) The functions in my copy of backend/access/transam/slru.c are now heavily instrumented. Statistic names starting with SLRU belong there. It looks like most of the commandHandling time comes from SLRUreadPageReadOnly: SLRUreadPageReadOnly::call depicts the total time spent in this function. However from the count value it is obvious that not all handleCommand visit this function. 1022 of these #8230ReadOnly function calls (SLRUreadPageReadOnly::SimpleReadPageCalls) call the SimpleLruReadPage function. Almost all of those 1267 seconds are spend in there. This function has 6 statistics build in. None of the others popup besides that one for the time consumption by the SLRUselectLRUPage::call. Since it was not obvious to me at that time what was going on in that function. Therefore I split this routine in three parts and measured time spend in each of them. Only part3 is visible in this Top-N table. Looking at the code I see this part contains an if-else construct in which either SimpleLruWaitIO or SimpleLruWritePage are called. The first one takes 1069 seconds for 747 calls, the second one takes only 8 seconds. However that is for just one call! This is a single write that takes forever. Why? And is this related to those 1069 seconds spent in the WaitIO part. Names used do point in that direction.The SimpleLruWaitIO routing too has 6 statistics of his own. One of them is: SimpleLRUWaitIO::bufferlock::Acquire.
The bufferlocks are assigned dynamically. This then, could be an explanation for the accumulation of time in LWLock 59. As you might recall in this one all statistics for all dynamic locks in the system are stored. The numbers do match. And indeed the maximum waiting time seen for LWLock::59:semaphoreTime is very similar to that WriteIO. It looks like all my client related processes are waiting for this single write that takes so long, since 128 \* 8 seconds translates to 1024 seconds total. Since this maximum time is for one single occurrence this also means that single block (did not actually look to the block size, so it is an assumption really) is used by all processes. If this is the case than there is something out there behaving as a classic scalability issue.

Time consuming write

Now why is that single write so terribly slow? Can I get more information out of the statistics I have? Before the SlruPhyscialWritePage routine is called one of the buffer locks is acquired in EXCLUSIVE mode. After this routine returns and some extra housekeeping is done this lock is released again. The PhyscialWritePage routine, which of course also has a fair amount of statistics, is the one where time is consumed. Meaning that that lock being held in EXCLUSIVE mode is our scalability issue! The PhyscialWritePage was also #8220magic#8221 to me, so simple separation by adding #8220PartX#8221s here and there learns me that Part6 is the time consumer. Since I know by traversing the code, that the last parameter passed to this routine has a value of null it must be the pg_fsync(fd) in this routine that causes the long waiting. This raises the next question. What is being synced? It must be data since we are in a checkpoint.

DTrace to the rescue
The little script I used for this basically starts on a fdsync system call. Before this happens it knows the file descriptors and file names that belong to each other. On each fdsync that takes longer than 2 seconds the filename that belongs to the file descriptor is printed.The script also counts all writes against the file descriptors. Basically the assumption is that all these write are out there waiting to be sync-ed. During the test three different filenames were shown as follows:
fsync 24637: Took 8 secs to write 122588 entries to "pg_tblspc/16384/16386/16387.1"
fsync 24637: Took 8 secs to write 128169 entries to "pg_tblspc/16384/16386/16387"fsync 24698: Took 7 secs to write 1 entries to "pg_clog/000C"
Here 24637 is the process id of the bgwriter. It is flushing single gigabyte portions at each flush. To all files of interest. (the ones above are application specific tablespaces/files). However writing a single block to clog also takes a very long time. Mmm that one is easy to resolve. Just wondering if it would make a difference. Moved pg_clog (including its content) to yet another disk. Created a softlink, and redid the test.

The CLOG solution
The next two graphs tell the story. These are the same graph as shown in the previous blog entry. The first one present the overall picture, the second one zooms in for a little more detail.
handlingcommandsclog.png handlingcommandsclogzoom.png

This one looks far better. However that pattern of "increased readcommand time / decreased handle time" is interesting too. And of course the questions:
  • CLOG is flushed by the fdsync. However I do not think this originates from the code pathes inspected so far. Just happened to be a side effect of the DTrace script and relatively ease to remove. So where did it come from?
  • Is it just the huge load of IOs on a ZFS file system that causes these waits, or can this be seen on UFS too for example?
  • Still there is a huge burst of writes (1GB) to each tablespace. Would it be worthwhile to sync in between checkpoints to spread the IO load over time? A single disk can sustain only so many IOs/sec. Spreading should indeed be advantaguous.

This screams for further inspection.


Post a Comment:
  • HTML Syntax: NOT allowed



« April 2014