PostgreSQL Lightweight Locks: Queue Length under high load

The previous blog entry (Using DTrace to measure Lightweight lock contention) discussed some findings with respect to contention on the lightweight locks in PostgreSQL. Using more extreme loads the ProcArrayLock turned out to be number one on the "waiting for locks" list.

I thought it would be worthwhile to also get an impression on the length of these waiter queues. In order to get this information I change the source code a little (lwlock.c). Added an inList element to that LWLock structure. Each time a process needs to wait for a lock to become available (just before starting the semaphore wait in LWLockAcquire()) this variable is incremented. And each time the wait on the semaphore returns a decrement is executed. Both are done under the control of the lock->mutex spinlock.

The current value of the inList element is passed to the lwlock_startwait probe being called just before the for(;;) { /\* wait on the semaphore \*/ }is executed. The other parameters to this probe are ofcourse lockid and mode.

Here are some results of the queue lengthes. Two sets of two columns each per user test. Two user tests: 384 and 1280. The two columns in a set show the maximim and the average of the list length for a given lockid and lock mode. The first set was for a heavy run, the second for a lightweight run. All tests are done with PostgreSQL 8.4 devel, compiled with our latest compiler. LockId and mode are the first two cells on each row.

The heavy run is done with the DTrace script used for the data presented in the previous blog entry. The lightweight DTrace script uses the changed source (lwlock.c). This script is shown at the bottom of this blog entry.

BTW the WALL clock times (in order of the next table): 3505, 3517, 3566, 3574 seconds.


384: Heavy Light 1280: Heavy Light
LockID Mode Max Avg Max Avg Max Avg Max Avg
0 E 1 1 2 0 4 1 7 0
3 E 42 1 106 0 1,154 21 51 0
3 S - - - - 569 113 1 0
4 E 124 6 225 0 1,117 963 943 588
4 S 126 7 226 0 1,116 970 959 27
5 E 176 71 46 3 1,067 351 1,017 224
5 S 177 70 45 0 1,068 334 1,016 0
6 E 1 1 2 0 1 1 4 0
7 E 227 41 180 0 928 13 1,008 2
8 E 385 56 385 10 1,281 19 1,281 22
11 E 109 1 45 0 1,229 442 84 0
11 S 105 1 49 0 1,229 239 85 0
17 E 1 1 1 0 1 1 0 0
25 E 184 13 254 0 964 8 488 0
25 S 9 1 10 0 1,121 252 81 0

For the 384 user test case the average is almost reduces to zero for all modes. Barely any waiting queues. Even for the 1280 user case most of the queues have disappeared. However for the ProcArrayLock (id 4) there is a very nice queue of waiters on average. We need to look at time waiting times to see the effect of these queues.
Here is the same information is presented in that earlier blog entry. However now all times are in seconds. First the heavy test, than the lightweight test.

For the 384 user tests:

LockID Mode Acquire Waited Acquire Waited
Count Time Count Time Count Time Count Time
3 E 4,328,730 205 132,580 122 6,783,614 0 765,257 96
4 E 4,328,587 14,224 2,190,914 14,016 6,783,877 0 2,960,167 1,594
4 S 10,197,325 2,557 381,405 2,209 28,064,153 0 704,473 349
5 E 9,282 187 13,300 186 9,631 0 13,508 7
5 S 22,926,841 871 6,285 90 106,924,214 0 11,096 5
7 E 16,411,884 135,452 2,732,079 134,857 25,166,994 0 6,020,423 1,729
8 E 2,911,491 181,859 2,708,528 181,677 6,328,096 0 5,959,058 35,831
11 E 4,361,735 655 349,049 504 6,892,708 0 352,448 92
11 S 3,573,270 267 136,207 172 6,995,489 0 162,538 25
25 E 48,205,700 63,320 3,241,790 62,072 120,873,295 0 13,794,350 5,838
25 S 472,079,693 8,654 6,350 298 745,844,079 0 5,188 3

For the 1280 user tests:


LockID Mode Acquire Waited Acquire Waited
Count Time Count Time Count Time Count Time
3 E 4,277,432 9,316 197,668 9,245 8,973,535 0 608,357 263
4 E 4,276,477 2,818,668 4,670,676 2,818,869 8,973,818 0 11,050,127 2,256,387
4 S 10,087,269 217,479 357,764 217,257 37,524,637 0 1,414,914 290,452
5 E 19,271 1,558 28,195 1,557 33,077 0 59,692 1,816
5 S 22,882,489 1,514 20,322 1,113 1 41,803,558 0 58,101 1,652
7 E 14,465,104 34,554 1,970,784 34,198 32,602,200 0 7,661,015 30,723
8 E 3,118,804 73,004 2,821,275 72,826 7,715,914 0 7,403,513 147,110
11 E 6,555,359 121,617 1,639,714 121,425 9,302,707 0 760,158 542
11 S 3,582,858 13,718 292,930 13,653 8,797,396 0 293,561 165
25 E 48,560,059 27,913 2,600,994 27,234 160,477,884 0 17,013,139 49,811
25 S 468,498,039 103,584 1,017,156 96,423 982,980,852 0 14,975 226

With respect to the results of these runs it is worthwhile to reveil the throughput numbers:

  • 384 users, light weight: 41000
  • 384 users, heavy weight: 26400
  • 1280 users, light weight: 55000
  • 1280 users, heavy weight: 26000

Looking at the 1280 case shows the wait time to be reduced from 2818K to 2256K for the ProcArrayLock. However this lock has been acquired twice as often! Also the total number of waits has grown drastically. However each single wait is much shorter comparing the lightweight to the heavy test. The overall effect can ofcourse be seen in the throughput numbers. For the 1280 user case the throughput more than doubled when changing to lightweight instrumentation.However the ProcArrayLock problems we identified earlier is still the number one contender. Even the artificial lock 25 (accumulation of locks with id 25 and higher) has significantly less wait time, while being acquired many more times.One noteworthy observation is that the number of times waited can be larger than the number of times a lock has been acquired. The wait if performed in a loop. Once the semaphore waited for is freed a new attempt is made to acquire it. Unfortunately this is not always succesfull. This seems to only happen for 4E and 5E.
Right now I believe it is fair to say two thing:
  1. Use intrumentation wisely. Be aware of the probe effect. Try to minimize the use of probes that fire often and execute more complex actions;

  2. Further investigation is required to understand why the ProcArrayLock becomes a scalability factor. The following points need further attention:

    1. make sure it is not due to a benchmark side effect that makes this issue artificial;
      1. vary the load generation by stressing with less users but with smaller "think times"
      2. use different load generations.

    2. check if CPU cache misses could be a source of inefficient CPU consumption, increasing the lock holding time;
    3. test if the snapshot taking logic implementation "drives" this pattern in an unforeseen way.



Here is the DTrace script used for the lightweight tests:

BEGIN { started = timestamp; }
postgresql\*:postgres:LWLockAcquire:lwlock-acquire-total
/arg0 < 25/
{
@lockAcquireCount[ arg0,arg1] = count();
@lockwaitCount[ arg0, arg1 ] = sum(arg2);
@lockwaitTime[ arg0, arg1 ] = sum(arg4);
@queueLengthMax[arg0, arg1] = max(arg3);
@queueLengthAvg[arg0, arg1] = avg(arg3);
}

postgresql\*:postgres:LWLockAcquire:lwlock-acquire-total
/arg0 >= 25/
{
@lockAcquireCount[ 25,arg1] = count();
@lockwaitCount[ 25, arg1 ] = sum(arg2);
@lockwaitTime[ 25, arg1 ] = sum(arg4);
@queueLengthMax[25, arg1] = max(arg3);
@queueLengthAvg[25, arg1] = avg(arg3);
}

:::tick-3600s { exit(0); }


:::END {
printf("WALL: %d millisec\\n", ((timestamp - started)/1000000) );
printa("CA %2d %1d x %@15d\\n", @lockAcquireCount);
printa("CW %2d %1d x %@15d\\n", @lockwaitCount);
normalize(@lockwaitTime,1000000); printa("TW %2d %1d x %@15d\\n",@lockwaitTime);
printa("QM %2d %1d %@10d\\n", @queueLengthMax);
printa("QA %2d %1d %@10d\\n", @queueLengthAvg);
}


Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

paulvandenbogaard

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today