Light Weight Lock Contention

During earlier tests it was discovered that PostgreSQL it's ProcArrayLock was the major source of contention when executing test with over 500 concurrent users. To put this into context some additional probes were added in the lwlock.c source and in postgres.c. In the latter the main routine(PostgresMain), that reads the user commands and executes them, determines the next thing that needs to be done like bind, parse, define and execute. By using these events it is possible to look at the LWLock contention in context of the current requests.

Four major statistics are measured with respect to these leight weight locks: the time to acquire the lock ; the time to wait for the lock before it is available; the time it takes to release it; and finally the time the lock was being held. The acquire time includes the wait time.
All times are reported in milliseconds and accummulated.

The acquire time --after correction for implicit wait time-- and release time are rather constant and small. The wait and held times are the interesting ones.

First, lets look at the numbers for a 64 user run and a 1280 user run. All lockid's uptill but not including 25 are predefined locks. All numbers starting at 25 are dynamically assigned. Locks with ID 25 or higher are cummulated and presented as lock with ID 25.

Here is some data about the consumption pattern of LWLocks. While doing a 3600 minute run there were 10 cycles of DTrace activity to collect this data. A single cycle consists of 1 minute of instrumentation followed by 4 minutes of no instrumentation. This will be refered to as a "partial (instrumentation) run."

All data from those 10 single minutes intervals are combined. For those locks that were acquired for more than 100000 times the data is shown. Data is sorted on the wait time column.

First column depicts the lockID, the second is a code of two letters. The first letter is either S (shared mode acquire) or E (exclusive mode acquire). The second letter is the context: P(arsing), B(inding), E(xecuting), S(ync). For a complete list check the source ;\^)

After these there are four groups. Each group consists of three columns: number of times, cummulative wait time in milliseconds and average wait time in milliseconds (cummulative divided by number). The groups are "Time Held", "Acquire Time (including wait time)", "Wait Time", and "Acquire Time without Wait Time." Tables are sorted for the column "WaitingTime."

64 users
               Holding                Acquiring                  Waiting           Acquire without Waiting   
ID       Count    Time   Avg    Count      Time     Avg  Count      Time     Avg    Count     Time     Avg
 5 SE   694364   16195 0.023   694364      8186   0.012     27         0   0.000   694364     8186   0.012
 5 SP   433670   10883 0.025   433670      5967   0.014      7         0   0.000   433670     5967   0.014
 5 SS   561073   12809 0.023   561073      6537   0.012     17         0   0.000   561073     6537   0.012
 5 SB   855804   20103 0.023   855804     10074   0.012     39         1   0.026   855804    10073   0.012
25 SE 17651780  419107 0.024 17651787    211544   0.012    186         6   0.032 17651787   211538   0.012
 4 SP   105640    3299 0.031   105640      1503   0.014    308        10   0.032   105640     1493   0.014
11 SE   133554    3405 0.025   133554      1731   0.013    283        13   0.046   133554     1718   0.013
11 EE   161634    3945 0.024   161634      2049   0.013    761        27   0.035   161634     2022   0.013
 4 SB   380304   12601 0.033   380304      5056   0.013    714        34   0.048   380304     5022   0.013
 4 SE   180781    5213 0.029   180781      2371   0.013    967        38   0.039   180781     2333   0.013
 4 EE   161423    3527 0.022   161423      2069   0.013   3455       223   0.065   161423     1846   0.011
 3 EE   161445    3797 0.024   161445      2442   0.015   8374       440   0.053   161445     2002   0.012
25 EP   224292    6479 0.029   224292      3993   0.018   9959       682   0.068   224292     3311   0.015
25 EE  1813561   87351 0.048  1813562     24029   0.013  15611      2341   0.150  1813562    21688   0.012
 7 EE   635664   16594 0.026   635664     11731   0.018  50265      3138   0.062   635664     8593   0.014
25 EB   852120   37442 0.044   852120     35556   0.042  95684      9798   0.102   852120    25758   0.030
 8 EE   161221  108131 0.671   161222    183921   1.141  63073    181181   2.873   161222     2740   0.017


1280 users 
 5 SS  1095699   29526 0.027  1095700     16210   0.015    239       483   2.021  1095700    15727   0.014
 5 SE  1357406   38023 0.028  1357407     22026   0.016    349       647   1.854  1357407    21379   0.016
 5 SP   851657   27240 0.032   851657     16652   0.020    325       649   1.997   851657    16003   0.019
 5 SB  1678278   49172 0.029  1678279     29385   0.018    684      1682   2.459  1678279    27703   0.017
 3 EE   315330    8923 0.028   315330     13394   0.042  14738      8524   0.578   315330     4870   0.015
11 SE   250017    7282 0.029   250017     19220   0.077   8783     15257   1.737   250017     3963   0.016
25 SE 34187469  988775 0.029 34187488    490078   0.014    398     19655  49.384 34187488   470423   0.014
11 EE   322850   11541 0.036   322850     43695   0.135  25076     36419   1.452   322850     7276   0.023
25 EP   445981   16483 0.037   445982     74782   0.168  23477     65533   2.791   445982     9249   0.021
25 EE  3460287 1258692 0.364  3460295    960810   0.278 176807    902399   5.104  3460295    58411   0.017
25 EB  1663721   74921 0.045  1663723   1308005   0.786 183328   1260363   6.875  1663723    47642   0.029
 7 EE  1138770   39522 0.035  1138771   1798318   1.579 169920   1772398  10.431  1138771    25920   0.023
 4 SP   206469   38776 0.188   206470   5182624  25.101   8794   5181440 589.202   206470     1184   0.006
 8 EE   230775   54803 0.237   230777   8466229  36.686 210054   8454877  40.251   230777    11352   0.049
 4 SE   384603   66350 0.173   384608   9997294  25.993  16936   9993060 590.048   384608     4234   0.011
 4 SB   742002  138343 0.186   742004  14315047  19.292  24277  14311237 589.498   742004     3810   0.005
 4 EE   305800   12427 0.041   305800 196006021 640.961 341076 200376943 587.485   305800 -4370922 -14.293

First some remarks. The negative numbers for 1280 users the ProcArrayLock (ID: 4) in exclusive mode in execute context (4 EE), are probably due to the way things are handled within DTrace with respect to timing and buffer management. I will try to find out what this can be. It seems to only happen (also for some other runs with higher number of users count) when the number of waits are bigger than the number of acquire requests. This occurs when too many processes are "fighting" for this lock. It basically means "the lock was released" but someone else was succesfull acquiring so I need to try again."

Looking at the actual time it takes to acquire a lock (last group) the time it takes is nicely constant when comparing a system under high load to a lightly loaded system. These numbers also map nicely into a single time slot of the Solaris scheduler (I always use the FX class when so many processes are running.) This means that fiddling with time slices to gain throughput will not show much effect. Also the time during which a lock is being held seems quite nice too. Other inspection with DTrace showed that (for locks with ID < 25) there were barely any system calls made. The locks are used for work that is highly dependend on CPU speed. Once more playing with time slices will not bring significant improvement.

When looking at the number of times a lock was held there is definitely no lineair relation comparing the 64 with the 1280 user test. Counts for 25 SE and 25 EE are about twice as high for the 1280 user case. Furher inspection shows that this is also true for all other locks.

From the above it looks ok to only look at the wait count and time characteristics while doing further comparisions when changing the load. This would at least make reading the data a little easier ;\^)

And before I forget, here is the DTrace script used to get this data:


self int handleCommand;
self int readyForQuery;
self char commandType;
self int commandTypeInitialized;

BEGIN { started = timestamp; }

/\* First the extra probes in postgres.c are used to place context. Since, when DTrace is started it can be
   there is not context! For this the charecter '-' is used
   The commands are only timed after their start has been seen
\*/
postgresql\*:postgres::ready_for_query_start
{
        self->readyForQuery = timestamp;
}

postgresql\*:postgres::ready_for_query_end
/self->readyForQuery /
{
        @count["readyForQuery", '-'] = count();
        @elapsed["readyForQuery", '-'] = sum(timestamp - self->readyForQuery);
        self->readyForQuery = 0;
}

postgresql\*:postgres::ignore_till_sync
{
        @count["ignore_till_sync", '-'] = count();
}

postgresql\*:postgres::handle_command_start
{
        self->commandTypeInitialized = 1;
}

postgresql\*:postgres::handle_command_start
/arg0 == 0/
{
        self->handleCommand = 0;
        @count["zero",'-'] = count();
}

postgresql\*:postgres::handle_command_start
/arg0 != 0/
{
        self->handleCommand = timestamp;
        self->commandType = (char)arg0;
}

postgresql\*:postgres::handle_command_end
/self->handleCommand/
{
        @count["handleCommand", arg0] = count();
        @elapsed["handleCommand", arg0] = sum(timestamp - self->handleCommand);
        self->handleCommand = 0;
        self->commandType = '-';
}

/\* Below the LWLock related probes are used to gather information
   All LWLocks are divided: up till ID 25 (not inclusive) are 
   locks with a well known ID. Locks starting with ID 25 and up 
   are  dynamic. Values are combined into grand totals for 25 and
   up.
\*/
postgresql\*:postgres:LWLockAcquire:lwlock-acquire
{
        self->ts[arg0,arg1]=timestamp; 
        self->mode[arg0] = arg1;
}

postgresql\*:postgres:LWLockRelease:lwlock-release
/self->commandTypeInitialized && arg0 < 25 && self->ts[arg0,self->mode[arg0] ] /
{
        this->mode = self->mode[arg0];
        @lockheldCount[arg0,this->mode,self->commandType] = count();
        @lockheldTime[arg0,this->mode,self->commandType] = sum (timestamp - self->ts[arg0,this->mode]);
        self->ts[arg0,this->mode] = 0;
}

postgresql\*:postgres:LWLockRelease:lwlock-release
/self->commandTypeInitialized && arg0 >= 25 && self->ts[arg0,self->mode[arg0]] / 
{
        this->mode = self->mode[arg0];
        @lockheldCount[25,this->mode,self->commandType] = count();
        @lockheldTime[25,this->mode,self->commandType] = sum (timestamp - self->ts[arg0,this->mode]);
        self->ts[arg0,this->mode] = 0;
}

postgresql\*:postgres:LWLockAcquire:lwlock-startwait
/arg0 < 25/
{
        self->tsw[arg0,arg1] = timestamp;
        @queueLengthStartwaitMax[ arg0,arg1 ] = max(arg2);
        @queueLengthStartwaitAvg[ arg0,arg1 ] = avg(arg2);
        @queueLengt2StartwaitMax[ arg0,arg1 ] = max(arg3);
        @queueLengt2StartwaitAvg[ arg0,arg1 ] = avg(arg3);
}

postgresql\*:postgres:LWLockAcquire:lwlock-startwait
/arg0 >= 25/
{
        self->tsw[arg0,arg1] = timestamp;
        @queueLengthStartwaitMax[25,arg1 ] = max(arg2);
        @queueLengthStartwaitAvg[25,arg1 ] = avg(arg2);
        @queueLengt2StartwaitMax[25,arg1 ] = max(arg3);
        @queueLengt2StartwaitAvg[25,arg1 ] = avg(arg3);
}

postgresql\*:postgres:LWLockAcquire:lwlock-endwait
/self->commandTypeInitialized && self->tsw[arg0,arg1] && (arg0 < 25)/
{
        @lockwaitCount[arg0,arg1,self->commandType] = count();
        @lockwaitTime[arg0,arg1,self->commandType] = sum(timestamp - self->tsw[arg0,arg1]);
        @cycleCountMax[arg0, arg1] = max(arg2);
        @cycleCountAvg[arg0, arg1] = avg(arg2);
        self->tsw[arg0,arg1] = 0;
}

postgresql\*:postgres:LWLockAcquire:lwlock-endwait
/self->commandTypeInitialized && self->tsw[arg0,arg1] && (arg0 >= 25)/
{
        @lockwaitCount[25,arg1,self->commandType] = count();
        @lockwaitTime[25,arg1,self->commandType] = sum(timestamp - self->tsw[arg0,arg1]);
        self->tsw[arg0,arg1] = 0;
        @cycleCountMax[25, arg1] = max(arg2);
        @cycleCountAvg[25, arg1] = avg(arg2);
}

postgresql\*:postgres:LWLockAcquire:lwlock-acquire-enter
{
        self->acqtime[arg0,arg1] = timestamp;
}

postgresql\*:postgres:LWLockAcquire:lwlock-acquire-return
/self->commandTypeInitialized && arg0 < 25 && self->acqtime[arg0,arg1]/
{
        @lockAcquireCount[ arg0,arg1,self->commandType ] = count();
        @lockAcquireTime[arg0,arg1,self->commandType ] = sum( timestamp - self->acqtime[arg0,arg1]);
        self->acqtime[arg0,arg1] = 0;
}

postgresql\*:postgres:LWLockAcquire:lwlock-acquire-return
/self->commandTypeInitialized && arg0 >= 25 && self->acqtime[arg0,arg1]/
{
        @lockAcquireCount[ 25,arg1,self->commandType ] = count();
        @lockAcquireTime[25,arg1,self->commandType ] = sum( timestamp - self->acqtime[arg0,arg1]);
        self->acqtime[arg0,arg1] = 0;
}

postgresql\*:postgres:LWLockRelease:lwlock-release-enter
/arg0 < 25/
{
        self->acqtime[arg0,self->mode[arg0] ] = timestamp;
}

postgresql\*:postgres:LWLockRelease:lwlock-release-enter
/arg0 >= 25/
{
        self->acqtime[arg0,self->mode[arg0] ] = timestamp;
}

postgresql\*:postgres:LWLockRelease:lwlock-release-return
/self->commandTypeInitialized && arg0 < 25 && self->acqtime[arg0, self->mode[arg0] ]/
{
        this->mode = self->mode[arg0];
        @lockReleaseCount[ arg0, this->mode,self->commandType ] = count();
        @lockReleaseTime[arg0, this->mode,self->commandType ] = sum( timestamp - self->acqtime[arg0, this->mode]);
        self->acqtime[arg0, this->mode] = 0;
        self->mode[arg0] = 0;
}

postgresql\*:postgres:LWLockRelease:lwlock-release-return
/self->commandTypeInitialized && arg0 >= 25 && self->acqtime[arg0, self->mode[arg0]]/
{
        this->mode = self->mode[arg0];
        @lockReleaseCount[25, this->mode,self->commandType] = count();
        @lockReleaseTime[25, this->mode,self->commandType ] = sum( timestamp - self->acqtime[arg0, this->mode]);
        self->acqtime[arg0, this->mode] = 0;
        self->mode[arg0] = 0;
}


:::tick-60s { exit(0); }

:::END {
        printf("WALL: %d millisec\\n", ((timestamp - started)/1000000) );

                                              printa("CH %2d %1d %2c %@15d\\n",@lockheldCount);
        normalize(@lockheldTime,1000000);     printa("TH %2d %1d %2c %@15d\\n",@lockheldTime);

                                              printa("CA %2d %1d %2c %@15d\\n", @lockAcquireCount);
        normalize(@lockAcquireTime, 1000000); printa("TA %2d %1d %2c %@15d\\n", @lockAcquireTime);

                                              printa("CR %2d %1d %2c %@15d\\n", @lockReleaseCount);
        normalize(@lockReleaseTime, 1000000); printa("TR %2d %1d %2c %@15d\\n", @lockReleaseTime);

                                              printa("CW %2d %1d %2c %@15d\\n",@lockwaitCount);
        normalize(@lockwaitTime,1000000);     printa("TW %2d %1d %2c %@15d\\n",@lockwaitTime);

        printa("QM %2d %1d %@10d\\n", @queueLengthStartwaitMax);
        printa("QA %2d %1d %@10d\\n", @queueLengthStartwaitAvg);

        printa("QM2 %2d %1d %@10d\\n", @queueLengt2StartwaitMax);
        printa("QA2 %2d %1d %@10d\\n", @queueLengt2StartwaitAvg);

        printa("CM %2d %1d %@10d\\n", @cycleCountMax);
        printa("CV %2d %1d %@10d\\n", @cycleCountAvg);

        normalize(@elapsed, 1000000);
        printa("C %10s  %2c  %@10d\\n", @count);
        printa("T %10s  %2c  %@10d\\n", @elapsed);
}
Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

paulvandenbogaard

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today