Using DTrace to measure Light Weight Lock contention

In an earlier post a table was shown with timing information for light weight locks. Here I will try to show a little more detail on the instrumentation effect. Please note these tests were done with PostgreSQL 8.4 devel.

Presented is data from a 384 and a 1280 user test. Both tests were done with so called partial and with full instrumention. The same DTrace script is used as for the data in the previous blog entry, however here some extra context is provided. This context comes from also measuring the number of executes, parses and binds that were involved.

During a full test DTrace is running continuously during the full hour a single test was running. In the partial case the same script ran for one minute followed by a 4 minute period of no instrumentation. This cycle repeated throughout the hour the test was running. This has one major consequence: the full DTrace tests ran for approximately one hour while the partial tests only ran for less than one sixth of an hour. In order to make the numbers comparable they are presented in a "per second" way. For completeness the actual wall clock time is also shown.

The column TP shows the light weight lock ID (4: ProcArrayLock; 8: WALWriteLock), and the context mode (E:xecuting; P:arsing; B:inding). Next there are three sets of counts (yyyCNT), waits (yyyTIME) and averages (yyyAVG). Times are in milliseconds. First set is used for holding the lock (HELDxxx) ; second set for acquiring the lock (ACQxxx); last set for waiting to acquire the lock (WAITxxx). The acquire time includes the wait time. However an acquire does not necessarely include a wait! The handleCommand and readCommand data are from probes inserted in the PostgresMain routine (backend/tcop/postgres.c). They are inserted in the "for (;;) {}" loop. Roughly the time spent in readCommand --waiting for a command from the client-- added to the time spent in handleCommand --time spent to process the command-- is equal to the time used for one iteration of this loop.

First the results from the 384 user runs.

                  HELD                  ACQUIRE                    WAIT
TP COUNT TIME AVG COUNT TIME AVG COUNT TIME AVG
partial WAL 435 sec
4E 2572.59 150.60 0.06 2572.59 4623.62 1.80 660.82 4523.92 6.85
8E 796.65 220.89 0.28 796.66 53639.95 67.33 739.88 53599.26 72.44
4B 2853.55 233.55 0.08 2853.55 754.84 0.26 106.45 658.33 6.18
full WAL 3526 sec
4E 2619.11 156.19 0.06 2619.11 4251.13 1.62 678.02 4147.65 6.12
8E 834.33 228.23 0.27 834.33 55138.68 66.09 777.58 55085.88 70.84
4B 2903.03 242.36 0.08 2903.03 699.30 0.24 108.40 600.84 5.54

COUNT TIME AVG
partial
readCommand: 23684.51 227058.51 9.59
handleCommand Bind: 6633.42 17195.51 2.59
handleCommand Parse: 793.29 1435.42 1.81
handleCommand Exec: 6630.89 124305.46 18.75
full
readCommand: 24113.79 233049.86 9.66
handleCommand Bind: 6750.85 13557.35 2.01
handleCommand Parse: 807.78 1294.89 1.60
handleCommand Exec: 6750.81 134847.69 19.98

For the 1280 user case the results are:
                  HELD                    ACQUIRE                       WAIT
TP COUNT TIME AVG COUNT TIME AVG COUNT TIME AVG
partial WAL 227 sec
4E 2424.23 287.60 0.12 2424.23 783245.16 323.09 1206.77 803197.79 665.57
8E 810.80 236.86 0.29 810.80 49875.65 61.51 732.74 50059.37 68.32
4B 2660.96 506.85 0.19 2660.97 62021.92 23.31 93.14 61992.47 665.62
full WAL 3568 sec
4E 1958.48 231.36 0.12 1958.48 504593.33 257.64 959.00 504494.48 526.06
8E 666.15 185.80 0.28 666.15 10707.98 16.07 584.79 10669.66 18.25
4B 2153.02 420.62 0.20 2153.02 39286.40 18.25 72.99 39193.92 537.00

COUNT TIME AVG
partial
readCommand 22082.03 212631.35 9.63
handleCommand Bind 6185.32 72201.28 11.67
handleCommand Parse 738.13 22909.84 31.04
handleCommand Exec 6145.63 861439.90 140.17
full
readCommand 17865.86 174686.57 9.78
handleCommand Bind 5001.35 48387.27 9.67
handleCommand Parse 599.76 14448.09 24.09
handleCommand Exec 5001.06 1039141.95 207.78
Looking at these number it can be seen that the readCommand takes the same amount of time. Approx 9.5 milli seconds. Comparing the handleCommands shows that both the bind and parse times reduce when under full load, while the execute takes marginally longer in a 384 user test and significantly longer in the 1280 user case.Looking at the total read/handle numbers shows not much difference between full and partial instrumentation in the 384 user case. However the 1280 user case shows an approx 20% reduction in read/handle numbers comparing the full to the partial instrumention. The time holding a lock is not affected by the different instrumentation methods. The percentage of acquire attempts that fall back into waiting HOWEVER increases from 25.6% to 33.8% (1280 user case). In the 384 user case these percentages can be considered as similar (24.2% vs 24.6%.)Another thing that stands out is the change in lock wait time when comparing the ProcArrayLock to the WALInsertLock. In the 384 case the WALInsertLock definitely is the major wait time consumer: 53.6 seconds are used waiting for this lock in a one second time span by 390 processes (one for each user, and the usual "standard" backend processes). That is 18.9% of the time. Looking at the 1280 user case the wait time consumption role is now taken by the ProcArrayLock.A total of 865190 milliseconds is spend waiting by 1286 processes during a period of one second. That's 67% of the time.
While we have been reporting we see huge amount of contention on the ProcArrayLock I believe we need to delve a little deeper to see if the instrumentation effect has anything to do with this. Using DTrace can be intrusive on a system. Although this could have a strong impact on application throughput, this should not alter the application its behaviour. Total intrusiveness has a strong relation with the number of probes firing, and the complexity of the clauses. Surely the probes used to measure time spent in the LWLock department do fire frequently. My implementation uses five probes in the acquire routine: one on entry, one just before exit, two around the mutex wait and the the fifth one already available.The lock release routine has three probes. Besides the one already available one that fires entering the function and the third one that fires on return. This means that for one cylce "acquire-release" 6 probes fire. When a mutex wait is needed this increases to 8. The 1280 partial tests shows almost 6000 cycles (not counting all the other locks) in a second from which 2000 induce a wait. That is 40,000 probes firing per second.The mentioned increase in waittime on the ProcArrayLock could be worsely affected by the instrumentation. DTrace probes that fire use CPU power. Once a process is the owner of a lock, others need to wait for it. Due to probe CPU consumption there is less CPU available per unit of time. This means the lock holding time increases. In a highly concurrent environment this can lead to a worse than lineair increase of wait time to acquire this lock: the number of waiters ariving grows when the lock interval increases. However I still cannot explain why the shift from WALInsert to ProcArray locks is seen. The queueing effect should hold for the WALInsertLock too.
Guess I still have plenty to do...
Right now the lesson learned is "be aware of the instrumentation effect" when using high frequency probes.



Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

paulvandenbogaard

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today