Scalability issues due to waits

Previous entries showed lots of detail around different types of LWLocks, and the time spend waiting for them. In order to get a "helicopter" view all lock waiting times are combined. Also so timing is done in PostgresMain to understand the amount of time spend getting (waiting for) a new command, handling a command and pg_stat reporting.

In the previous post all tests were done from 64 to 1280 concurrent users. The findings in this post are all from 384 till 1024 users, in steps of 64. Each run took one hour. Measurement time was approx. 30 seconds less.

First a rather big table that holds all the numbers of interest :-) Lets explain the columns by position (all times not specified are in seconds):

  1. number of concurrent users;
  2. number of times the pg_stat updates were executed;
  3. total amount in time taken by these pg_stat updates;
  4. average time (msec) a single pg_stat update takes;
  5. number of commands read;
  6. total time for reading these commands;
  7. average time (msec) for reading a command;
  8. total time for handling the commands read;
  9. average time (msec) for handling a command;
  10. total number of times a LWLock was waited for;
  11. total amount of time waited for LWLocks;
  12. average time (msec) waiting for a LWLock;
  13. throughput per minute;
  14. average response time from client perspective;
  15. total amount of time (cols 3 + 6 + 8) explained.

   1        2       3     4         5         6       7        8       9       10          11     12     13     14    15              
384 30,774,906 1,498 0.05 134,478,799 1,257,700 9.35 111,570 0.83 30,583,924 47,298 1.55 41,771 0.017 3569
448 35,254,626 2,708 0.08 154,052,468 1,447,983 9.40 148,952 0.97 35,416,188 72,604 2.05 47,931 0.019 3570
512 39,472,668 1,966 0.05 172,484,080 1,628,241 9.44 197,524 1.15 41,839,911 109,362 2.61 54,144 0.022 3569
576 41,647,430 2,148 0.05 181,986,937 1,724,056 9.47 330,029 1.81 47,182,321 233,743 4.95 57,251 0.034 3569
640 41,903,717 2,260 0.05 183,106,564 1,740,012 9.50 543,114 2.97 48,511,160 444,870 9.17 57,943 0.057 3570
704 41,537,271 2,290 0.06 181,505,836 1,729,515 9.53 782,201 4.31 48,491,749 684,688 14.12 57,042 0.084 3571
768 41,011,402 2,279 0.06 179,208,259 1,710,304 9.54 1,030,160 5.75 47,853,511 933,159 19.50 56,190 0.113 3571
832 40,331,868 2,299 0.06 176,237,327 1,685,163 9.56 1,284,115 7.29 46,607,735 1,188,499 25.50 55,407 0.143 3571
896 39,857,972 2,268 0.06 174,167,499 1,668,546 9.58 1,529,312 8.78 46,385,722 1,434,691 30.93 55,103 0.172 3571
960 39,120,264 2,216 0.06 170,943,881 1,641,102 9.60 1,784,889 10.44 44,893,231 1,691,997 37.69 53,655 0.206 3571
1024 38,916,412 2,219 0.06 170,053,697 1,633,946 9.61 2,017,717 11.87 44,316,238 1,924,754 43.43 53,323 0.234 3568

The nice thing is to know that almost all time is explained (column 15). Tests took a little less than 3600 seconds.
Lets look at some of these numbers in a graphical way. First what is happening from a client side perspective when increasing the concurrency?

resptimeandthroughput.png Maximum throughput is reached with 640 concurrent users. The initial linear looking increase in response time is already clearly not linear at this concurrency level. Once the 704 mark is passed it looks linear once more although it rises much stronger. The transaction throughput numbers are in TPM (per minute.)
The next graph shows what is happening in the database back-end. Here the number of command read and the time spend handling these commands and the time spend waiting for LWLocks is shown. All these numbers are totals for all the back-ends in the test. Please note the concurrency numbers do not include the standard Postgres processes (dbwriter, wal log writer, etc).

timespend.png
Once more the number of commands read (and handled) tops at 640 users. This seems also to be the place were both "time needed to handle" and "time needed to acquire a lock" start rising in a linear way. The actual CPU consumption (difference between the red and the blue graph) looks rather constant in the third figure:

cpuUsage.png In this graph the average CPU usage to handle a single command is shown. There seems to be a marginal increase in CPU consumption with increase in concurrency. The CPU usage increase when going from 384 to 640 users is remarkable. It could be that -- but this is not proven by proper tests and instrumentation -- the processes starts spinning, in order to acquire a LWLock, more and more with increasing concurrency. At 640 than, most of this spinning is not adequate any more and the processes start to wait in order to acquire the lock. These lock waits take more and more time due to increased concurrency. This leads to a reduction in throughput (or in commands handled).
Knowing that all these tests indeed took an hour, it can be concluded that these 3600 seconds were basically used for waiting. For now I assume that the readCommand is -- as its name implies -- spending most of its time waiting for commands and data to arrive, in order to process them. The actual command handling is constant in its CPU consumption (around 0.5 msec.) The waiting time lost for acquiring LWLocks is however increasing when concurrency goes up. The amount of time spend in reading a command is rather constant. Total amount of CPU is 0.47 msec per command in the 384 user case. This translates to 64,272 sec (111570 - 47298). Total amount of time waited is 1,304,998 sec (1257700 + 47298.) Fortunately this is most of the time waiting for the clients to send a new command. However at 640 users (the maximum throughput) the wait time for locks has grown to 444,870 seconds. Something that escalated to 1,924,754 seconds in the 1024 user case.
The amount of CPU used is definitely an overestimate. The assumption here is "time not waited for LWLock is CPU time". The following calculation shows this cannot be true:-)
If, in the 384 user case, 64,272 seconds are used by CPUs in a period of 3569 seconds, it must be that the system has at least 18 CPUs. Also knowing the amount of idle time was approx. 37% this should mean these tests were run on a 28 CPU machine.For the 1024 user case there were 92963 seconds consumed: 26 CPUs. Idle time during this test was 7%, meaning 28 CPUs.The system used had just 16 CPUs! So there is wait time in other area's too. However compared to the wait time for the LWLocks this is barely worth any attention.However a significant amount of wait time is lost in reading a command. Approx. 9.5 msec are used per command. This could be due to the protocol used for communication with the client, or deficiencies in the benchmark tool. At least it motivates some extra attention. Currently the energy spend on SMP scalability is best used in the area's where wait time is observed. The above mentioned numbers show two area's worthy of more attention to obtain a more scalable and performing solution.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

paulvandenbogaard

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today