Thundering Herd of Elephants for Scalability

During last year's PgCon 2008 I had presented about "Problems with PostgreSQL on Multi-core Systems".  On slide 14 I talked about the results with IGEN with various think times and had identied the problem of how it is difficult to scale with increasing number of users. The chart showed of how 200ms think time tests will saturate about 1000  active users and then throughput starts to fall.  On slide 16 & 17  I identified ProcArrayLock as the culprit of why scalability tanks with increasing number of users. 

Today I was again intrigued by the same problem as I was trying out PostgreSQL 8.4 builds and once again hit the bottleneck at about 1000 users and frustrated that PostgreSQL cannot scale even with only 1 active socket (64 strands)  of Sun Enterprise SPARC T5240 which is a 2 socket UltraSPARC T2 plus system.  

Again I was digging through the source code of PostgreSQL 8.4 snapshot to see what can be done. While  reviewing lwlock.c I thought of a change  and quickly  changed couple  of lines in the code and recompiled the database binaries and re-ran the test. The results are as show below (8.4 Vs 8.4fix)

The setup was quite standard (as how I test them). The database and log files are on RAM (/tmp). The think times were about 200ms for each user between transactions.  But the results was quite pleasing. On the same system setup,  TPM throughput went up about 1.89x up and response time now shows a more gradual increase rather than a knee-jerk response.

So what was the change in the source code? Well before I explain what I did, let me explain what is PostgreSQL trying to do in that code logic.

After a process acquires a lock and does its critical work, when it is ready to release the lock, it finds the next postgres process that it needs to wake up so effectively not causing starvation for processes trying to do exclusive locks  and also I think it is trying to avoid a Thundering Herd problem. It is a good thing to do for single core systems since CPU resources are sacred and effective usage results in better performane. However on systems with many CPU resources (say like 256 threads of Sun SPARC Enterprise T5440) this ends up artificially bottlenecking since it is not the system but the application determining which next process should wake up and try to get the lock.

The change I did was discard the selective process waiter wake-up and just wake up all waiters waiting for that lock and let the processes, OS Dispatcher, and CPU resources do its magic on its own way (and that worked,  worked very well.)


          //if (!proc->lwExclusive)
           if (1)
           {
                            while (proc->lwWaitLink != NULL &&
                                         1)
                                          // !proc->lwWaitLink->lwExclusive)
                                       proc = proc->lwWaitLink;
           }


Its amazing that last year I had tried so many different approaches to the problem and a mere simple approach proved to be more effective.

I have put a proposal to the PostgreSQL Performance alias that a postgresql.conf tunable be defined for PostgreSQL 8.4 so people can tweak their instances to use the above method of awaking all waiters without impacting the existing behavior for most existing users.

In the meantime if you do compile your own PostgreSQL binaries, try the workaround if you are using PostgreSQL on ay 16 or more cores/threads system and provide feedback about the impact on your workloads.


Comments:

Jignesh,

This is indeed very good achievement.

It is always best practice to write comments about any assumptions you are making in certain code block. It helps in a review like yours to quickly notice if reality changes on those assumptions.

Just wondering in similar scenario, how MySQL works?

Posted by VeryGood on March 12, 2009 at 04:19 AM EDT #

hey there,
Jignesh, wonderful hack.

I am doing some huge deployments with a few tb's of data and the machine is a intel dual core 2.6 processor.
Will I need your hack in that machine? I do compile postgresql and till date I have never found a serious bottleneck with a large volume of data on my machines.

Hi Jignesh,

Great work.
I am not so good in hacking postgresql source code but do compile it often for large servers.

I have so far used postgresql with a few terrabites of data on machines with core2 duo and amd64 processors.

Do I need to make the change you specified?
I have not seen any slowdown or bottlenecks so far with the given large volumns of data so I am not sure.
thanks.
Krishnakant.

Posted by Krishnakant on April 05, 2009 at 03:59 AM EDT #

Post a Comment:
Comments are closed for this entry.
About

Jignesh Shah is Principal Software Engineer in Application Integration Engineering, Oracle Corporation. AIE enables integration of ISV products including Oracle with Unified Storage Systems. You can also follow me on my blog http://jkshah.blogspot.com

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today