Wednesday Mar 11, 2009

Thundering Herd of Elephants for Scalability

During last year's PgCon 2008 I had presented about "Problems with PostgreSQL on Multi-core Systems".  On slide 14 I talked about the results with IGEN with various think times and had identied the problem of how it is difficult to scale with increasing number of users. The chart showed of how 200ms think time tests will saturate about 1000  active users and then throughput starts to fall.  On slide 16 & 17  I identified ProcArrayLock as the culprit of why scalability tanks with increasing number of users. 

Today I was again intrigued by the same problem as I was trying out PostgreSQL 8.4 builds and once again hit the bottleneck at about 1000 users and frustrated that PostgreSQL cannot scale even with only 1 active socket (64 strands)  of Sun Enterprise SPARC T5240 which is a 2 socket UltraSPARC T2 plus system.  

Again I was digging through the source code of PostgreSQL 8.4 snapshot to see what can be done. While  reviewing lwlock.c I thought of a change  and quickly  changed couple  of lines in the code and recompiled the database binaries and re-ran the test. The results are as show below (8.4 Vs 8.4fix)

The setup was quite standard (as how I test them). The database and log files are on RAM (/tmp). The think times were about 200ms for each user between transactions.  But the results was quite pleasing. On the same system setup,  TPM throughput went up about 1.89x up and response time now shows a more gradual increase rather than a knee-jerk response.

So what was the change in the source code? Well before I explain what I did, let me explain what is PostgreSQL trying to do in that code logic.

After a process acquires a lock and does its critical work, when it is ready to release the lock, it finds the next postgres process that it needs to wake up so effectively not causing starvation for processes trying to do exclusive locks  and also I think it is trying to avoid a Thundering Herd problem. It is a good thing to do for single core systems since CPU resources are sacred and effective usage results in better performane. However on systems with many CPU resources (say like 256 threads of Sun SPARC Enterprise T5440) this ends up artificially bottlenecking since it is not the system but the application determining which next process should wake up and try to get the lock.

The change I did was discard the selective process waiter wake-up and just wake up all waiters waiting for that lock and let the processes, OS Dispatcher, and CPU resources do its magic on its own way (and that worked,  worked very well.)

          //if (!proc->lwExclusive)
           if (1)
                            while (proc->lwWaitLink != NULL &&
                                          // !proc->lwWaitLink->lwExclusive)
                                       proc = proc->lwWaitLink;

Its amazing that last year I had tried so many different approaches to the problem and a mere simple approach proved to be more effective.

I have put a proposal to the PostgreSQL Performance alias that a postgresql.conf tunable be defined for PostgreSQL 8.4 so people can tweak their instances to use the above method of awaking all waiters without impacting the existing behavior for most existing users.

In the meantime if you do compile your own PostgreSQL binaries, try the workaround if you are using PostgreSQL on ay 16 or more cores/threads system and provide feedback about the impact on your workloads.

Wednesday Jun 20, 2007

PostgreSQL 8.2.4 Scaling on Sun Fire T2000 with Solaris 10 11/06

As I move into my new role of working on Open Source Databases, I decided to do a "checkpoint" as they say in the PostgreSQL world. The idea is to "test-it-myself" of how PostgreSQL scales on Solaris.

Using a setup configuration consisting of Sun Enterprise T2000 (I still call it Sun Fire T2000 sometimes) provided by my well known team member, I decided to start with doing a small scalability test with PostgreSQL on Sun Fire T2000 server. I decided the best approach to do a scalability study is to start with only one core active on Sun Fire T2000 (other disabled via psradm) and measure postgreSQL with pre-defined criterias of load execution. If the executions pass the criterias, then I increase the load and carry out another tests till the number of cores start to saturate and the pre-defined criterias of doing load executions starts to fail. Thats when additional UltraSPARC T1 core on Sun Fire T2000s is turned on and the test is repeated till all cores are active and saturated.

With the test defined and the configurations set, I downloaded the latest PostgreSQL 8.2.4 source code and compiled it with the latest Sun Studio 12 compiler tools on Solaris 10 11/06 (all on Sun Fire T2000). With the newly generated binaries, Using my new favorite postgreSQL database layout option on Sun StorageTek 2540 and my tested postgresql.conf tunables, I initialized the database and created the schema for the OLTP workload.

Carrying out the test as planned and normalizing the throughput gathered from the series of test, I charted the throughput versus the number of cores acvite and compared it against what could mean a linear scaling.

Before looking at the chart, remember that each core on UltraSPARC T1 chip shows in Solaris as four virtual CPUs. With the numbering scheme on current T2000 servers are 0-3 virtual CPUs are fore core 1, 4-7 virtual CPUs are for core 2 and so on. (Somehow it did not made logical sense to activate partial threads on a core and hence in the tests that I did at any given time they are either active or inactive.) I had used psradm to turn on or off the virtual CPUs.

The resulting chart is as follows:

As you can observe PostgreSQL 8.2.4 on Solaris achieved more than 6.5X throughput on 8 cores (each core being set of 4 virtual CPU threads) as compared to throughput with a single core activated. Also I would like to add that at the 8 core run, the cpu on the driver machines were also nearing their peak value and could have caused some degradation on the 8 core test. (Since I also observed about 15-18% idle time with all cores active.) But still I think the results were beyond my expecations from what I expected of an Open Source Database. Why? Because even at this rate (of dropping scaling percentage), PostgreSQL 8.2.4 can scale well beyond 25-30 such cores or 100-120 virtual CPUs before the absolute throughput of PostgreSQL on UltraSPARC T1 cores is saturated.


Jignesh Shah is Principal Software Engineer in Application Integration Engineering, Oracle Corporation. AIE enables integration of ISV products including Oracle with Unified Storage Systems. You can also follow me on my blog


« July 2016