Wednesday Mar 11, 2009

Thundering Herd of Elephants for Scalability

During last year's PgCon 2008 I had presented about "Problems with PostgreSQL on Multi-core Systems".  On slide 14 I talked about the results with IGEN with various think times and had identied the problem of how it is difficult to scale with increasing number of users. The chart showed of how 200ms think time tests will saturate about 1000  active users and then throughput starts to fall.  On slide 16 & 17  I identified ProcArrayLock as the culprit of why scalability tanks with increasing number of users. 

Today I was again intrigued by the same problem as I was trying out PostgreSQL 8.4 builds and once again hit the bottleneck at about 1000 users and frustrated that PostgreSQL cannot scale even with only 1 active socket (64 strands)  of Sun Enterprise SPARC T5240 which is a 2 socket UltraSPARC T2 plus system.  

Again I was digging through the source code of PostgreSQL 8.4 snapshot to see what can be done. While  reviewing lwlock.c I thought of a change  and quickly  changed couple  of lines in the code and recompiled the database binaries and re-ran the test. The results are as show below (8.4 Vs 8.4fix)

The setup was quite standard (as how I test them). The database and log files are on RAM (/tmp). The think times were about 200ms for each user between transactions.  But the results was quite pleasing. On the same system setup,  TPM throughput went up about 1.89x up and response time now shows a more gradual increase rather than a knee-jerk response.

So what was the change in the source code? Well before I explain what I did, let me explain what is PostgreSQL trying to do in that code logic.

After a process acquires a lock and does its critical work, when it is ready to release the lock, it finds the next postgres process that it needs to wake up so effectively not causing starvation for processes trying to do exclusive locks  and also I think it is trying to avoid a Thundering Herd problem. It is a good thing to do for single core systems since CPU resources are sacred and effective usage results in better performane. However on systems with many CPU resources (say like 256 threads of Sun SPARC Enterprise T5440) this ends up artificially bottlenecking since it is not the system but the application determining which next process should wake up and try to get the lock.

The change I did was discard the selective process waiter wake-up and just wake up all waiters waiting for that lock and let the processes, OS Dispatcher, and CPU resources do its magic on its own way (and that worked,  worked very well.)

          //if (!proc->lwExclusive)
           if (1)
                            while (proc->lwWaitLink != NULL &&
                                          // !proc->lwWaitLink->lwExclusive)
                                       proc = proc->lwWaitLink;

Its amazing that last year I had tried so many different approaches to the problem and a mere simple approach proved to be more effective.

I have put a proposal to the PostgreSQL Performance alias that a postgresql.conf tunable be defined for PostgreSQL 8.4 so people can tweak their instances to use the above method of awaking all waiters without impacting the existing behavior for most existing users.

In the meantime if you do compile your own PostgreSQL binaries, try the workaround if you are using PostgreSQL on ay 16 or more cores/threads system and provide feedback about the impact on your workloads.

Wednesday Jan 09, 2008

Multi-cores, Operating Systems, Database, Applications and IT Projects

As days goes by, more and more multi-core systems are popping up everywhere. Infact, with the advent of the new 4-socket quad-core Sun Fire X4450 , 16-core systems are soon becoming common out there.

However,personally, I think these systems are still underutilized from their true potentials.  Of course virtualization is a way of increasing the utilization but  that is like working around the symptoms and not fixing the real problems. Software Applications fundamentally has lagged behind the microprocessor innovations. Operating Systems too have lagged behind too. Solaris, however advanced opertating system has still lot to achieve in this area. For example, yes the kernel is multi-threaded, yes it can scale well easily to 100s of cores but that scaling is generally achieved by creating copies of the process (or multiple connections or multiple threads however you look at it) at the APPLICATION level.  One area however that generally falls behind is its  own utility commands.  For example: tar, cp, compress, or pick your favorite /usr/bin command). These utility programs will generally end up using only one core or virtual cpu. Now Solaris does  provide a framework API for  multi-threaded systems, but it is still surprising to me that not many people are asking for versions of the basic utilities  that can use the resources available to make it  a big priority. I think there is a significant loss of productivity waiting for utilities to complete while the system practically is practically running idle loops on the rest of the cores.

Database is an application of an operating system, few commercial databases have handled these challenges to adapt to multi-core systems well, Even opensource databases are not far behind as seen by the multi-threaded nature of MySQL and the scaling of  PostgreSQL on multicore systems. However an area that even PostgreSQL lacks support for multi-core systems is "utilities". Load, Index creations, Backup, Restore to me are all utilities of a database system which generally when it happens have lot of eyes watching and waiting for them to complete. Now lot can be done by having multiple connections. Break up work and execute bits which can be done in parallel to be invoked using different connections. But again to me that looks like the buck is conviently passed to its APPLICATION  to avail of its features.

Similarly as the bucks passes from one SERVICE PROVIDER to its APPLICATION, the buck gets baggage of these heaviness which eventually causes the "eyes" of the end users waiting to see their "simple" tasks done on this systems boasting of 16x the compute potential take a long time to finish. What is the result? Loss of productivity, efficiency, wastage of power running idle threads, etc. Users then generally "Curse" and "Yell" just to kill the time while waiting for mundane tasks to finish on these multi-core systems.

Now if you look at each cores as a resource (like an IT Software Programmer), you have 16 programmers available to crunch code (or numbers in case of cores). To an IT Project manager budget is what generally limits the number of programmers. But when you have already paid  and got 16 programmers, a good IT Project manager will try to utilize to them efficiently to utilize them in the best possible way. After all the buck stops at the IT Project Manager's desk.

The question is what is the best possible way to utilize the 16 programmers to do a task. I think a good project manager will not call all its 16 programmers and tell them to do all  the 'n'  programming tasks asking every one of them to assign 1/n th of their time to each task. That just creates a chaos. Why? Well the assumption is "skills" of all the programmers are the same, all will take the same time to finish  a particular task, all of them will be doing all the tasks that way the shortest time to complete all  tasks will be acheived. Now any good IT Project manager will tell you that all assumptions are wrong here. Well then why in the world do we assume that in multi-cores systems?

 So what are we missing in "Software"? Yep you got it, we need an equivalent of IT Project manager to solve the chaos of multi-core systems. An IT Project manager which understands how the "Experience" and Potential of each of its compute resources (Programmers). But wait, Solaris engineers are already saying we have that Project Manager, it is called Solaris Scheduler. Duh.. then what is wrong? Think about it again from a IT Project Manager's view point.  An IT Project manager takes the input from a IT Director or Sr Manager regarding priorities of tasks  and then it works with Senior IT Architect to break up the Project into series of "Serial" and "Parallel Tasks" assuming it has 16 skilled programmers to achieve the goal of finishing the "Highest" priority projects first. The way currently most Operating System Scheduler works is just trying to find an available compute resource to continue its computing the "Serial" task  with no intelligence on how the task can be broken into sets of "serial" and parallel components. So there is a difference between Scheduler and Project Manager.

But then the question comes can One Project Manager be intelligent about everything in the IT Department? Maybe not, maybe thats why there is an hierarchy where our Project Manager banks on another Project Manager Assistants having knowledge for particular taks in order to assist our Project Manager for creating a chart for specific tasks not known to it. Which means now every application needs to provide a "Project Manager Assistant" in order to provide guidance to the Operating System Project Manager on how best to run the tasks of the applications. Currently except for the "nice" priority  bribes to the Operating Scheduler, an application does not provide much feedback on how it thinks it can be executed optimally on multi-core systems.

Come to think about it, these IT Project Managers do have a role to play in solving Engineering problems.

Now the onus is on the various operating systems, database, applications architects to provide a frame work on how applications can provide their own application project manager input  to the Operating System Project Manager. Maybe we should call it  Project "Project Manager". :-)

In the meanwhile, what we can all do is start thinking about ways to convert atleast the "utilities" that we own to utilize these multi-cores resources on the system.  Change software to adapt to multi-core, one utility at a time.


Sunday Jan 14, 2007

DB2 on UltraSPARC T1 (aka Niagara I) based system Benchmark Publication

On Jan 9,2006 Sun published SpecJAppServer2004 benchmark with WebLogic/DB2/Solaris 10 using Sun Fire T2000 servers using UltraSPARC T1 processors and Sun StorEdge 3320 storage array. The result is 801.70 SPECjAppServer2004 JOPS@Standard .

This was the first public benchmark ever to use DB2 V8.2 on Sun Fire T2000. The published benchmark runs DB2 on Sun Fire T2000 with 6-cores 1Ghz UltraSPARC T1 . The DB2 license that would be required for the config is 6 x 30PVU= 180 PVUs (or using the old terminology about 1.8 CPU Licenses). This proves that the combination of DB2 on Sun Fire T2000 is an attractive platform considering various metrics like database License Prices, Power Ratings, Volume used by the server, etc.

Disclosure Statement:
SPECjAppServer2004 Sun Fire T2000 (8 cores, 1 chip) 801.70 JOPS@Standard.
SPEC, SPECjAppServer reg tm of Standard Performance Evaluation Corporation. All results from as of 01/15/07.


Jignesh Shah is Principal Software Engineer in Application Integration Engineering, Oracle Corporation. AIE enables integration of ISV products including Oracle with Unified Storage Systems. You can also follow me on my blog


« June 2016