The perils of negative scalability

I've seen the following issue confound customers and colleagues of late, so thought it worth a blog entry.

Lets say you have an application that exhibits negative scalability. That is, if you were to plot throughput on the Y-axis and concurrency on the X-axis the shape of the curve would be convex downward -- performance climbs up to an apex and then falls off. (How can this happen? A common reason is that the communication overheads start to dominate and the benefit of concurrency is overcome by communication and synchronization costs). Under such circumstances it's common to introduce some type of admission control -- say, simple back-off or more elaborate mechanisms -- to restrict concurrency. Ideally, this yields an asymptotic curve where performance remains constant after reaching the peak, avoiding any fall-off.

If you tune the performance of such an application using the usual measure-analyze-modify cycle but pay attention only to the throughput values at high concurrency levels then you might be badly misled. The usual development feedback loop can fail because poor "optimizations" that slow down the code may actually serve as inadvertent implicit back-off (contention management or admission control) that will attenuate the negative scalability at higher concurrency levels but also needlessly impair performance at lower concurrency levels. Ideally, back-off should be applied only as needed, in response to contention.

A related effect is that inserting diagnostic probes might yield better performance in the region of negative scalability because of probe overhead -- a performance "Heisenbug" where performance improves when execution is more closely observed.

The take-away is that we should be careful to measure the performance of any proposed change over a wide range of concurrency values, and not just at the extremes.

An interesting human analog is Brooks's law. The same issues -- the overheads of orchestrating large numbers of humans or threads -- may underly both effects.

Comments:

I've seen it happen usually related to access to databases. You correct a bottleneck in your application, and all of a sudden, all those requests "fly" straight to the database and bring it dow to its knees, ending up with worse performance, because you did not know the bottleneck was actually "protecting" the database from excessive load.
Nothing that cannot be fixed, but it is true that it is sometimes difficult to explain how "fixing" something can result in worse performance... until you fix the hidden-and-now-uncovered problem.
S!

Posted by GreenEyed on September 28, 2009 at 12:25 AM EDT #

Good point ...

This type of effect (or close analogs) manifests in quite a number of ways. A few others that come to mind are CSMA-CD communication protocols when arbitrating for a share resource (spectrum, access to the physical "wire", etc) and "anticipatory scheduling" (http://en.wikipedia.org/wiki/Anticipatory_scheduling).

Regards, Dave

Posted by guest on September 28, 2009 at 03:16 AM EDT #

See also: www.cs.sfu.ca/~fedorova/papers/wiosca06.pdf.

Alexandra Fedorova, Margo Seltzer and Michael D. Smith, A Non-Work-Conserving Operating System Scheduler for SMT Processors , In Proceedings of the Workshop on the Interaction between Operating Systems and Computer Architecture (WIOSCA), in conjunction with ISCA-33, June 2006

Posted by David Dice on October 26, 2009 at 09:42 AM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Dave

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today