Wednesday Sep 23, 2009

The perils of negative scalability

I've seen the following issue confound customers and colleagues of late, so thought it worth a blog entry.

Lets say you have an application that exhibits negative scalability. That is, if you were to plot throughput on the Y-axis and concurrency on the X-axis the shape of the curve would be convex downward -- performance climbs up to an apex and then falls off. (How can this happen? A common reason is that the communication overheads start to dominate and the benefit of concurrency is overcome by communication and synchronization costs). Under such circumstances it's common to introduce some type of admission control -- say, simple back-off or more elaborate mechanisms -- to restrict concurrency. Ideally, this yields an asymptotic curve where performance remains constant after reaching the peak, avoiding any fall-off.

If you tune the performance of such an application using the usual measure-analyze-modify cycle but pay attention only to the throughput values at high concurrency levels then you might be badly misled. The usual development feedback loop can fail because poor "optimizations" that slow down the code may actually serve as inadvertent implicit back-off (contention management or admission control) that will attenuate the negative scalability at higher concurrency levels but also needlessly impair performance at lower concurrency levels. Ideally, back-off should be applied only as needed, in response to contention.

A related effect is that inserting diagnostic probes might yield better performance in the region of negative scalability because of probe overhead -- a performance "Heisenbug" where performance improves when execution is more closely observed.

The take-away is that we should be careful to measure the performance of any proposed change over a wide range of concurrency values, and not just at the extremes.

An interesting human analog is Brooks's law. The same issues -- the overheads of orchestrating large numbers of humans or threads -- may underly both effects.

Friday May 29, 2009

Instruction selection for volatile fences : MFENCE vs LOCK:ADD

In the past the JVM has used MFENCE but, because of latency issues on AMD processors and potential pipeline issues on modern Intel processors it appears that a LOCK:ADD of 0 to the top of stack is preferable (Gory Details).

Sunday May 24, 2009

memcpy() concurrency curiosities

I was recently asked to diagnose a problem a customer was encountering that involved Java and the JNI getIntArrayElements() and releaseIntArrayElements() primitives. The outcome of the exploration was sufficiently interesting that I thought I'd mention it here in order that other JNI users might avoid the same pitfall. Briefly, the customer was running a highly threaded application on a machine with lots of available concurrency. The Java threads would repeatedly read from an array of int[]s that was effectively immutable. References to the array were also passed to native C code via JNI calls. Those native methods would access the array via the getIntArrayElements() and releaseIntArrayElements() primitives, but in no case did the native code ever modify the array contents. The problem was that the Java readers would occasionally observe scenarios that suggested that the contents of the array were transiently "flickering" to unexpected values -- ephemeral corruption -- and then returning to the expected values. (Recall that the array was effectively immutable, so this shouldn't have been happening). I carefully checked the Java code and native code and quickly convinced myself the source of the bug was elsewhere.

I next switched my attention to the implementation of getIntArrayElements() and releaseIntArrayElements(). Briefly GetIntArrayElements() will, as used by this particular application, (a) switch the caller's thread state back to "InVM". If there's a stop-the-world garbage collection event going on the state transition operator should block the caller until the GC completes. At that point the thread can safely access the heap to extract the array data. The "InVM" thread state is such that GC should be inhibited while we're copying the data out. (b) malloc a properly sized array to hold the data; (c) memcpy() the data from the heap data into that just allocated array; (d) restore the thread state; and (e) return a pointer to the just allocated array to the caller. ReleaseIntArrayElements() operates in a similar fashion but reverses the memcpy() direction (copying into the heap) and then releases the allocated array. Note that in our case the source and destination memcpy() regions are completely disjoint. My first thought was that something might have been wrong with the state transition operations, allowing what should have been stop-the-world GC to inadvertently run concurrently with the memcpy() operations. Further investigation ruled that out as there were no GC events when the apparent corruption was observed.

Upon reflection, however, I recalled that some optimized memcpy() implementations may execute explicitly coded benign speculative stores into the destination region. Another possibility is the block-initializing-store (BIS) used by memcpy(). A concurrent observer might fetch from a memcpy() target cache line in the narrow timing window between when the line was zeroed and when the contents were copied back into the line. In either case, the memcpy() in releaseIntArrayCritical could transiently mutate the contents of the java array in the heap to some unexpected value. By the time the memcpy() finishes, of course, the destination buffer will again contain the expected contents. I quickly confirmed this scenario might occur by writing a quick throw-away multi-threaded C test case that modeled the JNI behavior and demonstrated the effect. The underlying mechanism at work deserves a bit more explanation. Lets say we're writing C code and have an effectively immutable array A. A contains a known set of values. Concurrently, we have a 1st thread reading from A and 2nd thread memcpy()ing precisely the same set of values known to be in A back into A. At first glance it's not unreasonable to assume that the reader (the 1st thread) will always see the expected values in A. That is, we're assuming that memcpy()ing the same value back into an array is idempotent. The memcpy() operation is certainly idempotent from the point of view of the caller of memcpy(), but interestingly it's not idempotent from the point of view concurrent observers. With optimized memcpy() implementations concurrent observers can see "odd" intermediate states. Naively, you'd expect that concurrent observers would see the array as unchanging, but that isn't the case on all platforms. The case at hand the memcpy() in releaseIntArrayElements() is the source of the problem, as array in heap might "flicker". To further confirm this diagnosis I provided the customer with a simplistic byte-by-byte unoptimized memcpy() implementation in an LD_PRELOAD-able shared object. And indeed, the problem wasn't evident when running with this simple memcpy() instead of the optimized system form.

While technicall legaly, such behavior in memcpy() -- and the JNI operators that call memcpy() -- is, at best, surprising and violates the principle of least astonishment. Strictly speaking, memcpy() provides no particular guarantees about concurrently observable intermediate states, but rather only specifies the state of the destination buffer at the end of the invocation. That is, while a memcpy() is in flight the destination buffer might contain transient or ephemeral values which could be observed by other threads. Overwriting an array with an exact copy of the array will be idempotent from the perspective of the caller after memcpy() completes, but concurrent observers of the array are permitted to see transient ephemeral "garbage" values. My first thought when we ran into this behavior some years ago was that it was a bug in the memcpy() implementation, but upon deeper reflection I concluded the bug was in my errant assumptions about how memcpy() worked. Having said that, while such behavior is technically permissible I'm also sympathetic that such effects are, at best, unexpected. I's not unreasonable to assume the destination buffer would remain unmolested. One could imagine customers easily falling into this trap, in both C code and Java.




« April 2014