memcpy() concurrency curiosities

I was recently asked to diagnose a problem a customer was encountering that involved Java and the JNI getIntArrayElements() and releaseIntArrayElements() primitives. The outcome of the exploration was sufficiently interesting that I thought I'd mention it here in order that other JNI users might avoid the same pitfall. Briefly, the customer was running a highly threaded application on a machine with lots of available concurrency. The Java threads would repeatedly read from an array of int[]s that was effectively immutable. References to the array were also passed to native C code via JNI calls. Those native methods would access the array via the getIntArrayElements() and releaseIntArrayElements() primitives, but in no case did the native code ever modify the array contents. The problem was that the Java readers would occasionally observe scenarios that suggested that the contents of the array were transiently "flickering" to unexpected values -- ephemeral corruption -- and then returning to the expected values. (Recall that the array was effectively immutable, so this shouldn't have been happening). I carefully checked the Java code and native code and quickly convinced myself the source of the bug was elsewhere.

I next switched my attention to the implementation of getIntArrayElements() and releaseIntArrayElements(). Briefly GetIntArrayElements() will, as used by this particular application, (a) switch the caller's thread state back to "InVM". If there's a stop-the-world garbage collection event going on the state transition operator should block the caller until the GC completes. At that point the thread can safely access the heap to extract the array data. The "InVM" thread state is such that GC should be inhibited while we're copying the data out. (b) malloc a properly sized array to hold the data; (c) memcpy() the data from the heap data into that just allocated array; (d) restore the thread state; and (e) return a pointer to the just allocated array to the caller. ReleaseIntArrayElements() operates in a similar fashion but reverses the memcpy() direction (copying into the heap) and then releases the allocated array. Note that in our case the source and destination memcpy() regions are completely disjoint. My first thought was that something might have been wrong with the state transition operations, allowing what should have been stop-the-world GC to inadvertently run concurrently with the memcpy() operations. Further investigation ruled that out as there were no GC events when the apparent corruption was observed.

Upon reflection, however, I recalled that some optimized memcpy() implementations may execute explicitly coded benign speculative stores into the destination region. Another possibility is the block-initializing-store (BIS) used by memcpy(). A concurrent observer might fetch from a memcpy() target cache line in the narrow timing window between when the line was zeroed and when the contents were copied back into the line. In either case, the memcpy() in releaseIntArrayCritical could transiently mutate the contents of the java array in the heap to some unexpected value. By the time the memcpy() finishes, of course, the destination buffer will again contain the expected contents. I quickly confirmed this scenario might occur by writing a quick throw-away multi-threaded C test case that modeled the JNI behavior and demonstrated the effect. The underlying mechanism at work deserves a bit more explanation. Lets say we're writing C code and have an effectively immutable array A. A contains a known set of values. Concurrently, we have a 1st thread reading from A and 2nd thread memcpy()ing precisely the same set of values known to be in A back into A. At first glance it's not unreasonable to assume that the reader (the 1st thread) will always see the expected values in A. That is, we're assuming that memcpy()ing the same value back into an array is idempotent. The memcpy() operation is certainly idempotent from the point of view of the caller of memcpy(), but interestingly it's not idempotent from the point of view concurrent observers. With optimized memcpy() implementations concurrent observers can see "odd" intermediate states. Naively, you'd expect that concurrent observers would see the array as unchanging, but that isn't the case on all platforms. The case at hand the memcpy() in releaseIntArrayElements() is the source of the problem, as array in heap might "flicker". To further confirm this diagnosis I provided the customer with a simplistic byte-by-byte unoptimized memcpy() implementation in an LD_PRELOAD-able mempcy.so shared object. And indeed, the problem wasn't evident when running with this simple memcpy() instead of the optimized system form.

While technicall legaly, such behavior in memcpy() -- and the JNI operators that call memcpy() -- is, at best, surprising and violates the principle of least astonishment. Strictly speaking, memcpy() provides no particular guarantees about concurrently observable intermediate states, but rather only specifies the state of the destination buffer at the end of the invocation. That is, while a memcpy() is in flight the destination buffer might contain transient or ephemeral values which could be observed by other threads. Overwriting an array with an exact copy of the array will be idempotent from the perspective of the caller after memcpy() completes, but concurrent observers of the array are permitted to see transient ephemeral "garbage" values. My first thought when we ran into this behavior some years ago was that it was a bug in the memcpy() implementation, but upon deeper reflection I concluded the bug was in my errant assumptions about how memcpy() worked. Having said that, while such behavior is technically permissible I'm also sympathetic that such effects are, at best, unexpected. I's not unreasonable to assume the destination buffer would remain unmolested. One could imagine customers easily falling into this trap, in both C code and Java.

Comments:

This is interesting. Why would memcpy() do this? Some sort of alignment issue?

What does the JNI code do? Were you able to switch to JNI_ABORT mode for the release?

Posted by Bob Lee on May 24, 2009 at 07:12 AM EDT #

Hi Bob, I believe that memcpy() issues the speculative stores to cover odd-ball alignment cases as you conjectured. They're precautionary to cover a case that probably doesn't happen too often, and typically those stores will be overwritten by the proper values later in the operation. The saving from this approach seems to be in the elimination of conditional branches.

In the particular case I mentioned it turns out that they can't readily change the source code, at least not over the short-term, so I believe they might use the LD_PRELOAD mechanism to provide immediate relief. I have mixed feeling about that given that I only intended memcpy.so as a diagnostic tool to help narrow down the problem.

Regards, -Dave

Posted by David Dice on May 24, 2009 at 11:56 AM EDT #

Interesting mismatch of C and Java memory models. But what is the correct idiom to use in C JNI code to read (and not write) arrays and keep any concurrent running Java code from seeing spurious values? Or is this a bug in the Release<Type>ArrayElement() implementation that needs to not use memcpy() for placing back elements. Does JNI guarantee you can read arrays without disturbing the java memory model?

Posted by Mark Wielaard on May 24, 2009 at 08:14 PM EDT #

Hi Dave,

I don't quite follow this. If the pointer to the memcpy'ed buffer isn't returned until after it completes, what reader was concurrently looking into that buffer??? If you had a concurrent reader of the buffer then they would be able to see a partially copied array - which would be completely broken and obviously a lack of synchronization between the reader and writer.

Posted by David Holmes on May 24, 2009 at 08:19 PM EDT #

Doh! Never mind. It's the memcpy back into the heap that allows concurrent reads. Ouch!

Posted by David Holmes on May 24, 2009 at 08:21 PM EDT #

Great post.

I wonder how much time the speculative store "optimizations" really save over a more traditional 3 part implementation? Such as:
1. optional copy until aligned
2. aligned copying
3. optional trailing partial block copy

I suspect not too much, although the real answer probably depends on median value of n passed to memcpy.

I suspect this web page highlights some of the issues that the memcpy implementor was trying to deal with.
http://forums.sun.com/thread.jspa?threadID=5263939

I have to admit to writing some code like this once on a PPC where I increased performance of a copy loop using "dbcz" which zeros a cache line and avoids reading data into cache you are about to write. If that cache line were to be cast out before line size bytes are written, another processor would see zeros. In my case that was OK because this was data that being transformed by a single processor before it was going to be DMA'd.

Posted by L. Bakst on May 24, 2009 at 09:10 PM EDT #

Found the answer to my own question. Bob Lee gave the hint already in his first comment. To keep the concurrent java code to see intermediate values being written back to the array you need to use the JNI_ABORT flag to Release<Type>ArrayElements().

Since if you use JNI_ABORT the function frees the memory allocated for the native array without copying back the new contents. Which is what you want if you are only reading the array contents in the C JNI code.

Posted by Mark Wielaard on May 24, 2009 at 09:46 PM EDT #

Following up on Mark's comments ... I believe the memcpy() and JNI issues encountered by the customer are well outside the domain of the JMM. While the behavior seen is \*technically\* legal I believe it also violates the principle of least astonishment.

Following up on L. Bakst's comments ... First, thanks for the pointer discussion of memcpy() on Solaris. I'm not sure the issue driven directly by T1-specific optimizations as we've seen memcpy() oddities dating back to well before the T1. My guess is that it's simply a ploy to reduce hard-to-predict conditional branches. (Spelunking the source history would probably provide a definitive answer).

As and aside, at one point in time the JVM used the platform memcpy() for intra-heap copying of arrays of references. Concurrent readers could sometimes see a "half written" references, however, even though the transfer wasn't a simple byte-by-byte copy. That, in turn, could lead to type system failure and other issues, so we're now careful to use explicit word-size copying and avoid memcpy() for references.

Posted by David Dice on May 25, 2009 at 11:51 AM EDT #

Why is the memcpy back into the heap happening at all if the array is immutable?

Posted by AnonymousCoward on May 25, 2009 at 09:10 PM EDT #

Anon: I believe it's because releaseIntArrayElements() is called, which both frees the buffer and copies it back to the heap to preserve changes. Bob Lee & Mark W. suggest using JNI_ABORT to prevent the copy-back and just free the buffer. (Correct me if I'm wrong).

Posted by guest on May 25, 2009 at 09:39 PM EDT #

The GC should use a memfence instruction before leaving the "inVM" state to flush all the queued writes. This is a GC bug.

Posted by guest on May 26, 2009 at 03:25 AM EDT #

Following up by a posting from 204.131.51.65 regarding the use of a
fence after changing thread state (InVM). My description was schematic and
necessarily abridged many details. Obviously there's an MFENCE or MEMBAR
#storeload in the state transition operators to safely coordinate the Dekker-like
protocol used by the mutator threads and the VM and collector threads. The
particular bug under discussion has nothing to do with platform memory models,
however. It'd still arise in platforms with a strong sequentially consistent
memory model (where fences were unnecessary).

Posted by David Dice on May 26, 2009 at 03:58 AM EDT #

Interesting post. Assuming you were changing the same array from different threads, would you have to lock it before using getIntArrayElements(), releaseIntArrayElements()? If so, this behavior sounds "correct" even if it's surprising. If not, I would consider this a definite JNI bug.

At the very least, we should probably add a warning to the JNI documentation that unless you use JNI_ABORT you are essentially modifying the array and that has concurrency implications. It's definitely not obvious.

Posted by Gili on May 26, 2009 at 06:59 AM EDT #

Quick question - does the Hotspot intrinsic System.arraycopy() implementation have this problem? I started following it through the C2 compiler code, but I eventually hit x86 assembler, at which point I gave up.

If it did, of course, that would be a horrible Java memory model violation that should probably be fixed...

Posted by Jeremy Manson on May 31, 2009 at 09:02 PM EDT #

Jeremy, arrayCopy() for reference arrays -- intrinsified or the "vanilla" form called from the interpreter -- shouldn't use memcpy() because, as I might have mentioned in replies to earlier comments, doing so could lead to type system failure (viewing partial reference words) and all manner of subsequent pathologies. We fixed that many years ago. I'd need to check to see how the other forms (int,char,etc) are handled. Since we don't really have arrays of volatiles do you have any expectations?

Posted by David Dice on June 01, 2009 at 01:42 AM EDT #

Hi Dave,

For non-volatile Java array stores and loads, the MM requires that you see a value that was written to the memory location you are reading, per the causality rules. That's the so-called "not-out-of-thin-air" guarantee. If you see a garbage value, that would violate the model (except in the case of non-atomic writes to 64-bit scalar values).

In the JNI case, you could argue (as, indeed, you have) that the accesses to the array are native stores and loads, so weird results are par for the course. But you can't really argue that about System.arrayCopy(), which is "pure Java". In that case, you shouldn't see garbage results. That's why I was curious.

Whether the specific results you got were legal would depend on what the results were, of course. For example, if you temporarily saw a location a[i] revert to its previous value, that would be fine. But if it were a 32-bit int, and you saw the first 16 bits from the new value, and the second 16 bits from the previous value, that would be illegal.

Posted by Jeremy Manson on June 01, 2009 at 01:18 PM EDT #

On the UltraSPARC T2Plus CPU where this problem was encountered, the optimized memcpy routine uses what is called a Block Initializing Store.

To quote the PRM:

This special store will not generally fetch the line from memory
(initializing it with zeros instead).

Programming Note – Overlapped copies must avoid issuing block-init stores to lines
before all loads to that line, otherwise the load may see the side-effect zeroing

Posted by Paul R on June 01, 2009 at 04:05 PM EDT #

Following up on Jeremy's comments ... thanks for the clarification. arrayCopy() is aggressively intrinsified as it's critical for some benchmarks that "scroll" 80x25 pseudo-screens. At least at one point in time we were vulnerable to word/byte atomicity issues in arrayCopy(). That was a long time ago, but the code in question attracts optimizations. I took a very brief look and my current read is that it's safe and correct -- crucially, it avoids memcpy() and always copies units that are either the same or larger than the array type.

Posted by David Dice on June 02, 2009 at 01:50 AM EDT #

Following up on Paul's comments ...

In our case the arrays are completely disjoint, so the caveat in the PRM about overlap shouldn't apply. Furthermore, the intent of the warning is to caution the writer of memcpy()-like operations to issue all their LDs before they issue the block-initializing-store. The concern is only about what (and when) the processing executing the BIS might observe, and no what other CPUs might observe. (I don't have a PRM in front of me, by IIRC some of the "block" ASIs require before and after bracketing with MEMBAR #sync as they provide weaker ordering and aren't necessarily interlock).

I still suspect the byte stores found in the mempcy() prolog and epilog code. But another possibility is the that BIS forces a destination line into local M-state in the cache of the processor executing the mempcy() and then fills the line with zeros. Some other processor then reads that line and sees the intermediate 0s in the window between the BIS and point in time where memcpy() writes the contents into the line. (Conceptually this is no different that memcpy() just explicitly zeroing target lines and then copying into them -- a window exists).

The reason I still suspect the byte stores is that in the past we've seen transient single-byte ephemera when copying arrays of references. But the other case seems equally likely. When the transient was observed did we ever manage to capture and report any of the unexpected values?

Posted by David Dice on June 02, 2009 at 02:22 AM EDT #

hi

Posted by guest on June 11, 2009 at 06:37 AM EDT #

Sorry to be a bit late to the party...forgive what might seem a simple question, but (at least by the terms of the C standard) doesn't this operation require memmove() rather than memcpy()?

memcpy() decorates its arguments with 'restrict', which certainly suggests to me that a copy from a to a is illegal.

Posted by guest on July 08, 2011 at 10:51 AM EDT #

The actual case was a copy from A to B and then B back to A while concurrent readers were observing A. A and B are disjoint.

Regards, -Dave

Posted by guest on July 08, 2011 at 03:12 PM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Dave

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today