A race in LockSupport park() arising from weak memory models

I recently diagnosed the root cause of a concurrency bug, CR6822370,
and thought it sufficiently interesting to share the details. (CR 6822370 actually represents a cluster of bugs that are now thought to be related by a common underlying issue). Briefly, we have a lost wakeup bug in the native C++ Parker::park() platform-specific
infrastructure code that implements java.util.concurrent.LockSupport.park(). The lost wakeup arises from a race that itself arises because of architectural reordering that in turn occurs because of missing memory barrier instructions. The lost wakeup may manifest as various 'hangs' or instances of progress failure.

For background, a Parker is a native HotSpot C++ type that implements a construct that's somewhat like a restricted-range binary semaphore, except that park() is allowed to return spuriously. See LockSupport for details. If a thread is blocked in park() we're guaranteed that a subsequent unpark() will make it ready. (A perfectly legal but low-quality implementation of park() and unpark() would be empty methods, in which the program degenerates to simple spinning. An in fact that's the litmus test for correct park()-unpark() usage). A Parker instance is associated at most one thread at any one time and only that designed thread call invoke park() on that particular Parker. Any thread, however, may call unpark() on a given Parker. Furthermore, Parker instances also type-stable and immortal, at least in Sun's HotSpot implementation. On Solaris and Linux a Parker contains a pthread condvar (named _condvar), mutex (named _mutex), and a volatile integer, _counter, that represents the state of the semaphore. Despite its name, _counter only takes on the values 0
(neutral) and 1 (signaled). Each thread has a Parker instance dedicated to use by LockSupport.

Parker:: park() and unpark()

// Redacted and annotated for clarity
void Parker::unpark() {
pthread_mutex_lock (_mutex) ;
int s = _counter;
_counter = 1;
pthread_mutex_unlock (_mutex) ;
if (s < 1) {
// We signal after having dropped the lock to minimize the hold time and
// in case the underlying implementation doesn't provide wait morphing.
pthread_cond_signal (_cond) ;
}
}

void Parker::park() {
if (_counter > 0) {
// Optional optimization to avoid taking the lock
_counter = 0 ;
return ;
}
if (pthread_mutex_trylock(_mutex) != 0) {
// Another optional optimization
// Must be a concurrent unpark() - just return
return;
}
if (_counter > 0) { // no wait needed
_counter = 0;
pthread_mutex_unlock(_mutex);
return;
}
pthread_cond_wait (_cond, _mutex) ;
_counter = 0 ;
pthread_mutex_unlock(_mutex);
}

Failure scenario


  1. Lets suppose we have a thread T1 whose LockSupport/Parker instance has _counter=1.

  2. T1 calls LockSupport.park() which then invokes Parker::park(). Park() fetches _counter and observes 1, and then sets _counter=0 and returns. There are no atomic or
    fence instructions in this path. Crucially, the store of 0 into _counter languishes in the processor's local store buffer and is not yet visible to other processors.

  3. T1 returns from park(), checks for the condition of interest and observes that it's not yet signaled, and again calls park(). The store to _counter from step (1) is still languishing in the store buffer and is not yet globally visible. Control reaches the very top of park().

  4. Thread T2 causes the condition of interest to enter signaled state via some stores. These are typically stores to Java volatile variables, so the JIT will emit the necessary memory barriers after the stores.

  5. Thread T2 then calls unpark(T1). The unpark() operator stores 1 into _counter. We'll assume that this store becomes globally visible immediately. T2 returns from unpark().

  6. Thread T1's old store of 0 into _counter from step (1) finally drains from its local store buffer and becomes globally visible , overwriting the value 1 just written by T2 in step (6). With due lack of formal precision, _counter is now "globally" 0.

  7. Thread T1 in park() fetches _counter, observes 0, and then blocks on the condvar. We have a lost wakeup — T1 stalls.

Remarks and analysis


  • The underlying issue is a classic optimization in unpark() which tries to avoid taking a lock. The _counter instance variable is often but not always accessed under _mutex. If the underlying platform provided sequential consistency then the code as shown above would be correct, but x86 and SPARC provide a weaker memory consistency models, allowing memory or visibility order to differ from program order. (Refer to the excellent papers by Peter Sewell et al. for background). Given those architectural reorderings, the program admits a race which can in turn result in missed wakeups.

  • The problem would only manifest when we were using the -UseMembar optimization that lets us remove fences from certain hot thread state transitions paths that need to coordinate safepoints between mutator threads and the JVM. This feature is enabled by default, but we can turn it off with the -XX:+UseMembar switch, which causes the JVM to emit normal fence instructions in the state transitions paths. (That particular optimization is an example of asymmetric Dekker synchronization). Crucially, the park() path contains such a state transition. In reality the fence emitted by the +UseMembar switch was simply covering up the otherwise latent Parker:: bug. +UseMembar constitutes a work-around. Sensitivity to UseMembar was initially confounding but ultimately a valuable clue.

    After thinking about various timing scenarios I settled on the one given above as the most likely culprit. To support that hypothesis I wrote a simple C model of the pathology and verified that it would "fail" in a similar fashion. Having collected data with the C model on various platforms I suspect that processors where stores can languish in the store buffer for longer periods are more exposed to the bug.

  • Inserting appropriate barrier instructions after both stores of 0 into _counter in park() provides a solution. Furthermore, we're not formally guaranteed that pthread_mutex_unlock() has barrier semantics, so to be conservative we need a barrier in that location as well. For our purposes a fence instruction prevents subsequent loads (subsequent in program order) from executing before prior stores become globally visible. We typically use volatile to control for compile-time reordering and fences to control for architectural reordering.

  • The bug will not manifest on uniprocessors or environments where threads are otherwise constrained to just a single processor.

  • The bug is a "day-one" bug and present in all versions of HotSpot.

  • Parker::park() and unpark() reside in os_linux.cpp, os_solaris.cpp and os_windows.cpp for Linux, Solaris and Windows, respectively.

  • The built-in synchronized implementation uses a different park mechanism (PlatformPark::) whereas the java.util.concurrent infrastructure uses Parker::. Only Parker:: is vulnerable.

  • Additional reading:

Appendix - elaborated scenario

We have a volatile Java variable C, initially 0. Thread T1's Parker::_counter value is 1.


  • Thread T1 executes: while (C == 0) park() ;
  • Thread T2 executes: C=1; unpark(T1)

Timing:


  1. Thread T1 loads C, observes 0, and calls park()
  2. Thread T1 in 1st park() invocation:

    • Load _counter (1)
    • Store _counter (0) — languishes in store buffer
    • return from park()

  3. Thread T1: Load C (0)
  4. Thread T1: call park() a 2nd time
  5. Thread T2: Store C=1; MFENCE (by virtue of java volatile).
    MFENCE is a completely local operation, influencing only T2.
  6. Thread T2: call unpark(T1)

    • lock _mutex with atomic instruction such as CAS or LOCK:CMPXCHG
    • Load _counter (0) — observes value from memory, completely legal
    • Store _counter (1)
    • unlock _mutex

  7. Thread T1's store of 0 into _counter finally drains from the store buffer and becomes
    globally visible, overwriting the value 1 just stored by T2
  8. Thread T1 continues in 2nd invocation of park()

    • Load _counter (0)
    • lock _mutex
    • Load _counter (0)
    • block on _condvar

Another way to think of the problem is via the Dekker-duality. Observe that T1 executes { Store _counter; Load C; } while T2 executes { Store C; MFENCE; Load _counter;}. Note the missing MFENCE from T1's path. The duality is slightly obscured because the store into _counter occurs in the 1st invocation of park() which returns immediately and the load of C occurs in the application code. An in fact that's what distinguishes this bug - that the failure idiom is distributed over two invocations of park().

Update: Scott Owens in Peter Sewell's group coined the term triangular race in his 2010 ECOOP paper "Reasoning about the Implementation of Concurrency Abstractions on x86-TSO" to describe this type of situation.

Comments:

Hi David,
I can't understand the following:
"Load _counter (0) — observes value from memory, completely legal"
Initial value of counter is 1, and T1's store of 0 is still in local store buffer. So where 0 comes from?

Btw, does park/unpark intended to work with NON volatile variables? I.e. is following legal provided that C is plain non volatile var?
--------------------------------------
#T1: while (C == 0) park() ;
#T2: C=1; unpark(T1)
--------------------------------------
Thank you.

Posted by Dmitriy V'jukov on November 30, 2009 at 06:09 PM EST #

David,

You do not need memory barrier in park() after \*both\* stores (actually there are three), you need it only after first store.
Also absence of memory barrier in pthread_mutex_unlock() has nothing to do with this issue, it's completely harmless. But you indeed need memory barrier before load from _counter in unpark(), otherwise you will get the same deadlock if user state is \*non\* volatile.
Also here is small (mostly harmless) liveness violation.
I've described all this in more detail here:
http://groups.google.com/group/relacy/browse_frm/thread/78b09addba8a1d78
--
Dmitriy V'jukov

Posted by Dmitry V'jukov on November 30, 2009 at 11:38 PM EST #

Dmitriy,

Thanks for the comments. Relacy is a terrific tool, btw.

For background, the executions we're discussing involve mixed C++ code and Java code compiled by a JIT which might lead to some misunderstanding. In the "elaborated scenario" above I called out the variable C as Java volatile. (That's a strong requirement. If C were non-volatile then there's no guarantee of the MFENCE being emitted in the T2 path. Furthermore, the JIT would invariably transform the code in T1 into {if (C==0) for(;;) park();}) We're guaranteed -- as a property of the implementation -- that the JIT will emit an MFENCE or equivalent instruction after the store into C but before invoking unpark(T1).

As an aside, on all our reference platforms pthreads_mutex_lock() will execute an atomic instruction having barrier semantics, so there's yet another barrier in the path, although we'd \*never\* want to depend on such implementation-specific knowledge.

As for the liveness issue you mention, technically you're correct. In practice, however, we depend on eventual consistency. It's almost always faster to let the pending stores (the stores that will update the condition-of-interest) drain and become visible to the consumer than it is to yield. Yielding also has rather horrible performance issues on some of our reference platforms.

Regards
Dave

Posted by David Dice on December 01, 2009 at 09:20 AM EST #

Hi Dave,

> Relacy is a terrific tool, btw.
Thank you.

Regarding "{if (C==0) for(;;) park();}" optimization. I can't understand from the documentation as to whether Parker must work for plain variables, however I would expect it to act as a synchronization primitive, i.e. to establish synchronizes-with edge between unpark and park (unpark acts as a volatile store and park acts as a volatile load, otherwise it does not provides sequential consistency). And if one would replace unpark/park pair with volatile store/load pair, then "{if (C==0) for(;;) park();}" optimization will be illegal.

Regarding liveness issue, actually I meant some wicked compiler optimizations. Especially if park/unpark is used with plain var+Fences API (no orderAccesses, just orderReads/orderWrites). I think that store to user state may sink into critical section in unpark. Or load from user state may hoist from loop ("{if (C==0) for(;;) park();}").

Posted by Dmitriy V'jukov on December 02, 2009 at 08:41 PM EST #

Humm... So why you fix it by inserting \*3\* memory barriers into park?
http://hg.openjdk.java.net/jdk7/hotspot-rt/hotspot/rev/95e9083cf4a7

My bet is that only \*1\* is required.
If consumer acquires mutex in park and then observes counter=1 => he synchronizes with producer that set counter=1, so consumer MUST see state change.

Posted by Dmitriy V'jukov on December 02, 2009 at 08:46 PM EST #

Hi Dmitriy, Regarding the examples with the variable C. If C is a java volatile then in the producer code the JIT will emit an MFENCE (or equivalent instruction) before the call to unpark(). In the consumer path the JIT isn't allowed to elide or otherwise optimize away the loads of C in the park() loop. But if C is non-volatile then we have an application error. There's no guarantee of an MFENCE in the producer path between the store and the call to unpark(), and furthermore the JIT call legally transform the loop as I mentioned above. It can't elide the park() call, but it'll simply wrap park in an indefinite loop. Regards, Dave

Posted by David Dice on December 03, 2009 at 02:51 AM EST #

Hi Dave,

Did the original author get confused about the semantics of volatile in C++, or was this just simply a straightforward oversight?

Jeremy

Posted by Jeremy Manson on December 06, 2009 at 07:41 AM EST #

Hi Jeremy, Doug wrote Parker:: (I wrote ParkEvent::). Given that it, must be a simple oversight. It was a moderately subtle race, and furthermore it was masked until recently by the fences in the state transition code, as well as by the the fact that stores didn't languish in store buffers for very long until the advent of recent processors.

As an aside, recent discussions on the C-I mailing list about interrupts (posting and recognition forming an HBO edge) gave me pause about that mechanism as well. I suspect we have some related issues in those paths.

Regards, Dave

Posted by David Dice on December 06, 2009 at 08:31 AM EST #

Can you explain why its 3 memory barriers in your patch ?.

Posted by gustav on December 08, 2009 at 07:08 AM EST #

Any time we store 0 into _counter in park() we want a #storeload fence before returning to the application loop. There's no guarantee that pthread_mutex_lock() has #storeload semantics on all our platforms, so thus the additional 2 fences.

Posted by dave dice on December 08, 2009 at 07:23 AM EST #

David, I do not agree with you here.
Even if mutex acquisition does not provide full memory fence, it still provides sequential consistency because all operations on a particular mutex can be thought as operations on a single variable, which in turn provides total order.
With regard to Dekker algorithm this is expressed as:

std::atomic<int> flag1;
std::atomic<int> flag1;
std::atomic<int> mtx;

// thread 1
flag1.store(1, std::memory_order_relaxed);
mtx.exchange(0, std::memory_order_acq_rel);
flag2.load(std::memory_order_relaxed);

// thread 2
flag2.store(1, std::memory_order_relaxed);
mtx.exchange(0, std::memory_order_acq_rel);
flag1.load(std::memory_order_relaxed);

Even if exchange(std::memory_order_acq_rel) does not include full memory fence (#StoreLoad style membar), it still makes the algorithm work, i.e. outcome flag1 == flag2 == 0 is impossible.

Back to Parker. Mutex guarantees total order over all acquisitions. So 2 cases possible: order of acquisition is either producer then consumer, or consumer then producer. If producer acquires the mutex first, then order of operations is:

// producer:
user_state = 1;
mtx.lock();
if (counter == 0)
signal(); // does not taken in this execution
counter = 1;
mtx.unlock(); // mtx.store(0, memory_order_release)

// then consumer:
mtx.lock(); // mtx.load(memory_order_acquire)
if (counter == 0)
wait(); // does not taken in this execution
counter = 0;
mtx.unlock();
R = user_state;

It's guaranteed that consumer will see R == 1, because producer made a store-release to mtx and then consumer made a load-acquire from mtx, so visibility of user_state is transferred from producer to consumer.

Case 2: consumer acquires mutex first:

// consumer:
mtx.lock();
if (counter == 0)
wait(); // taken in this execution
counter = 0;
mtx.unlock();
R = user_state;

// then producer:
user_state = 1;
mtx.lock();
if (counter == 0)
signal(); // taken in this execution
counter = 1;
mtx.unlock();

It's guaranteed that producer will observe counter == 0 and thus signal consumer, because visibility of counter once again transferred by means of the mutex.

What I am missing? David, can you provide a particular execution in which full memory fence is required after all stores to counter on consumer side?

Posted by Dmitry V'jukov on December 15, 2009 at 02:56 AM EST #

so the patch got promoted in the mercurial repo recently and is hence geting into next jdk build despite the author wont or cant explain hes design ?.

Posted by gustav on December 26, 2009 at 10:15 AM EST #

Gustav, I intend to get around to a detailed reply to Dmitriy's question and comments at a later time, but at the moment I'm engaged with other activities. I think too, that the confusion is rooted in the particular context, so when I have time I'll try to make it more clear.

Regards
Dave

p.s., Doug Lea and I have a different ultimate fix to the problem, but given that we had a very short window in which to get the code into the JDK release in question, it was deemed prudent to provide a very targeted and conservative change set.

p.s., you may find the following of interest: http://www.cl.cam.ac.uk/~so294/documents/draft-scTSO.pdf

Posted by dave dice on December 26, 2009 at 02:34 PM EST #

Hi,

I have a couple of (untimely) questions:

Assuming that the cache line where *counter* resides is exclusive to PROC(T2), it seems to me that when the write from T1 goes into the store buffer, an invalidate message is sent to PROC(T2) regarding that cache line. So, shouldn't the CAS issued by T2 when acquiring the mutex drain PROC(T2)'s invalidate queue?

Also, is the following scenario possible? It would also justify the problem.

- T1 loads _counter == 1
- T1 loads C == 0 (*reordered*; maybe a speculative load)
- T2 stores C = 1
- T2 executes unpark, storing _counter = 1
- T1 stores _counter = 0 (*reordered*)
- T1 loops and blocks.

Thanks,
Duarte

Posted by Duarte Nunes on July 29, 2011 at 10:05 AM EDT #

Hi Duarte,

Regarding your question about probes, I believe there's significant variation in when probes are sent out relative to when stores enter and exit the store buffer. If you probe aggressively on ingress then you have to be prepared for the case that you lost the line between ingress and egress. (Locking the line might result in deadlock, at least in naive systems).

Regarding the alternative possible scenario, I'd need to give that some thought (I'm traveling with my family as I write this. Metro stations aren't conducive to thinking about potential orderings (:>)). To allow the scenario you describe a LD and ST (in program order) by T1 would have to be reordered, correct? On SPARC and x86 LD;ST won't be reordered. ST;LD is the only vulnerable sequence.

Regards, -Dave

Posted by dave dice on August 01, 2011 at 04:34 AM EDT #

Hi Dave,

Ah, I didn't consider that a line could be lost. I have to read more on that.

I actually described the instruction sequence with the ST;LD already reordered, the program order being _counter = 0; C == 0.

Thanks for the answers,
-Duarte

PS: The turnstiles always remind me of memory barriers :)

Posted by Duarte Nunes on August 01, 2011 at 11:38 AM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Dave

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today