Thursday Apr 14, 2011

Flat-Combining NUMA Locks

Flat-Combining NUMA Locks to appear in SPAA 2011.

Tuesday Apr 05, 2011

Cache index-aware memory allocation

Cache Index-Aware Memory Allocation to appear in
International Symposium on Memory Managment (ISMM) 2011. As a consequence of the design, this "malloc" allocator is also NUMA-aware and NUMA-friendly assuming the system uses the usual default first-touch NUMA page placement policy.


ACM DL Author-ize serviceCache index-aware memory allocation
Yehuda Afek, Dave Dice, Adam Morrison
ISMM '11 Proceedings of the international symposium on Memory management, 2011


Tuesday Mar 01, 2011

crazy concurrency

Crazy Concurrency -- high-end nerdcore.

Tuesday Feb 15, 2011

atomic fetch-and-add vs compare-and-swap

There are a number of cases where an atomic fetch-and-add instruction might yield better performance than the classic Load;Φ;CAS loop, where CAS is the atomic compare-and-swap instruction. The x86 architecture exposes LOCK:XADD which I'll use in the discussion below. (If you don't need the original value you can also use LOCK:INC or LOCK:ADD instead of LOCK:XADD).


  1. CAS is "optimistic" and admits failure, whereas XADD does not. With XADD there's no explicit window of vulnerability to remote interference, and thus no need for a retry loop. Arguably, XADD has better progress properties, assuming the underlying XADD implementation doesn't have an implicit loop, but even in that case the window would be narrower than with Load;Φ;CAS.

  2. If you use the typical Load;Φ;CAS idiom, assuming normal snoop-based cache coherence, the Load may cause a read-to-share bus transaction to get the underlying cache line into S or E state. The CAS, which effectively has store semantics with respect to the cache coherence protocol, may induce another bus transaction to upgrade the line to M state. So in the worst case the idiom might incur two bus transactions, but an XADD implementation will usually drive the line directly into M state, requiring just one bus transaction. Of course you might be able to speculate on the value and have fast-path that tries a "naked" CAS without any prior load. (I use this frequently in native HotSpot code). Also, it's possible for sophisticated processor implementations to perform coherence speculation and aggressively probe the target line into M state. Finally, in some cases you can profitably insert a prefetch-for-write (PREFETCHW) instruction prior to the load to avoid the upgrade transaction. But that approach has to be applied carefully as in some cases it can do more harm than good. Given all that, XADD, where available, has an advantage.

  3. Lets say you were trying to increment a variable with the usual Load;INC;CAS loop. When the CAS starts failing with sufficient frequency you can find that the branch to exit the loop (normally taken under no or light contention) starts to predict toward the failure path where we stay in the loop. So when the CAS ultimately succeeds, you'll incur a branch mispredict when trying to exit the loop. This can be quite painful on processors with deep pipelines and lots of out-of-order speculative machinery. Typically, this is in a piece of code where you don't want a long stall. Relatedly, when the CAS starts to fail frequently, the branch begins to predict that control stays in the loop and in turn the loop runs faster and we cycle around more quickly to the CAS. Typically, we'd want some back-off in the loop. Under light load with infrequent failures the mispredict to retry the loop served as a potentially useful implicit back-off. But under higher load we lose the benefit of implicit back-off arising from the mispredict. There's no loop and no such issues with XADD

Even though CAS has a higher (infinite) consensus number, there are times when fetch-and-add -- which has a consensus number of just 2 -- is preferable.

Update: see CR7023898

Monday Feb 14, 2011

False sharing induced by card table marking

Garbage-collected runtime environments frequently use card tables in conjunction with write barriers to accelerate root-scanning during garage collection. (See A Fast write barrier for generational garbage collectors by Urs Holzle, ECOOP-OOPSLA 1993 for details). Briefly, and skipping a few details, this design partitions the heap into an array of power-of-two sized card pages. When a mutator stores into a reference field of an object the runtime write barrier code will mark the card table entry for the card page containing (covering) that field as dirty. In the HotSpot JVM the card page size is 512 bytes and the card table is implemented as a simple array of bytes. That is, a given card table entry, which represents the state of a card page, is just one byte. The write barrier is emitted directly by the JIT and is usually just a shift and store instruction. In a subsequent non-moving minor GC, the collector can avoid scanning reference fields in a card that is not dirty.

This design is well-proven widely employed but unfortunately it can result in performance issues in highly concurrent environments. Lets say our cache line size is 64 bytes, which is fairly common in modern processors. This means that 64 cards (32KB = 64\*512) will share a cache line in the card table. So reference stores by different threads that just happen to fall within the same 32KB region cause writes to the same cache line underlying the card table. This can result in excessive write invalidation and cache coherence traffic, which can reduce performance and impede scaling. Interestingly, the impact can worsen after a full/moving GC as threads tend to allocate into different address ranges by virtue of thread-local allocation buffers (TLABs), but after a full collection the remaining objects tend to be more tightly packed and thus more prone to the problem. Furthermore, most card table stores are redundant, as often the card is already marked dirty. This suggests a simple solution: instead of using an unconditional store in the barrier, we first check the card table entry and only store if it is clean. This slightly increases the barrier path-length and adds a conditional branch -- unless we were to be somewhat clever with conditional moves by annulling a redundant store by changing the destination address to be a thread-local dummy variable. On the other hand it avoids the problem. (For historical background, some years ago Doug Lea noticed an odd slow-down after a full GC in some concurrent Java code. He contacted me and I speculated that false sharing in the card table could be the issue. We conjured up a JVM with an experimental -XX:+UseCondCardMark flag that let us emit write barriers as either the usual unconditional store, or a conditional form that avoids redundant stores. The conditional form provided relief).

I ran into the problem recently when experimenting with some concurrent queue implementations, which are reference-heavy, on a Sun 256-way Sun/Oracle T5440. This is 4-socket system where each socket contains a Niagara T2+ UltraSPARC processor having 64 logical processors. My benchmark has 50 producer threads and 50 consumer threads and measures and reports the message throughput over a timing interval. In the default configuration we can pass about 8.8K messages per msec. Using the same JVM, when I add the -XX:+UseCondCardMark flag we can achieve 52.5K messages per msec, clearly demonstrating the magnitude of the effect.

I should also note that we ran into this same issue when experimenting with Java-level lock elision using hardware transactional memory. If two unrelated concurrent transactions happened to store into reference fields in the same 32KB region we'd have false aborts because of write-sharing on the card table cache line. Again, -XX:+UseCondCardMark provided relief.

Update: see CR7029167

Thursday Jan 20, 2011

MultiLane : a concurrent blocking multiset

MultiLane : a concurrent blocking multiset - under submission.


It's rather trivial to extend the general idea and construct a deque instead of a queue-like construct.

Friday Jan 14, 2011

Partitioned Ticket Lock

The Partitioned Ticket Lock - under submission.

Friday Oct 15, 2010

Submissions are being accepted for VEE 2010

Submissions are now being accepted for VEE 2010 (more formally, the 2011 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments). VEE is co-located with ASPLOS this year.

Tuesday Sep 07, 2010

Cache index-aware memory allocation

Cache index-Aware memory allocation(under submission) by Dave Dice, Yehuda Afek and Adam Morrison. We describe a relatively simple set of changes to malloc allocators that may reduce the conflict miss rate of applications accessing blocks returned by malloc.

Wednesday Jul 21, 2010

Solaris scheduling : SPARC and CPUIDs

Since it's a commonly asked question and source of confusion I thought I'd write up the following.

First, I should introduce some terminology and state the mapping between solaris logical CPUIDs and physical resources. On a Solaris/SPARC Niagara-2 system the logical CPUIDs map geographically to physical IDs and resources. You can interpret a logical CPUID as follows: (DieNumber: D; CoreNumber:3 ; ThreadGroup:1 ; Strand:2). That is, you have 8 cores per die, 2 thread groups per core, and 4 strands per thread group. All the logical processors on a given core share an level-1 cache. The "ThreadGroup" is a rather obscure name for a pipeline. On a Niagara-1, for instance, there is only 1 pipeline per core, but you have 2 per core on an N2. You can query the CPUID on which a thread is currently running with getcpuid(), which is extremely fast.

Assuming a simple model where all your threads remain runnable and there's no preemption ...

With unbound threads, the solaris scheduler will balance (disperse fairly) 1st over dies, then over cores, then over pipelines, in order to maximize true parallelism and, to the extent possible, avoiding contention over shared resources by placing threads as 'far away' from each other as possible. This is usually the best policy -- particularly for completely independent threads -- but beware that it ignores the issue of inter-chip coherency costs. If you have high coherence costs (writes to shared variables) then packing a group of communicating threads on-chip can sometimes be better than letting them disperse over multiple chips. (As an aside, you're typically much better off letting the scheduler assign CPUs than by trying to bind yourself. Naive binding -- say, with sequential CPUIDs -- will almost always result in suboptimal performance).

In this future these polices might shift to allow better power management by trying to keep dies or cores "parked" (idle, drawing less power). Digressing slightly, recent linux schedulers _do try to impose some type of power management by default, making it sometimes hard to squeeze maximal performance out of a MP Nehalem/AMD system and introducing confounding factors for those of us benchmarking.

If the threads are entirely CPU-bound then typically the scheduler will place them and the thread:cpu relationship then becomes fairly stable. Even if the thread blocks briefly, if it comes back ONPROC (running) in under 3 msecs it's considered to have residual affinity and will go back to the processor where it last ran, barring gross dispatch queue length imbalance. Check the OpenSolaris kernel sources for "rechoose_interval" if you're curious.

For background, each core has its own local dispatch queue and makes its own local scheduling decisions. There's no centralized global scheduler agent and no centralized scheduling data structures (with the exception of a queue for unbound real-time threads). The scheduling policies and parameters are designed such that the local & independent decisions collectively result in achieving the desired global scheduling policy. (If you're biologically inspired, think of a communal insect colony). At a high level the scheduler attempts to maximize aggregate useful work completed in unit-time. Specifically, the scheduler tries to maximize the # of cpus doing useful work as well as attempting, where possible, to maintain affinity (minimize migration). With CMT/CMP/HT systems it also tries to disperse work (LWPs) over cores. As I mentioned above, over time more constraints are being added to the scheduler's responsibilities, such as trying to minimize or reduce power consumption, striking a balance between performance and energy.

The system disperses threads via stealing (pulling, where idle CPU steal from other dispatch queues) and queue balancing (pushing, where if there's a gross imbalance in dispatch queue depth, a local scheduler will try to pass some blocked threads to other less-loaded dispatch queues).

As an aside, threads in the real-time scheduling class cause more centralized dispatch. If they're unbound they can induce lots of preemption and migration, but that's a topic for another day.

Tuesday Jun 01, 2010

Europar 2010 : Transactional Mutex Locks

Transactional Mutex Locks by Luke Dalessandro, Dave Dice, Michael Scott, Nir Shavit and Michael Spear will appear in Europar 2010.

Monday Apr 05, 2010

TLRW: Return of the Read-Write Lock

TLRW: Return of the Read-Write Lock by Dave Dice and Nir Shavit will appear in SPAA 2010. (The paper introduces the concept of the TLRW-ByteLock).

Simplifying Concurrent Algorithms by Exploiting Hardware Transactional Memory

Simplifying Concurrent Algorithms by Exploiting Hardware Transactional Memory will appear in SPAA 2010.

Wednesday Feb 03, 2010

QPI Quiescence

It's not uncommon to find Dekker-like idioms in modern concurrent programs. On platforms with weaker memory models -- say where a store followed by a load in program order can be reordered by the architecture to appear as a load and then a store in the effective memory order (sometimes called the "visibility order") -- programs must use barrier instructions to enforce memory ordering to implement Dekker correctly. For the purposes of discussion and assuming a relatively common system model we'll define memory order as the order of operations as they appear at the interface between the processor and the first-level coherent cache. Examples of barriers are MFENCE on x86 and MEMBAR #storeload on SPARC. In addition, x86 and SPARC TSO memory models allow only one variety of architectural reordering, the store-then-load form noted above. (For simplicity we'll restrict the discussion to TSO-like memory consistency models). On some platforms barriers introduce significant local latency. Perversely, we sometimes find that atomic instructions which have barrier semantics (are barrier-equivalent) are faster than the purpose-defined barrier instructions. A simplistic barrier implementation might simply quiesce the pipeline and wait for the store buffer to drain. To allay a common misconception it's worth pointing out that barriers -- sometimes called fences -- are typically implemented as processor-local operations and don't cause any distinguished action on the bus or interconnect and instead simply instruct the processor to ensure that prior stores become visible before subsequent loads (subsequent and prior refer to the barrier in program order). That is, they don't force anything to happen -- such as coherence messages on the bus -- that were not already destined to occur. Instead, they simply enforce an order, momentarily reconciling program and memory order. Crucially, at least with current x86 and SPARC implementations, barriers don't force anything to occur off-processor. That also means they don't impede or impair scalability. There's no fundamental reason, however, why barriers should be so slow. The processor implementation is free to speculate over the barrier, for instance, as long as stores in the speculative episode are not made visible and loads in the episode are tracked for coherence. And in fact on at least one processor, barrier instructions effectively have 0 latency.

Returning to the Dekker idiom, threads T1 and T2 might coordinate as follows: T1 might execute (ST A; barrier; LD B) and T2 executes (ST B; barrier; LD A), and in particular we refer to this pattern as the Dekker duality. As a concrete example, we coordinate thread state transitions in the HotSpot JVM via a similar protocol, where T1 is a Java thread (mutator) executing the reentry path from a JNI call, T2 has the role of the VM thread coordinating a stop-the-world safepoint, A is a variable that indicates T1's thread state (executing outside the JVM on a JNI call, or executing inside the managed runtime), and B indicates if a stop-the-world safepoint is pending. Critically, if T1 is running on a JNI call and attempts to return back into the managed environment while a safepoint is executing, we need to stall T1 at the point of ingress, as the VM expects that Java threads will not access the heap during a safepoint. (Among other uses, Safepoints are employed for certain types of garbage collection operations, for instance where we don't want the collector and the Java threads accessing the heap simultaneously). For the purposes of illustration I'm showing just a single mutator thread T1 and a single VM thread T2, but in practice the mechanism is much more general. T1's path, above, is likely to execute much more often than T2's, as JNI calls could be expected to occur more frequently than safepoints. As such, to improve performance we'd like to elide the barrier instruction from T1's path. Asymmetric Dekker Synchronization is a family of related mechanisms that allow us to safely remove the barrier from T1 while shifting the responsibility of dealing with T1's potential reorderings to T2. We call it asymmetric because to be profitable T1's path needs to run much more frequently than T2's. We then call T1's path the fast-path and T2's the slow-path. (This mechanism can enabled and disabled by way of the -XX:+/-UseMembar switch ).

The Asymmetric Dekker Synchronization document mentioned above enumerates a number of ways in which we might allow T1 and T2 to coordinate while still removing the barrier from T1's hot path, including signals, cross-calls (inter-processor interrupts), page-protection mechanisms, etc. On windows T2 might simply invoke the FlushProcessWriteBuffers facility, which seems to precisely match our needs. (Some time ago I filed an RFE -- request-for-enhancement -- for Solaris to provide a similar facility). Still, we're always looking for better ways to implement our asymmetric protocol, which almost brings us to QPI quiescence, but first we need some historical background.

Long ago Intel implemented atomics with a global bus lock. It was called the #LOCK signal and driven by the LOCK: prefix on instructions, thus the names we have today. Bus locking was conceptually simple as most multiprocess Intel systems used a common front-side bus (FSB) between the processor and memory. Unrelated atomic operations, however, could impair overall performance as #LOCK had to quiesce the bus. The old FSB was a split-transaction request-response bus, and allowed multiple requests in-flight at a given time, so to assert #LOCK too often could rob the system of performance. Bus locking also supported atomics that spanned 2 cache lines. Intel subsequently switched to so-called cache-locking, where atomic read-modify-write instructions were implemented directly in the local cache of the processor executing the atomic, avoiding the need to lock the shared bus. From the perspective of the bus such atomic operations are no different than a store. (All SPARC systems that I know of use cache-locking). Cache-locking was a good step forward as atomics now scale ideally if there's no sharing of the underlying cache lines. Despite that, Intel preserved bus locking to handle the exotic legacy case of atomics that span cache lines (split atomics), which, by definition, are misaligned accesses. For this odd case the best solution was to simply resort to bus locking so the two lines underlying the operand could be accessed atomically. Note that Intel and AMD appear to frown up such behavior in their reference manuals, but the processors still support it for legacy reasons, at least as of today.

With the advent of QuickPath Interconnect (QPI) Intel eliminated the common FSB and switched to a topology more akin to AMD's hypertransport. Nehalem is the first
processor to use QPI. But even with QPI the architects needed a way to support those legacy split atomics. To that end, QPI has the ability of quiesce the whole system to allow the split atomic to execute. It appears that QPI quiescence also drains the pipes and forces at least of the equivalent of barrier semantics over the whole system. That is, split atomics may serve as a way to force a system wide "remote" barrier.

Additional remarks



  • Beware, it's not clear that QPI quiescence is actually safe and provides true quiescence. Empirical tests with a program designed to stress the -UseMembar facility and inspired by a simple hunch about the QPI implementation suggest so, but absence of evidence isn't evidence of absence --we might yet find Karl Popper's black swan.

  • At best, QPI quiescence should be considered an academic curiosity and should never be used in production code. I'm sure processor vendors would be loath to endorse such stunts because it could ultimately limit their latitude in future bus designs. QPI quiescence is simply an implementation artifact and not an architecturally defined or guaranteed trait. Even if the facility were somehow blessed I doubt vendors would want to expose it by means of split atomics so perhaps a new instruction might be called for. (Put another way, there are two issues, should the facility be provided, and if so how to expose it to programmers). So for the moment QPI quiescence is only good for prototyping what \*might\* be accomplished with a hypothetical instruction.

  • It's possible such mechanism might be applicable to certain forms of RCU (read-copy-update).

  • Dmitriy V'jukov recently posted some timing results for FlushProcessWriteBuffers. It's pretty efficient on his system. I hope to be able to run similar benchmarks to measure the cost of QPI quiescence in the near future, at which point I'll post the results here. (Dmitriy is also the author of the relacy race detector which I recommend). It's worth noting that implementations of facilities such as FlushProcessWriteBuffers can be made very efficient. For example the implementation might be able to avoid a cross-call to a processor if it's known that no thread in the process is executing on that processor, using the knowledge that context switches are serializing events.


    Some extremely preliminary data shows that QPI quiescence by way of a split atomic incurs a local penalty of about 4800 cycles on an i7-920, and the degree of impact on the progress of other processors is very much a function of the miss rate of those processors.

  • I mentioned the idea of QPI quiescence to Dmitriy, who in turn pointed point out a relevant article on QPI in Dr. Dobbs. The "Locks" section is particularly interesting. As Dmitry noted, if it's possible to use quiescence for hot-plugging then it's not unreasonable to think that both the bus and processors are completely quiesced with no pending stores languishing in store buffers, which is precisely the behavior in which we're interested.

  • Is there a working analog to QPI quiescence on AMD's coherent hypertransport? This is left as an exercise for the curious reader.

  • Can QPI quiescence lead to a break-down of performance isolation in virtual machines running on the same physical system? (That's pretty easy to test even in the absence of virtual machines, with a simple multithreaded program or a few single-threaded processes).

Related reading



  • For another example of the Dekker duality in the HotSpot JVM and further discussion about the challenges of weak memory models
    refer to a previous blog entry about a long-standing bug in the park-unpark subsystem.

  • Biased locking, which is used in the HotSpot JVM, is a mechanism that attempts to the reduce the impact of high-latency atomic instructions. Interestingly, as processor vendors make improvements in the latency of such instructions there may come a time in the near future when biased locking is no longer profitable, at least on some platforms.

  • US07644409 and US7475395

Tuesday Jan 26, 2010

A scheduling and dispatching oddity on Linux

While benchmarking a concurrent application on Linux I ran into an odd problem worth relating. Specifically, I'm using ubuntu 9.10 with linux kernel 2.6.31-1 running on a 1x4x2 Core2 i7-920 "Nehalem" (1 package; 4 cores/package; 2 logical processors/core via hyperthreading). I'd noticed that our scaling numbers were a bit odd, with more than the usual fall off past 4 threads (it's a 1x4x2 system so we expect some fade past 4 threads, even for ideally scalable benchmarks) and more variance than I expected. The benchmark harness runs for fixed time and reports aggregate progress over the measurement interval, and of course the system is otherwise idle. As a sanity check the benchmark also reports the total user CPU time consumed over the measurement interval. Interestingly, the CPU times values were unexpectedly low, considerably less than min(#threads,#cpus) \* MeasurementInterval. All the threads should stay running, or least ready, for the duration of the interval. Note too that the effect is independent of and still occurs with long intervals, so it's not a simple issue of allowing the scheduler time to steal and rebalance or otherwise converge on a state where the runnable threads are well-dispersed over the processors.

It appeared that the scheduler wasn't aggressively dispatching ready threads onto idle CPUs. Put another way there were prolonged periods where we had both idle CPUs and ready threads at the same time -- the kernel was failing to saturate the available processors.

To avoid the problem I initially tried binding threads to processors via sched_setaffinity, which provided complete relief. Still, I'm cautious about binding because it requires knowledge of the platform topology. On SPARC/CMT/Solaris, for instance, logical CPUIDs map to physical resources geographically in the following manner: the bits in the logical CPUID select, from most significant bits to least significant, chip then core then pipeline ("thread group" in Sun terminology) then strand. So if you just bind threads by "natural" order (thread N to CPUID N) then you'll end up with many threads sharing some cores and other cores completely idle, which is likely undesirable and may yield skewed scaling results. This, btw, is a common benchmarking pitfall. On Solaris/SPARC you're better off letting the kernel disperse threads onto processors as it'll balance 1st over chips, then cores, then pipelines, which is optimal for independent threads to make headway. (That policy is clearly not optimal if there's sharing -- particular if there are writers -- in which case you might win by packing the threads less "distant" from each other, for some interconnect distance metric, and assuming the increased cache pressure and replication doesn't do too much harm). Unlike SPARC/Solaris, the logical operating system-level CPUID to physical resource mappings on my ubuntu/x64 system are well-dispersed if you use natural CPU assignment, but there's no hard guarantee of that property although I vaguely recall that Intel advises a certain canonical mapping. In more detail, the logical CPUID to physical mapping on my system -- as discovered by iterating over the logical CPUID values and using the CPUID instruction to query physical resources -- is : 0 to C0S0, 1 to C1S0, 2 to C2S0, 3 to C3S0, 4 to C0S1, 5 to C1S1, 6 to C2S1, 7 to C3S1, where C is the core# and S is the relative strand# on the core.

I'm guessing that the linux kernel I'm using institutes polices that attempt to balance power with performance whereas Solaris currently optimizes for performance. After further poking through the Linux kernel sources I realized we could adjust the scheduling policy more to our liking via tunables exposed via the /proc file system. At that point I came upon the tune-sched-domains script that makes it easy to quickly adjust scheduler tunables. (Note that the script assumes bash). First, run tune-sched-domains with no arguments and examine the SD_WAKE_AFFINITY and SD_WAKE_IDLE settings. We want SD_WAKE_AFFINITY clear and SD_WAKE_IDLE set. (If I'm interpreting the comments in the kernel code correctly, WAKE_AFFINITY appears to try to place the wakee on the same CPU as the waker, presuming they communicate through memory that's already in the local cache, while WAKE_IDLE instructs the kernel to aggressively wake idle CPUs when making threads ready). If necessary, compute a new SD_ mask value and run the script again, passing the value (in decimal) as an argument to the script. These settings provided relief for the under-utilization problem.

In addition I noticed that the HotSpot JVM performed much better on multi-threaded workloads under the settings mentioned above.

While I didn't have time for the experiments, it may be the case that adjusting the LB_BIAS flag may also provide relief.

Monday Nov 30, 2009

A race in LockSupport park() arising from weak memory models

I recently diagnosed the root cause of a concurrency bug, CR6822370,
and thought it sufficiently interesting to share the details. (CR 6822370 actually represents a cluster of bugs that are now thought to be related by a common underlying issue). Briefly, we have a lost wakeup bug in the native C++ Parker::park() platform-specific
infrastructure code that implements java.util.concurrent.LockSupport.park(). The lost wakeup arises from a race that itself arises because of architectural reordering that in turn occurs because of missing memory barrier instructions. The lost wakeup may manifest as various 'hangs' or instances of progress failure.

For background, a Parker is a native HotSpot C++ type that implements a construct that's somewhat like a restricted-range binary semaphore, except that park() is allowed to return spuriously. See LockSupport for details. If a thread is blocked in park() we're guaranteed that a subsequent unpark() will make it ready. (A perfectly legal but low-quality implementation of park() and unpark() would be empty methods, in which the program degenerates to simple spinning. An in fact that's the litmus test for correct park()-unpark() usage). A Parker instance is associated at most one thread at any one time and only that designed thread call invoke park() on that particular Parker. Any thread, however, may call unpark() on a given Parker. Furthermore, Parker instances also type-stable and immortal, at least in Sun's HotSpot implementation. On Solaris and Linux a Parker contains a pthread condvar (named _condvar), mutex (named _mutex), and a volatile integer, _counter, that represents the state of the semaphore. Despite its name, _counter only takes on the values 0
(neutral) and 1 (signaled). Each thread has a Parker instance dedicated to use by LockSupport.

Parker:: park() and unpark()

// Redacted and annotated for clarity
void Parker::unpark() {
pthread_mutex_lock (_mutex) ;
int s = _counter;
_counter = 1;
pthread_mutex_unlock (_mutex) ;
if (s < 1) {
// We signal after having dropped the lock to minimize the hold time and
// in case the underlying implementation doesn't provide wait morphing.
pthread_cond_signal (_cond) ;
}
}

void Parker::park() {
if (_counter > 0) {
// Optional optimization to avoid taking the lock
_counter = 0 ;
return ;
}
if (pthread_mutex_trylock(_mutex) != 0) {
// Another optional optimization
// Must be a concurrent unpark() - just return
return;
}
if (_counter > 0) { // no wait needed
_counter = 0;
pthread_mutex_unlock(_mutex);
return;
}
pthread_cond_wait (_cond, _mutex) ;
_counter = 0 ;
pthread_mutex_unlock(_mutex);
}

Failure scenario


  1. Lets suppose we have a thread T1 whose LockSupport/Parker instance has _counter=1.

  2. T1 calls LockSupport.park() which then invokes Parker::park(). Park() fetches _counter and observes 1, and then sets _counter=0 and returns. There are no atomic or
    fence instructions in this path. Crucially, the store of 0 into _counter languishes in the processor's local store buffer and is not yet visible to other processors.

  3. T1 returns from park(), checks for the condition of interest and observes that it's not yet signaled, and again calls park(). The store to _counter from step (1) is still languishing in the store buffer and is not yet globally visible. Control reaches the very top of park().

  4. Thread T2 causes the condition of interest to enter signaled state via some stores. These are typically stores to Java volatile variables, so the JIT will emit the necessary memory barriers after the stores.

  5. Thread T2 then calls unpark(T1). The unpark() operator stores 1 into _counter. We'll assume that this store becomes globally visible immediately. T2 returns from unpark().

  6. Thread T1's old store of 0 into _counter from step (1) finally drains from its local store buffer and becomes globally visible , overwriting the value 1 just written by T2 in step (6). With due lack of formal precision, _counter is now "globally" 0.

  7. Thread T1 in park() fetches _counter, observes 0, and then blocks on the condvar. We have a lost wakeup — T1 stalls.

Remarks and analysis


  • The underlying issue is a classic optimization in unpark() which tries to avoid taking a lock. The _counter instance variable is often but not always accessed under _mutex. If the underlying platform provided sequential consistency then the code as shown above would be correct, but x86 and SPARC provide a weaker memory consistency models, allowing memory or visibility order to differ from program order. (Refer to the excellent papers by Peter Sewell et al. for background). Given those architectural reorderings, the program admits a race which can in turn result in missed wakeups.

  • The problem would only manifest when we were using the -UseMembar optimization that lets us remove fences from certain hot thread state transitions paths that need to coordinate safepoints between mutator threads and the JVM. This feature is enabled by default, but we can turn it off with the -XX:+UseMembar switch, which causes the JVM to emit normal fence instructions in the state transitions paths. (That particular optimization is an example of asymmetric Dekker synchronization). Crucially, the park() path contains such a state transition. In reality the fence emitted by the +UseMembar switch was simply covering up the otherwise latent Parker:: bug. +UseMembar constitutes a work-around. Sensitivity to UseMembar was initially confounding but ultimately a valuable clue.

    After thinking about various timing scenarios I settled on the one given above as the most likely culprit. To support that hypothesis I wrote a simple C model of the pathology and verified that it would "fail" in a similar fashion. Having collected data with the C model on various platforms I suspect that processors where stores can languish in the store buffer for longer periods are more exposed to the bug.

  • Inserting appropriate barrier instructions after both stores of 0 into _counter in park() provides a solution. Furthermore, we're not formally guaranteed that pthread_mutex_unlock() has barrier semantics, so to be conservative we need a barrier in that location as well. For our purposes a fence instruction prevents subsequent loads (subsequent in program order) from executing before prior stores become globally visible. We typically use volatile to control for compile-time reordering and fences to control for architectural reordering.

  • The bug will not manifest on uniprocessors or environments where threads are otherwise constrained to just a single processor.

  • The bug is a "day-one" bug and present in all versions of HotSpot.

  • Parker::park() and unpark() reside in os_linux.cpp, os_solaris.cpp and os_windows.cpp for Linux, Solaris and Windows, respectively.

  • The built-in synchronized implementation uses a different park mechanism (PlatformPark::) whereas the java.util.concurrent infrastructure uses Parker::. Only Parker:: is vulnerable.

  • Additional reading:

Appendix - elaborated scenario

We have a volatile Java variable C, initially 0. Thread T1's Parker::_counter value is 1.


  • Thread T1 executes: while (C == 0) park() ;
  • Thread T2 executes: C=1; unpark(T1)

Timing:


  1. Thread T1 loads C, observes 0, and calls park()
  2. Thread T1 in 1st park() invocation:

    • Load _counter (1)
    • Store _counter (0) — languishes in store buffer
    • return from park()

  3. Thread T1: Load C (0)
  4. Thread T1: call park() a 2nd time
  5. Thread T2: Store C=1; MFENCE (by virtue of java volatile).
    MFENCE is a completely local operation, influencing only T2.
  6. Thread T2: call unpark(T1)

    • lock _mutex with atomic instruction such as CAS or LOCK:CMPXCHG
    • Load _counter (0) — observes value from memory, completely legal
    • Store _counter (1)
    • unlock _mutex

  7. Thread T1's store of 0 into _counter finally drains from the store buffer and becomes
    globally visible, overwriting the value 1 just stored by T2
  8. Thread T1 continues in 2nd invocation of park()

    • Load _counter (0)
    • lock _mutex
    • Load _counter (0)
    • block on _condvar

Another way to think of the problem is via the Dekker-duality. Observe that T1 executes { Store _counter; Load C; } while T2 executes { Store C; MFENCE; Load _counter;}. Note the missing MFENCE from T1's path. The duality is slightly obscured because the store into _counter occurs in the 1st invocation of park() which returns immediately and the load of C occurs in the application code. An in fact that's what distinguishes this bug - that the failure idiom is distributed over two invocations of park().

Update: Scott Owens in Peter Sewell's group coined the term triangular race in his 2010 ECOOP paper "Reasoning about the Implementation of Concurrency Abstractions on x86-TSO" to describe this type of situation.

Thursday Oct 29, 2009

ROCK Hardware Transactional memory - Sun technical report

We recent published an expanded version of our earlier ASPLOS paper as a Sun labs technical report.

Wednesday Sep 23, 2009

The perils of negative scalability

I've seen the following issue confound customers and colleagues of late, so thought it worth a blog entry.

Lets say you have an application that exhibits negative scalability. That is, if you were to plot throughput on the Y-axis and concurrency on the X-axis the shape of the curve would be convex downward -- performance climbs up to an apex and then falls off. (How can this happen? A common reason is that the communication overheads start to dominate and the benefit of concurrency is overcome by communication and synchronization costs). Under such circumstances it's common to introduce some type of admission control -- say, simple back-off or more elaborate mechanisms -- to restrict concurrency. Ideally, this yields an asymptotic curve where performance remains constant after reaching the peak, avoiding any fall-off.

If you tune the performance of such an application using the usual measure-analyze-modify cycle but pay attention only to the throughput values at high concurrency levels then you might be badly misled. The usual development feedback loop can fail because poor "optimizations" that slow down the code may actually serve as inadvertent implicit back-off (contention management or admission control) that will attenuate the negative scalability at higher concurrency levels but also needlessly impair performance at lower concurrency levels. Ideally, back-off should be applied only as needed, in response to contention.

A related effect is that inserting diagnostic probes might yield better performance in the region of negative scalability because of probe overhead -- a performance "Heisenbug" where performance improves when execution is more closely observed.

The take-away is that we should be careful to measure the performance of any proposed change over a wide range of concurrency values, and not just at the extremes.

An interesting human analog is Brooks's law. The same issues -- the overheads of orchestrating large numbers of humans or threads -- may underly both effects.

Monday Jun 01, 2009

X86 platform memory model - clarifications in recent Intel manuals

Intel has taken steps toward addressing some of the concerns about potential platform memory model relaxations that I identified in a
previous blog entry. Specifically see section 7.2.3.7 of their
latest Software Developer's Manual which now guarantees global store ordering.

Friday May 29, 2009

Instruction selection for volatile fences : MFENCE vs LOCK:ADD

In the past the JVM has used MFENCE but, because of latency issues on AMD processors and potential pipeline issues on modern Intel processors it appears that a LOCK:ADD of 0 to the top of stack is preferable (Gory Details).

Sunday May 24, 2009

memcpy() concurrency curiosities

I was recently asked to diagnose a problem a customer was encountering that involved Java and the JNI getIntArrayElements() and releaseIntArrayElements() primitives. The outcome of the exploration was sufficiently interesting that I thought I'd mention it here in order that other JNI users might avoid the same pitfall. Briefly, the customer was running a highly threaded application on a machine with lots of available concurrency. The Java threads would repeatedly read from an array of int[]s that was effectively immutable. References to the array were also passed to native C code via JNI calls. Those native methods would access the array via the getIntArrayElements() and releaseIntArrayElements() primitives, but in no case did the native code ever modify the array contents. The problem was that the Java readers would occasionally observe scenarios that suggested that the contents of the array were transiently "flickering" to unexpected values -- ephemeral corruption -- and then returning to the expected values. (Recall that the array was effectively immutable, so this shouldn't have been happening). I carefully checked the Java code and native code and quickly convinced myself the source of the bug was elsewhere.

I next switched my attention to the implementation of getIntArrayElements() and releaseIntArrayElements(). Briefly GetIntArrayElements() will, as used by this particular application, (a) switch the caller's thread state back to "InVM". If there's a stop-the-world garbage collection event going on the state transition operator should block the caller until the GC completes. At that point the thread can safely access the heap to extract the array data. The "InVM" thread state is such that GC should be inhibited while we're copying the data out. (b) malloc a properly sized array to hold the data; (c) memcpy() the data from the heap data into that just allocated array; (d) restore the thread state; and (e) return a pointer to the just allocated array to the caller. ReleaseIntArrayElements() operates in a similar fashion but reverses the memcpy() direction (copying into the heap) and then releases the allocated array. Note that in our case the source and destination memcpy() regions are completely disjoint. My first thought was that something might have been wrong with the state transition operations, allowing what should have been stop-the-world GC to inadvertently run concurrently with the memcpy() operations. Further investigation ruled that out as there were no GC events when the apparent corruption was observed.

Upon reflection, however, I recalled that some optimized memcpy() implementations may execute explicitly coded benign speculative stores into the destination region. Another possibility is the block-initializing-store (BIS) used by memcpy(). A concurrent observer might fetch from a memcpy() target cache line in the narrow timing window between when the line was zeroed and when the contents were copied back into the line. In either case, the memcpy() in releaseIntArrayCritical could transiently mutate the contents of the java array in the heap to some unexpected value. By the time the memcpy() finishes, of course, the destination buffer will again contain the expected contents. I quickly confirmed this scenario might occur by writing a quick throw-away multi-threaded C test case that modeled the JNI behavior and demonstrated the effect. The underlying mechanism at work deserves a bit more explanation. Lets say we're writing C code and have an effectively immutable array A. A contains a known set of values. Concurrently, we have a 1st thread reading from A and 2nd thread memcpy()ing precisely the same set of values known to be in A back into A. At first glance it's not unreasonable to assume that the reader (the 1st thread) will always see the expected values in A. That is, we're assuming that memcpy()ing the same value back into an array is idempotent. The memcpy() operation is certainly idempotent from the point of view of the caller of memcpy(), but interestingly it's not idempotent from the point of view concurrent observers. With optimized memcpy() implementations concurrent observers can see "odd" intermediate states. Naively, you'd expect that concurrent observers would see the array as unchanging, but that isn't the case on all platforms. The case at hand the memcpy() in releaseIntArrayElements() is the source of the problem, as array in heap might "flicker". To further confirm this diagnosis I provided the customer with a simplistic byte-by-byte unoptimized memcpy() implementation in an LD_PRELOAD-able mempcy.so shared object. And indeed, the problem wasn't evident when running with this simple memcpy() instead of the optimized system form.

While technicall legaly, such behavior in memcpy() -- and the JNI operators that call memcpy() -- is, at best, surprising and violates the principle of least astonishment. Strictly speaking, memcpy() provides no particular guarantees about concurrently observable intermediate states, but rather only specifies the state of the destination buffer at the end of the invocation. That is, while a memcpy() is in flight the destination buffer might contain transient or ephemeral values which could be observed by other threads. Overwriting an array with an exact copy of the array will be idempotent from the perspective of the caller after memcpy() completes, but concurrent observers of the array are permitted to see transient ephemeral "garbage" values. My first thought when we ran into this behavior some years ago was that it was a bug in the memcpy() implementation, but upon deeper reflection I concluded the bug was in my errant assumptions about how memcpy() worked. Having said that, while such behavior is technically permissible I'm also sympathetic that such effects are, at best, unexpected. I's not unreasonable to assume the destination buffer would remain unmolested. One could imagine customers easily falling into this trap, in both C code and Java.

Wednesday Mar 19, 2008

Carmina Burana Reinterpreted

Among other things, musicologists recontextualize and reinterpret older works for the current generation. If you love Carmina Burana but struggle with medieval high German or ecclesiastic "church" Latin, musicologists have an easy-to-digest answer (SFW).

Friday Mar 14, 2008

CAS and Cache Trivia - Invalidate or Update In-Place

In a previous blog entry on biased locking I noted that CAS (the atomic Compare-And-Swap instruction) usually has the same semantics as store with respect to caches and the interconnect. It's worth calling out an interesting exception, however. For background, the level-1 data cache -- often shortened to L1D$ or just D$ -- of Niagara-1 and Niagara-2 processors uses a write-around policy. All stores go over the cross-bar, but if the line is also present in the D$, the store instruction (ST) will also update the line in-place. This is where CAS differs from ST. If the line targeted by the CAS is also in the the D$ of processor executing the CAS, the line will be invalidated instead of being updated in-place. Of course it's not uncommon for the line to be present in the cache given the prevalence of the LD...CAS usage idiom. More importantly, it's extremely common for a thread to access the same line in short order after a CAS. This is where the CAS-Invalidates policy can impact performance. A good example is a lock that's acquired via CAS where the lock metadata is collocated with the data covered by the lock. Obviously, after having acquire a lock a thread is quite likely to access data protected that same lock. The first data access to the just-CASed-to-line will miss in the D$. Another example would be a thread that repeatedly locks and unlocks the same lock. Assuming the lock and unlock operators both use a LD...CAS idiom, even if the line containing the lock metadata wasn't accessed within the critical section, the CAS in lock operation will cause the LD in the subsequent unlock to miss, and the CAS in unlock will cause the LD in the subsequent lock operation to miss. Thankfully on Niagara-family processors a miss in the D$ that hits in the level-2 cache can be satisfied fairly quickly. Still, we'd like to avoid that D$ miss if at all possible. I'd argue that to the extent feasible, processor designers should avoid a CAS-Invalidates policy, which I believe to be an implementation artifact and not strictly fundamental to CAS. CAS-Invalidates is cache-antagonistic to common synchronization usage.

About

Dave is a senior research scientist in the Scalable Synchronization Research Group within Oracle Labs : Google Scholar.

Search

Categories
Archives
« March 2015
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today