By Dave on Feb 03, 2010
It's not uncommon to find Dekker-like idioms in modern concurrent programs. On platforms with weaker memory models -- say where a store followed by a load in program order can be reordered by the architecture to appear as a load and then a store in the effective memory order (sometimes called the "visibility order") -- programs must use barrier instructions to enforce memory ordering to implement Dekker correctly. For the purposes of discussion and assuming a relatively common system model we'll define memory order as the order of operations as they appear at the interface between the processor and the first-level coherent cache. Examples of barriers are MFENCE on x86 and MEMBAR #storeload on SPARC. In addition, x86 and SPARC TSO memory models allow only one variety of architectural reordering, the store-then-load form noted above. (For simplicity we'll restrict the discussion to TSO-like memory consistency models). On some platforms barriers introduce significant local latency. Perversely, we sometimes find that atomic instructions which have barrier semantics (are barrier-equivalent) are faster than the purpose-defined barrier instructions. A simplistic barrier implementation might simply quiesce the pipeline and wait for the store buffer to drain. To allay a common misconception it's worth pointing out that barriers -- sometimes called fences -- are typically implemented as processor-local operations and don't cause any distinguished action on the bus or interconnect and instead simply instruct the processor to ensure that prior stores become visible before subsequent loads (subsequent and prior refer to the barrier in program order). That is, they don't force anything to happen -- such as coherence messages on the bus -- that were not already destined to occur. Instead, they simply enforce an order, momentarily reconciling program and memory order. Crucially, at least with current x86 and SPARC implementations, barriers don't force anything to occur off-processor. That also means they don't impede or impair scalability. There's no fundamental reason, however, why barriers should be so slow. The processor implementation is free to speculate over the barrier, for instance, as long as stores in the speculative episode are not made visible and loads in the episode are tracked for coherence. And in fact on at least one processor, barrier instructions effectively have 0 latency.
Returning to the Dekker idiom, threads T1 and T2 might coordinate as follows: T1 might execute (ST A; barrier; LD B) and T2 executes (ST B; barrier; LD A), and in particular we refer to this pattern as the Dekker duality. As a concrete example, we coordinate thread state transitions in the HotSpot JVM via a similar protocol, where T1 is a Java thread (mutator) executing the reentry path from a JNI call, T2 has the role of the VM thread coordinating a stop-the-world safepoint, A is a variable that indicates T1's thread state (executing outside the JVM on a JNI call, or executing inside the managed runtime), and B indicates if a stop-the-world safepoint is pending. Critically, if T1 is running on a JNI call and attempts to return back into the managed environment while a safepoint is executing, we need to stall T1 at the point of ingress, as the VM expects that Java threads will not access the heap during a safepoint. (Among other uses, Safepoints are employed for certain types of garbage collection operations, for instance where we don't want the collector and the Java threads accessing the heap simultaneously). For the purposes of illustration I'm showing just a single mutator thread T1 and a single VM thread T2, but in practice the mechanism is much more general. T1's path, above, is likely to execute much more often than T2's, as JNI calls could be expected to occur more frequently than safepoints. As such, to improve performance we'd like to elide the barrier instruction from T1's path. Asymmetric Dekker Synchronization is a family of related mechanisms that allow us to safely remove the barrier from T1 while shifting the responsibility of dealing with T1's potential reorderings to T2. We call it asymmetric because to be profitable T1's path needs to run much more frequently than T2's. We then call T1's path the fast-path and T2's the slow-path. (This mechanism can enabled and disabled by way of the -XX:+/-UseMembar switch ).
The Asymmetric Dekker Synchronization document mentioned above enumerates a number of ways in which we might allow T1 and T2 to coordinate while still removing the barrier from T1's hot path, including signals, cross-calls (inter-processor interrupts), page-protection mechanisms, etc. On windows T2 might simply invoke the FlushProcessWriteBuffers facility, which seems to precisely match our needs. (Some time ago I filed an RFE -- request-for-enhancement -- for Solaris to provide a similar facility). Still, we're always looking for better ways to implement our asymmetric protocol, which almost brings us to QPI quiescence, but first we need some historical background.
Long ago Intel implemented atomics with a global bus lock. It was called the #LOCK signal and driven by the LOCK: prefix on instructions, thus the names we have today. Bus locking was conceptually simple as most multiprocess Intel systems used a common front-side bus (FSB) between the processor and memory. Unrelated atomic operations, however, could impair overall performance as #LOCK had to quiesce the bus. The old FSB was a split-transaction request-response bus, and allowed multiple requests in-flight at a given time, so to assert #LOCK too often could rob the system of performance. Bus locking also supported atomics that spanned 2 cache lines. Intel subsequently switched to so-called cache-locking, where atomic read-modify-write instructions were implemented directly in the local cache of the processor executing the atomic, avoiding the need to lock the shared bus. From the perspective of the bus such atomic operations are no different than a store. (All SPARC systems that I know of use cache-locking). Cache-locking was a good step forward as atomics now scale ideally if there's no sharing of the underlying cache lines. Despite that, Intel preserved bus locking to handle the exotic legacy case of atomics that span cache lines (split atomics), which, by definition, are misaligned accesses. For this odd case the best solution was to simply resort to bus locking so the two lines underlying the operand could be accessed atomically. Note that Intel and AMD appear to frown up such behavior in their reference manuals, but the processors still support it for legacy reasons, at least as of today.
With the advent of QuickPath Interconnect (QPI) Intel eliminated the common FSB and switched to a topology more akin to AMD's hypertransport. Nehalem is the first
processor to use QPI. But even with QPI the architects needed a way to support those legacy split atomics. To that end, QPI has the ability of quiesce the whole system to allow the split atomic to execute. It appears that QPI quiescence also drains the pipes and forces at least of the equivalent of barrier semantics over the whole system. That is, split atomics may serve as a way to force a system wide "remote" barrier.
Beware, it's not clear that QPI quiescence is actually safe and provides true quiescence. Empirical tests with a program designed to stress the -UseMembar facility and inspired by a simple hunch about the QPI implementation suggest so, but absence of evidence isn't evidence of absence --we might yet find Karl Popper's black swan.
At best, QPI quiescence should be considered an academic curiosity and should never be used in production code. I'm sure processor vendors would be loath to endorse such stunts because it could ultimately limit their latitude in future bus designs. QPI quiescence is simply an implementation artifact and not an architecturally defined or guaranteed trait. Even if the facility were somehow blessed I doubt vendors would want to expose it by means of split atomics so perhaps a new instruction might be called for. (Put another way, there are two issues, should the facility be provided, and if so how to expose it to programmers). So for the moment QPI quiescence is only good for prototyping what \*might\* be accomplished with a hypothetical instruction.
It's possible such mechanism might be applicable to certain forms of RCU (read-copy-update).
Dmitriy V'jukov recently posted some timing results for FlushProcessWriteBuffers. It's pretty efficient on his system. I hope to be able to run similar benchmarks to measure the cost of QPI quiescence in the near future, at which point I'll post the results here. (Dmitriy is also the author of the relacy race detector which I recommend). It's worth noting that implementations of facilities such as FlushProcessWriteBuffers can be made very efficient. For example the implementation might be able to avoid a cross-call to a processor if it's known that no thread in the process is executing on that processor, using the knowledge that context switches are serializing events.
Some extremely preliminary data shows that QPI quiescence by way of a split atomic incurs a local penalty of about 4800 cycles on an i7-920, and the degree of impact on the progress of other processors is very much a function of the miss rate of those processors.
I mentioned the idea of QPI quiescence to Dmitriy, who in turn pointed point out a relevant article on QPI in Dr. Dobbs. The "Locks" section is particularly interesting. As Dmitry noted, if it's possible to use quiescence for hot-plugging then it's not unreasonable to think that both the bus and processors are completely quiesced with no pending stores languishing in store buffers, which is precisely the behavior in which we're interested.
Is there a working analog to QPI quiescence on AMD's coherent hypertransport? This is left as an exercise for the curious reader.
Can QPI quiescence lead to a break-down of performance isolation in virtual machines running on the same physical system? (That's pretty easy to test even in the absence of virtual machines, with a simple multithreaded program or a few single-threaded processes).
For another example of the Dekker duality in the HotSpot JVM and further discussion about the challenges of weak memory models
refer to a previous blog entry about a long-standing bug in the park-unpark subsystem.
Biased locking, which is used in the HotSpot JVM, is a mechanism that attempts to the reduce the impact of high-latency atomic instructions. Interestingly, as processor vendors make improvements in the latency of such instructions there may come a time in the near future when biased locking is no longer profitable, at least on some platforms.