atomic fetch-and-add vs compare-and-swap

There are a number of cases where an atomic fetch-and-add instruction might yield better performance than the classic Load;Φ;CAS loop, where CAS is the atomic compare-and-swap instruction. The x86 architecture exposes LOCK:XADD which I'll use in the discussion below. (If you don't need the original value you can also use LOCK:INC or LOCK:ADD instead of LOCK:XADD).


  1. CAS is "optimistic" and admits failure, whereas XADD does not. With XADD there's no explicit window of vulnerability to remote interference, and thus no need for a retry loop. Arguably, XADD has better progress properties, assuming the underlying XADD implementation doesn't have an implicit loop, but even in that case the window would be narrower than with Load;Φ;CAS.

  2. If you use the typical Load;Φ;CAS idiom, assuming normal snoop-based cache coherence, the Load may cause a read-to-share bus transaction to get the underlying cache line into S or E state. The CAS, which effectively has store semantics with respect to the cache coherence protocol, may induce another bus transaction to upgrade the line to M state. So in the worst case the idiom might incur two bus transactions, but an XADD implementation will usually drive the line directly into M state, requiring just one bus transaction. Of course you might be able to speculate on the value and have fast-path that tries a "naked" CAS without any prior load. (I use this frequently in native HotSpot code). Also, it's possible for sophisticated processor implementations to perform coherence speculation and aggressively probe the target line into M state. Finally, in some cases you can profitably insert a prefetch-for-write instruction prior to the load to avoid the upgrade transaction. But that approach has to be applied carefully as in some cases it can do more harm than good. Given all that, XADD, where available, has an advantage.

  3. Lets say you were trying to increment a variable with the usual Load;INC;CAS loop. When the CAS starts failing with sufficient frequency you can find that the branch to exit the loop (normally taken under no or light contention) starts to predict toward the failure path. So when the CAS ultimately succeeds, you'll incur a branch mispredict, which can be quite painful on processors with deep pipelines and lots of out-of-order speculative machinery. Typically, this is in a piece of code where you don't want a long stall. There's no loop and no such issues with XADD.

Update: see CR7023898

Comments:

Well if that ain't the nitty-gritty, I don't know what is! B\^>

Thanks

Damon

Posted by Damon Hart-Davis on February 15, 2011 at 04:30 AM EST #

Ahh this is something that has bothered me in the past. I m at a loss to understand when (if ever) load-inc-cas is beter than natively hw provided lock:fetch_add. The Java AtomicInteger.incrementAndGet seems to use the loop version even on hw that has fetch and add, is that likely to change going fwd.

On a slightly tangential note, it should not be difficult for hw to provide wait-free behaviour for fetch_add. consider that lock:f_a is a blocking instruction (cant ctxsw until it completes) in that case even the simplest Round Robin algorithm to provide cache line ownership in FIFO order should solve the problem. I was very interested in this for implementing ticket spin lock in Java but gave up after a while.

Posted by bank kus on February 15, 2011 at 05:52 AM EST #

Hi Damon, Doug Lea and I have been discussing the AtomicInteger/long getAndIncrement issue and will probably see about adding new unsafe primitives that will be emitted by the JIT as LOCK:XADD on x86. So yes, I think this will be fixed at some point.

And I agree that most HW could provide a fetch-add implementation with good progress properties, although cache line arbitration and access fairness can be a tricky business in practice.

Regards, -Dave

Posted by guest on February 15, 2011 at 06:30 AM EST #

> the Load may cause a read-to-share bus transaction to get the underlying cache line into S or E state

Hi Dave,

I made the same speculation, that is I expected load-CAS to be ~2 times slower than XADD under heavy contention from several cores. But my experiments (on Intel x86) didn't confirm that, that is CAS only moderately slower than XADD, so 2 cache coherence transactions can't be the case.
That time I concluded that hardware indeed speculatively issues prefetch for write in this case. May you provide some other details on this? Perhaps you know some hardware that does not do that.
All other points indeed make sense, wait-free progress is always nice to have all others equal.
Thank you.

Posted by Dmitry Vyukov on February 16, 2011 at 05:55 PM EST #

Hi Dmitry, Were you on a Nehalem? I believe the Nehalems have coherence speculation which can mitigate the impact of the 2nd bus txn. Perhaps they even eliminate it. (Some SPARC processors will issue RTO for simple LD;Phi;ST sequences). My data was taken on a Nehalem and showed about 1.3x better for XADD under load when compared to a LD;Phi;CAS loop. Regards, -Dave

Posted by guest on February 17, 2011 at 12:56 AM EST #

It was on Intel Q6600 - very early multicore.
I think that 1.3x means that there is no 2 separate trx in most cases.

Posted by Dmitry Vyukov on February 17, 2011 at 03:43 AM EST #

When I get some time I'll look at the available performance counters. Without resorting to the manual, I vaguely recall there are counters for RTS probes, RTO probes and RTS-RTO upgrades. If so, that'd yield useful information. Regards, -Dave

Posted by guest on February 17, 2011 at 03:54 AM EST #

When we fix addAndGet, decrementAndGet, getAndAdd, getAndDecrement, getAndIncrement and incrementAndGet to be implemented in terms of LOCK:ADD, we should also change getAndSet to bind to XCHG.
-Dave

Posted by guest on February 23, 2011 at 12:57 AM EST #

Are the primitives that you were considering adding going to be exposed in a public API?

I am starting to worry about the large number of low-level primitives that are being exposed in j.u.concurrent. They may have the effect of ossifying applications that use them.

Hopefully one day very soon, the JIT compiler can recognize those load-inc-cas loops in your implementation of getAndIncrement and automatically replace them with the lock:xadd instruction, then there's no need to expose more primitives and applications can continue to use the more meaningful, higher-level public APIs, yet still get the best performance.

Posted by Ben L. Titzer on February 28, 2011 at 05:29 AM EST #

The idea is to shift the implementation of something like getAndIncrement from LD;phi;CAS to LOCK:XADD. That'd happen without without accreting any new public APIs -- the changes would be under the covers in the unsafe primitives and the JUC implementation.

As you say, we could do idiom recognition in the JIT and accomplish the same thing without changing any unsafes or any of the implementation in JUC. That sounds trivial but usually involves touching much more code than you'd think. Since Doug presciently designed the API set to allow the substitution of improved implementations, it probably makes the most sense to take that approach. Regards, -Dave

Posted by guest on February 28, 2011 at 05:41 AM EST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Dave

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today