An Oracle blog about Transactional locks

atomic fetch-and-add vs compare-and-swap

Dave Dice
Senior Research Scientist

There are a number of cases where an atomic fetch-and-add instruction might yield better performance than the classic Load;Φ;CAS loop, where CAS is the atomic compare-and-swap instruction. The x86 architecture exposes LOCK:XADD which I'll use in the discussion below. (If you don't need the original value you can also use LOCK:INC or LOCK:ADD instead of LOCK:XADD).

  1. CAS is "optimistic" and admits failure, whereas XADD does not. With XADD there's no explicit window of vulnerability to remote interference, and thus no need for a retry loop. Arguably, XADD has better progress properties, assuming the underlying XADD implementation doesn't have an implicit loop, but even in that case the window would be narrower than with Load;Φ;CAS.
  2. If you use the typical Load;Φ;CAS idiom, assuming normal snoop-based cache coherence, the Load may cause a read-to-share bus transaction to get the underlying cache line into S or E state. The CAS, which effectively has store semantics with respect to the cache coherence protocol, may induce another bus transaction to upgrade the line to M state. So in the worst case the idiom might incur two bus transactions, but an XADD implementation will usually drive the line directly into M state, requiring just one bus transaction. Of course you might be able to speculate on the value and have fast-path that tries a "naked" CAS without any prior load. (I use this frequently in native HotSpot code). Also, it's possible for sophisticated processor implementations to perform coherence speculation and aggressively probe the target line into M state. Finally, in some cases you can profitably insert a prefetch-for-write (PREFETCHW) instruction prior to the load to avoid the upgrade transaction. But that approach has to be applied carefully as in some cases it can do more harm than good. Given all that, XADD, where available, has an advantage.
  3. Lets say you were trying to increment a variable with the usual Load;INC;CAS loop. When the CAS starts failing with sufficient frequency you can find that the branch to exit the loop (normally taken under no or light contention) starts to predict toward the failure path where we stay in the loop. So when the CAS ultimately succeeds, you'll incur a branch mispredict when trying to exit the loop. This can be quite painful on processors with deep pipelines and lots of out-of-order speculative machinery. Typically, this is in a piece of code where you don't want a long stall. Relatedly, when the CAS starts to fail frequently, the branch begins to predict that control stays in the loop and in turn the loop runs faster and we cycle around more quickly to the CAS. Typically, we'd want some back-off in the loop. Under light load with infrequent failures the mispredict to retry the loop served as a potentially useful implicit back-off. But under higher load we lose the benefit of implicit back-off arising from the mispredict. There's no loop and no such issues with XADD

Even though CAS has a higher (infinite) consensus number, there are times when fetch-and-add -- which has a consensus number of just 2 -- is preferable.

Update: see CR7023898

Join the discussion

Comments ( 10 )
  • Damon Hart-Davis Tuesday, February 15, 2011

    Well if that ain't the nitty-gritty, I don't know what is! B\^>



  • bank kus Tuesday, February 15, 2011

    Ahh this is something that has bothered me in the past. I m at a loss to understand when (if ever) load-inc-cas is beter than natively hw provided lock:fetch_add. The Java AtomicInteger.incrementAndGet seems to use the loop version even on hw that has fetch and add, is that likely to change going fwd.

    On a slightly tangential note, it should not be difficult for hw to provide wait-free behaviour for fetch_add. consider that lock:f_a is a blocking instruction (cant ctxsw until it completes) in that case even the simplest Round Robin algorithm to provide cache line ownership in FIFO order should solve the problem. I was very interested in this for implementing ticket spin lock in Java but gave up after a while.

  • guest Tuesday, February 15, 2011

    Hi Damon, Doug Lea and I have been discussing the AtomicInteger/long getAndIncrement issue and will probably see about adding new unsafe primitives that will be emitted by the JIT as LOCK:XADD on x86. So yes, I think this will be fixed at some point.

    And I agree that most HW could provide a fetch-add implementation with good progress properties, although cache line arbitration and access fairness can be a tricky business in practice.

    Regards, -Dave

  • Dmitry Vyukov Wednesday, February 16, 2011

    > the Load may cause a read-to-share bus transaction to get the underlying cache line into S or E state

    Hi Dave,

    I made the same speculation, that is I expected load-CAS to be ~2 times slower than XADD under heavy contention from several cores. But my experiments (on Intel x86) didn't confirm that, that is CAS only moderately slower than XADD, so 2 cache coherence transactions can't be the case.

    That time I concluded that hardware indeed speculatively issues prefetch for write in this case. May you provide some other details on this? Perhaps you know some hardware that does not do that.

    All other points indeed make sense, wait-free progress is always nice to have all others equal.

    Thank you.

  • guest Thursday, February 17, 2011

    Hi Dmitry, Were you on a Nehalem? I believe the Nehalems have coherence speculation which can mitigate the impact of the 2nd bus txn. Perhaps they even eliminate it. (Some SPARC processors will issue RTO for simple LD;Phi;ST sequences). My data was taken on a Nehalem and showed about 1.3x better for XADD under load when compared to a LD;Phi;CAS loop. Regards, -Dave

  • Dmitry Vyukov Thursday, February 17, 2011

    It was on Intel Q6600 - very early multicore.

    I think that 1.3x means that there is no 2 separate trx in most cases.

  • guest Thursday, February 17, 2011

    When I get some time I'll look at the available performance counters. Without resorting to the manual, I vaguely recall there are counters for RTS probes, RTO probes and RTS-RTO upgrades. If so, that'd yield useful information. Regards, -Dave

  • guest Wednesday, February 23, 2011

    When we fix addAndGet, decrementAndGet, getAndAdd, getAndDecrement, getAndIncrement and incrementAndGet to be implemented in terms of LOCK:ADD, we should also change getAndSet to bind to XCHG.


  • Ben L. Titzer Monday, February 28, 2011

    Are the primitives that you were considering adding going to be exposed in a public API?

    I am starting to worry about the large number of low-level primitives that are being exposed in j.u.concurrent. They may have the effect of ossifying applications that use them.

    Hopefully one day very soon, the JIT compiler can recognize those load-inc-cas loops in your implementation of getAndIncrement and automatically replace them with the lock:xadd instruction, then there's no need to expose more primitives and applications can continue to use the more meaningful, higher-level public APIs, yet still get the best performance.

  • guest Monday, February 28, 2011

    The idea is to shift the implementation of something like getAndIncrement from LD;phi;CAS to LOCK:XADD. That'd happen without without accreting any new public APIs -- the changes would be under the covers in the unsafe primitives and the JUC implementation.

    As you say, we could do idiom recognition in the JIT and accomplish the same thing without changing any unsafes or any of the implementation in JUC. That sounds trivial but usually involves touching much more code than you'd think. Since Doug presciently designed the API set to allow the substitution of improved implementations, it probably makes the most sense to take that approach. Regards, -Dave

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.