There are a number of cases where an atomic fetch-and-add instruction might yield better performance than the classic Load;Φ;CAS loop, where CAS is the atomic compare-and-swap instruction. The x86 architecture exposes LOCK:XADD which I'll use in the discussion below. (If you don't need the original value you can also use LOCK:INC or LOCK:ADD instead of LOCK:XADD).
- CAS is "optimistic" and admits failure, whereas XADD does not. With XADD there's no explicit window of vulnerability to remote interference, and thus no need for a retry loop. Arguably, XADD has better progress properties, assuming the underlying XADD implementation doesn't have an implicit loop, but even in that case the window would be narrower than with Load;Φ;CAS.
- If you use the typical Load;Φ;CAS idiom, assuming normal snoop-based cache coherence, the Load may cause a read-to-share bus transaction to get the underlying cache line into S or E state. The CAS, which effectively has store semantics with respect to the cache coherence protocol, may induce another bus transaction to upgrade the line to M state. So in the worst case the idiom might incur two bus transactions, but an XADD implementation will usually drive the line directly into M state, requiring just one bus transaction. Of course you might be able to speculate on the value and have fast-path that tries a "naked" CAS without any prior load. (I use this frequently in native HotSpot code). Also, it's possible for sophisticated processor implementations to perform coherence speculation and aggressively probe the target line into M state. Finally, in some cases you can profitably insert a prefetch-for-write (PREFETCHW) instruction prior to the load to avoid the upgrade transaction. But that approach has to be applied carefully as in some cases it can do more harm than good. Given all that, XADD, where available, has an advantage.
- Lets say you were trying to increment a variable with the usual Load;INC;CAS loop. When the CAS starts failing with sufficient frequency you can find that the branch to exit the loop (normally taken under no or light contention) starts to predict toward the failure path where we stay in the loop. So when the CAS ultimately succeeds, you'll incur a branch mispredict when trying to exit the loop. This can be quite painful on processors with deep pipelines and lots of out-of-order speculative machinery. Typically, this is in a piece of code where you don't want a long stall. Relatedly, when the CAS starts to fail frequently, the branch begins to predict that control stays in the loop and in turn the loop runs faster and we cycle around more quickly to the CAS. Typically, we'd want some back-off in the loop. Under light load with infrequent failures the mispredict to retry the loop served as a potentially useful implicit back-off. But under higher load we lose the benefit of implicit back-off arising from the mispredict. There's no loop and no such issues with XADD
Even though CAS has a higher (infinite) consensus number, there are times when fetch-and-add -- which has a consensus number of just 2 -- is preferable.
Update: see CR7023898