atomic fetch-and-add vs compare-and-swap
By Dave on Feb 15, 2011
There are a number of cases where an atomic fetch-and-add instruction might yield better performance than the classic Load;Φ;CAS loop, where CAS is the atomic compare-and-swap instruction. The x86 architecture exposes LOCK:XADD which I'll use in the discussion below. (If you don't need the original value you can also use LOCK:INC or LOCK:ADD instead of LOCK:XADD).
- CAS is "optimistic" and admits failure, whereas XADD does not. With XADD there's no explicit window of vulnerability to remote interference, and thus no need for a retry loop. Arguably, XADD has better progress properties, assuming the underlying XADD implementation doesn't have an implicit loop, but even in that case the window would be narrower than with Load;Φ;CAS.
- If you use the typical Load;Φ;CAS idiom, assuming normal snoop-based cache coherence, the Load may cause a read-to-share bus transaction to get the underlying cache line into S or E state. The CAS, which effectively has store semantics with respect to the cache coherence protocol, may induce another bus transaction to upgrade the line to M state. So in the worst case the idiom might incur two bus transactions, but an XADD implementation will usually drive the line directly into M state, requiring just one bus transaction. Of course you might be able to speculate on the value and have fast-path that tries a "naked" CAS without any prior load. (I use this frequently in native HotSpot code). Also, it's possible for sophisticated processor implementations to perform coherence speculation and aggressively probe the target line into M state. Finally, in some cases you can profitably insert a prefetch-for-write instruction prior to the load to avoid the upgrade transaction. But that approach has to be applied carefully as in some cases it can do more harm than good. Given all that, XADD, where available, has an advantage.
- Lets say you were trying to increment a variable with the usual Load;INC;CAS loop. When the CAS starts failing with sufficient frequency you can find that the branch to exit the loop (normally taken under no or light contention) starts to predict toward the failure path. So when the CAS ultimately succeeds, you'll incur a branch mispredict, which can be quite painful on processors with deep pipelines and lots of out-of-order speculative machinery. Typically, this is in a piece of code where you don't want a long stall. There's no loop and no such issues with XADD.
Update: see CR7023898