TXPAUSE : polite waiting for hardware transactional memory
By Dave on Oct 25, 2012
Classic locks are an appropriate tool to prevent potentially conflicting operations A and B, invoked by different threads, from running at the same time. In a sense the locks cause either A to run before B or vice-versa. Similarly, we can replace the locks with hardware transactional memory, or use transactional lock elision to leverage potential disjoint access parallelism between A and B. But often we want A to wait until B has run. In a Pthreads environment we'd usually use locks in conjunction with condition variables to implement our "wait until" constraint. MONITOR-MWAIT is another way to wait for a memory location to change, but it only allows us to track one cache line and it's only available on x86. There's no similar "wait until" construct for hardware transactions. At the instruction-set level a simple way to express "wait until" in transactions would be to add a new TXPAUSE instruction that could be used within an active hardware transaction. TXPAUSE would politely stall the invoking thread, possibly surrendering or yielding compute resources, while at the same time continuing to track the transaction's address-set. Once a transaction has executed TXPAUSE it can only abort. Ideally that'd happen when some other thread modifies a variable that's in the transaction's read-set or write-set. And since we're aborting all writes would be discarded. In a sense this gives us multi-location MWAIT but with much more flexibility. We could also augment the TXPAUSE with a cycle-count bound to cap the time spent stalled. (For a discussion of the benefits of polite waiting, see this blog entry on the use of WRPAUSE.)
A related concept for software transactional memory is the retry facility proposed by Harris et al. in Composable memory transactions.
I should note that we can already use hardware transactions to enter a tight spin loop in a transaction to wait for updates to the address-set, which will in turn cause an abort. Assuming that the implementation monitors the address-set via cache-coherence probes, by waiting in this fashion we actually communicate via the probes, and not via memory values. That is, the updating thread signals the waiter via probes instead of by traditional memory values. But TXPAUSE takes that a step further and gives us a polite way to spin within transactions.
Lets consider a classic 'polite' test-and-test-and-set loop as might be find in a simple spin lock. The contending threads loop, loading the lock word value to see if the lock has transitioned from LOCKED to UNLOCKED state. If they observe the UNLOCKED they'll then try to acquire the lock with an atomic operation such as XCHG or compare-and-swap. While busy-waiting, the cache line underlying the lockword might be in MESI "S" = shared state in the contending thread's local cache. When the owner ultimately drops the lock -- typically by using a STORE instruction to overwrite LOCKED with UNLOCKED -- the line will change from "S" to "I" (Invalid) in the contending thread's cache. The contending thread continues to loop. The next load by the contending thread misses because the line is in local "I" state, and then causes a read-to-share (RTS) coherence operation to get the line back in the local cache in "S" state. Assuming the spinning thread sees UNLOCKED it'll then try the atomic read-modify-write instruction to acquire the lock. This causes another read-to-owner (RFO or RTO) coherence operation to upgrade the line from "S" to "M" so the atomic can run. So the operations in the local cache were : a remote invalidation from the owner, an RTS via the load and then an RTO via the atomic. (This is the best case, btw).
Now lets say the spinning thread uses a hardware transaction to wait. Assuming Haswell-like hardware, the contending thread would execute an XBEGIN instruction to start the transaction, and then load the lock word. If the lock word is LOCKED, then the transaction simply enters a tight loop, where the only exit from the loop is via abort. If the lock word happened to contain UNLOCKED then the transaction can try to acquire the lock with a store of LOCKED followed by a COMMIT instruction. In that case we're done and the thread has acquired ownership. The case where we spin in the transaction is more interesting. The transaction will have issued a load on the lock word, so the line is in local "S" state. The store used by the owner to subsequently drop the lock will cause the line underlying the lock to be invalidated in the spinning thread's local cache. This causes an abort. After the abort, the contending thread can try an atomic compare-and-swap to acquire lock. This will typically transition the cache line from I to M by virtue of an RTO coherence operation. Note that we've saved the intermediate RTS operation by spinning in this manner, so, even absent TXPAUSE, it can be useful to use transactions to wait for values in memory to change. (As an aside, ignoring the use of hardware transactions, MOESI protocols are much more tolerant of spin locks than are MESI systems).
See also US20100169895.
Keywords: Hardware Transactional Memory; HTM; Haswell; RTM; TSX; spin lock; busy-wait
On the topic of new instructions, it might be useful to have a selective memory fence instruction that specified prior and subsequent addresses. Currently we might write "ST A=1; MEMBAR StoreLoad; LD B". A selective fence would allow us to write "ST A=1; MEMBAR StoreLoad A, B; LD B;" The selective fence would ensure that the store to A was visible before the fetch of B, but without imposing full store-load barrier semantics for all accesses. A degenerate form might be expressed as "MEMBAR A" which would specify that all subsequent (in program order) accesses to A would be visible before any subsequent memory accesses. On a simple in-order processor with a store buffer, "MEMBAR A" could simply wait while an address matching "A" appear in the store buffer. When A finally drained out to visible & coherent space, the MEMBAR would complete. It's possibly such selective fences might give the CPU designers more latitude to improve performance.