Hardware-assisted transactional read-set validation
By Dave on Oct 02, 2006
The LoadLocked and StoreConditional primitives, commonly referred to as LL-SC, provide for optimistic concurrency control. To atomically increment a shared word in memory, we could execute LL on the variable, fetching it into a register, increment that register, and then try to execute SC to conditionally store the updated register value back into memory, overwriting the previously fetched value. The SC instruction will fail if another processor managed to update the variable between the LL and SC, in which case the program might typically retry the operation. In most LL-SC implementations, the processor executing LL tracks the location using the existing snoop-based cache-coherency mechanism, watching for remote updates. In a sense, LL-SC provides a degenerate form of hardware transactional memory (HTM) where the read-set (the set of shared variables read during the transaction) and the write-set (the set of shared variables speculative stored-into during the transaction) are collectively restricted to just one location. LL-SC is available on PowerPC, MIPS, Alpha and forthcoming multiprocessor ARM processors. A related instruction, compare-and-swap (CAS) is available on SPARC, IA32 and AMD64. CAS and LL-SC are key primitives for lock-free algorithms. LL-SC can be extended to allow multiple disjoint locations in the read-set and write-set, yielding a more general HTM.
First generation HTMs, even when available, are likely to place restrictions on the size of the read-set and write-set. As such Software Transactional Memories (STMs) are of interest. STMs can be implemented on today's stock hardware.
Both STMs and HTMs typically ensure that the data observed by a transaction (this is, its read-set) is mutually consistent. For instance lets say that we have shared variables A and B with the invariant that (A+B) is even. A and B are both initially 5. Transaction T1 reads A and observes 5. Transaction T2 updates A to 4 and B to 6. Transaction T2 then reads B and observes 6. We say that T2 has read inconsistent data. In this case T2 must abort and retry the transaction. Databases often provide the so-called ACID transactional properties - Atomicity, Consistency, Isolation and Durability - whereas transactional memory implementations usually provide just ACI.
HTMs provide consistency by snooping for remote invalidations to the current transaction's read-set. STMs typically provide consistency in one of two ways. First, the STM can implement so-called visible readers. Conceptually, each transacted upon location is covered by a read-write lock. A visible reader acquires the read-lock for each element of the read-set at the time the transactional read is performed. In a sense, this is pessimistic as the read-lock ensures that the values previously read in the transaction remain unchanged; the read-lock prevents an update transaction from acquiring the write-lock until the reader's transaction completes. In the example above transaction T1 would acquire the read-lock covering A, preventing transaction T2 from acquiring the write-lock needed to update A until T1 completes. Unfortunately visible readers require that readers write shared transactional metadata to acquire and release the read-write locks. See the following blog entry for a discussion of why we want to minimize writes to shared data and metadata.
STMs such as TL associate versioned write-locks each transactional location. The version number is advanced after each update. A versioned write-lock is similar in spirit to a linux Seqlock. The STM tracks the version numbers associated with the read-set and validates the read-set at commit-time. This provides invisible readers; no writes (stores) to shared metadata are necessary to track the read-set. The technique is optimistic in that it detects and recovers from inconsistent data whereas visible readers prevent inconsistent data.
STM read-set consistency management of either flavor - visible or invisible readers - can constitute a considerable fraction of the transactional overhead. In addition, for most transactions the size of the read-set typically dominates the size of the write-set. (This also holds normal code - loads greatly outnumber stores. Depending on which literature you prefer, the load:store ratio is usually given as 4:1 or 3:1). Nir Shavit and I propose a hybrid scheme where read-set consistency is managed by hardware but the write-set is managed in software through locking, as is done in TL(TL1) or TL2. A traditional HTM as described in the literature handles both the read-set and write-set in hardware whereas we propose decoupling read-set and write-set management. The goal is reduce transactional latency by making a hopefully modest change in the processor, leveraging existing cache coherency constructs. To be profitable, the hardware read-set tracking would need to be more efficient -- have lower latency -- than the explicit version number checks used by STMs such as TL(TL1) or TL2. (The version number checks ensure that the read-set is mutually consistent).
STMs such as TL1 periodically revalidate the read-set during a transaction. This is necessary as a transaction might have read a set of values that are inconsistent and then, based on that inconsistent data, enter an infinite loop, generate traps, or otherwise misbehave. We refer to such a transaction as a zombie: it's fated to eventually abort but it's still running. We can avoid zombie transactions by revalidating the entire read-set after each transactional load. That's sufficient to avoid zombies, but the cost is quadratic with the size of the read-set. Instead, TL1 admits zombies (transiently) and validates periodically. But the TL1 runtime environment must also provide zombie containment to gracefully tolerate the misbehavior of zombie transactions. Note that TL2 does not admit zombies execution, potentially making it more suitable for C or C++ unmanaged runtime environments. Relatedly, STMs that use visible readers avoid zombies too, but at the cost of considerable read-write coherency traffic. If we have hardware-assisted read-set validation and notification that a thread's read-set has been "spoiled" by a remote write is immediate (say by interrupt delivery) then we can use schemes such as TL1, but still avoid zombies and the complexity of zombie containment.
Hardware read-set management could be accomplished in various ways. First, we might modify the existing data cache (D$) to tag and track the lines containing read-set elements. Remote snoop invalidations to such tracked cache lines that occur before commit-time would result in in an forced abort. This leverages the existing cache-coherency protocol. One problem with this approach is that loads during a single transaction might cause earlier transactional loads from the same transaction to be displaced from the cache because of insufficient cache associativity. If that were to happen we'd lose the ability to track displaced line and thus the ability to guarantee consistency. The transaction would need to abort. Retrying is useless, of course, as the resource deficit is local and would simply recur. To avoid such problems we might use a high-associativity cache similar to a victim cache. Such as specialized cache wouldn't necessarily need to contain data, but rather just tags and tracking metadata. This idea is vaguely similar to the store-miss accelerator proposed by Chou et al., in IEEE Micro'05. Both of these approaches have capacity limits, however, and some transactions may not be locally feasible. To avoid capacity limits we might provide each processor with a local hardware Bloom filter keyed by physical address. The contents of the filter would be reset at the start of a transaction. As transactional reads were performed, those addresses would be added to the local Bloom filter. That is, the Bloom filter would track the read-set. Remote invalidation probes would test the Bloom filter and if there was a hit, cause the current transaction to abort. This mechanism admits false-positive aborts, but by proper sizing of the Bloom filter they could be made relatively rare. (Critically, a Bloom filter is conservative and does not admit false-negatives - it will not fail to miss any instances of read-set interference. In the absence of remote coherency traffic, all local transactions will be feasible).
For additional details see the following write up.
Most modern speculative Out-of-Order (OoO) processors already provide coherence tracking for speculative loaded operands. Transparently to the programmer, as an optimization, loads might be executed more aggressively than program order would seem to allow. In that case such speculative loaded addresses are tracked. If another process writes to a speculatively loaded-from location before the commit "wave front" passes the load, the process will discard the now-stale state and quietly rerun the load and discard any speculative state that depends on that load. Tracking state allows OoO execution but preserves the platform's memory model. I recommend "Memory Ordering: a Value-based Approach" and "Deconstructing Commit".
Finally, Chris Purcell and Tim Harris mention a clever idea in their paper Implementing Multi-word Atomic Snapshots on Current Hardware where they use existing performance counters available on most processors to detect read-set interference. Their scheme works for read-only transactions and requires two passes over the read-set.
Further reading: Bulk Disambiguation of Speculative Threads in Multiprocessors by Ceze, et. al.