SeqLocks in Java - readers shouldn't write synchronization metadata
By Dave on Sep 05, 2006
Lets say you have data that's currently protected by a synchronized block, a ReentrantLock, or a ReentrantReadWriteLock. Most of the accessor operations are pure reads - operations that write are much less frequent. You'll enjoy better parallelism with a read-write lock, of course, but even then, the act of acquiring the read-write lock (even for reading) can cause stores into the lock metadata. If you're running on a traditional Symmetric Multi-Processor (SMP) system, concurrent threads attempting to acquire the read lock can generate coherency traffic -- cache line "sloshing" as the cache line containing the read-write lock metadata is repeatedly written and invalidated, migrating between the processor's caches. See the entry on biased locking for a description of the economics of cache coherency.
Using something akin to Linux's SeqLocks can provide a viable alternative. SeqLocks provide all the parallelism of read-write locks but pure read operations can be accomplished without writing to metadata (that is, without writing to the SeqLock). As such, concurrent readers using SeqLocks can avoid the cache line sloshing that would otherwise occur when acquiring and releasing read-locks. In addition, a pure read operation can operate without any potentially costly compare-and-swap (CAS) atomic instructions. As constituted in Linux, a SeqLock is simply a volatile word where the least-significant bit is a "LOCKED" indicator and the remainder of the word serves as a version number. A pure reader will initially read the SeqLock value, spinning as long as the LOCKED bit is set. Once the reader has observed a SeqLock value with the LOCKED bit clear, it records that value and then proceeds to speculatively read the shared data (the data protected by the SeqLock), and finally validate that the observed data is consistent by verifying that the SeqLock value remains the same. If the SeqLock value changed then the reader must re-run the read attempt. Otherwise the read attempt was successful. Critically, read operations are optimistic. Writers will try to CAS to swing the LOCKED bit from 0 to 1, at which point they have exclusive access to the data and can then safely write. Writers are immune to abort. When the writes are completed the writer simply increments the SeqLock, which clears the LSB and increments the version number field. A reader can also attempt to upgrade itself to a writer by attempting to CAS the previously observed SeqLock value into LOCKED state. That CAS, if successful, will both validate that the previously observed data is consistent and will acquire the write lock. To avoid any practical concerns about roll-over and ABA aliasing, a SeqLock is typically 64-bits. Of course you have to be careful with regards to compile-time reordering, ensuring that the SeqLock and the accessed data fields are appropriately marked volatile. In addition, depending on the memory model of the platform, various barrier or "fence" instructions might be needed to prevent architectural reordering (where the program order differs from the memory visibility order). See slides 19 and 20 for more details on reordering.
As a variation, instead of a LOCKED bit we might use a distinguished LOCKED value. Of course that LOCKED value would never appear as a legal sequence number. The writer would use CAS to swing the version number from V to LOCKED; perform the writes; and then use a simple store to change LOCKED to V+1. In addition, if we knew we only had one writer then we could forgo the CAS that sets the LOCKED bit or LOCKED encoding and use simple stores.
In the degenerate case where the data being protected is a counter or "clock" that takes on monotonically increasing values but where number of bits required for the counter exceeds the atomic word size of the platform, we might simply use the SeqLock itself as the low-order data word.
Note that SeqLocks could equally well be implemented with version numbers that count down or advance using a Gray encoding. We never compare SeqLock version numbers for magnitude, but only for equality. The critical property is that the LOCKED state be detectable and that subsequently generated version numbers differ from any previously generated versions that might still be in circulation (say, in the hands of a reader). This property ensures that readers avoid false-positive verification. Interestingly, in Java we can implement the SeqLock as a reference to Object. LOCKED is encoded as a null reference. The writer would release the SeqLock by: Lock = new Object(). This variant leverages Java's garbage collection mechanism to ensure that no "old" references exist in the hands of readers before allowing an object reference to potentially recycle. This approach places additional stress on the allocator and collector, but modern JVMs, particularly those with thread-local allocation buffers (TLABs) are up to the task.
Caveat Utor: SeqLocks require special care, particularly for pure read transactions. If a reader sees inconsistent data in the midst of the transaction we say that reader is a "zombie". It's possibly read inconsistent data and is fated to die; it will eventually abort, but it hasn't yet reached the validation phase. While operating as a zombie, the reader might inadvertently enter an infinite loop, throw an expected exception, or trigger an assertion. To avoid this, the developer may need to insert validation tests as necessary into read accessors. Validation ensures that the SeqLock value observed at the the start of the transaction remains the same. That, in turn, implies that data read up to that point in the read attempt is mutually consistent and there has been no interference from writers. If the SeqLock value differs, however, the reader should abort and retry the operation. In some simple cases it's safe to elide validation. For instance say your have a class with with 3 instance fields "int A, B, C;" and you want to implement the pure reader sumABC() with SeqLocks. It's perfectly safe to read and test the SeqLock, sum A, B and C into a tentative result, and then validate the SeqLock version before returning the result. In a more complicated data structure with dependent data you'd be better advised to validate in the middle of the read transaction, however. Lets say you have an instance field R that refers to some boxed value V. By convention R is non-null while the SeqLock is unlocked, but writers may store a transient null value into R before ultimately storing a non-null reference and releasing the lock. To implement a pure reader getRV(), which conceptually returns R.V, the reader would fetch R into a temporary T, validate, fetch T.V and then validate again before returning the fetched T.V value. Finally, since readers may abort, a reader shouldn't perform any I/O operations or cause any changes to the state of the system that can't be rolled back. Writers are pessimistic; they use mutual exclusion and can't abort so I/O is safe.
As described, SeqLocks provide a policy of writer preference. Successfully acquiring write access will kill -- eventually cause aborts in -- any concurrent read transactions.
For Java, the SeqLock can easily be implemented with java.util.concurrent AtomicLong . It's also critical that fields reference in SeqLock transaction are volatile or final. Since Java doesn't directly support arrays of volatiles, the java.util.concurrent Atomic\*Array constructs will prove useful.
SeqLocks are useful where the frequency of pure-read operations greatly dominates the frequency of update transactions. Beware, however, that there are pitfalls. First, there's the issue of dependent data and the need for validation mentioned above. In addition, the write-lock is a spinlock, so it should be employed only for very short bounded critical sections. If a critical section in the pure-read transaction is excessively long, then that operation may be more vulnerable to interference from writers, as well as wasting more cycles on a futile transaction if it needs to abort. Finally, as the data should be marked volatile, the performance may vary with the platform. Depending on the memory model employed by the platform, the just-in-time compiler may need to inject load-load barrier instructions to ensure that the loads of the SeqLock and the data fields are not reordered relative to each other. On some platforms such barrier instructions incur high latency.
Regarding spinlocks, lets say that a low priority thread acquires a write-lock and is preempted. Other higher priority threads are made runnable and then spin, attempting to acquire the spinlock held by the preempted thread. To avoid livelock and ensure progress, the spinlock mechanism assumes that the holder, even though running at a lower priority, will eventually receive CPU cycles from the scheduler and will be able to exit the critical section and release the lock. That is, spinlocks assume (a) preemptive scheduling, and (b) that the scheduler will avoid complete starvation, and that all threads, even low priority threads, will eventually receive CPU cycles even if higher priority threads remain runnable. This is certainly true with the Linux O(1) scheduler. Solaris (in the IA and TS scheduling class) and Windows have an anti-languishing feature where threads stranded on the ready queue will periorically receieve temporary priority boosts until they are eventually allowed to run. Conversely, spinlocks are not appropriate for environments that lack preemption or that have true real-time threads.
Depending the specifics of the implementation, it's possible that Hardware Transactional Memories (HTMs) might also be able to execute pure-reader transactions without any writes to shared data (metadata). The hardware might leverage the existing snoop-based cache coherency protocol to monitor and detect modifications or interference to the read-set (the read-set is the set of shared variables fetched during a transaction). Say processor P1 executes a transactional load. If the cache line underlying the transactionally fetched variable wasn't already locally in M, S or E MESI state, the transactional load would probe the line into S-state or E-state. If another processor P2 wrote to the line, it would need probe the line into M-state, which would cause invalidation. That invalidation could be detected by P1 as interference.
Finally, it's worth remarking that various Software Transactional Memory (STM) implementations use SeqLocks, either embedded in each object or as an array of SeqLocks that is hashed into given a transactional variable's address. In STM terminology we say that SeqLocks provide invisible readers, as readers don't write to shared metadata. In STM implementations such as TL or TL2 pure read transactions can be accomplished without any stores to shared data or metadata.
See also: The Solaris "Locking Strategy for high-resolution timing".