Biased Locking in HotSpot

The biased locking scheme in HotSpot arose from the following paper authored by myself, Mark Moir, and Bill Scherer. Briefly, the compare-and-swap (CAS) operations normally used to acquire a Java monitor incurs considerable local latency in the executing CPU. Since most objects are locked by at most one thread during their lifetime, we allow that thread to bias an object toward itself. Once biased, that thread can subsequently lock and unlock the object without resorting to expensive atomic instructions. Obviously, an object can be biased toward at most one thread at any given time. (We refer to that thread as the bias holding thread). If another thread tries to acquire a biased object, however, we need to revoke the bias from the original thread. (At this juncture we can either rebias the object or simply revert to normal locking for the remainder of the object's lifetime). The key challenge in revocation is to coordinate the revoker and the revokee (the bias holding thread) -- we must ensure that the revokee doesn't lock or unlock the object during revocation. As noted in the paper, revocation can be implemented in various ways - signals, suspension, and safepoints to name a few. Critically, for biased locking to prove profitable, the benefit conferred from avoiding the atomic instructions must exceed the revocation costs. Another equivalent way to conceptualize biased locking is that the original owner simply defers unlocking the object until it encounters contention. Biased locking bears some semblance to the concept of lock reservation, which is well-described in Kawachiya's dissertation. It's also similar in spirit to "opportunistic locks" (oplocks) found in various file systems; the scheme describe in Mike Burrows's "How to implement unnecessary mutexes" (In Computer Systems: Theory, Technology and Applications, Dec. 2003); or Christian Siebert's one-sided mutual exclusion primitives.

Biased locking is strictly a response to the latency of CAS. It's important to note that CAS incurs local latency, but does not impact scalability on modern processors. A common myth is that each CAS operation "goes on the bus", and, given that the interconnect is a fixed a contended resource, use of CAS can impair scalability. That's false. CAS can be accomplished locally -- that is, with no bus transactions -- if the line is already in M-state. CAS is usually implemented on top of the existing MESI snoop-based cache coherence protocol, but in terms of the bus, CAS is no different than a store. For example lets say you have a true 16-way system. You launch a thread that CASes 1 billion times to a thread-private location, measuring the elapsed time. If you then launch 16 threads, all CASing to thread-private locations, the elapsed time will be the same. The threads don't interfere with or impede each other in any way. If you launch 16 threads all CASing to the same location you'll typically see a massive slow-down because of interconnect traffic. (The sole exception to that claim is Sun's Niagara, which can gracefully tolerate sharing on a massive scale as the L2$ serves as the interconnect). If you then change that CAS to a normal store you'll also see a similar slow-down; as noted before, in terms of coherency bus traffic, CAS isn't appreciably different than a normal store. Some of the misinformation regarding CAS probably arises from the original implementation of lock:cmpxchg (CAS) on Intel processors. The lock: prefix caused the LOCK# signal to be asserted, acquiring exclusive access to the bus. This didn't scale of course. Subsequent implementations of lock:cmpxchg leverage cache coherency protocol -- typically snoop-based MESI -- and don't assert LOCK#. Note that lock:cmpxchg will still drive LOCK# in one extremely exotic case -- when the memory address is misaligned and spans 2 cache lines. Finally, we can safely use cmpxchg on uniprocessors but must use lock:cmpxchg on multiprocessor systems. Lock:cmpxchg incurs more latency, but then again it's a fundamentally different instruction that cmpxchg. Lock:cmpxchg is serializing, providing bidirectional mfence-equivalent semantics. (Fence or barrier instructions are never needed for uniprocessors) This fact might also have contributed to the myth that CAS is more expensive on MP systems. But of course lock:cmpxchg incurs no more latency on a 2x system than on an 8x system.

While digressing on the topic of bus operations, lets say that a load is followed closely in program order by a store or CAS to the same cache line. If the cache line is not present in the issuing processor the load will generate a request-to-share transaction to get the line in S-state and the store or CAS will result in a subsequent request-to-own transaction to force the line into M-state. This 2nd transaction can be avoided on some platforms by using a prefetch-for-write instruction before the load, which will force the line directly into M-state. It's also worth mentioning that on typical classic SMP systems, pure read-sharing is very efficient. All the requesting processors can have the cache line(s) replicated in their caches. But if even one processor is writing to a shared cache line, those writes will generate considerable cache coherence traffic; assuming a write-invalidate cache coherence policy (as opposed to write-update) the readers will continually re-load the cache line just to have it subsequently invalidated by the writer(s). Put differently, loads to a cache line are cheap if other processors are loading from but not storing to that same line. Stores are cheap only if no other processors are concurrently storing to or loading from that same line. (We can draw an imprecise analogy between cache coherency protocols and read-write locks in that for a given cache line there can only be one writer at any given time. That's the processor with the line in M-state. Multiple readers of the line allowed and of course the lifetime of a reader can't overlap a write. Unlike traditional read-write locks, however, the cache coherency protocol allows writers to invalidate readers, so we can't push the analogy too far. In a twisted sense, the coherency protocol is obstruction-free). Coherency bandwidth is a fixed and contended global resource, so in addition to local latency, excessive sharing traffic will impact overall scalability and impede the progress of threads running on other processors. A so-called coherency miss -- for example a load on processor P1 where processor P2 has the cache line in M-state -- is typically much slower than a normal miss (except on Niagara). Recall too, that acquiring a lock involves a store (CAS, really) to the lock metadata, so if you have threads on processors P1 and P2 iterating, acquiring the same, the lock acquisition itself will generate coherency traffic and result in the cache "sloshing" of the line(s) holding the metadata. Generally, excessive coherency traffic is to be avoided on classic SMP systems. But as usual, there's an exception to any rule, and in this case that exception is Sun's Niagara, which can tolerate sharing gracefully. We should now return back to the topic of CAS latency, as that issue motivates the need for biased locking.

CAS and the fence (AKA barrier or MEMBAR) instructions are said to be serializing. On multiprocessor systems serializing instructions control the order in which which stores are loads are visible with respect to memory. Such instructions are often implemented in a crude fashion on most current processors. Typically, a serializing instruction will bring the CPU to a near halt, kill and inhibit any out-of-order (OoO) instructions, and wait for the local store buffer to drain. That's OK for simple processors, but it results in considerable loss of performance on more sophisticated OoO processors. There's no fundamental reason why CAS or fence should be slow. Until recently, the trend was toward worse CAS and fence performance, but it appears that this trend may be reversing. IBM made considerable improvements in fence latency between the Power4 and Power5. Likewise, Intel made remarkable improvements for both lock:cmpxchg and mfence latency in the recent Core2 processors.

In the case of biased locking -- since we know that the object is unshared -- it's reasonable to assume that the actions of the bias holding thread would keep cache lines underlying the object's lock word and data in M- or E-state

Interestingly, if a a memory barrier, or "fence" instruction is sufficiently faster than CAS we can implement biased locking revocation with a Dekker-like mechanism. If you're interested in delving deeper you might find the following of interest: Asymmetric Dekker Synchronization. (That same document also describes an optimization used in HotSpot where we were able to eliminate a MEMBAR from the state transition code path on multiprocessor systems).

Kenneth Russell and David Detlefs (formerly of SunLabs) have a paper appearing in OOPSLA'06 -- Eliminating Synchronization-Related Atomic Operations with Biased Locking and Bulk Rebiasing -- where they describe a scheme to reduce the cost of revocation.

It's important to note that biased locking is permitted under the the Java Memory Model as clarified by JSR133. Refer to the Java Language Specification, 3rd edition, Chapter 17, Section 4. Doug Lea's JSR-133 Cookbook is an excellent resource for JVM developers or curious Java programmers.

Finally, there's not currently space in the mark word to support both an identity hashCode() value as well as the thread ID needed for the biased locking encoding. Given that, you can avoid biased locking on a per-object basis by calling System.identityHashCode(o). If the object is already biased, assigning an identity hashCode will result in revocation, otherwise, the assignment of a hashCode() will make the object ineligible for subsequent biased locking. This property is an artifact of our current implementation, of course.

As an aside, the latency of atomic read-modify-write operations such as CAS was in part what motivated myself and Alex Garthwaite to develop our Mostly Lock-Free Malloc (ISMM 2002) mechanism, which uses processor-local allocation without requiring atomics or the binding of threads to processors.

See also US07814488 (2002).

Comments:

Thanks for writing this up, hope there's more to come.

Posted by Mikael Gueck on August 19, 2006 at 12:39 AM EDT #

btw, if you know Russian there are several other posts related to JIT optimization (including Biased Locking) here: http://blogs.sun.com/vmrobot/category/Compilers

Posted by katya on September 03, 2006 at 01:14 AM EDT #

Hi,

I read this article with interest. Biased locking was introduced in JDK 1.6 and improved uncontended synchronization by removing atomic locks. However I have read elsewhere that Atomic locks (AtomicInterger etc) are used in the java.util.concurrent libs and improve performance. Can you explain the conflict please.

Thank you

Posted by andy black on June 22, 2009 at 05:25 AM EDT #

Hello Andy,

Instead of 'atomic locks' lets say 'atomic instructions', which, on x86 typically have a LOCK: prefix (except XCHG).

AtomicInteger could legitimately be implemented in a number of ways, such as via locks ("synchronized" or a j.u.c.ReentrantLock) or via atomic read-modify-write instructions such as CAS/LOCK:CMPXCHG etc. We happen to use the latter in HotSpot. That gives us a non-blocking implementation but at the cost of an atomic instruction. (Atomics, BTW, are becoming cheaper as processor designers put more effort in the implementation. Intel has clearly paid close attention to atomic latency -- it's much better on Nehalem, for instance). The non-blocking property is generally desirable.

Hypothetically, if we implemented AtomicInteger with synchronized then it could also benefit from biased locking in the case where only one thread ever accessed a given instance. But broadly, we expect the j.u.c primitives to be used in the case where there's real sharing. That is, they're optimized for the case of real concurrent access, which is the right trade-off.

Regards, -Dave

Posted by guest on June 22, 2009 at 05:54 AM EDT #

Is AtomicInteger.getAndIncrement() guaranteed to complete on all hw platforms. The way I see it for hw to provide this guarantee lock:addAndIncrement on that platform must not allow thread context switch until it completes. So on a 16 thread system lets assume all 16 threads went for this at the same time they are all guaranteed to complete before other threads run on those hw threads.

Posted by bank kus on December 11, 2009 at 08:29 PM EST #

The progress properties aren't specified as they depend on the capabilities of the JVM and and the the underlying platform. On an x86 with LOCK:XADD it could be wait-free. On other platforms with CAS it'll be lock-free (LD...CAS in a loop). On some platforms and JVMs the implementation may be lock-based.

Regarding context switching, if you use LOCK:XADD or some other similar atomic then either the atomic instruction completes or it doesn't. If the implementation uses a LD...CAS or LL..SC idiom then we can be preempted between the LL/LD or CAS/SC. That's harmless, however, as at worst we'll just retry.

Posted by dave dice on December 13, 2009 at 12:51 AM EST #

Hi Dave
Thanks for the clarification. Curious how we implement Fair locks in JUC? Clearly ticket locking cannot be implemented on hw that doesnt have a wait-free fetchAndIncrement.

Posted by bank kus on December 13, 2009 at 08:22 AM EST #

Hi Bank, good question.

The fair locks are "fair" in the sense that succession is managed by passing ownership directly to another thread, instead of simply releasing the lock and and using competitive succession, which can allow so-called "barging" by other threads trying to gain the lock. We typically use an explicit list of threads blocked on a lock as we want to unpark the specific successor. (We do this because we can't allow unbounded spinning).

IIRC, fair locks use CAS in the "doorway" protocol (at least on our reference platforms), so it's possible that a thread that arrived 1st in the doorway might ultimately enqueue behind some other thread that arrived more recently at the lock doorway. So it's not strictly fair in that sense.

Regards
Dave

Posted by David Dice on December 14, 2009 at 02:34 AM EST #

Interesting I had a similar implementation for a FairLock (or I thought was fair) where I wouldnt release the lockvar instead do lock handover. This way tryLock stays fair too if there are queued contenders.

What I realized is that with handover to list of contenders theres still a problem. Adding a thread to the list itself is not wait-free. Perhaps the JDK implementation doesnt suffer from this.

Btw this blog has been very educational for me wish I had foudn it earlier :-)

Posted by bank kus on December 14, 2009 at 06:43 AM EST #

Hi Banks,

MCS and CLH are good algorithms (at least for starting points) if you want to construct a lock with a wait-free doorway. CLH arguably has better execution properties but also requires a spare dummy node per lock instance.

Regards, -Dave.
p.s., some of the lock queues for "synchronized" etc. in hotspot native code are lock-free. I implemented them as such to avoid "double contention" which can occur if we have a hot lock and where some "inner" lock that protects the queues is also hot.

Posted by dave dice on December 14, 2009 at 07:00 AM EST #

Dave - I read with interest your QRL paper and the other associated materials in your post. For those who are interested, you can get at the OOPSLA paper directly from Sun, in case you don't have an ACM account:

http://java.sun.com/javase/technologies/hotspot/publications/biasedlocking_oopsla2006_wp.pdf

I am specifically interested in the scheme you mention in 3.2 of the QRL paper regarding using a half-word store followed by a full-word read to ensure the read cannot be re-ordered before the write.

In particular, I've been trying to determine if this will work under the x86 memory model. Intel provides a pretty detailed description of the IA-32 behavior in:

http://www.intel.com/Assets/PDF/manual/253668.pdf

... section 8.2 - but there doesn't seem to be enough information to determine whether a re-ordering that would break the QRL semantics is possible.

For example, under "8.2.3.5 Intra-Processor Forwarding Is Allowed", we have a scenario where a CPU executing a write may see this right "before" writes of other CPUs - and that those other CPUs will see their writes first. This seems dangerous for the QRL trick, since it may be possible for the biased thread to see its own half-word write before a successful revoking CAS by another CPU is seen, but the revoking CPU may see the opposite case - CAS succeeded and no half-word write yet.

This leads to the odd case where both the write and CAS have succeeded, which is clearly impossible under sequential consistency, or any model where memory processes through a series of well-defined states. It is not clear to me whether it is allowed under the Intel model however, because (a) they don't discussed extensively the relationship between regular (no LOCK/fence) reads/stores and interlocked operations on different CPUs (they do point out that they can't be relatively reordered on a single CPU), or (b) whether two read/stores are considered to be at "same address" if the operands have different lengths, but one is contained with the other and (c) don't specifically discuss CAS which is a compound, conditional read-then-write operation - and whether it is possible for an apparent re-ordering of a store on one CPU (due to intra-processor forwarding) that would have caused a CAS to fail, to allow the to CAS to succeed, but also let the local CPU see the store.

Well that was a mouthful! Any insight you have would be appreciated. It is my understanding that the current mechanism in the JVM is not to use stores at all, favoring instead an invasive revocation which retroactively modifies the locking records, so I guess this topic doesn't directly there (but I'm also curious why this approach was favored over one which uses the regular-store trick you describe, which is presumably quite cheap).

Posted by Travis Downs on January 04, 2010 at 08:09 AM EST #

Hi Travis,

The collocation trick (sometimes referred to "word tearing", although that term is slightly overloaded), while interesting, isn't guaranteed safe. As a practical concern, if you execute a subword ST followed by a fullword LD that covers that stored-to address, most processors will stall, allowing the store to drain before the load is satisfied. (We're assuming a fullword store to the address isn't already in the store buffer, in which case an implementation \*might\* merge the fullword store with the subsequent subword store and then satisfy the load by lookaside). In other words on most implementations the ST;LDX trick has MEMBAR #storeload semantics, but this property isn't formally defined by the architecture, so we'd never consider using it in a product.

And in fact since this was written AMD and Intel have clarified their memory models, clearly indicating that such unsavory tricks aren't safe in the long term. (Perhaps it works today, but it may not on tomorrow's processor).

Furthermore, the performance of the technique, even if it were somehow deemed safe, isn't profoundly better than a traditional fence, as the processor needs to stall.

See also http://blogs.sun.com/dave/entry/instruction_selection_for_volatile_fences for some more up to date information on fence/barrier/membar costs.

As you point out, in our current implementation in hotspot we use a safepoint rendezvous to revoke bias. That's rather heavy-handed but it's entirely safe.

Regards,Dave

Posted by David Dice on January 04, 2010 at 02:28 PM EST #

Thanks, Dave, for your quick and comprehensive response.

I guess the warning in 8.1.2.2 of the current Intel guide is along these lines:

"Software should access semaphores (shared memory used for signalling between
multiple processors) using identical addresses and operand lengths. For example, if
one processor accesses a semaphore using a word access, other processors should
not access the semaphore using a byte access."

Not rigorous by any means (why they start referring to "semaphores" here rather than simple memory location access as in the remainder of the section is beyond me) - but a hint that this so-called "word tearing" may be a no-no.

Posted by Travis Downs on January 04, 2010 at 05:30 PM EST #

Hi Travis, I agree -- there's a good bit of interpretation required when reading the various memory model documents. BTW, I highly recommend the following: http://www.cl.cam.ac.uk/~so294/documents/draft-scTSO.pdf.

Regards, Dave

Posted by David Dice on January 05, 2010 at 12:33 AM EST #

I was re-reading your paper today, and looking at the HotSpot source code, and was curious that when revoking a bias due to a lock attempt on an object by a thread other than the bias-owner and discovering the bias-owner has the object locked, that in addition to making the bias-owner's thread appear to have lightweight locked the object, you don't also inflate the object's monitor.

Is this to avoid mixing locking and revoking concerns, or perhaps in the hope that the bias-owner might unlock the object before the revoking thread actually runs.

It just seems like that since you're already at a safepoint you could avoid some of the spin-nastiness associated with inflating.

Posted by graham sanderson on January 06, 2010 at 03:07 PM EST #

Hi Graham,

You're correct that if object O is biased to T1, T1 has O logically locked, and T2 tries to lock O resulting in revocation of bias T1's bias from O, then we could transition O directly to inflated state. Broadly, though, we were trying to keep the state transitions as simple as possible. As you point out, for the sake of code stewardship sometimes it's better to avoid mixing concerns. Also, the inflation path, as you'd expect, is burried in the slow-path contended "enter" mechanism. It's prudent to leave it that way. Also, the dominant case is that O, while biased, is logically unlocked.

Regards, -Dave

Posted by David Dice on January 07, 2010 at 07:07 AM EST #

Some info on System.identityHashCode and using it to avoid the biased locking causes another problem w/ the default settings: it uses os:random() that contends on a singleton state.
The latter results into a lot of sharing. -XX:hashCode=-1 disables it and relies on Marsaliga's random that has a thread local state, i.e. concurrent friendly.

Posted by guest on April 04, 2013 at 07:16 AM EDT #

Some info on System.identityHashCode and using it to avoid the biased locking causes another problem w/ the default settings: it uses os:random() that contends on a singleton state.
The latter results into a lot of sharing. -XX:hashCode=-1 disables it and relies on Marsaliga's random that has a thread local state, i.e. concurrent friendly.

Posted by bestsss on April 04, 2013 at 07:17 AM EDT #

Thanks for the comment regarding identityHashCode. I implemented the alternative -XX:hashCode=N variants a few years ago but the J2SE team hasn't had time to switch to the defaults. As noted, the default is racy -- that's legal and arguably benign -- but undesirable. The global state for os::random() is also a coherence hotspot. We can see the impact of this with a simple benchmark where otherwise independent threads assign hashCodes (we assign lazily) at a high rate. The coherence traffic on the os::random() variables quickly chokes scalability. Using alternative schemes provides relief. That's precisely why I provided the Marsaglia shift-xor PRNG.

Regards, -Dave

Posted by guest on June 03, 2013 at 10:13 AM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Dave

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today