Java Synchronization Optimizations

Tim Bray’s recent blog, http://www.tbray.org/ongoing/When/200x/2005/06/12/Threads has given me the kick in the ribs I needed to share some of what we’ve been working on in JVM performance at Sun. So Paul Hohensee and I came up with the following sneek peak. Tim raises a good point about the use of Java(TM) synchronization in J2SE library code; synchronization is heavily used in the core collection classes to ensure threads don’t step on shared data. Sure, there are alternative APIs that a developer can use when they know their application design ensures that won’t happen, most notably StringBuilder instead of StringBuffer and ArrayList instead of Vector. But in many cases code changes are not an option, so we’ve been working on ways for the Hotspot(TM) JVM (Java Virtual Machine) to optimize synchronization. First, a little background on object locking. When a thread attempts to lock a Java object, it will find that either no thread owns the lock or that another thread already does. In the latter case, we say that the lock is “contended”. On the other hand, if every thread that locks an object finds it unowned, we say that the lock is “uncontended”. As long as a lock is uncontended, it stays in what we call a “thin” state, in which the lock state is recorded only in the object header and thread-local storage (in the Hotspot JVM, the thread stack). The JVM executes fast path code code to acquire and release thin lock ownership. The fast path typically involves one or two compare-exchange instructions (lock;cmpxchg on x86 and cas on SPARC). The Hotspot JVM uses two, one each for acquire and release. When a lock becomes contended, it’s “inflated” into the “fat” state, where it’s represented by a shared data structure built around an OS mutex. Threads add themselves to a waiters list and park on the mutex, to be woken up by the runtime system at some future time. We call this process the slow path because it’s really, really expensive. It’s possible for a contended lock to eventually become uncontended. When that happens, it may be “deflated” into back into the “thin” state. So much for the background. It turns out that most locks are uncontended in Java programs, so fast path performance is relatively more important than slow path. But the fast path can be slow, because a compare-exchange instruction is atomic and must wait for all incomplete instructions and memory operations to finish. On modern multi-issue and out-of-order processors with NUMA memory systems, that can take hundreds of cycles. This is really obvious in benchmarks like SPECjbb2000 (http::/www.spec.org), where uncontended locking takes up to 20% of the total time on x86 boxes, and on the SciMark2 Monte Carlo sub-test (http://math.nist.gov/scimark2), where it can take more than half the total time. We’re working on various ways to reduce or eliminate uncontended locking overhead in the Hotspot JVM. There’s basically three approaches. First, we can get rid of the lock entirely. If the JVM can determine that the object being locked will never be accessed by any thread but the one that allocates it, then its lock can never be contended, so the JVM never has to lock it. The analysis required to figure this out is called “escape” analysis, because it determines whether or not an object can “escape” from the prison of its allocating thread. I’ve pointed to some references on how this is done at the end of this note. Second, we can notice that an unlock of an object is immediately followed by a lock of the same object and eliminate both the unlock and lock. This process is called “lock coarsening”. It does enlarge the size of the locked region, which might become a problem if the lock ever becomes contended: a long locked region means threads might have to wait awhile to acquire the lock. This is usually a non-issue though, because most locks are uncontended. Plus, locked regions are usually quite small to begin with, so making them a bit bigger is no big deal. Simple lock coarsening has been implemented in the latest mustang builds starting at build 38. See: http://www.java.net/download/jdk6/binaries/ Finally, we can make the fast path faster by eliminating the compare-exchange instruction(s) and replacing them by a compare-and-branch sequence. This is pretty tricky stuff: see the references for various ways of doing this safely. Combining these three approaches can basically eliminate most of the synchronization overhead in a typical Java application. Coming soon to a JVM near you. References: Escape analysis: http://www.research.ibm.com/people/g/gupta/toplas03.pdf http://www.research.ibm.com/people/j/jdchoi/escape-pointsto.html Thin locks: http://www.research.ibm.com/people/d/dfb/papers/Bacon98Thin.pdf Faster fast path: http://www.usenix.org/publications/library/proceedings/jvm01/full_papers/dice/dice.pdf http://www.research.ibm.com/trl/projects/jit/paper/p020-kawachiya.ps
Comments:

I heard during a Q&A during the Java Performance BOF at JavaOne last night that some of these sync optimizations will be in 1.5.0_06? Could u give a little more details on what will be in? Also, if we have a shared static variable that is mostly being read and updated very infrequently, shall we use say the atomic variable features in the java.util.concurrent package or should we stick with synchronized given the optimizations you mentioned?

Posted by Sam To on June 28, 2005 at 10:50 AM EDT #

Interesting stuff. Minor nit: could you make the urls into actual links? Ta.

Posted by Mike Moran on September 06, 2005 at 10:40 PM EDT #

s 深圳市凡超美迪科技有限公司位于繁华的深圳市布吉镇京南路,公司主营“美迪王”<a href="http://www.mdwkod.net/" target="_blank"><strong>点歌机</strong></a>由公司独立研制开发,该技术在KOD行业中遥遥领先,并申请专利。经过近几年的发展,现已成为深圳KOD<a href="http://www.mdwkod.net/company.asp">点歌机</a>、投币点歌机设备开发、生产、销售、客户服务同步发展的电子公司

Posted by fds on August 30, 2007 at 12:58 PM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

dagastine

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today