Monday Jun 01, 2009

X86 platform memory model - clarifications in recent Intel manuals

Intel has taken steps toward addressing some of the concerns about potential platform memory model relaxations that I identified in a
previous blog entry. Specifically see section of their
latest Software Developer's Manual which now guarantees global store ordering.

Friday May 29, 2009

Instruction selection for volatile fences : MFENCE vs LOCK:ADD

In the past the JVM has used MFENCE but, because of latency issues on AMD processors and potential pipeline issues on modern Intel processors it appears that a LOCK:ADD of 0 to the top of stack is preferable (Gory Details).

Sunday May 24, 2009

memcpy() concurrency curiosities

I was recently asked to diagnose a problem a customer was encountering that involved Java and the JNI getIntArrayElements() and releaseIntArrayElements() primitives. The outcome of the exploration was sufficiently interesting that I thought I'd mention it here in order that other JNI users might avoid the same pitfall. Briefly, the customer was running a highly threaded application on a machine with lots of available concurrency. The Java threads would repeatedly read from an array of int[]s that was effectively immutable. References to the array were also passed to native C code via JNI calls. Those native methods would access the array via the getIntArrayElements() and releaseIntArrayElements() primitives, but in no case did the native code ever modify the array contents. The problem was that the Java readers would occasionally observe scenarios that suggested that the contents of the array were transiently "flickering" to unexpected values -- ephemeral corruption -- and then returning to the expected values. (Recall that the array was effectively immutable, so this shouldn't have been happening). I carefully checked the Java code and native code and quickly convinced myself the source of the bug was elsewhere.

I next switched my attention to the implementation of getIntArrayElements() and releaseIntArrayElements(). Briefly GetIntArrayElements() will, as used by this particular application, (a) switch the caller's thread state back to "InVM". If there's a stop-the-world garbage collection event going on the state transition operator should block the caller until the GC completes. At that point the thread can safely access the heap to extract the array data. The "InVM" thread state is such that GC should be inhibited while we're copying the data out. (b) malloc a properly sized array to hold the data; (c) memcpy() the data from the heap data into that just allocated array; (d) restore the thread state; and (e) return a pointer to the just allocated array to the caller. ReleaseIntArrayElements() operates in a similar fashion but reverses the memcpy() direction (copying into the heap) and then releases the allocated array. Note that in our case the source and destination memcpy() regions are completely disjoint. My first thought was that something might have been wrong with the state transition operations, allowing what should have been stop-the-world GC to inadvertently run concurrently with the memcpy() operations. Further investigation ruled that out as there were no GC events when the apparent corruption was observed.

Upon reflection, however, I recalled that some optimized memcpy() implementations may execute explicitly coded benign speculative stores into the destination region. Another possibility is the block-initializing-store (BIS) used by memcpy(). A concurrent observer might fetch from a memcpy() target cache line in the narrow timing window between when the line was zeroed and when the contents were copied back into the line. In either case, the memcpy() in releaseIntArrayCritical could transiently mutate the contents of the java array in the heap to some unexpected value. By the time the memcpy() finishes, of course, the destination buffer will again contain the expected contents. I quickly confirmed this scenario might occur by writing a quick throw-away multi-threaded C test case that modeled the JNI behavior and demonstrated the effect. The underlying mechanism at work deserves a bit more explanation. Lets say we're writing C code and have an effectively immutable array A. A contains a known set of values. Concurrently, we have a 1st thread reading from A and 2nd thread memcpy()ing precisely the same set of values known to be in A back into A. At first glance it's not unreasonable to assume that the reader (the 1st thread) will always see the expected values in A. That is, we're assuming that memcpy()ing the same value back into an array is idempotent. The memcpy() operation is certainly idempotent from the point of view of the caller of memcpy(), but interestingly it's not idempotent from the point of view concurrent observers. With optimized memcpy() implementations concurrent observers can see "odd" intermediate states. Naively, you'd expect that concurrent observers would see the array as unchanging, but that isn't the case on all platforms. The case at hand the memcpy() in releaseIntArrayElements() is the source of the problem, as array in heap might "flicker". To further confirm this diagnosis I provided the customer with a simplistic byte-by-byte unoptimized memcpy() implementation in an LD_PRELOAD-able shared object. And indeed, the problem wasn't evident when running with this simple memcpy() instead of the optimized system form.

While technicall legaly, such behavior in memcpy() -- and the JNI operators that call memcpy() -- is, at best, surprising and violates the principle of least astonishment. Strictly speaking, memcpy() provides no particular guarantees about concurrently observable intermediate states, but rather only specifies the state of the destination buffer at the end of the invocation. That is, while a memcpy() is in flight the destination buffer might contain transient or ephemeral values which could be observed by other threads. Overwriting an array with an exact copy of the array will be idempotent from the perspective of the caller after memcpy() completes, but concurrent observers of the array are permitted to see transient ephemeral "garbage" values. My first thought when we ran into this behavior some years ago was that it was a bug in the memcpy() implementation, but upon deeper reflection I concluded the bug was in my errant assumptions about how memcpy() worked. Having said that, while such behavior is technically permissible I'm also sympathetic that such effects are, at best, unexpected. I's not unreasonable to assume the destination buffer would remain unmolested. One could imagine customers easily falling into this trap, in both C code and Java.

Wednesday Mar 19, 2008

Carmina Burana Reinterpreted

Among other things, musicologists recontextualize and reinterpret older works for the current generation. If you love Carmina Burana but struggle with medieval high German or ecclesiastic "church" Latin, musicologists have an easy-to-digest answer (SFW).

Friday Mar 14, 2008

CAS and Cache Trivia - Invalidate or Update In-Place

In a previous blog entry on biased locking I noted that CAS (the atomic Compare-And-Swap instruction) usually has the same semantics as store with respect to caches and the interconnect. It's worth calling out an interesting exception, however. For background, the level-1 data cache -- often shortened to L1D$ or just D$ -- of Niagara-1 and Niagara-2 processors uses a write-around policy. All stores go over the cross-bar, but if the line is also present in the D$, the store instruction (ST) will also update the line in-place. This is where CAS differs from ST. If the line targeted by the CAS is also in the the D$ of processor executing the CAS, the line will be invalidated instead of being updated in-place. Of course it's not uncommon for the line to be present in the cache given the prevalence of the LD...CAS usage idiom. More importantly, it's extremely common for a thread to access the same line in short order after a CAS. This is where the CAS-Invalidates policy can impact performance. A good example is a lock that's acquired via CAS where the lock metadata is collocated with the data covered by the lock. Obviously, after having acquire a lock a thread is quite likely to access data protected that same lock. The first data access to the just-CASed-to-line will miss in the D$. Another example would be a thread that repeatedly locks and unlocks the same lock. Assuming the lock and unlock operators both use a LD...CAS idiom, even if the line containing the lock metadata wasn't accessed within the critical section, the CAS in lock operation will cause the LD in the subsequent unlock to miss, and the CAS in unlock will cause the LD in the subsequent lock operation to miss. Thankfully on Niagara-family processors a miss in the D$ that hits in the level-2 cache can be satisfied fairly quickly. Still, we'd like to avoid that D$ miss if at all possible. I'd argue that to the extent feasible, processor designers should avoid a CAS-Invalidates policy, which I believe to be an implementation artifact and not strictly fundamental to CAS. CAS-Invalidates is cache-antagonistic to common synchronization usage.

Wednesday Feb 13, 2008

Rock-style Transactional Memory - Lock Elision for Java, etc.

In Applications of the Adaptive Transactional Memory Test Platform to appear in Transact 2008 we should how Rock-like transactional memory can be applied to various components of the Solaris software stack, including transactional lock elision for "synchronized" blocks in Java.

Monday Jan 28, 2008

National Wear Red Day

February 1st is National Wear Red Day. This is particularly dear to my heart because of my Wife's experience with cardiac arrest last May 15th.

Friday Jan 18, 2008

Questions about Java-SE on Solaris?

Feel free to post questions to our Ask the Experts forum.

Wednesday Jan 16, 2008

Java Memory Model concerns on Intel and AMD systems

The Java Memory Model (JMM) was recently clarified by JSR-133 with the corresponding changes incorporated into chapter 17 of the Java Language Specification, 3rd edition. Doug Lea's excellent JSR-133 Cookbook reinterprets JSR-133 from the perspective of and for the benefit of JVM implementers. A JVM must reconcile the JMM and the memory consistency model of the underlying platform. Intel/AMD (x86) and SPARC Total Store Order(TSO) define relatively strong memory consistency models; the only architectural reordering of concern is that a store followed by a load in program order can be reordered by the platform such that the store becomes visible before the load executes. If we require that store to become visible before the load executes then a serializing instruction -- typically an atomic instruction, such as CAS or a fence (MFENCE, MEMBAR #StoreLoad) -- must execute between the store and load in question.

The JMM defines a strong memory model akin to sequential consistency (SC) for volatile accesses. On Intel, AMD and SPARC processors, it's sufficient for the JVM to execute a fence instruction after all volatile stores. In practice this means that while translating Java bytecode to native code the just-in-time-compiler, or JIT, emits a fence after all volatile stores. In addition to avoiding architectural reordering through the use of fence instructions, the JIT will also avoid compile-time ordering of volatile accesses. To be somewhat more precise, a volatile load has acquire semantics and a volatile store has release semantics.

Of late, however, both Intel in their IntelĀ® 64 Architecture Memory Ordering White Paper and AMD (in section 7.2 "Multiprocessor Memory Access Ordering" of their recently updated systems programming guide) have relaxed the definition of their platform memory models. Under their previously defined memory models, for instance, if MFENCE instructions appeared between all store-load pairs you'd effectively have sequential consistency. That no longer holds. Instead of sequential consistency we'll instead have slightly weaker causal consistency. (As an aside, I wonder if these specification changes apply to existing processors already in the field -- that is, they clarify the behavior of existing processors -- or if they reflect future or planned processors? I'd hope the latter). Intel claims to have analyzed a large body of existing code in the field and believes that no programs will observe the change or be adversely affected. Strictly speaking, however, existing JVMs that emit MFENCE instructions after volatile stores would be in violation of the JMM when running on processors that actually implemented causal consistency instead of the previous TSO-like model. Collectively, we could clarify the JMM yet again to admit causal consistency for volatiles. Another option would be to change the code emission in the JIT to use locked instructions or XCHG instead of MFENCE. By my reading of the new Intel and AMD documents that'd be sufficient to put the JVM into compliance with the JMM on processors with the relaxed memory model. That's likely slower, however.

Readers interested in this topic would also likely enjoy Hans Boehm's presentation Getting C++ Threads Right which touches on the analogous problem for the new C++0x memory model (Youtube video).

Update 2008-3-7: See Rick Hudson's IA Memory Ordering talk.

Update 2008-10-27: See and in particular their POPL 2009 submission The Semantics of X86 Multiprocessor Machine Code.

Thursday Jan 10, 2008

Java Thread Priority Revisited in JDK7

In an earlier blog entry I described how Java thread priorities relate to native thread priority. We recently ran into bug 6518490 that forced me to change the default mapping of priorities on Solaris™ as previously described.

Briefly, in the Solaris IA(interactive) or TS(timeshare) scheduling classes, the effective priority of a thread is function of the thread's assigned priority and a thread-specific system-managed behavioral factor. (That's a gross oversimplification. If you're curious, I recommend Solaris Internals, 2nd Edition Section 3.7 followed by a skim of ts.c. In the case of Java the JVM maps the logical Java priority set via Thread.setPriority() to an TS- or IA-specific priority, and then calls priocntl() to set the assigned priority for the thread. The scheduler avoids indefinite starvation of low priority threads by periodically applying a boost (this is the system-managed behavioral factor) for threads that have languished too long on the ready queue without having been run. In addition, threads that block voluntarily -- as opposed to exhausting their time slice and requiring involuntary preemption -- may receive a boost to their effective priority by way of changes to the previously mentioned behavioral factor.

Lets say we have a 2x multiprocessor system, otherwise idle, with threads T1 and T2 at high assigned priority and T3 at low assigned priority. T1 and T2 loop, yielding, or alternatively they might pass a "token" back and forth via a pipe. Critically, they block often, so they receive the aforementioned boost to their effective priorities. Thread T3 simply loops, computing the digits of π. T3 will languish on the ready queue. Over time the scheduler will boost T3's effective priority, but unfortunately that boost isn't sufficient to push T3's priority beyond that of the boosted effective priorities of T1 and T2, so T3 can starve indefinitely.

To avoid the problem I've inverted the default for the -XX:[+/-]UseThreadPriorities flag from True to False on Solaris. The defaults setting for the flag remains unchanged on other platforms. Running all Java threads at the same priority sidesteps the problem and avoid starvation. I considered just posting a work-around and leaving the flag polarity unchanged, but the bug is difficult to diagnose properly. A work-around, albeit simple, is useless if a customer is forced to diagnose and identify a rather opaque and exotic bug. Put another way, a work-around is no better than the diagnostic procedure used to identify the associated malady. The change to UseThreadPriorities is in JDK7 and is propagating back through JDK6 and JDK5. If you desperately want Java threads priorities to map to underlying native priority you can use -XX:+UseThreadPriorities, but in a sense you'll be accepting the risk of the hang. You have to explicitly opt-in to enable priority mapping. Priorities are a quality-of-implementation concern, not a correctness concern. Recall that thread priorities are strictly advisory, and the JVM has license and latitude to implement or change them as needed. They should never be depended upon to ensure progress (relative to other threads) or lack of progress (i.e., synchronization).

Finally, this changed isn't as significant as it might seem. Under IA and TS, the upper half of the logical priority range was previously mapped to native "high" and the lower half was mapped over the range of native priorities. The only observable difference between the old behavior and running with thread priorities disabled is that Java threads assigned priorities in the lower half of the logical range will now run at the same effective (native) priority as threads in the upper half. When competing for CPU cycles with other threads, these low priority threads will receive relatively more cycles from the scheduler than they did in the past.

Tuesday Apr 24, 2007

More musical parody - interpret this as you may

Alanis Morissette parodies "My Humps". Deep social commentary, shallow parody, both, or something else altogether? It's your call.

Sunday Mar 11, 2007

kill-dash-nine song

Would you consider it good or bad if you understand all or most of the references in kill-dash-nine song? Consider it a geek litmus test. (NSFW, NPC2007)

Monday Feb 26, 2007

Understanding Tradeoffs in Software Transactional Memory

Understanding Tradeoffs in Software Transactional Memory by Nir Shavit and myself was accepted in CGO-2007. The paper describes the TL1 (Transactional Locking) mechanism. The slides presented at the conference are available as well. (We first described TL1 at Transact'06 in Ottawa). TL1's successor, TL2, authored by mself, Nir Shavit and Ori Shalev, was published in DISC'06.

Briefly, STMs such as TL1 periodically revalidate the read-set during a transaction. This is necessary as a transaction might have read a set of values that are inconsistent and then, based on that inconsistent data, enter an infinite loop, generate traps, or otherwise misbehave. We refer to such a transaction as a zombie: it's observed inconsistent state and could never successfully commit, but it's still running. We could avoid zombie transactions by revalidating the entire read-set after each transactional load. That's sufficient to avoid zombies, but the cost is quadratic with the size of the read-set. Instead, TL1 admits zombies (transiently) and validates periodically. The period of zombie execution is bounded. But the TL1 runtime environment must also provide zombie containment to gracefully tolerate the misbehavior of zombie transactions.

Another way to avoid zombies is to use so-called visible readers. Conceptually each transactional variable maps to a read-write lock. The first time a transaction reads from a location it acquires the read-lock. And of course transactional writes acquire the lock for writing. Visible readers avoid zombie execution but at the cost of considerable read-write coherency traffic on the read-write locks.

TL2 does not admit zombies execution, potentially making it more suitable for C or C++ unmanaged runtime environments. And it manages to avoid zombies with validation costs only proportional to the read-set size and without visible readers.

Sunday Feb 25, 2007

Geocaching - In the median strip

Warning: This video clip could be spoiler for the final of a rather interesting multicache.

Monday Nov 13, 2006

Open Source Java - Crossing the Rubicon

At this point I should probably employ a pithy turn of phrase or draw a clever analogy between Sun's announcement of open source Java and some interesting historical event. But if you've read this far I'll spare you that pain. There are considerable high-level implications to Sun's actions, but those are well-covered elsewhere so I'll stick to local prognostication. The HotSpot sources are now accessible through an easy-to-view opengrok instance, similar in spirit to that used by Open Solaris. And of course there's svn access or zipballs too. Bloggers will able to annotate their blogs with pointers into the source, providing a much richer narrative and enabling better engagement with the larger community. My personal hope is that the HotSpot JVM will be more appealing to academic and research environments. I look forward to those collaborations.
Alea Iacta Est.

Monday Oct 02, 2006

Hardware-assisted transactional read-set validation

The LoadLocked and StoreConditional primitives, commonly referred to as LL-SC, provide for optimistic concurrency control. To atomically increment a shared word in memory, we could execute LL on the variable, fetching it into a register, increment that register, and then try to execute SC to conditionally store the updated register value back into memory, overwriting the previously fetched value. The SC instruction will fail if another processor managed to update the variable between the LL and SC, in which case the program might typically retry the operation. In most LL-SC implementations, the processor executing LL tracks the location using the existing snoop-based cache-coherency mechanism, watching for remote updates. In a sense, LL-SC provides a degenerate form of hardware transactional memory (HTM) where the read-set (the set of shared variables read during the transaction) and the write-set (the set of shared variables speculative stored-into during the transaction) are collectively restricted to just one location. LL-SC is available on PowerPC, MIPS, Alpha and forthcoming multiprocessor ARM processors. A related instruction, compare-and-swap (CAS) is available on SPARC, IA32 and AMD64. CAS and LL-SC are key primitives for lock-free algorithms. LL-SC can be extended to allow multiple disjoint locations in the read-set and write-set, yielding a more general HTM.

First generation HTMs, even when available, are likely to place restrictions on the size of the read-set and write-set. As such Software Transactional Memories (STMs) are of interest. STMs can be implemented on today's stock hardware.

Both STMs and HTMs typically ensure that the data observed by a transaction (this is, its read-set) is mutually consistent. For instance lets say that we have shared variables A and B with the invariant that (A+B) is even. A and B are both initially 5. Transaction T1 reads A and observes 5. Transaction T2 updates A to 4 and B to 6. Transaction T2 then reads B and observes 6. We say that T2 has read inconsistent data. In this case T2 must abort and retry the transaction. Databases often provide the so-called ACID transactional properties - Atomicity, Consistency, Isolation and Durability - whereas transactional memory implementations usually provide just ACI.

HTMs provide consistency by snooping for remote invalidations to the current transaction's read-set. STMs typically provide consistency in one of two ways. First, the STM can implement so-called visible readers. Conceptually, each transacted upon location is covered by a read-write lock. A visible reader acquires the read-lock for each element of the read-set at the time the transactional read is performed. In a sense, this is pessimistic as the read-lock ensures that the values previously read in the transaction remain unchanged; the read-lock prevents an update transaction from acquiring the write-lock until the reader's transaction completes. In the example above transaction T1 would acquire the read-lock covering A, preventing transaction T2 from acquiring the write-lock needed to update A until T1 completes. Unfortunately visible readers require that readers write shared transactional metadata to acquire and release the read-write locks. See the following blog entry for a discussion of why we want to minimize writes to shared data and metadata.

STMs such as TL associate versioned write-locks each transactional location. The version number is advanced after each update. A versioned write-lock is similar in spirit to a linux Seqlock. The STM tracks the version numbers associated with the read-set and validates the read-set at commit-time. This provides invisible readers; no writes (stores) to shared metadata are necessary to track the read-set. The technique is optimistic in that it detects and recovers from inconsistent data whereas visible readers prevent inconsistent data.

STM read-set consistency management of either flavor - visible or invisible readers - can constitute a considerable fraction of the transactional overhead. In addition, for most transactions the size of the read-set typically dominates the size of the write-set. (This also holds normal code - loads greatly outnumber stores. Depending on which literature you prefer, the load:store ratio is usually given as 4:1 or 3:1). Nir Shavit and I propose a hybrid scheme where read-set consistency is managed by hardware but the write-set is managed in software through locking, as is done in TL(TL1) or TL2. A traditional HTM as described in the literature handles both the read-set and write-set in hardware whereas we propose decoupling read-set and write-set management. The goal is reduce transactional latency by making a hopefully modest change in the processor, leveraging existing cache coherency constructs. To be profitable, the hardware read-set tracking would need to be more efficient -- have lower latency -- than the explicit version number checks used by STMs such as TL(TL1) or TL2. (The version number checks ensure that the read-set is mutually consistent).

STMs such as TL1 periodically revalidate the read-set during a transaction. This is necessary as a transaction might have read a set of values that are inconsistent and then, based on that inconsistent data, enter an infinite loop, generate traps, or otherwise misbehave. We refer to such a transaction as a zombie: it's fated to eventually abort but it's still running. We can avoid zombie transactions by revalidating the entire read-set after each transactional load. That's sufficient to avoid zombies, but the cost is quadratic with the size of the read-set. Instead, TL1 admits zombies (transiently) and validates periodically. But the TL1 runtime environment must also provide zombie containment to gracefully tolerate the misbehavior of zombie transactions. Note that TL2 does not admit zombies execution, potentially making it more suitable for C or C++ unmanaged runtime environments. Relatedly, STMs that use visible readers avoid zombies too, but at the cost of considerable read-write coherency traffic. If we have hardware-assisted read-set validation and notification that a thread's read-set has been "spoiled" by a remote write is immediate (say by interrupt delivery) then we can use schemes such as TL1, but still avoid zombies and the complexity of zombie containment.

Hardware read-set management could be accomplished in various ways. First, we might modify the existing data cache (D$) to tag and track the lines containing read-set elements. Remote snoop invalidations to such tracked cache lines that occur before commit-time would result in in an forced abort. This leverages the existing cache-coherency protocol. One problem with this approach is that loads during a single transaction might cause earlier transactional loads from the same transaction to be displaced from the cache because of insufficient cache associativity. If that were to happen we'd lose the ability to track displaced line and thus the ability to guarantee consistency. The transaction would need to abort. Retrying is useless, of course, as the resource deficit is local and would simply recur. To avoid such problems we might use a high-associativity cache similar to a victim cache. Such as specialized cache wouldn't necessarily need to contain data, but rather just tags and tracking metadata. This idea is vaguely similar to the store-miss accelerator proposed by Chou et al., in IEEE Micro'05. Both of these approaches have capacity limits, however, and some transactions may not be locally feasible. To avoid capacity limits we might provide each processor with a local hardware Bloom filter keyed by physical address. The contents of the filter would be reset at the start of a transaction. As transactional reads were performed, those addresses would be added to the local Bloom filter. That is, the Bloom filter would track the read-set. Remote invalidation probes would test the Bloom filter and if there was a hit, cause the current transaction to abort. This mechanism admits false-positive aborts, but by proper sizing of the Bloom filter they could be made relatively rare. (Critically, a Bloom filter is conservative and does not admit false-negatives - it will not fail to miss any instances of read-set interference. In the absence of remote coherency traffic, all local transactions will be feasible).

For additional details see the following write up.

Most modern speculative Out-of-Order (OoO) processors already provide coherence tracking for speculative loaded operands. Transparently to the programmer, as an optimization, loads might be executed more aggressively than program order would seem to allow. In that case such speculative loaded addresses are tracked. If another process writes to a speculatively loaded-from location before the commit "wave front" passes the load, the process will discard the now-stale state and quietly rerun the load and discard any speculative state that depends on that load. Tracking state allows OoO execution but preserves the platform's memory model. I recommend "Memory Ordering: a Value-based Approach" and "Deconstructing Commit".

Finally, Chris Purcell and Tim Harris mention a clever idea in their paper Implementing Multi-word Atomic Snapshots on Current Hardware where they use existing performance counters available on most processors to detect read-set interference. Their scheme works for read-only transactions and requires two passes over the read-set.

Further reading: Bulk Disambiguation of Speculative Threads in Multiprocessors by Ceze, et. al.

Monday Sep 25, 2006

Parallelism Manifesto

Java provides threads to express parallelism and locks to restrict or prune that parallelism. This subtractive model seem somewhat awkward: threads give parallelism and locks take it away. (I'm intentionally ignoring lock-free programming and optimistic transactions for the purposes of this rant. Note too, that multithreaded programs and thus locks are still required on uniprocessors, but I'd argue that threads are used on uniprocessor systems only to achieve IO-compute overlap. Better asynchronous IO facilities provide a better solution than threads.) Think of taking one drug [parallelism] as a remedy for a disease [limited single-CPU performance] and then needing a 2nd drug [locks] to counter act the side-effects [races] of that 1st drug [parallelism]. Remember, true parallelism is not a feature -- it's just a remedy, and a problematic one at that. Just as Java removed explicit memory management from C and C++, I expect that the next generation of programming languages may remove threads and locking as we know it today. Explicit threading will still be available, of course, but ideally I'd like to see the world switch to a model where, by default, threads are managed by the runtime environment. Ultimately, we'd disappear threads (sadly, disappear has entered regular usage as a transitive verb). Programmers would describe potential parallelism with new language constructs but avoid explicitly prescribing threads. Such a descriptive model is better than our current prescriptive model as we trade two mechanisms (threads and locks) for one construct used to express potential parallelism. By default, execution would be serial unless the programmer explicitly marked it as potentially parallel. That's inverted from our current model. The managed runtime would automatically provision the process with real threads as needed.

Of course you can approximate a nearly thread-free programming model today with deft use of thread pools in the java.util.concurrent libraries. But instead of a library implementation I'm proposing first-class language integration. Relatedly, one might correctly argue that the descriptive model described above could be emulated today with very light-weight threads. In such a model we'd use threads and implicit cyclic barriers to express the parallel regions. The "body" of the thread's run() method would typically be short and not contain any synchronization operations. While workable, that approach isn't a complete solution. Not all concurrency problems map cleanly or easily onto the fork-join model. See also Cilk and Fork-Join parallelism.

The key economy we're grappling with is really human scalability (developers), not machine scalability. (Programmer scalability is just as important as program scalability). Parallelism has arrived on our desktops and the free lunch is over. Traditional concurrent programming has been error-prone and difficult, making it expensive and best left to a small priesthood of specialists. That situation can't stand. It's probably true that humans are naturally better at sequential reasoning and programming (c.f., dichotic listening). You can even draw a sketchy analogy between human multitasking (NY Times, 2007-3-26) and parallel programming. To the extent possible, we need to provide languages and constructs that allow programmers to "think sequentially" but "execute concurrently". There's no reason that the programmer who has domain expertise in international tax law should also be required in the future to become a concurrency expert.

Sun's Fortress language is an interesting step in the right direction. Fortress provides loop-level parallelism by default. (If you're interested in how concurrency is expressed in modern programming languages then IBM's X10 is also worth a look).

Note that we're not describing automatic parallelization, where the compiler or runtime analyzes the dependencies in a serial instruction stream, converting implicit instruction-level parallelism (ILP) to thread-level parallelism (TLP). That's the holly grail, but it remains an open research topic. It's also likely that for most codes, the benefit is bounded. This is the same reason we're seeing diminishing returns from out-of-order and speculative execution in modern processors.

Since locks are problematic, a useful interim step is to provide transactions in lieu of locks. Think synchronized but without the need to specify an object. Some researchers prefer atomic { ...code...} instead, but it's all syntactic sugar over the same concept. Briefly, transactions provide correctness as a first-order property. The programmer doesn't need to be concerned about which variables are protected by which locks or locking protocol. Deadlock is impossible as well - atomic sections are composable. As a secondary benefit, depending on the application and the implementation of the transactional subsystem, the program might enjoy more potential parallelism and thus improved throughput. There's often no need for the programmer to design complex and error-prone fine-grained locking protocols to extract additional parallelism. (Of course hand code algorithms will always have a performance edge, but we hope that STMs will often be able to archive competitive performance). This particular area is where STM (Software Transactional Memory) research intersects language design. As you'd expect, much of this remains an open -- and very active -- area of research. If you're curious, you might enjoy reading about the atomjava project as well as watching this video interview with Simon Peyton-Jones and Tim Harris of Microsoft research or the following Google TechTalk video on
transactional memory.

As an aside, some have suggested MPI or OpenMPI as a workable parallel programming model. Message passing has a certain aesthetic appeal, but I don't see it as the way forward. See Gosling's blog entry on the topic. Relatedly, a segment of the concurrency community believes that Erlang (or a programming model like Erlang) is the right path for the future of concurrency. Erlang doesn't allow traditional threads and shared memory, but instead provides efficient asynchronous message passing between processes. A process can neither address, read, or mutate the memory of another process. Many of the bugs we face today in a classic threads + shared-memory programming model aren't even expressible. Minimizing shared mutable data is generally a good practice; Erlang simply takes that the philosophy to the extreme and eliminates shared mutable data, providing communication through explicit message passing instead. (Of course you could approximate the Erlang model in Java by partitioning data as follows: Mutable unshared data would be thread-local -- one thread couldn't even reach another thread's unshared data. Shared immutable data is benign. Finally, you could ban shared mutable data and communicate through java.util.concurrent queues with serialized objects.) The Erlang programming model is seductive, but we don't know if it scales out or maps to large problems where state can end up dispersed. Collectively, we have more experience with the traditional threads memory model. Interestingly, David Patterson mentioned in his Feb 2007 Stanford EE380 talk that recent programmer productivity studies suggested message passing might be more difficult to master. (See also The Landscape of Parallel Computing Research: A View from Berkeley.) Distributed programming brings its own benefits as well as problems. I suspect in the future we'll use hybrid environments with message passing between nodes and threads with shared memory for intra-node cooperation.

It's also worth noting that concurrency isn't a panacea and must be used with care. It's entirely possible to encounter negative scalability. In the classic case if plot throughput on the Y-axis and concurrency on the X-axis you'll see either a concave graph or a monotonically descending line. If the communication costs (shared memory coherency traffic, synchronization, etc) don't overcome the benefits derived from compute parallelism then you may encounter this problem. In addition, we'd like to use fine-grain parallelism to maximize system utilization, but fine-grained parallelism typically incurs higher communications costs.

Finally, it's worth commenting on language closures. Normally you wouldn't think there's much connection between closures and parallelism, but in practice closures are an excellent way of expressing certain actions. Think of a closure being passed into a ParallelApply() method that operates on a collection and you'll quickly see the benefits of closures. BTW, the scala programming language provides a rather elegant closure model. I've only played with it briefly, but scala looks promising. It's a nice shiny new language but the compiler generates Java bytecode so you can take advantage of your highly optimized JVM and mature Java libraries. Closures are also a hot topic for Java, of course. For a balanced look at the design choices I recommend John Rose's blog entry.

Monday Sep 11, 2006

A tool to decode hs_err files

The perl script parses an existing hs_err file and annotates it with additional information. In particular the script will decode the the rather cryptic "Error ID" string as well as making an attempt to decorate the module-relative offsets (e.g., [jvm.dll+0x1234]) found in native stack listings with more useful symbolic names.

The script is provided as-is and is entirely unsupported. It works only on J2SE 6 reference platforms. To get started run with the --help option. Ideally, future versions of the JVM will integrate this feature and there'll be no need for the script.

Friday Sep 08, 2006

MVM in HotSpot - Bright, Shiny Objects

MVM -- the Multi-Tasking Virtual Machine technology developed by SunLabs -- is extremely useful in some environments. There's considerable debate, however, about why MVM is not part of J2SE. For balanced background reading I'd recommend starting with the MVM article on, Chet Haase's blog entry on MVM and Greg Czajkowski's blog. (Greg was one of the creators of MVM while he was at SunLabs). Briefly, the JVM can be though of as providing a virtual machine. (That's a tautology, but bear with me). Similarly, Java threads are vaguely analogous to virtual processors. There's no concept, however, of a process or address-space isolation in the JVM. MVM adds that missing construct under the name of task. MVM allows independent Java tasks to coexist and run safely in a single JVM without any risk that the tasks will interfere with each other. One of the key mechanisms MVM uses to provide isolation is the replication of class static data; each task binds to a private instance of class static data. MVM also leverages Java's existing type-safety properties: a thread in one task, for instance, can't forge a reference to objects used in other tasks. Most operating systems provide address-space isolation via the hardware memory-management unit (MMU), but MVM provides isolation via software type-safety. (This is actually an old idea. The Burroughs B5000 series provided full application isolation but without MMU isolation and without the concept of hardware privilege. In fact the JVM borrows a number of key concepts from the B5000, but that's a story for another day). AppDomains in Microsoft's CLR are similar to MVM tasks. MVM offers the possibility of a number of direct benefits such as fast inter-task communication, decreased startup latency, fine-grained resource management, and improved memory footprint.

MVM is extremely useful - you might even say required - in environments where the platform doesn't provide a process-like construct. This could be because the operating system is too small or because the MMU doesn't address-space isolation. MVM can step in and make up for the missing platform features, enabling the user to run multiple independent Java applications. This makes MVM a good choice for small embedded environments often inhabited by J2ME. J2SE, however, typically runs on more capable systems, such as Solaris, Linux, or Windows, that already provide mature and efficient process-based isolation. At that point, we have to ask ourselves if we want to recreate the wheel in J2SE and try to do in user-mode what the operating system and processor already do quite well?

The Isolate JSR (JSR121) is often conflated with MVM. If a JVM employs MVM technology, then isolates would be the logical way to express a task. But isolates can also be implemented without MVM, using the existing task-per-process model.

Properties of MVM:

Startup MVM provides exceptionally fast startup for the 2nd and subsequent tasks. In a sense, the JVM is already on hot standby, running some tasks, but ready to accept and start a new task without the need to create a new process and run through JVM initialization. It's worth noting, however, that class data sharing (CDS), which was developed after MVM, has improved startup performance in J2SE. A simple alternative to MVM is use class loader-based isolation. Under loader-based isolation each "task" runs under its own class loader instance. Application classes will have private per-task class static data. It's not perfect isolation, however, as the static variables in the system classes loaded by the boot class loader will not not be replicated. That means that one task might take actions that modify class static variables in the system classes and another task could see those changes. Depending on your environment, of course, this is unlikely to be a practical concern. If you're interested in this approach you might find NailGun of use.

It's worth mentioning that MVM is not multi-user. That single process can not normally be safely shared by multiple users, so each user would need at least one MVM process. In addition, MVM provides data isolation at the Java level, but not for native code that might be called by Java. Application-specific state could inadvertently be captured and stored by native JNI code. MVM can't and doesn't prevent that for happening.

Footprint It's commonly thought that MVM would offer considerable footprint savings. That is, running applications A, B and C in a single MVM process will consume much less memory than running A, B and C as distinct traditional JVM processes. This sounds like a compelling benefit and so the point deserves more attention. Lets consider the memory consumption of application A running in a traditional JVM. First, we'll have all the code associated with the JVM itself (the .text segment of jvm.dll or That's already shared by virtue of of the operating system, however. Next, there's mutable native data; that's unshared and largely unsharable. The sample applies to thread stacks -- unshared and entirely unsharable. Next, we have plain vanilla mutable live heap data. As you'd expect that's unshared and unsharable. Critically, for a large class of enterprise applications the live heap data dominates footprint. Next we have the class file image data. The class files for rt.jar are shared to the extent possible by CDS. The in-memory class file images for application A are not currently shared in J2SE 6. Likewise, the code emitted by the just-in-time compiler (JIT) is unshared. So if we run A, B and C under MVM we won't actually enjoy much savings as compared to running three processors. Paraphrasing James Carville, "It's the heap!". MVM could share some in-memory class file data images and metadata, and potentially some emitted code, but that's it. Those components of footprint are typically just a small fraction of the heap. Very small applications (consider A, B and C as ls, cat and ps written in Java), however, would see more relative footprint benefit under MVM as they would typically have small heaps.

I should point out that we're at risk of engaging in circular reasoning. Small applications (e.g., ls or cat) might be less common in Java because of startup and footprint issues. Because the class of small applications is itself small, J2SE might continue to focus on larger enterprise-class applications, perpetuating the cycle. MVM would certainly provide excellent startup time for such small application. In addition, for such small applications the heap wouldn't dominate footprint and the sharing fraction under MVM would be much higher. We have to beware that we don't preclude an entirely new class of applications.

Resource Control MVM holds the promise of very fine grained resource control. But again, this might be better handled by the host operating or virtual machine monitor (VMM). You can achieve much the same result with existing JVM command line switches, Solaris zones or virtualization mechanisms such as VMware or Xen. Virtual machines make a excellent "container" for resource management in addition to providing checkpoint-restart capabilities that can be used to circumvent startup latency.

MVM is great technology, and in fact I'd claim it's mandatory for J2ME on smaller platforms. It's arguably superfluous for J2SE, however, as the underlying operating systems already provide safe and efficient page-level sharing for the major sharable components in the JVM's address space. If you're interested in MVM-like capabilities -- particularly startup -- I'd recommend giving NailGun a try. Chris Oliver's JApplication takes an even leaner approach to the problem. Also take a look at Steve Bohne's blog entry, which helps dispel some of the myths around the JVM's memory footprint. For an alternate viewpoint see MVM Fact and Fiction.

Tuesday Sep 05, 2006

SeqLocks in Java - readers shouldn't write synchronization metadata

Lets say you have data that's currently protected by a synchronized block, a ReentrantLock, or a ReentrantReadWriteLock. Most of the accessor operations are pure reads - operations that write are much less frequent. You'll enjoy better parallelism with a read-write lock, of course, but even then, the act of acquiring the read-write lock (even for reading) can cause stores into the lock metadata. If you're running on a traditional Symmetric Multi-Processor (SMP) system, concurrent threads attempting to acquire the read lock can generate coherency traffic -- cache line "sloshing" as the cache line containing the read-write lock metadata is repeatedly written and invalidated, migrating between the processor's caches. See the entry on biased locking for a description of the economics of cache coherency.

Using something akin to Linux's SeqLocks can provide a viable alternative. SeqLocks provide all the parallelism of read-write locks but pure read operations can be accomplished without writing to metadata (that is, without writing to the SeqLock). As such, concurrent readers using SeqLocks can avoid the cache line sloshing that would otherwise occur when acquiring and releasing read-locks. In addition, a pure read operation can operate without any potentially costly compare-and-swap (CAS) atomic instructions. As constituted in Linux, a SeqLock is simply a volatile word where the least-significant bit is a "LOCKED" indicator and the remainder of the word serves as a version number. A pure reader will initially read the SeqLock value, spinning as long as the LOCKED bit is set. Once the reader has observed a SeqLock value with the LOCKED bit clear, it records that value and then proceeds to speculatively read the shared data (the data protected by the SeqLock), and finally validate that the observed data is consistent by verifying that the SeqLock value remains the same. If the SeqLock value changed then the reader must re-run the read attempt. Otherwise the read attempt was successful. Critically, read operations are optimistic. Writers will try to CAS to swing the LOCKED bit from 0 to 1, at which point they have exclusive access to the data and can then safely write. Writers are immune to abort. When the writes are completed the writer simply increments the SeqLock, which clears the LSB and increments the version number field. A reader can also attempt to upgrade itself to a writer by attempting to CAS the previously observed SeqLock value into LOCKED state. That CAS, if successful, will both validate that the previously observed data is consistent and will acquire the write lock. To avoid any practical concerns about roll-over and ABA aliasing, a SeqLock is typically 64-bits. Of course you have to be careful with regards to compile-time reordering, ensuring that the SeqLock and the accessed data fields are appropriately marked volatile. In addition, depending on the memory model of the platform, various barrier or "fence" instructions might be needed to prevent architectural reordering (where the program order differs from the memory visibility order). See slides 19 and 20 for more details on reordering.

As a variation, instead of a LOCKED bit we might use a distinguished LOCKED value. Of course that LOCKED value would never appear as a legal sequence number. The writer would use CAS to swing the version number from V to LOCKED; perform the writes; and then use a simple store to change LOCKED to V+1. In addition, if we knew we only had one writer then we could forgo the CAS that sets the LOCKED bit or LOCKED encoding and use simple stores.

In the degenerate case where the data being protected is a counter or "clock" that takes on monotonically increasing values but where number of bits required for the counter exceeds the atomic word size of the platform, we might simply use the SeqLock itself as the low-order data word.

Note that SeqLocks could equally well be implemented with version numbers that count down or advance using a Gray encoding. We never compare SeqLock version numbers for magnitude, but only for equality. The critical property is that the LOCKED state be detectable and that subsequently generated version numbers differ from any previously generated versions that might still be in circulation (say, in the hands of a reader). This property ensures that readers avoid false-positive verification. Interestingly, in Java we can implement the SeqLock as a reference to Object. LOCKED is encoded as a null reference. The writer would release the SeqLock by: Lock = new Object(). This variant leverages Java's garbage collection mechanism to ensure that no "old" references exist in the hands of readers before allowing an object reference to potentially recycle. This approach places additional stress on the allocator and collector, but modern JVMs, particularly those with thread-local allocation buffers (TLABs) are up to the task.

Caveat Utor: SeqLocks require special care, particularly for pure read transactions. If a reader sees inconsistent data in the midst of the transaction we say that reader is a "zombie". It's possibly read inconsistent data and is fated to die; it will eventually abort, but it hasn't yet reached the validation phase. While operating as a zombie, the reader might inadvertently enter an infinite loop, throw an expected exception, or trigger an assertion. To avoid this, the developer may need to insert validation tests as necessary into read accessors. Validation ensures that the SeqLock value observed at the the start of the transaction remains the same. That, in turn, implies that data read up to that point in the read attempt is mutually consistent and there has been no interference from writers. If the SeqLock value differs, however, the reader should abort and retry the operation. In some simple cases it's safe to elide validation. For instance say your have a class with with 3 instance fields "int A, B, C;" and you want to implement the pure reader sumABC() with SeqLocks. It's perfectly safe to read and test the SeqLock, sum A, B and C into a tentative result, and then validate the SeqLock version before returning the result. In a more complicated data structure with dependent data you'd be better advised to validate in the middle of the read transaction, however. Lets say you have an instance field R that refers to some boxed value V. By convention R is non-null while the SeqLock is unlocked, but writers may store a transient null value into R before ultimately storing a non-null reference and releasing the lock. To implement a pure reader getRV(), which conceptually returns R.V, the reader would fetch R into a temporary T, validate, fetch T.V and then validate again before returning the fetched T.V value. Finally, since readers may abort, a reader shouldn't perform any I/O operations or cause any changes to the state of the system that can't be rolled back. Writers are pessimistic; they use mutual exclusion and can't abort so I/O is safe.

As described, SeqLocks provide a policy of writer preference. Successfully acquiring write access will kill -- eventually cause aborts in -- any concurrent read transactions.

For Java, the SeqLock can easily be implemented with java.util.concurrent AtomicLong . It's also critical that fields reference in SeqLock transaction are volatile or final. Since Java doesn't directly support arrays of volatiles, the java.util.concurrent Atomic\*Array constructs will prove useful.

SeqLocks are useful where the frequency of pure-read operations greatly dominates the frequency of update transactions. Beware, however, that there are pitfalls. First, there's the issue of dependent data and the need for validation mentioned above. In addition, the write-lock is a spinlock, so it should be employed only for very short bounded critical sections. If a critical section in the pure-read transaction is excessively long, then that operation may be more vulnerable to interference from writers, as well as wasting more cycles on a futile transaction if it needs to abort. Finally, as the data should be marked volatile, the performance may vary with the platform. Depending on the memory model employed by the platform, the just-in-time compiler may need to inject load-load barrier instructions to ensure that the loads of the SeqLock and the data fields are not reordered relative to each other. On some platforms such barrier instructions incur high latency.

Regarding spinlocks, lets say that a low priority thread acquires a write-lock and is preempted. Other higher priority threads are made runnable and then spin, attempting to acquire the spinlock held by the preempted thread. To avoid livelock and ensure progress, the spinlock mechanism assumes that the holder, even though running at a lower priority, will eventually receive CPU cycles from the scheduler and will be able to exit the critical section and release the lock. That is, spinlocks assume (a) preemptive scheduling, and (b) that the scheduler will avoid complete starvation, and that all threads, even low priority threads, will eventually receive CPU cycles even if higher priority threads remain runnable. This is certainly true with the Linux O(1) scheduler. Solaris (in the IA and TS scheduling class) and Windows have an anti-languishing feature where threads stranded on the ready queue will periorically receieve temporary priority boosts until they are eventually allowed to run. Conversely, spinlocks are not appropriate for environments that lack preemption or that have true real-time threads.

Depending the specifics of the implementation, it's possible that Hardware Transactional Memories (HTMs) might also be able to execute pure-reader transactions without any writes to shared data (metadata). The hardware might leverage the existing snoop-based cache coherency protocol to monitor and detect modifications or interference to the read-set (the read-set is the set of shared variables fetched during a transaction). Say processor P1 executes a transactional load. If the cache line underlying the transactionally fetched variable wasn't already locally in M, S or E MESI state, the transactional load would probe the line into S-state or E-state. If another processor P2 wrote to the line, it would need probe the line into M-state, which would cause invalidation. That invalidation could be detected by P1 as interference.

Finally, it's worth remarking that various Software Transactional Memory (STM) implementations use SeqLocks, either embedded in each object or as an array of SeqLocks that is hashed into given a transactional variable's address. In STM terminology we say that SeqLocks provide invisible readers, as readers don't write to shared metadata. In STM implementations such as TL or TL2 pure read transactions can be accomplished without any stores to shared data or metadata.

See also: The Solaris "Locking Strategy for high-resolution timing".

Monday Sep 04, 2006

BerkeleyDB JE (Java Edition) on a T1000/T2000 Niagara

Charles Lamb's blog on the topic of scalable Java performance on a Niagara-class system makes excellent reading.

Friday Sep 01, 2006

Musing about Lock-free and wait-free algorithms for real-time environments

Intuitively, programmers tend to think that lock-free algorithms are a good match for real-time environments. On a multiprocessor system, however, a low-priority thread might repeatedly perform a short lock-free update that causes a higher priority thread trying a longer update to fail and retry repeatedly. In theory this could lead to indefinite postponement and lack of progress for the higher priority thread. (Because the operation attempted by the higher priority thread is longer, it's also more vulnerable to interference). Wait-free algorithms provide a solution, of course. Wait-freedom guarantees that each individual thread will make progress after a fixed amount of (virtual) processing time. Unfortunately we don't know of as many wait-free algorithms as we do lock-free. But if you care about real-time, RTJ (real-time Java) provides locks with all the semantics you'd want.

The same argument, by the way, will apply to many hardware and software transactional memory implementations. A lower priority thread may be able to cause a transaction in a higher priority thread to repeatedly abort.

Thursday Aug 31, 2006

java.util.concurrent ReentrantLock vs synchronized() - which should you use?

In J2SE 6 there's little difference - either should suffice in most circumstances. ReentrantLock might be more convenient if you need to implement a hand-over-hand locking protocol. A classic example of hand-over-hand locking would be a thread that traverses a linked list, locking the next node and then unlocking the current node. That's hard to express with Java's lexically balanced synchronized construct. Synchronized, however, is a mature and first-class language feature.

With respect to performance, both mechanisms are up to the task. (It's worth noting that the synchronization primitives in modern JVMs provide latency and throughput performance that is typically better than that of the native pthreads_mutex constructs). The builtin synchronized construct currently has a few advantages, such as lock coarsening, biased locking (see below) and the potential for lock elision via escape analysis. Those optimizations aren't currently implemented for ReentrantLock and friends. There's no fundamental reason, however, why a JVM couldn't eventually apply such optimizations to ReentrantLock.

The Synchronized implementation also provides adaptive spinning, whereas ReentantLock currently does not. Adaptive spinning employs a two-phase spin-then-block strategy. Briefly, on multiprocessor systems a contended synchronized enter attempt will spin briefly before blocking in order to avoid context switching. Context switching is wasted work -- it doesn't contribute toward forward progress of the application. Worse, it causes TLBs and caches to be repopulated when the blocked thread eventually resumes. (This is the so-called "cache reload transient"). The spin duration varies as a function of the success/failure ratio of recent spin attempts on that same monitor, so the mechanism adapts automatically to parallelism, current system load, application modality, critical section length, etc. In addition, we avoid spinning for a lock where the current lock owner is itself blocked and unlikely to release the lock in a timely fashion. On solaris our checks can be more refined, determining if the target thread is ONPROC (running), for instance, via the contract private thr_schedctl interface. And it should go without saying that we spin "politely", using a backoff to avoid generating excessive and wasteful traffic on the coherency bus, as well as using PAUSE on IA32 and AMD64 platforms. We'll likely add spinning support to ReentrantLock in a future release.

If you're curious about the inner workings or ReentrantLock, see Doug Lea's The java.util.concurrent Synchronizer Framework. If you're curious about adaptive spinning see synchronizer.cpp in the J2SE 6 source kit.

Finally, ReentrantLock and synchronized are equivalent with respect to the clarified Java Memory Model (JMM, JSR133).




« April 2014