Java Memory Model concerns on Intel and AMD systems

The Java Memory Model (JMM) was recently clarified by JSR-133 with the corresponding changes incorporated into chapter 17 of the Java Language Specification, 3rd edition. Doug Lea's excellent JSR-133 Cookbook reinterprets JSR-133 from the perspective of and for the benefit of JVM implementers. A JVM must reconcile the JMM and the memory consistency model of the underlying platform. Intel/AMD (x86) and SPARC Total Store Order(TSO) define relatively strong memory consistency models; the only architectural reordering of concern is that a store followed by a load in program order can be reordered by the platform such that the store becomes visible before the load executes. If we require that store to become visible before the load executes then a serializing instruction -- typically an atomic instruction, such as CAS or a fence (MFENCE, MEMBAR #StoreLoad) -- must execute between the store and load in question.

The JMM defines a strong memory model akin to sequential consistency (SC) for volatile accesses. On Intel, AMD and SPARC processors, it's sufficient for the JVM to execute a fence instruction after all volatile stores. In practice this means that while translating Java bytecode to native code the just-in-time-compiler, or JIT, emits a fence after all volatile stores. In addition to avoiding architectural reordering through the use of fence instructions, the JIT will also avoid compile-time ordering of volatile accesses. To be somewhat more precise, a volatile load has acquire semantics and a volatile store has release semantics.

Of late, however, both Intel in their Intel® 64 Architecture Memory Ordering White Paper and AMD (in section 7.2 "Multiprocessor Memory Access Ordering" of their recently updated systems programming guide) have relaxed the definition of their platform memory models. Under their previously defined memory models, for instance, if MFENCE instructions appeared between all store-load pairs you'd effectively have sequential consistency. That no longer holds. Instead of sequential consistency we'll instead have slightly weaker causal consistency. (As an aside, I wonder if these specification changes apply to existing processors already in the field -- that is, they clarify the behavior of existing processors -- or if they reflect future or planned processors? I'd hope the latter). Intel claims to have analyzed a large body of existing code in the field and believes that no programs will observe the change or be adversely affected. Strictly speaking, however, existing JVMs that emit MFENCE instructions after volatile stores would be in violation of the JMM when running on processors that actually implemented causal consistency instead of the previous TSO-like model. Collectively, we could clarify the JMM yet again to admit causal consistency for volatiles. Another option would be to change the code emission in the JIT to use locked instructions or XCHG instead of MFENCE. By my reading of the new Intel and AMD documents that'd be sufficient to put the JVM into compliance with the JMM on processors with the relaxed memory model. That's likely slower, however.

Readers interested in this topic would also likely enjoy Hans Boehm's presentation Getting C++ Threads Right which touches on the analogous problem for the new C++0x memory model (Youtube video).

Update 2008-3-7: See Rick Hudson's IA Memory Ordering talk.

Update 2008-10-27: See and in particular their POPL 2009 submission The Semantics of X86 Multiprocessor Machine Code.


Always enjoy reading your posts! I didn't realize Intel and AMD were making this kind of change.

Posted by huntch (aka charlie hunt) on January 19, 2008 at 03:59 AM EST #

Doug Lea's JSR-133 Cookbook, while an excellent intro doc, is sometimes a little bit confusing.

For example, it states that "on the processors discussed below, a StoreLoad is strictly necessary only for separating stores from subsequent loads of the same location(s) as were stored before the barrier".

In a first instance, I interpreted this to mean that StoreLoads can be eliminated when they separate a store from a load on a \*different\* location. But consider the following program.

Volatiles u, v.
Initially u = v = 0.
T1 | T2
11: u = 1; | 21: v = 1;
12: u = 2; | 22: v = 2;
13: r1 = v; | 23: r2 = u;

The JMM doesn't allow the final result r1 = r2 = 1 with volatiles u, v. I cannot see how to impose a total sync order compatible with prog order that allows \*both\* reads to see 1. Sequential consistency is safe.

On the other side, on machines where other than StoreLoad barriers are no-ops, the above program seems not to need any barrier at all, since according to my understanding of the above, the potential StoreLoads between 12 and 13 and between 22 and 23 could be removed. But this would then be exactly the same as a program with nonvolatiles u, v, for which the JMM allows r1 = r2 = 1. This contradicts the above reasoning. The barriers are needed despite the above quotation may suggest the contrary.

Fortunately, the cookbook later prescribes to issue a StoreLoad barrier to separate a volatile store from a volatile load without further case analysis.

Moreover, I interpret the JSR-133 Cookbook as saying that the purpose of barriers is to avoid reorderings at the processor level (including write buffers, caches and execution units), not necessarily at the shared memory level. To me, this means that the memory actions are \*issued\* to memory according to the barriers, not that they are actually \*seen\* by other CPUs in the same order as issued.

This, too, is a little bit confusing to me.

On the other hand, the hardware docs of both Intel and AMD are even more confusing. For example, the Intel doc you refer to doesn't mention xFENCE instructions at all. What about their guarantees? The doc chooses to remain silent.

Posted by Raffaello Giulietti on January 28, 2008 at 03:02 AM EST #

Hi Dave
In answer to your parenthetical "aside" in para 3 of this entry, we've specifically asked Intel the question of whether this represents a change in behaviour or a simple clarification of existing behaviour.

The response from the authors of the Intel document is that this is a clarification. There is no change in the behaviour of new CPUs relative to the behaviour of older CPUs.

Of course it's absolutely true to say that this means that mfence is not an adequate enforcement of sc semantics, but then again it turns out that it never was.


Posted by Paul Murray on February 19, 2008 at 09:56 PM EST #

Hi David,

I think JMM also needs an update regarding its guarentees for final fields. Also, some examples in JMM and some of J2SE sources use non-final fields as if they were final. I've put notes on it at

Posted by Peter Kehl on April 25, 2008 at 09:39 PM EDT #

Hi - Do we know if membar #storeload also flushes store buffer.

-- Store A
-- membar#StoreLoad
-- Load A

-- Load A

What guarantees P1's load sees P0's write if that store is stuck in P0's store buffer.

Regards banks

Posted by bank kus on March 17, 2011 at 04:58 PM EDT #

Hi Banks, One possible way to implement membar #storeload is to simply stall and wait for the store buffer to drain into visible cache-coherent space. An implementation might also defer the stall until the 1st load subsequent to the membar. I believe a slightly more sophisticated mechanism could allow speculation over the membar. Specifically, the processor could start a speculative episode at the membar. Subsequent loads would be snooped (tracked) via the cache coherence protocol for remote invalidation until the last store prior to the membar became visible, at which time the processor could commit the speculative state. If a snooped location changed because of remote updates, the processor would abandon the speculative state, roll-back, and replay. (These speculative episodes are actually quite similar to hardware transactional memory, and in fact a naive HTM can be constructed by "simply" exposing the mechanism via the ISA).

As for your example, there's no such guarantee as there's no happens-before-ordering edge in the communication graph.

The membar really deals with cache consistency, not cache coherence.

Regards, -Dave

Posted by David Dice on March 17, 2011 at 05:42 PM EDT #

I am interested in this discussion and surprised at Paul's comments "Of course it's absolutely true to say that this means that mfence is not an adequate enforcement of sc semantics, but then again it turns out that it never was." which I was not aware before. I thought the mfence would have the barrier effect forbidding any read/write operations after the mfence to be reordered prior to the mfence, and flushing cached values to main memory for proper visibility.

I indeed don't have a full understanding of the current situation after reading the posts.

I know from above and the JSR-133 Cookbook that an alternative to mfence for implementing volatile store is to use an atomic instruction (for example XCHG on x86).

Then, may I ask if such alternative has been adopted and if there still exists any problems about Java volatile field handling in current common implementations (e.g. JDK and IBM's) of the new JMM on x86 32- or 64-bit architecture?

Thanks for clarification.

Posted by King Tin on March 28, 2011 at 02:57 AM EDT #

Or indeed, I should ask more clearly: after the intel/AMD relax of their memory models, if current JVM implementations of the new JMM still ensure sequential consistency (SC) for volatile accesses, or have become the weaker causal consistency you mentioned? I believe only the former is correct.

I haven't got this after reading the posts. Sorry.


Posted by King Tin on March 28, 2011 at 03:06 AM EDT #

Hi King,

The epilog to this post can be found in : where the vendors yet again clarified their memory models. Arguably, this subsequent clarification renders makes currently JVM implementations again compliant with the JMM, and volatiles will have SC behavior.

Regards, -Dave

Posted by guest on March 28, 2011 at 03:14 AM EDT #

It's fine then. Thank you for your clarification.

Posted by King Tin on March 28, 2011 at 03:43 AM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed

Dave is a senior research scientist in the Scalable Synchronization Research Group within Oracle Labs : Google Scholar.


« June 2016