Instruction selection for volatile fences : MFENCE vs LOCK:ADD

In the past the JVM has used MFENCE but, because of latency issues on AMD processors and potential pipeline issues on modern Intel processors it appears that a LOCK:ADD of 0 to the top of stack is preferable (Gory Details).

Comments:

Hi,

But you still make the choice at run-time based on the exact CPU flavour underneath you yes? It's not going to be always LOCK:ADD or always MFENCE across all x86/x64 CPUs for a given JVM revision is it?

Rgds

Damon

Posted by Damon Hart-Davis on May 29, 2009 at 07:18 AM EDT #

Hello Damon -- correct, but it's fairly important to get the defaults correct so that the JVMs we implement today work well on the platforms of tomorrow. There are still platforms where MFENCE is clear winner (older AMD processors, for instance). We'd very much like to avoid impairing performance on such systems. At least from what we can see today, however, LOCKED:ADD looks like the instruction of choice for the near future. Regards, -Dave

Posted by David Dice on May 29, 2009 at 08:17 AM EDT #

It's worth something like 2.5% on SPECjbb2005 to use LOCK:ADD and 14% on Derby on SPECjvm2008. Oh and you're welcome ;)

Posted by Azeem Jiva on May 29, 2009 at 09:20 AM EDT #

Hi Azeem, that's on AMD, correct? I believe the difference was much less on modern Intel Core\* processors. Regards, -Dave

Posted by David Dice on May 29, 2009 at 09:34 AM EDT #

Yeah that was on the latest quad core Opterons.

Posted by Azeem Jiva on May 29, 2009 at 10:39 AM EDT #

Thanks. It confess I was surprised when we encountered the MFENCE latency behavior on AMD. I had subsequent conversations with folks at AMD who confirmed our observations. Broadly, we expected that MFENCE/MEMBAR would either have the same latency or better latency than LOCK:ADD/CAS instructions. While not always strictly true it was a reasonable rule of thumb. Both instructions have bidirectional fence semantics and the ADD or CMPXCHG is, in a sense, doing more work leading me to suspect an implementation artifact rather than something fundamental. Interestingly -- and this is the point I was trying to make in the posting -- is that while on Nehalem we find MFENCE has good "simple" latency, it doesn't appear to pipeline as well as LOCKED:ADD. Regards, -Dave

Posted by David Dice on May 29, 2009 at 10:58 AM EDT #

That is quite odd, especially given that in addition to "extra" work implied by the RMW CAS operation, the barrier semantics of MFENCE are apparently weaker than the full interlock of LOCK CMPXCHG, as you point out elsewhere (http://blogs.sun.com/dave/entry/java_memory_model_concerns_on).

Perhaps it as a result of the relative popularity of CAS as the basis for many lock and lock-free algorithms of all stripes that the vendors have simply focused on the low hanging fruit - leading to a catch-22 situation where everyone keeps using the LOCK instructions since they are no-worse than MFENCE (and maybe stronger), so they keep getting improved and so on.

Perhaps the fact that CMPXCHG has a load/store to attach itself to helps in tracking the speculative nature of the instructions through the various ROB and store queues, while MFENCE is implemented more cheaply as a simple pipeline flush.

Posted by Travis Downs on January 04, 2010 at 05:27 PM EST #

Corrected link in my immediately preceeding post:

http://blogs.sun.com/dave/entry/java_memory_model_concerns_on

Posted by Travis Downs on January 04, 2010 at 05:28 PM EST #

Hi Travis, my impression is that vendors have associated extra semantics with MFENCE, potentially including: recognition of asynchronous pending memory errors, flushing the pipe for self-modifying code, etc. Regards, Dave

Posted by David Dice on January 05, 2010 at 12:38 AM EST #

I ll admit I looked at this instruction in awe for a bit :-)
lock:add [TOS],0

I take it TOS because it has very good dcache residency?
OR
perhaps add,0 is fast path'ed in the processor such that no actual load/stores are done. being a locked instruction prevents reordering.

Posted by bank kus on March 12, 2011 at 07:36 PM EST #

We use top-of-stack because the underlying cache line is likely to already be in M-state, requiring no coherence probes to update. (It's frequently written to but unshared). The slight downside to top-of-stack is that the ADD kills the integer condition codes. The JIT is smart enough to include these in global value number optimizations, so having an idiom that kills them is annoying. A slightly better trick is as follows. When we have the thread's "self" pointer manifested in a register (common in ia32, always the case for x86 where we dedicate a register to that purpose), we just execute XCHG R, R->SelfField where R is the thread register and SelfField is a field in the thread structure that refers to the thread-structure itself (self-referential).

Regards, -Dave

Posted by guest on March 13, 2011 at 04:10 AM EDT #

>> because the underlying cache line is
>> likely to already be in M-state

I see. The underlying load esp operation could miss L1d is that a worry or does it have good residency characteristics in practise.

>> XCHG R, R->SelfField
Neat! Though the R->SelfField perhaps will suffer cache misses or again is it usually resident in practise.

In general I m wondering if these zero side effect synchronizing memops mimicing mfence could suffer worse latencies due to L1/L2 misses (not the coherence cost but full cache line cost)
giving them a far more dangerous worst case (something microbenchmarks will probably not show)

Regards
banks

Posted by bank kus on March 13, 2011 at 08:02 AM EDT #

Even if the underlying LD hit in local cache, if it were only in S-State then the CPU would have to probe to upgrade to M-state. The top-of-stack is very likely to have been written to and resident, so there are very high odds it's already in M-state. Likewise the thread structure is likely to largely immune to sharing and, depending the fields in a line, resident an already in M-state.

You're right that these instructions could cause problems if they start missing, but it's unlikely the will, or if they do, in the case of top-of-stack, it's very like the thread was going to write to that line anyway in the near future, effectively the atomic doesn't make anything worse.

And in general it's unfortunate that mfence is freighted with addition semantics. Using a dummy atomic operation is counter-intuitive and, ideally, we'd have a low-cost fence instruction that didn't itself access memory.

Regards, Dave

Posted by guest on March 13, 2011 at 08:16 AM EDT #

>> it's very like the thread was going to write to that line anyway in the near future,
>> effectively the atomic doesn't make anything worse.

Ahh!!.. makes a lot of sense, thanks all very useful information for me.

Regards - banks

Posted by bank kus on March 13, 2011 at 08:55 AM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Dave

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today