False sharing induced by card table marking

Garbage-collected runtime environments frequently use card tables in conjunction with write barriers to accelerate root-scanning during garage collection. (See A Fast write barrier for generational garbage collectors by Urs Holzle, ECOOP-OOPSLA 1993 for details). Briefly, and skipping a few details, this design partitions the heap into an array of power-of-two sized card pages. When a mutator stores into a reference field of an object the runtime write barrier code will mark the card table entry for the card page containing (covering) that field as dirty. In the HotSpot JVM the card page size is 512 bytes and the card table is implemented as a simple array of bytes. That is, a given card table entry, which represents the state of a card page, is just one byte. The write barrier is emitted directly by the JIT and is usually just a shift and store instruction. In a subsequent non-moving minor GC, the collector can avoid scanning reference fields in a card that is not dirty.

This design is well-proven widely employed but unfortunately it can result in performance issues in highly concurrent environments. Lets say our cache line size is 64 bytes, which is fairly common in modern processors. This means that 64 cards (32KB = 64\*512) will share a cache line in the card table. So reference stores by different threads that just happen to fall within the same 32KB region cause writes to the same cache line underlying the card table. This can result in excessive write invalidation and cache coherence traffic, which can reduce performance and impede scaling. Interestingly, the impact can worsen after a full/moving GC as threads tend to allocate into different address ranges by virtue of thread-local allocation buffers (TLABs), but after a full collection the remaining objects tend to be more tightly packed and thus more prone to the problem. Furthermore, most card table stores are redundant, as often the card is already marked dirty. This suggests a simple solution: instead of using an unconditional store in the barrier, we first check the card table entry and only store if it is clean. This slightly increases the barrier path-length and adds a conditional branch -- unless we were to be somewhat clever with conditional moves by annulling a redundant store by changing the destination address to be a thread-local dummy variable. On the other hand it avoids the problem. (For historical background, some years ago Doug Lea noticed an odd slow-down after a full GC in some concurrent Java code. He contacted me and I speculated that false sharing in the card table could be the issue. We conjured up a JVM with an experimental -XX:+UseCondCardMark flag that let us emit write barriers as either the usual unconditional store, or a conditional form that avoids redundant stores. The conditional form provided relief).

I ran into the problem recently when experimenting with some concurrent queue implementations, which are reference-heavy, on a Sun 256-way Sun/Oracle T5440. This is 4-socket system where each socket contains a Niagara T2+ UltraSPARC processor having 64 logical processors. My benchmark has 50 producer threads and 50 consumer threads and measures and reports the message throughput over a timing interval. In the default configuration we can pass about 8.8K messages per msec. Using the same JVM, when I add the -XX:+UseCondCardMark flag we can achieve 52.5K messages per msec, clearly demonstrating the magnitude of the effect.

I should also note that we ran into this same issue when experimenting with Java-level lock elision using hardware transactional memory. If two unrelated concurrent transactions happened to store into reference fields in the same 32KB region we'd have false aborts because of write-sharing on the card table cache line. Again, -XX:+UseCondCardMark provided relief.

Update: see CR7029167

Comments:

Will the -XX:+UseCondCardMark flag be coming to a JVM near me soon or will the conditional form simply become the default behavior going forward?

Posted by Dane Foster on February 14, 2011 at 06:11 AM EST #

Hi Dane, if your thread counts are low, or if they're high but you're on a system where coherence and communication are rather inexpensive (say, a single-socket Niagara) then using the conditional card mark isn't profitable and can impart a very minor penalty. So there's no simple answer. In an ideal world we'd start out with the usual unconditional stores, detect write-invalidations on the card table at runtime through sampling, and then recompile just the offending write barrier sites at runtime. Regards, -Dave

Posted by David Dice on February 14, 2011 at 06:20 AM EST #

p.s., initially, I'd like to see the flag incorporated with the default set to "off". That gives customers both a diagnostic tool to check for the effect, and, if necessary, a work-around. The more elaborate adaptive techniques could come later. Regards, -Dave

Posted by David Dice on February 14, 2011 at 06:24 AM EST #

Dave,
What about using non-temporal stores like MASKMOVQ?

Posted by Dmitry Zaslavsky on February 14, 2011 at 08:28 AM EST #

Dmitry, The non-temporal stores are cache-coherent, so we'll still have the all the usual write invalidation overheads. Regards, -Dave

Posted by guest on February 14, 2011 at 08:41 AM EST #

Dave,
(I may way off. I don't have the hardware to test either) From this doc:
www.intel.com/performance/resources/briefs/memband.pdf

"Of note, on the Westmere EX processor, which is the next generation Intel processor following the Intel
Xeon processor 7500, the IOH coherency buffer, noted above, has been moved into the processor. The
result of this micro-architecture change means non-cacheable write transactions will not cause the extra
read transaction to occur..."

In my understanding it is this "extra" read that causes the penalty.

Posted by Dmitry Zaslavsky on February 14, 2011 at 09:18 AM EST #

Hi Dmitry, I think the challenge to using something like MASKMOVQ is that we actually depend on cache coherence. If we were to use MASKMOVQ I believe one processor could cause another processor's stores of "dirty" indicator bytes into that same line to be lost as MASKMOVQ doesn't 1st read the line. Lost card mark stores would be fatal.
Regards, -Dave

Posted by guest on February 14, 2011 at 09:38 AM EST #

Hi Dave - this brings me to a larger question. What about cache line false sharing of variables in Java programs like say an array being worked on by different threads. Does the compiler detect and realign. In native programs one could explicitly provide cacheline alignment for the array elements.

Regards
banks

Posted by bank kus on February 15, 2011 at 04:59 AM EST #

Hi Banks, Good question -- unfortunately there's not facility for forcing the alignment of objects and arrays. You can try aggressive padding to reduce false sharing, but you have to be careful because the object layout manager may place the fields differently than what you wrote. It's not uncommon to pack the long fields 1st for better packing, and also to try to cluster all the reference fields so GC scanning is more efficient. (As an aside, I've always thought it might be useful to cluster the read-only final fields together).

Regards, -Dave

Posted by guest on February 15, 2011 at 06:19 AM EST #

wonderful articel on Garbage collection mechanism of Java

Posted by Lava Kafle on February 15, 2011 at 03:31 PM EST #

[DaveD] Dmitry, The non-temporal stores are cache-coherent, so we'll still have the all the usual write invalidation overheads.

Hi Dave - I ve often tried to glue together how NTS works behind the scenes. Is the following correct?

==Goals == :
<a> avoid cache line pollution at all costs
<b> if possible speed up write bandwidth and reduce write latency

==Steps ==:
<a> write invalidate cacheline or Read_to_Own cacheline (without actually reading cache if such a message exists in the coh protocol)
<b> send the write to write combine buffer
<c> flush WC when full ... repeat

It appears with this you pay the _full cost_ of coherence on every first cache line write. It will avoid read cacheline cost and cache pollution yes. The write bandwidth speed up though should be minimal given write combine buffers are small.

Does the above sequence match your expectation? Regards - banks

Posted by bank kus on February 26, 2011 at 12:35 PM EST #

Hi Banks, I agree -- the non-temporal store implementation is somewhat more opaque than I'd prefer. Very old AMD documents (perhaps obsolete or non-authoritative ) suggested that NTS would 1st invalidate the line out of the cache hierarchy. The NTS stores would then collect in the local WC store buffers, of which there are typically a small number (1-4). When the WC buffer was flushed to memory (simple displacement, remote probes, or various local instructions), the processor would write-back the contents. Evidently if only some parts of a WC line were written to, the CPU could use smaller bus operations to perform the write-back(s). I'm not sure this accurately describes modern implementations, however.

Given that most card table writes are redundant -- we're storing dirty over dirty -- it seems possible that NTS could actually generate more bus traffic than the conditional card marking.

Another challenge to using NTS is that some concurrent GC algoritms are are sensitive to the memory visibility ordering of the GC write barrier and the reference store, possibly necessitating a ST-ST memory barrier. (Sadly, the term "barrier" is overloaded). For classic stop-the-world collectors we simply need to ensure (a) that safepoints don't occur between the reference store and the GC write barrier, and (b) that when a thread advertise that it's reached a safepoint, that all prior stores by that thread are visible.

Regards, -Dave

Posted by guest on February 28, 2011 at 03:33 AM EST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Dave

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today