For background on the membar elision techniques and the serialization page, see the following: 7644409
; Asymmetric Dekker Synchronization
; and QPI Quiescence
. On normal x86 and SPARC systems these are strictly local latency optimizations (because MEMBAR is a local operation) although on some systems where fences have global effects, they may actually improve scalability. As an aside, such optimizations may no longer be profitable on modern processors where the cost of fences has decreased steadily. Relatedly, on larger systems, the TLB shootdown activity -- interprocessor interrupts, etc -- associated with mprotect(PROT_NONE) may constitute a system-wide scaling impediment. So the prevailing trend is away from such techniques, and back toward fences. Similar arguments apply to the biased locking -- another local latency optimization -- which may have outworn its usefulness.
A colleague in Oracle Labs ran into a puzzling JNI performance problem. It originally manifested in a complex environment, but he managed to reduce the problem to a simple test case where a set of independent concurrent threads make JNI calls to targets that return immediately. Scaling starts to fade at a suspiciously low number of threads. (I eliminated the usual thermal, energy and hyperthreading concerns).
On a hunch, I tried +UseMembar, and the scaling was flat. The problem appears to be false sharing for the store accesses into the serialization page. If you're following along in the openjdk source code, the culprits appear to be write_memory_serialize_page() and Macroassembler::serialize_memory(). The “hash” function that selects an offset in the page — to reduce false sharing — needs improvement. And since the membar elision code was written, I believe biased locking forced the thread instances to be aligned on 256-byte boundaries, which contributes in part to the poor hash distribution. On a whim, I added an “Ordinal” field to the thread structure, and initialize it in the Thread ctor by fetch-and-add of a static global. The 5th created thread will have Ordinal==5, etc. I then changed the hash function in the files mentioned above to generate an offset calculated via : ((Ordinal*128) & (PageSize-1)). “128” is important as that’s the alignment/padding unit to avoid false sharing on x86. (The unit of coherence on x86 is a 64-byte cache line, but Intel notes in their manuals that you need 128 to avoid false sharing. Adjacent sector prefetch makes it 128 bytes, effectively). This provided relief.
With 128 byte units and a 4K base page size, we have only 32 unique “slots" on the serialization page. It might make sense to increase the serialization region to multiple pages, with the number of pages is possibly a function of the number of logical CPUs. That is, to reduce the odds of collisions, it probably makes sense to conservatively over-provision the region. (mprotect() operations on contiguous regions of virtual pages are only slightly more expensive than mprotect operations on a single page, at least on x86 or SPARC. So switching from a single page to multiple pages shouldn’t result in any performance loss). Ideally we’d index with the CPUID, but I don’t see that happening as getting the CPUID in a timely fashion can be problematic on some platforms. We could still have very poor distribution with the OrdinalID scheme I mentioned above. Slightly better than the OrdinalID approach might be to try to balance the number of threads associated with each of the slots. This could be done in the thread ctor. It’s still palliative as you could have a poor distribution over the set of threads using JNI at any given moment. But something like that, coupled with increasing the size of the region, would probably work well.
p.s., the mprotect()-based serialization technique is safe only on systems that have a memory consistency model that's TSO or stronger. And the access to the serialization page has to be store. Because of memory model issues, a load isn't sufficient. Update
: friends in J2SE have filed an RFE as JDK-8143878