In an ideal world the clock sources underlying nanoTime() would be causal. But that's not always the case on all platforms, in which case the JVM has to enforce the property by tracking the maximum observed time in a memory location and only returning the larger of the underlying source that variable. In turn, this creates a coherence hotspot that can impede scaling. (In some cases we even seen a concave scaling curve, particularly on NUMA systems).
I've provided a simple harness that models the nanoTime() implementation on Solaris. It creates T threads, each of which loops calling MonoTime(). At the end of a 10 second measurement interval it reports throughput -- the aggregate number of MonoTime() calls completed by that cohort of threads. On a SPARC T5 for 1,2,4,8 and 16 threads, I observe 11M, 14M, 25M, 29M and 29M iterations completed, respectively. You might naively expect ideal scaling up the number of cores (16 in this case), but that's not the case. (Note that I just happened to use a T5 for data collection. Gethrtime() is causal on a T5).
MonoTime can also take a "granularity" argument, which allows a trade-off between the granularity of the returned value and the update rate on the variable that tracks the maximum observed time. (The command line is "./MonoTime Threads GranularityInNsecs"). 0 reflects the existing nanoTime() implementation. If I use a granularity of 1000 then for 1,2,4,8 and 16 threads I observe 14M, 27M, 54M, 104M, and 181M iterations. (The improvement at 1 thread arises because of reduced local latency as we use atomic "CAS" less, but the improved scalability proper comes from reduced coherence traffic on the variable that tracks the maximum returned value). MonoTime
This might make an interesting "-XX:" switch for HotSpot, both as a diagnostic test for the existence of the problem, and as a potential work around.
Java's nanoTime() API guarantees a monotonic (really, non-retrograde) relative clock source. It's also expected to be causal in the following sense. Say thread A calls nanoTime(), gets value V, and then stores V into memory. Some other thread B then observes A's store of V, then calls nanoTime() itself and gets W. We expect W should be greater than or equal to V.