Friday Nov 23, 2012

Java @Contended annotation to help reduce false sharing

See this posting by Aleksey Shipilev for details -- @Contended is something we've wanted for a long time. The JVM provides automatic layout and placement of fields. Usually it'll (a) sort fields by descending size to improve footprint, and (b) pack reference fields so the garbage collector can process a contiguous run of reference fields when tracing. @Contended gives the program a way to provide more explicit guidance with respect to concurrency and false sharing. Using this facility we can sequester hot frequently written shared fields away from other mostly read-only or cold fields. The simple rule is that read-sharing is cheap, and write-sharing is very expensive. We can also pack fields together that tend to be written together by the same thread at about the same time.

More generally, we're trying to influence relative field placement to minimize coherency misses. In a simple single-threaded environment fields that are accessed closely together in time should be placed proximally in space to promote cache locality. That is, temporal locality should condition spatial locality. Fields accessed together in time should be nearby in space. That having been said, when threads are accessing our fields concurrently we have to be careful to avoid false sharing and excessive invalidation from coherence traffic. As such, we try to cluster or otherwise sequester fields that tend to written at approximately the same time by the same thread onto the same cache line. Note that there's a tension at play: if we try too hard to minimize single-threaded capacity misses then we can end up with excessive coherency misses running in a parallel environment. In native C/C++ code it's fairly typical for programmers to use informed concurrency-aware structure layout. @Contended should give use the same capability in Java, although in native code the binding of fields to offsets happens at compile-time, while it happens at load-time for the Java. It's worth pointing out that in the general case there is no single optimal layout for both single-thread and multithreaded environments. And the ideal layout problem itself is NP-hard.

Ideally, a JVM would employ hardware monitoring facilities to detect sharing behavior and change the layout on the fly. That's a bit difficult as we don't yet have the right plumbing to provide efficient and expedient information to the JVM. Hint: we need to disintermediate the OS and hypervisor. Another challenge is that raw field offsets are used in the unsafe facility, so we'd need to address that issue, possibly with an extra level of indirection.

Finally, I'd like to be able to pack final fields together as well, as those are known to be read-only.

Thursday Nov 22, 2012

NUMA-aware placement of communication variables

For classic NUMA-aware programming I'm typically most concerned about simple cold, capacity and compulsory misses and whether we can satisfy the miss by locally connected memory or whether we have to pull the line from its home node over the coherent interconnect -- we'd like to minimize channel contention and conserve interconnect bandwidth. That is, for this style of programming we're quite aware of where memory is homed relative to the threads that will be accessing it. Ideally, a page is collocated on the node with the thread that's expected to most frequently access the page, as simple misses on the page can be satisfied without resorting to transferring the line over the interconnect. The default "first touch" NUMA page placement policy tends to work reasonable well in this regard. When a virtual page is first accessed, the operating system will attempt to provision and map that virtual page to a physical page allocated from the node where the accessing thread is running. It's worth noting that the node-level memory interleaving granularity is usually a multiple of the page size, so we can say that a given page P resides on some node N. That is, the memory underlying a page resides on just one node.

But when thinking about accesses to heavily-written communication variables we normally consider what caches the lines underlying such variables might be resident in, and in what states. We want to minimize coherence misses and cache probe activity and interconnect traffic in general. I don't usually give much thought to the location of the home NUMA node underlying such highly shared variables. On a SPARC T5440, for instance, which consists of 4 T2+ processors connected by a central coherence hub, the home node and placement of heavily accessed communication variables has very little impact on performance. The variables are frequently accessed so likely in M-state in some cache, and the location of the home node is of little consequence because a requester can use cache-to-cache transfers to get the line.

Or at least that's what I thought. Recently, though, I was exploring a simple shared memory point-to-point communication model where a client writes a request into a request mailbox and then busy-waits on a response variable. It's a simple example of delegation based on message passing. The server polls the request mailbox, and having fetched a new request value, performs some operation and then writes a reply value into the response variable. As noted above, on a T5440 performance is insensitive to the placement of the communication variables -- the request and response mailbox words. But on a Sun/Oracle X4800 I noticed that was not the case and that NUMA placement of the communication variables was actually quite important.

For background an X4800 system consists of 8 Intel X7560 Xeons . Each package (socket) has 8 cores with 2 contexts per core, so the system is 8x8x2. Each package is also a NUMA node and has locally attached memory. Every package has 3 point-to-point QPI links for cache coherence, and the system is configured with a twisted ladder "mobius" topology. The cache coherence fabric is glueless -- there's not central arbiter or coherence hub. The maximum distance between any two nodes is just 2 hops over the QPI links. For any given node, 3 other nodes are 1 hop distant and the remaining 4 nodes are 2 hops distant.

Using a single request (client) thread and a single response (server) thread, a benchmark harness explored all permutations of NUMA placement for the two threads and the two communication variables, measuring the average round-trip-time and throughput rate between the client and server. In this benchmark the server simply acts as a simple transponder, writing the request value plus 1 back into the reply field, so there's no particular computation phase and we're only measuring communication overheads. In addition to varying the placement of communication variables over pairs of nodes, we also explored variations where both variables were placed on one page (and thus on one node) -- either on the same cache line or different cache lines -- while varying the node where the variables reside along with the placement of the threads. The key observation was that if the client and server threads were on different nodes, then the best placement of variables was to have the request variable (written by the client and read by the server) reside on the same node as the client thread, and to place the response variable (written by the server and read by the client) on the same node as the server. That is, if you have a variable that's to be written by one thread and read by another, it should be homed with the writer thread. For our simple client-server model that means using split request and response communication variables with unidirectional message flow on a given page. This can yield up to twice the throughput of less favorable placement strategies.

Our X4800 uses the QPI 1.0 protocol with source-based snooping. Briefly, when node A needs to probe a cache line it fires off snoop requests to all the nodes in the system. Those recipients then forward their response not to the original requester, but to the home node H of the cache line. H waits for and collects the responses, adjudicates and resolves conflicts and ensures memory-model ordering, and then sends a definitive reply back to the original requester A. If some node B needed to transfer the line to A, it will do so by cache-to-cache transfer and let H know about the disposition of the cache line. A needs to wait for the authoritative response from H. So if a thread on node A wants to write a value to be read by a thread on node B, the latency is dependent on the distances between A, B, and H. We observe the best performance when the written-to variable is co-homed with the writer A. That is, we want H and A to be the same node, as the writer doesn't need the home to respond over the QPI link, as the writer and the home reside on the very same node. With architecturally informed placement of communication variables we eliminate at least one QPI hop from the critical path.

Newer Intel processors use the QPI 1.1 coherence protocol with home-based snooping. As noted above, under source-snooping a requester broadcasts snoop requests to all nodes. Those nodes send their response to the home node of the location, which provides memory ordering, reconciles conflicts, etc., and then posts a definitive reply to the requester. In home-based snooping the snoop probe goes directly to the home node and are not broadcast. The home node can consult snoop filters -- if present -- and send out requests to retrieve the line if necessary. The 3rd party owner of the line, if any, can respond either to the home or the original requester (or even to both) according to the protocol policies. There are myriad variations that have been implemented, and unfortunately vendor terminology doesn't always agree between vendors or with the academic taxonomy papers. The key is that home-snooping enables the use of a snoop filter to reduce interconnect traffic. And while home-snooping might have a longer critical path (latency) than source-based snooping, it also may require fewer messages and less overall bandwidth. It'll be interesting to reprise these experiments on a platform with home-based snooping.

While collecting data I also noticed that there are placement concerns even in the seemingly trivial case when both threads and both variables reside on a single node. Internally, the cores on each X7560 package are connected by an internal ring. (Actually there are multiple contra-rotating rings). And the last-level on-chip cache (LLC) is partitioned in banks or slices, which with each slice being associated with a core on the ring topology. A hardware hash function associates each physical address with a specific home bank. Thus we face distance and topology concerns even for intra-package communications, although the latencies are not nearly the magnitude we see inter-package. I've not seen such communication distance artifacts on the T2+, where the cache banks are connected to the cores via a high-speed crossbar instead of a ring -- communication latencies seem more regular.

Finally, I've seen strong hints that the placement of threads relative to lock metadata accessed by those threads plays an important part in performance for contended locks.

Tuesday Nov 06, 2012

C-states and P-states : confounding factors for benchmarking

I was recently looking into a performance issue in the java.util.concurrent (JUC) fork-join pool framework related to particularly long latencies when trying to wake (unpark) threads in the pool. Eventually I tracked the issue down to the power & scaling governor and idle-state policies on x86. Briefly, P-states refer to the set of clock rates (speeds) at which a processor can run. C-states reflect the possible idle states. The deeper the C-state (higher numerical values) the less power the processor will draw, but the longer it takes the processor to respond and exit that sleep state on the next idle to non-idle transition. In some cases the latency can be worse than 100 microseconds. C0 is normal execution state, and P0 is "full speed" with higher Pn values reflecting reduced clock rates. C-states are P-states are orthogonal, although P-states only have meaning at C0. You could also think of the states as occupying a spectrum as follows : P0, P1, P2, Pn, C1, C2, ... Cn, where all the P-states are at C0. Our fork-join framework was calling unpark() to wake a thread from the pool, and that thread was being dispatched onto a processor at deep C-state, so we were observing rather impressive latencies between the time of the unpark and the time the thread actually resumed and was able to accept work. (I originally thought we were seeing situations where the wakee was preempting the waker, but that wasn't the case. I'll save that topic for a future blog entry). It's also worth pointing out that higher P-state values draw less power and there's usually some latency in ramping up the clock (P-states) in response to offered load.

The issue of C-states and P-states isn't new and has been described at length elsewhere, but it may be new to Java programmers, adding a new confounding factor to benchmarking methodologies and procedures. To get stable results I'd recommend running at C0 and P0, particularly for server-side applications. As appropriate, disabling "turbo" mode may also be prudent. But it also makes sense to run with the system defaults to understand if your application exhibits any performance sensitivity to power management policies.

The operating system power management sub-system typically control the P-state and C-states based on current and recent load. The scaling governor manages P-states. Operating systems often use adaptive policies that try to avoid deep C-states for some period if recent deep idle episodes proved to be very short and futile. This helps make the system more responsive under bursty or otherwise irregular load. But it also means the system is stateful and exhibits a memory effect, which can further complicate benchmarking. Forcing C0 + P0 should avoid this issue.

About

Dave is a senior research scientist in the Scalable Synchronization Research Group within Oracle Labs : Google Scholar.

Search

Categories
Archives
« November 2012 »
SunMonTueWedThuFriSat
    
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
24
25
26
27
28
29
30
 
       
Today