Thursday Oct 25, 2012

NUMA-aware constructs for java.util.concurrent

The constructs in the java.util.concurrent JSR-166 "JUC" concurrency library are currently NUMA-oblivious. That's because we currently don't have the topology discovery infrastructure and underpinnings in place that would allow and enable NUMA-awareness. But some quick throw-away prototypes show that it's possible to write NUMA-aware library code. I happened to use JUC Exchanger as a research vehicle. Another interesting idea is to adapt fork-join work-stealing to favor stealing from queues associated with 'nearby' threads.

Performance triage

Folks often ask me how to approach a suspected performance issue. My personal strategy is informed by the fact that I work on concurrency issues. (When you have a hammer everything looks like a nail, but I'll try to keep this general). A good starting point is to ask yourself if the observed performance matches your expectations. Expectations might be derived from known system performance limits, prototypes, and other software or environments that are comparable to your particular system-under-test. Some simple comparisons and microbenchmarks can be useful at this stage. It's also useful to write some very simple programs to validate some of the reported or expected system limits. Can that disk controller really tolerate and sustain 500 reads per second? To reduce the number of confounding factors it's better to try to answer that question with a very simple targeted program. And finally, nothing beats having familiarity with the technologies that underlying your particular layer.

On the topic of confounding factors, as our technology stacks become deeper and less transparent, we often find our own technology working against us in some unexpected way to choke performance rather than simply running into some fundamental system limit. A good example is the warm-up time needed by just-in-time compilers in Java Virtual Machines. I won't delve too far into that particular hole except to say that it's rare to find good benchmarks and methodology for java code. Another example is power management on x86. Power management is great, but it can take a while for the CPUs to throttle up from low(er) frequencies to full throttle. And while I love "turbo" mode, it makes benchmarking applications with multiple threads a chore as you have to remember to turn it off and then back on otherwise short single-threaded runs may look abnormally fast compared to runs with higher thread counts. In general for performance characterization I disable turbo mode and fix the power governor at "performance" state. Another source of complexity is the scheduler, which I've discussed in prior blog entries.

Lets say I have a running application and I want to better understand its behavior and performance. We'll presume it's warmed up, is under load, and is an execution mode representative of what we think the norm would be. It should be in steady-state, if a steady-state mode even exists. On Solaris the very first thing I'll do is take a set of "pstack" samples. Pstack briefly stops the process and walks each of the stacks, reporting symbolic information (if available) for each frame. For Java, pstack has been augmented to understand java frames, and even report inlining. A few pstack samples can provide powerful insight into what's actually going on inside the program. You'll be able to see calling patterns, which threads are blocked on what system calls or synchronization constructs, memory allocation, etc. If your code is CPU-bound then you'll get a good sense where the cycles are being spent. (I should caution that normal C/C++ inlining can diffuse an otherwise "hot" method into other methods. This is a rare instance where pstack sampling might not immediately point to the key problem). At this point you'll need to reconcile what you're seeing with pstack and your mental model of what you think the program should be doing. They're often rather different. And generally if there's a key performance issue, you'll spot it with a moderate number of samples.

I'll also use OS-level observability tools to lock for the existence of bottlenecks where threads contend for locks; other situations where threads are blocked; and the distribution of threads over the system. On Solaris some good tools are mpstat and too a lesser degree, vmstat. Try running "mpstat -a 5" in one window while the application program runs concurrently. One key measure is the voluntary context switch rate "vctx" or "csw" which reflects threads descheduling themselves. It's also good to look at the user; system; and idle CPU percentages. This can give a broad but useful understanding if your threads are mostly parked or mostly running. For instance if your program makes heavy use of malloc/free, then it might be the case you're contending on the central malloc lock in the default allocator. In that case you'd see malloc calling lock in the stack traces, observe a high csw/vctx rate as threads block for the malloc lock, and your "usr" time would be less than expected.

Solaris dtrace is a wonderful and invaluable performance tool as well, but in a sense you have to frame and articulate a meaningful and specific question to get a useful answer, so I tend not to use it for first-order screening of problems. It's also most effective for OS and software-level performance issues as opposed to HW-level issues. For that reason I recommend mpstat & pstack as my the 1st step in performance triage. If some other OS-level issue is evident then it's good to switch to dtrace to drill more deeply into the problem.

Only after I've ruled out OS-level issues do I switch to using hardware performance counters to look for architectural impediments.

TXPAUSE : polite waiting for hardware transactional memory

Classic locks are an appropriate tool to prevent potentially conflicting operations A and B, invoked by different threads, from running at the same time. In a sense the locks cause either A to run before B or vice-versa. Similarly, we can replace the locks with hardware transactional memory, or use transactional lock elision to leverage potential disjoint access parallelism between A and B. But often we want A to wait until B has run. In a Pthreads environment we'd usually use locks in conjunction with condition variables to implement our "wait until" constraint. MONITOR-MWAIT is another way to wait for a memory location to change, but it only allows us to track one cache line and it's only available on x86. There's no similar "wait until" construct for hardware transactions. At the instruction-set level a simple way to express "wait until" in transactions would be to add a new TXPAUSE instruction that could be used within an active hardware transaction. TXPAUSE would politely stall the invoking thread, possibly surrendering or yielding compute resources, while at the same time continuing to track the transaction's address-set. Once a transaction has executed TXPAUSE it can only abort. Ideally that'd happen when some other thread modifies a variable that's in the transaction's read-set or write-set. And since we're aborting all writes would be discarded. In a sense this gives us multi-location MWAIT but with much more flexibility. We could also augment the TXPAUSE with a cycle-count bound to cap the time spent stalled.

I should note that we can already use hardware transactions to enter a tight spin loop in a transaction to wait for updates to the address-set, which will in turn cause an abort. Assuming that the implementation monitors the address-set via cache-coherence probes, by waiting in this fashion we actually communicate via the probes, and not via memory values. That is, the updating thread signals the waiter via probes instead of by traditional memory values. But TXPAUSE takes that a step further and gives us a polite way to spin within transactions.

Lets consider a classic 'polite' test-and-test-and-set loop as might be find in a simple spin lock. The contending threads loop, loading the lock word value to see if the lock has transitioned from LOCKED to UNLOCKED state. If they observe the UNLOCKED they'll then try to acquire the lock with an atomic operation such as XCHG or compare-and-swap. While busy-waiting, the cache line underlying the lock word might be in MESI "S" = shared state in the contending thread's local cache. When the owner ultimately drops the lock -- typically by using a STORE instruction to overwrite LOCKED with UNLOCKED -- the line will change from "S" to "I" (Invalid) in the contending thread's cache. The contending thread continues to loop. The next load by the contending thread misses because the line is in local "I" state, and then causes a read-to-share (RTS) coherence operation to get the line back in the local cache in "S" state. Assuming the spinning thread sees UNLOCKED it'll then try the atomic read-modify-write instruction to acquire the lock. This causes another read-to-owner (RFO or RTO) coherence operation to upgrade the line from "S" to "M" so the atomic can run. So the operations in the local cache were : a remote invalidation from the owner, an RTS via the load and then an RTO via the atomic. (This is the best case, btw).

Now lets say the spinning thread uses a hardware transaction to wait. Assuming Haswell-like hardware, the contending thread would execute an XBEGIN instruction to start the transaction, and then load the lock word. If the lock word is LOCKED, then the transaction simply enters a tight loop, where the only exit from the loop is via abort. If the lock word happened to contain UNLOCKED then the transaction can try to acquire the lock with a store of LOCKED followed by a COMMIT instruction. In that case we're done and the thread has acquired ownership. The case where we spin in the transaction is more interesting. The transaction will have issued a load on the lock word, so the line is in local "S" state. The store used by the owner to subsequently drop the lock will cause the line underlying the lock to be invalidated in the spinning thread's local cache. This causes an abort. After the abort, the contending thread can try an atomic compare-and-swap to acquire lock. This will typically transition the cache line from I to M by virtue of an RTO coherence operation. Note that we've saved the intermediate RTS operation by spinning in this manner, so, even absent TXPAUSE, it can be useful to use transactions to wait for values in memory to change. (As an aside, ignoring the use of hardware transactions, MOESI protocols are much more tolerant of spin locks than are MESI systems).

Keywords: Hardware Transactional Memory; HTM; Haswell; RTM; TSX; spin lock; busy-wait

On the topic of new instructions, it might be useful to have a selective memory fence instruction that specified prior and subsequent addresses. Currently we might write "ST A=1; MEMBAR StoreLoad; LD B". A selective fence would allow us to write "ST A=1; MEMBAR StoreLoad A, B; LD B;" The selective fence would ensure that the store to A was visible before the fetch of B, but without imposing full store-load barrier semantics for all accesses. A degenerate form might be expressed as "MEMBAR A" which would specify that all subsequent (in program order) accesses to A would be visible before any subsequent memory accesses. On a simple in-order processor with a store buffer, "MEMBAR A" could simply wait while an address matching "A" appear in the store buffer. When A finally drained out to visible & coherent space, the MEMBAR would complete. It's possibly such selective fences might give the CPU designers more latitude to improve performance.




« October 2012 »