Thursday Oct 25, 2012

NUMA-aware constructs for java.util.concurrent

The constructs in the java.util.concurrent JSR-166 "JUC" concurrency library are currently NUMA-oblivious. That's because we currently don't have the topology discovery infrastructure and underpinnings in place that would allow and enable NUMA-awareness. But some quick throw-away prototypes show that it's possible to write NUMA-aware library code. I happened to use JUC Exchanger as a research vehicle. Another interesting idea is to adapt fork-join work-stealing to favor stealing from queues associated with 'nearby' threads.

Performance triage

Folks often ask me how to approach a suspected performance issue. My personal strategy is informed by the fact that I work on concurrency issues. (When you have a hammer everything looks like a nail, but I'll try to keep this general). A good starting point is to ask yourself if the observed performance matches your expectations. Expectations might be derived from known system performance limits, prototypes, and other software or environments that are comparable to your particular system-under-test. Some simple comparisons and microbenchmarks can be useful at this stage. It's also useful to write some very simple programs to validate some of the reported or expected system limits. Can that disk controller really tolerate and sustain 500 reads per second? To reduce the number of confounding factors it's better to try to answer that question with a very simple targeted program. And finally, nothing beats having familiarity with the technologies that underlying your particular layer.

On the topic of confounding factors, as our technology stacks become deeper and less transparent, we often find our own technology working against us in some unexpected way to choke performance rather than simply running into some fundamental system limit. A good example is the warm-up time needed by just-in-time compilers in Java Virtual Machines. I won't delve too far into that particular hole except to say that it's rare to find good benchmarks and methodology for java code. Another example is power management on x86. Power management is great, but it can take a while for the CPUs to throttle up from low(er) frequencies to full throttle. And while I love "turbo" mode, it makes benchmarking applications with multiple threads a chore as you have to remember to turn it off and then back on otherwise short single-threaded runs may look abnormally fast compared to runs with higher thread counts. In general for performance characterization I disable turbo mode and fix the power governor at "performance" state. Another source of complexity is the scheduler, which I've discussed in prior blog entries.

Lets say I have a running application and I want to better understand its behavior and performance. We'll presume it's warmed up, is under load, and is an execution mode representative of what we think the norm would be. It should be in steady-state, if a steady-state mode even exists. On Solaris the very first thing I'll do is take a set of "pstack" samples. Pstack briefly stops the process and walks each of the stacks, reporting symbolic information (if available) for each frame. For Java, pstack has been augmented to understand java frames, and even report inlining. A few pstack samples can provide powerful insight into what's actually going on inside the program. You'll be able to see calling patterns, which threads are blocked on what system calls or synchronization constructs, memory allocation, etc. If your code is CPU-bound then you'll get a good sense where the cycles are being spent. (I should caution that normal C/C++ inlining can diffuse an otherwise "hot" method into other methods. This is a rare instance where pstack sampling might not immediately point to the key problem). At this point you'll need to reconcile what you're seeing with pstack and your mental model of what you think the program should be doing. They're often rather different. And generally if there's a key performance issue, you'll spot it with a moderate number of samples.

I'll also use OS-level observability tools to lock for the existence of bottlenecks where threads contend for locks; other situations where threads are blocked; and the distribution of threads over the system. On Solaris some good tools are mpstat and too a lesser degree, vmstat. Try running "mpstat -a 5" in one window while the application program runs concurrently. One key measure is the voluntary context switch rate "vctx" or "csw" which reflects threads descheduling themselves. It's also good to look at the user; system; and idle CPU percentages. This can give a broad but useful understanding if your threads are mostly parked or mostly running. For instance if your program makes heavy use of malloc/free, then it might be the case you're contending on the central malloc lock in the default allocator. In that case you'd see malloc calling lock in the stack traces, observe a high csw/vctx rate as threads block for the malloc lock, and your "usr" time would be less than expected.

Solaris dtrace is a wonderful and invaluable performance tool as well, but in a sense you have to frame and articulate a meaningful and specific question to get a useful answer, so I tend not to use it for first-order screening of problems. It's also most effective for OS and software-level performance issues as opposed to HW-level issues. For that reason I recommend mpstat & pstack as my the 1st step in performance triage. If some other OS-level issue is evident then it's good to switch to dtrace to drill more deeply into the problem.

Only after I've ruled out OS-level issues do I switch to using hardware performance counters to look for architectural impediments.

TXPAUSE : polite waiting for hardware transactional memory

Classic locks are an appropriate tool to prevent potentially conflicting operations A and B, invoked by different threads, from running at the same time. In a sense the locks cause either A to run before B or vice-versa. Similarly, we can replace the locks with hardware transactional memory, or use transactional lock elision to leverage potential disjoint access parallelism between A and B. But often we want A to wait until B has run. In a Pthreads environment we'd usually use locks in conjunction with condition variables to implement our "wait until" constraint. MONITOR-MWAIT is another way to wait for a memory location to change, but it only allows us to track one cache line and it's only available on x86. There's no similar "wait until" construct for hardware transactions. At the instruction-set level a simple way to express "wait until" in transactions would be to add a new TXPAUSE instruction that could be used within an active hardware transaction. TXPAUSE would politely stall the invoking thread, possibly surrendering or yielding compute resources, while at the same time continuing to track the transaction's address-set. Once a transaction has executed TXPAUSE it can only abort. Ideally that'd happen when some other thread modifies a variable that's in the transaction's read-set or write-set. And since we're aborting all writes would be discarded. In a sense this gives us multi-location MWAIT but with much more flexibility. We could also augment the TXPAUSE with a cycle-count bound to cap the time spent stalled.

I should note that we can already use hardware transactions to enter a tight spin loop in a transaction to wait for updates to the address-set, which will in turn cause an abort. Assuming that the implementation monitors the address-set via cache-coherence probes, by waiting in this fashion we actually communicate via the probes, and not via memory values. That is, the updating thread signals the waiter via probes instead of by traditional memory values. But TXPAUSE takes that a step further and gives us a polite way to spin within transactions.

Lets consider a classic 'polite' test-and-test-and-set loop as might be find in a simple spin lock. The contending threads loop, loading the lock word value to see if the lock has transitioned from LOCKED to UNLOCKED state. If they observe the UNLOCKED they'll then try to acquire the lock with an atomic operation such as XCHG or compare-and-swap. While busy-waiting, the cache line underlying the lock word might be in MESI "S" = shared state in the contending thread's local cache. When the owner ultimately drops the lock -- typically by using a STORE instruction to overwrite LOCKED with UNLOCKED -- the line will change from "S" to "I" (Invalid) in the contending thread's cache. The contending thread continues to loop. The next load by the contending thread misses because the line is in local "I" state, and then causes a read-to-share (RTS) coherence operation to get the line back in the local cache in "S" state. Assuming the spinning thread sees UNLOCKED it'll then try the atomic read-modify-write instruction to acquire the lock. This causes another read-to-owner (RFO or RTO) coherence operation to upgrade the line from "S" to "M" so the atomic can run. So the operations in the local cache were : a remote invalidation from the owner, an RTS via the load and then an RTO via the atomic. (This is the best case, btw).

Now lets say the spinning thread uses a hardware transaction to wait. Assuming Haswell-like hardware, the contending thread would execute an XBEGIN instruction to start the transaction, and then load the lock word. If the lock word is LOCKED, then the transaction simply enters a tight loop, where the only exit from the loop is via abort. If the lock word happened to contain UNLOCKED then the transaction can try to acquire the lock with a store of LOCKED followed by a COMMIT instruction. In that case we're done and the thread has acquired ownership. The case where we spin in the transaction is more interesting. The transaction will have issued a load on the lock word, so the line is in local "S" state. The store used by the owner to subsequently drop the lock will cause the line underlying the lock to be invalidated in the spinning thread's local cache. This causes an abort. After the abort, the contending thread can try an atomic compare-and-swap to acquire lock. This will typically transition the cache line from I to M by virtue of an RTO coherence operation. Note that we've saved the intermediate RTS operation by spinning in this manner, so, even absent TXPAUSE, it can be useful to use transactions to wait for values in memory to change. (As an aside, ignoring the use of hardware transactions, MOESI protocols are much more tolerant of spin locks than are MESI systems).

See also US20100169895.

Keywords: Hardware Transactional Memory; HTM; Haswell; RTM; TSX; spin lock; busy-wait

On the topic of new instructions, it might be useful to have a selective memory fence instruction that specified prior and subsequent addresses. Currently we might write "ST A=1; MEMBAR StoreLoad; LD B". A selective fence would allow us to write "ST A=1; MEMBAR StoreLoad A, B; LD B;" The selective fence would ensure that the store to A was visible before the fetch of B, but without imposing full store-load barrier semantics for all accesses. A degenerate form might be expressed as "MEMBAR A" which would specify that all subsequent (in program order) accesses to A would be visible before any subsequent memory accesses. On a simple in-order processor with a store buffer, "MEMBAR A" could simply wait while an address matching "A" appear in the store buffer. When A finally drained out to visible & coherent space, the MEMBAR would complete. It's possibly such selective fences might give the CPU designers more latitude to improve performance.

Wednesday Oct 24, 2012

Polite busy-waiting with WRPAUSE on SPARC

Unbounded busy-waiting is an poor idea for user-space code, so we typically use spin-then-block strategies when, say, waiting for a lock to be released or some other event. If we're going to spin, even briefly, then we'd prefer to do so in a manner that minimizes performance degradation for other sibling logical processors ("strands") that share compute resources. We want to spin politely and refrain from impeding the progress and performance of other threads — ostensibly doing useful work and making progress — that run on the same core. On a SPARC T4, for instance, 8 strands will share a core, and that core has its own L1 cache and 2 pipelines. On x86 we have the PAUSE instruction, which, naively, can be thought of as a hardware "yield" operator which temporarily surrenders compute resources to threads on sibling strands. Of course this helps avoid intra-core performance interference. On the SPARC T2 our preferred busy-waiting idiom was "RD %CCR,%G0" which is a high-latency no-nop. The T4 provides a dedicated and extremely useful WRPAUSE instruction. The processor architecture manuals are the authoritative source, but briefly, WRPAUSE writes a cycle count into the the PAUSE register, which is ASR27. Barring interrupts, the processor then delays for the requested period. There's no need for the operating system to save the PAUSE register over context switches as it always resets to 0 on traps.

Digressing briefly, if you use unbounded spinning then ultimately the kernel will preempt and deschedule your thread if there are other ready threads than are starving. But by using a spin-then-block strategy we can allow other ready threads to run without resorting to involuntary time-slicing, which operates on a long-ish time scale. Generally, that makes your application more responsive. In addition, by blocking voluntarily we give the operating system far more latitude regarding power management. Finally, I should note that while we have OS-level facilities like sched_yield() at our disposal, yielding almost never does what you'd want or naively expect.

Returning to WRPAUSE, it's natural to ask how well it works. To help answer that question I wrote a very simple C/pthreads benchmark that launches 8 concurrent threads and binds those threads to processors 0..7. The processors are numbered geographically on the T4, so those threads will all be running on just one core. Unlike the SPARC T2, where logical CPUs 0,1,2 and 3 were assigned to the first pipeline, and CPUs 4,5,6 and 7 were assigned to the 2nd, there's no fixed mapping between CPUs and pipelines in the T4. And in some circumstances when the other 7 logical processors are idling quietly, it's possible for the remaining logical processor to leverage both pipelines -- "pipeline fusion". Some number T of the threads will iterate in a tight loop advancing a simple Marsaglia xor-shift pseudo-random number generator. T is a command-line argument. The main thread loops, reporting the aggregate number of PRNG steps performed collectively by those T threads in the last 10 second measurement interval. The other threads (there are 8-T of these) run in a loop busy-waiting concurrently with the T threads. We vary T between 1 and 8 threads, and report on various busy-waiting idioms. The values in the table are the aggregate number of PRNG steps completed by the set of T threads. The unit is millions of iterations per 10 seconds. For the "PRNG step" busy-waiting mode, the busy-waiting threads execute exactly the same code as the T worker threads. We can easily normalize the scores and compute the average rate of progress for individual worker threads by dividing the aggregate score by the number of worker threads T. I should note that the PRNG steps are extremely cycle-heavy and access almost no memory, so arguably this microbenchmark is not as representative of "normal" code as it could be. And for the purposes of comparison I included a row in the table that reflects a waiting policy where the waiting threads call poll(NULL,0,1000) and block in the kernel. Obviously this isn't busy-waiting, but the data is interesting for reference.

As we can see in the table below, WRPAUSE provides a good way to spin politely. And for short-term waiting it's much more efficient than parking in the kernel and potentially creating software timers for timed OS-level waiting. So we have a new facility that's as polite and effective -- with respect to sibling interference -- as is parking a thread, but that avoids the trip to the kernel and the other overheads associated with context switching. It's worth pointing out that the T=1 and T=2 scores for poll() and WRPAUSE forms are about equal because at T=1 we're leveraging both pipelines. And 3348 units of work is the approximate cycle cap for a core.

Aggregate progress
T = #worker threads
Wait Mechanism for 8-T threadsT=1T=2T=3T=4T=5T=6T=7T=8
Park thread in poll() 32653347334833483348334833483348
no-op 415 831 124316482060249729303349
RD %ccr,%g0 "pause" 14262429269228623013316232553349
PRNG step 412 829 124616702092251029303348
WRPause(8000) 32443361333133483349334833483348
WRPause(4000) 32153308331533223347334833473348
WRPause(1000) 30853199322432513310334833483348
WRPause(500) 29173070315032223270330933483348
WRPause(250) 26942864294930773205338833483348
WRPause(100) 21552469262227902911321433303348


Dave is a senior research scientist in the Scalable Synchronization Research Group within Oracle Labs : Google Scholar.


« October 2012 »