Monday Oct 28, 2013
Friday May 31, 2013
By Dave on May 31, 2013
By Dave on May 31, 2013
By Dave on May 31, 2013
"Lightweight Contention Management for Efficient Compare-and-Swap Operations" by Dave Dice, Danny Hendler and Ilya Mirsky will appear in Euro-Par 2013. A longer technical report is also available in arxiv.
Monday Jan 07, 2013
By Dave on Jan 07, 2013
By Dave on Jan 07, 2013
As a side note, it seems Vyukov's distributed reader-write lock is similar to the concept of sub-latches and super-latches found in some database implementations.
Friday Nov 23, 2012
By Dave on Nov 23, 2012
See this posting by Aleksey Shipilev for details -- @Contended is something we've wanted for a long time. The JVM provides automatic layout and placement of fields. Usually it'll (a) sort fields by descending size to improve footprint, and (b) pack reference fields so the garbage collector can process a contiguous run of reference fields when tracing. @Contended gives the program a way to provide more explicit guidance with respect to concurrency and false sharing. Using this facility we can sequester hot frequently written shared fields away from other mostly read-only or cold fields. The simple rule is that read-sharing is cheap, and write-sharing is very expensive. We can also pack fields together that tend to be written together by the same thread at about the same time.
More generally, we're trying to influence relative field placement to minimize coherency misses. In a simple single-threaded environment fields that are accessed closely together in time should be placed proximally in space to promote cache locality. That is, temporal locality should condition spatial locality. Fields accessed together in time should be nearby in space. That having been said, when threads are accessing our fields concurrently we have to be careful to avoid false sharing and excessive invalidation from coherence traffic. As such, we try to cluster or otherwise sequester fields that tend to written at approximately the same time by the same thread onto the same cache line. Note that there's a tension at play: if we try too hard to minimize single-threaded capacity misses then we can end up with excessive coherency misses running in a parallel environment. In native C/C++ code it's fairly typical for programmers to use informed concurrency-aware structure layout. @Contended should give use the same capability in Java, although in native code the binding of fields to offsets happens at compile-time, while it happens at load-time for the Java. It's worth pointing out that in the general case there is no single optimal layout for both single-thread and multithreaded environments. And the ideal layout problem itself is NP-hard.
Ideally, a JVM would employ hardware monitoring facilities to detect sharing behavior and change the layout on the fly. That's a bit difficult as we don't yet have the right plumbing to provide efficient and expedient information to the JVM. Hint: we need to disintermediate the OS and hypervisor. Another challenge is that raw field offsets are used in the unsafe facility, so we'd need to address that issue, possibly with an extra level of indirection.
Finally, I'd like to be able to pack final fields together as well, as those are known to be read-only.
Thursday Nov 22, 2012
By Dave on Nov 22, 2012
For classic NUMA-aware programming I'm typically most concerned about simple cold, capacity and compulsory misses and whether we can satisfy the miss by locally connected memory or whether we have to pull the line from its home node over the coherent interconnect -- we'd like to minimize channel contention and conserve interconnect bandwidth. That is, for this style of programming we're quite aware of where memory is homed relative to the threads that will be accessing it. Ideally, a page is collocated on the node with the thread that's expected to most frequently access the page, as simple misses on the page can be satisfied without resorting to transferring the line over the interconnect. The default "first touch" NUMA page placement policy tends to work reasonable well in this regard. When a virtual page is first accessed, the operating system will attempt to provision and map that virtual page to a physical page allocated from the node where the accessing thread is running. It's worth noting that the node-level memory interleaving granularity is usually a multiple of the page size, so we can say that a given page P resides on some node N. That is, the memory underlying a page resides on just one node.
But when thinking about accesses to heavily-written communication variables we normally consider what caches the lines underlying such variables might be resident in, and in what states. We want to minimize coherence misses and cache probe activity and interconnect traffic in general. I don't usually give much thought to the location of the home NUMA node underlying such highly shared variables. On a SPARC T5440, for instance, which consists of 4 T2+ processors connected by a central coherence hub, the home node and placement of heavily accessed communication variables has very little impact on performance. The variables are frequently accessed so likely in M-state in some cache, and the location of the home node is of little consequence because a requester can use cache-to-cache transfers to get the line.
Or at least that's what I thought. Recently, though, I was exploring a simple shared memory point-to-point communication model where a client writes a request into a request mailbox and then busy-waits on a response variable. It's a simple example of delegation based on message passing. The server polls the request mailbox, and having fetched a new request value, performs some operation and then writes a reply value into the response variable. As noted above, on a T5440 performance is insensitive to the placement of the communication variables -- the request and response mailbox words. But on a Sun/Oracle X4800 I noticed that was not the case and that NUMA placement of the communication variables was actually quite important.
For background an X4800 system consists of 8 Intel X7560 Xeons . Each package (socket) has 8 cores with 2 contexts per core, so the system is 8x8x2. Each package is also a NUMA node and has locally attached memory. Every package has 3 point-to-point QPI links for cache coherence, and the system is configured with a twisted ladder "mobius" topology. The cache coherence fabric is glueless -- there's not central arbiter or coherence hub. The maximum distance between any two nodes is just 2 hops over the QPI links. For any given node, 3 other nodes are 1 hop distant and the remaining 4 nodes are 2 hops distant.
Using a single request (client) thread and a single response (server) thread, a benchmark harness explored all permutations of NUMA placement for the two threads and the two communication variables, measuring the average round-trip-time and throughput rate between the client and server. In this benchmark the server simply acts as a simple transponder, writing the request value plus 1 back into the reply field, so there's no particular computation phase and we're only measuring communication overheads. In addition to varying the placement of communication variables over pairs of nodes, we also explored variations where both variables were placed on one page (and thus on one node) -- either on the same cache line or different cache lines -- while varying the node where the variables reside along with the placement of the threads. The key observation was that if the client and server threads were on different nodes, then the best placement of variables was to have the request variable (written by the client and read by the server) reside on the same node as the client thread, and to place the response variable (written by the server and read by the client) on the same node as the server. That is, if you have a variable that's to be written by one thread and read by another, it should be homed with the writer thread. For our simple client-server model that means using split request and response communication variables with unidirectional message flow on a given page. This can yield up to twice the throughput of less favorable placement strategies.
Our X4800 uses the QPI 1.0 protocol with source-based snooping. Briefly, when node A needs to probe a cache line it fires off snoop requests to all the nodes in the system. Those recipients then forward their response not to the original requester, but to the home node H of the cache line. H waits for and collects the responses, adjudicates and resolves conflicts and ensures memory-model ordering, and then sends a definitive reply back to the original requester A. If some node B needed to transfer the line to A, it will do so by cache-to-cache transfer and let H know about the disposition of the cache line. A needs to wait for the authoritative response from H. So if a thread on node A wants to write a value to be read by a thread on node B, the latency is dependent on the distances between A, B, and H. We observe the best performance when the written-to variable is co-homed with the writer A. That is, we want H and A to be the same node, as the writer doesn't need the home to respond over the QPI link, as the writer and the home reside on the very same node. With architecturally informed placement of communication variables we eliminate at least one QPI hop from the critical path.
Newer Intel processors use the QPI 1.1 coherence protocol with home-based snooping. As noted above, under source-snooping a requester broadcasts snoop requests to all nodes. Those nodes send their response to the home node of the location, which provides memory ordering, reconciles conflicts, etc., and then posts a definitive reply to the requester. In home-based snooping the snoop probe goes directly to the home node and are not broadcast. The home node can consult snoop filters -- if present -- and send out requests to retrieve the line if necessary. The 3rd party owner of the line, if any, can respond either to the home or the original requester (or even to both) according to the protocol policies. There are myriad variations that have been implemented, and unfortunately vendor terminology doesn't always agree between vendors or with the academic taxonomy papers. The key is that home-snooping enables the use of a snoop filter to reduce interconnect traffic. And while home-snooping might have a longer critical path (latency) than source-based snooping, it also may require fewer messages and less overall bandwidth. It'll be interesting to reprise these experiments on a platform with home-based snooping.
While collecting data I also noticed that there are placement concerns even in the seemingly trivial case when both threads and both variables reside on a single node. Internally, the cores on each X7560 package are connected by an internal ring. (Actually there are multiple contra-rotating rings). And the last-level on-chip cache (LLC) is partitioned in banks or slices, which with each slice being associated with a core on the ring topology. A hardware hash function associates each physical address with a specific home bank. Thus we face distance and topology concerns even for intra-package communications, although the latencies are not nearly the magnitude we see inter-package. I've not seen such communication distance artifacts on the T2+, where the cache banks are connected to the cores via a high-speed crossbar instead of a ring -- communication latencies seem more regular.
Finally, I've seen strong hints that the placement of threads relative to lock metadata accessed by those threads plays an important part in performance for contended locks.
Tuesday Nov 06, 2012
By Dave on Nov 06, 2012
I was recently looking into a performance issue in the java.util.concurrent (JUC) fork-join pool framework related to particularly long latencies when trying to wake (unpark) threads in the pool. Eventually I tracked the issue down to the power & scaling governor and idle-state policies on x86. Briefly,
The issue of C-states and P-states isn't new and has been described at length elsewhere, but it may be new to Java programmers, adding a new confounding factor to benchmarking methodologies and procedures. To get stable results I'd recommend running at C0 and P0, particularly for server-side applications. As appropriate, disabling "turbo" mode may also be prudent. But it also makes sense to run with the system defaults to understand if your application exhibits any performance sensitivity to power management policies.
The operating system power management sub-system typically control the P-state and C-states based on current and recent load. The scaling governor manages P-states. Operating systems often use adaptive policies that try to avoid deep C-states for some period if recent deep idle episodes proved to be very short and futile. This helps make the system more responsive under bursty or otherwise irregular load. But it also means the system is stateful and exhibits a memory effect, which can further complicate benchmarking. Forcing C0 + P0 should avoid this issue.
Thursday Oct 25, 2012
By Dave on Oct 25, 2012
The constructs in the java.util.concurrent JSR-166 "JUC" concurrency library are currently NUMA-oblivious. That's because we currently don't have the topology discovery infrastructure and underpinnings in place that would allow and enable NUMA-awareness. But some quick throw-away prototypes show that it's possible to write NUMA-aware library code. I happened to use JUC Exchanger as a research vehicle. Another interesting idea is to adapt fork-join work-stealing to favor stealing from queues associated with 'nearby' threads.
By Dave on Oct 25, 2012
Folks often ask me how to approach a suspected performance issue. My personal strategy is informed by the fact that I work on concurrency issues. (When you have a hammer everything looks like a nail, but I'll try to keep this general). A good starting point is to ask yourself if the observed performance matches your expectations. Expectations might be derived from known system performance limits, prototypes, and other software or environments that are comparable to your particular system-under-test. Some simple comparisons and microbenchmarks can be useful at this stage. It's also useful to write some very simple programs to validate some of the reported or expected system limits. Can that disk controller really tolerate and sustain 500 reads per second? To reduce the number of confounding factors it's better to try to answer that question with a very simple targeted program. And finally, nothing beats having familiarity with the technologies that underlying your particular layer.
On the topic of confounding factors, as our technology stacks become deeper and less transparent, we often find our own technology working against us in some unexpected way to choke performance rather than simply running into some fundamental system limit. A good example is the warm-up time needed by just-in-time compilers in Java Virtual Machines. I won't delve too far into that particular hole except to say that it's rare to find good benchmarks and methodology for java code. Another example is power management on x86. Power management is great, but it can take a while for the CPUs to throttle up from low(er) frequencies to full throttle. And while I love "turbo" mode, it makes benchmarking applications with multiple threads a chore as you have to remember to turn it off and then back on otherwise short single-threaded runs may look abnormally fast compared to runs with higher thread counts. In general for performance characterization I disable turbo mode and fix the power governor at "performance" state. Another source of complexity is the scheduler, which I've discussed in prior blog entries.
Lets say I have a running application and I want to better understand its behavior and performance. We'll presume it's warmed up, is under load, and is an execution mode representative of what we think the norm would be. It should be in steady-state, if a steady-state mode even exists. On Solaris the very first thing I'll do is take a set of "pstack" samples. Pstack briefly stops the process and walks each of the stacks, reporting symbolic information (if available) for each frame. For Java, pstack has been augmented to understand java frames, and even report inlining. A few pstack samples can provide powerful insight into what's actually going on inside the program. You'll be able to see calling patterns, which threads are blocked on what system calls or synchronization constructs, memory allocation, etc. If your code is CPU-bound then you'll get a good sense where the cycles are being spent. (I should caution that normal C/C++ inlining can diffuse an otherwise "hot" method into other methods. This is a rare instance where pstack sampling might not immediately point to the key problem). At this point you'll need to reconcile what you're seeing with pstack and your mental model of what you think the program should be doing. They're often rather different. And generally if there's a key performance issue, you'll spot it with a moderate number of samples.
I'll also use OS-level observability tools to lock for the existence of bottlenecks where threads contend for locks; other situations where threads are blocked; and the distribution of threads over the system. On Solaris some good tools are mpstat and too a lesser degree, vmstat. Try running "mpstat -a 5" in one window while the application program runs concurrently. One key measure is the voluntary context switch rate "vctx" or "csw" which reflects threads descheduling themselves. It's also good to look at the user; system; and idle CPU percentages. This can give a broad but useful understanding if your threads are mostly parked or mostly running. For instance if your program makes heavy use of malloc/free, then it might be the case you're contending on the central malloc lock in the default allocator. In that case you'd see malloc calling lock in the stack traces, observe a high csw/vctx rate as threads block for the malloc lock, and your "usr" time would be less than expected.
Solaris dtrace is a wonderful and invaluable performance tool as well, but in a sense you have to frame and articulate a meaningful and specific question to get a useful answer, so I tend not to use it for first-order screening of problems. It's also most effective for OS and software-level performance issues as opposed to HW-level issues. For that reason I recommend mpstat & pstack as my the 1st step in performance triage. If some other OS-level issue is evident then it's good to switch to dtrace to drill more deeply into the problem.
Only after I've ruled out OS-level issues do I switch to using hardware performance counters to look for architectural impediments.
By Dave on Oct 25, 2012
Classic locks are an appropriate tool to prevent potentially conflicting operations A and B, invoked by different threads, from running at the same time. In a sense the locks cause either A to run before B or vice-versa. Similarly, we can replace the locks with hardware transactional memory, or use transactional lock elision to leverage potential disjoint access parallelism between A and B. But often we want A to wait until B has run. In a Pthreads environment we'd usually use locks in conjunction with condition variables to implement our "wait until" constraint. MONITOR-MWAIT is another way to wait for a memory location to change, but it only allows us to track one cache line and it's only available on x86. There's no similar "wait until" construct for hardware transactions. At the instruction-set level a simple way to express "wait until" in transactions would be to add a new TXPAUSE instruction that could be used within an active hardware transaction. TXPAUSE would politely stall the invoking thread, possibly surrendering or yielding compute resources, while at the same time continuing to track the transaction's address-set. Once a transaction has executed TXPAUSE it can only abort. Ideally that'd happen when some other thread modifies a variable that's in the transaction's read-set or write-set. And since we're aborting all writes would be discarded. In a sense this gives us multi-location MWAIT but with much more flexibility. We could also augment the TXPAUSE with a cycle-count bound to cap the time spent stalled. (For a discussion of the benefits of polite waiting, see this blog entry on the use of WRPAUSE.)
A related concept for software transactional memory is the retry facility proposed by Harris et al. in Composable memory transactions.
I should note that we can already use hardware transactions to enter a tight spin loop in a transaction to wait for updates to the address-set, which will in turn cause an abort. Assuming that the implementation monitors the address-set via cache-coherence probes, by waiting in this fashion we actually communicate via the probes, and not via memory values. That is, the updating thread signals the waiter via probes instead of by traditional memory values. But TXPAUSE takes that a step further and gives us a polite way to spin within transactions.
Lets consider a classic 'polite' test-and-test-and-set loop as might be find in a simple spin lock. The contending threads loop, loading the lock word value to see if the lock has transitioned from LOCKED to UNLOCKED state. If they observe the UNLOCKED they'll then try to acquire the lock with an atomic operation such as XCHG or compare-and-swap. While busy-waiting, the cache line underlying the lockword might be in MESI "S" = shared state in the contending thread's local cache. When the owner ultimately drops the lock -- typically by using a STORE instruction to overwrite LOCKED with UNLOCKED -- the line will change from "S" to "I" (Invalid) in the contending thread's cache. The contending thread continues to loop. The next load by the contending thread misses because the line is in local "I" state, and then causes a read-to-share (RTS) coherence operation to get the line back in the local cache in "S" state. Assuming the spinning thread sees UNLOCKED it'll then try the atomic read-modify-write instruction to acquire the lock. This causes another read-to-owner (RFO or RTO) coherence operation to upgrade the line from "S" to "M" so the atomic can run. So the operations in the local cache were : a remote invalidation from the owner, an RTS via the load and then an RTO via the atomic. (This is the best case, btw).
Now lets say the spinning thread uses a hardware transaction to wait. Assuming Haswell-like hardware, the contending thread would execute an XBEGIN instruction to start the transaction, and then load the lock word. If the lock word is LOCKED, then the transaction simply enters a tight loop, where the only exit from the loop is via abort. If the lock word happened to contain UNLOCKED then the transaction can try to acquire the lock with a store of LOCKED followed by a COMMIT instruction. In that case we're done and the thread has acquired ownership. The case where we spin in the transaction is more interesting. The transaction will have issued a load on the lock word, so the line is in local "S" state. The store used by the owner to subsequently drop the lock will cause the line underlying the lock to be invalidated in the spinning thread's local cache. This causes an abort. After the abort, the contending thread can try an atomic compare-and-swap to acquire lock. This will typically transition the cache line from I to M by virtue of an RTO coherence operation. Note that we've saved the intermediate RTS operation by spinning in this manner, so, even absent TXPAUSE, it can be useful to use transactions to wait for values in memory to change. (As an aside, ignoring the use of hardware transactions, MOESI protocols are much more tolerant of spin locks than are MESI systems).
See also US20100169895.
Keywords: Hardware Transactional Memory; HTM; Haswell; RTM; TSX; spin lock; busy-wait
On the topic of new instructions, it might be useful to have a selective memory fence instruction that specified prior and subsequent addresses. Currently we might write "ST A=1; MEMBAR StoreLoad; LD B". A selective fence would allow us to write "ST A=1; MEMBAR StoreLoad A, B; LD B;" The selective fence would ensure that the store to A was visible before the fetch of B, but without imposing full store-load barrier semantics for all accesses. A degenerate form might be expressed as "MEMBAR A" which would specify that all subsequent (in program order) accesses to A would be visible before any subsequent memory accesses. On a simple in-order processor with a store buffer, "MEMBAR A" could simply wait while an address matching "A" appear in the store buffer. When A finally drained out to visible & coherent space, the MEMBAR would complete. It's possibly such selective fences might give the CPU designers more latitude to improve performance.
Wednesday Oct 24, 2012
By Dave on Oct 24, 2012
Unbounded busy-waiting is an poor idea for user-space code, so we typically use spin-then-block strategies when, say, waiting for a lock to be released or some other event. If we're going to spin, even briefly, then we'd prefer to do so in a manner that minimizes performance degradation for other sibling logical processors ("strands") that share compute resources. We want to spin politely and refrain from impeding the progress and performance of other threads — ostensibly doing useful work and making progress — that run on the same core. On a SPARC T4, for instance, 8 strands will share a core, and that core has its own L1 cache and 2 pipelines. On x86 we have the PAUSE instruction, which, naively, can be thought of as a hardware "yield" operator which temporarily surrenders compute resources to threads on sibling strands. Of course this helps avoid intra-core performance interference. On the SPARC T2 our preferred busy-waiting idiom was "RD %CCR,%G0" which is a high-latency no-nop. The T4 provides a dedicated and extremely useful WRPAUSE instruction. The processor architecture manuals are the authoritative source, but briefly, WRPAUSE writes a cycle count into the the PAUSE register, which is ASR27. Barring interrupts, the processor then delays for the requested period. There's no need for the operating system to save the PAUSE register over context switches as it always resets to 0 on traps.
Digressing briefly, if you use unbounded spinning then ultimately the kernel will preempt and deschedule your thread if there are other ready threads than are starving. But by using a spin-then-block strategy we can allow other ready threads to run without resorting to involuntary time-slicing, which operates on a long-ish time scale. Generally, that makes your application more responsive. In addition, by blocking voluntarily we give the operating system far more latitude regarding power management. Finally, I should note that while we have OS-level facilities like sched_yield() at our disposal, yielding almost never does what you'd want or naively expect.
Returning to WRPAUSE, it's natural to ask how well it works. To help answer that question I wrote a very simple C/pthreads benchmark that launches 8 concurrent threads and binds those threads to processors 0..7. The processors are numbered geographically on the T4, so those threads will all be running on just one core. Unlike the SPARC T2, where logical CPUs 0,1,2 and 3 were assigned to the first pipeline, and CPUs 4,5,6 and 7 were assigned to the 2nd, there's no fixed mapping between CPUs and pipelines in the T4. And in some circumstances when the other 7 logical processors are idling quietly, it's possible for the remaining logical processor to leverage both pipelines -- "pipeline fusion". Some number T of the threads will iterate in a tight loop advancing a simple Marsaglia xor-shift pseudo-random number generator. T is a command-line argument. The main thread loops, reporting the aggregate number of PRNG steps performed collectively by those T threads in the last 10 second measurement interval. The other threads (there are 8-T of these) run in a loop busy-waiting concurrently with the T threads. We vary T between 1 and 8 threads, and report on various busy-waiting idioms. The values in the table are the aggregate number of PRNG steps completed by the set of T threads. The unit is millions of iterations per 10 seconds. For the "PRNG step" busy-waiting mode, the busy-waiting threads execute exactly the same code as the T worker threads. We can easily normalize the scores and compute the average rate of progress for individual worker threads by dividing the aggregate score by the number of worker threads T. I should note that the PRNG steps are extremely cycle-heavy and access almost no memory, so arguably this microbenchmark is not as representative of "normal" code as it could be. And for the purposes of comparison I included a row in the table that reflects a waiting policy where the waiting threads call poll(NULL,0,1000) and block in the kernel. Obviously this isn't busy-waiting, but the data is interesting for reference.
As we can see in the table below, WRPAUSE provides a good way to spin politely. And for short-term waiting it's much more efficient than parking in the kernel and potentially creating software timers for timed OS-level waiting. So we have a new facility that's as polite and effective -- with respect to sibling interference -- as is parking a thread, but that avoids the trip to the kernel and the other overheads associated with context switching. It's worth pointing out that the T=1 and T=2 scores for poll() and WRPAUSE forms are about equal because at T=1 we're leveraging both pipelines. And 3348 units of work is the approximate cycle cap for a core.
|T = #worker threads|
|Wait Mechanism for 8-T threads||T=1||T=2||T=3||T=4||T=5||T=6||T=7||T=8|
|Park thread in poll()||3265||3347||3348||3348||3348||3348||3348||3348|
|RD %ccr,%g0 "pause"||1426||2429||2692||2862||3013||3162||3255||3349|
Tuesday Jul 03, 2012
By Dave on Jul 03, 2012
In a prior blog entry I noted that Solaris used a "maximum dispersal" placement policy to assign nascent threads to their initial processors. The general idea is that threads should be placed as far away from each other as possible in the resource topology in order to reduce resource contention between concurrently running threads. This policy assumes that resource contention -- pipelines, memory channel contention, destructive interference in the shared caches, etc -- will likely outweigh (a) any potential communication benefits we might achieve by packing our threads more densely onto a subset of the NUMA nodes, and (b) benefits of NUMA affinity between memory allocated by one thread and accessed by other threads. We want our threads spread widely over the system and not packed together. Conceptually, when placing a new thread, the kernel picks the least loaded node NUMA node (the node with lowest aggregate load average), and then the least loaded core on that node, etc. Furthermore, the kernel places threads onto resources -- sockets, cores, pipelines, etc -- without regard to the thread's process membership. That is, initial placement is process-agnostic. Keep reading, though. This description is incorrect.
On Solaris 10 on a SPARC T5440 with 4 x T2+ NUMA nodes, if the system is otherwise unloaded and we launch a process that creates 20 compute-bound concurrent threads, then typically we'll see a perfect balance with 5 threads on each node. We see similar behavior on an 8-node x86 x4800 system, where each node has 8 cores and each core is 2-way hyperthreaded. So far so good; this behavior seems in agreement with the policy I described in the 1st paragraph.
I recently tried the same experiment on a 4-node T4-4 running Solaris 11. Both the T5440 and T4-4 are 4-node systems that expose 256 logical thread contexts. To my surprise, all 20 threads were placed onto just one NUMA node while the other 3 nodes remained completely idle. I checked the usual suspects such as processor sets inadvertently left around by colleagues, processors left offline, and power management policies, but the system was configured normally. I then launched multiple concurrent instances of the process, and, interestingly, all the threads from the 1st process landed on one node, all the threads from the 2nd process landed on another node, and so on. This happened even if I interleaved thread creating between the processes, so I was relatively sure the effect didn't related to thread creation time, but rather that placement was a function of process membership.
I this point I consulted the Solaris sources and talked with folks in the Solaris group. The new Solaris 11 behavior is intentional. The kernel is no longer using a simple maximum dispersal policy, and thread placement is process membership-aware. Now, even if other nodes are completely unloaded, the kernel will still try to pack new threads onto the home lgroup (socket) of the primordial thread until the load average of that node reaches 50%, after which it will pick the next least loaded node as the process's new favorite node for placement. On the T4-4 we have 64 logical thread contexts (strands) per socket (lgroup), so if we launch 48 concurrent threads we will find 32 placed on one node and 16 on some other node. If we launch 64 threads we'll find 32 and 32. That means we can end up with our threads clustered on a small subset of the nodes in a way that's quite different that what we've seen on Solaris 10. So we have a policy that allows process-aware packing but reverts to spreading threads onto other nodes if a node becomes too saturated. It turns out this policy was enabled in Solaris 10, but certain bugs suppressed the mixed packing/spreading behavior.
There are configuration variables in /etc/system that allow us to dial the affinity between nascent threads and their primordial thread up and down: see lgrp_expand_proc_thresh, specifically. In the OpenSolaris source code the key routine is mpo_update_tunables(). This method reads the /etc/system variables and sets up some global variables that will subsequently be used by the dispatcher, which calls lgrp_choose() in lgrp.c to place nascent threads. Lgrp_expand_proc_thresh controls how loaded an lgroup must be before we'll consider homing a process's threads to another lgroup. Tune this value lower to have it spread your process's threads out more.
To recap, the 'new' partial packing placement policy is as follows. Threads from the same process are packed onto a subset of the strands of a socket (50% for T-series). Once that socket reaches the 50% threshold the kernel then picks another preferred socket for that process. Threads from unrelated processes are spread across sockets. More precisely, different processes may have different preferred sockets (lgroups). Beware that I've simplified and elided details for the purposes of explication. The truth is in the code.
- It's worth noting that initial thread placement is just that. If there's a gross imbalance between the load on different nodes then the kernel will migrate threads to achieve a better and more even distribution over the set of available nodes. Once a thread runs and gains some affinity for a node, however, it becomes "stickier" under the assumption that the thread has residual cache residency on that node, and that memory allocated by that thread resides on that node given the default "first-touch" page-level NUMA allocation policy. Exactly how the various policies interact and which have precedence under what circumstances could the topic of a future blog entry.
- The scheduler is work-conserving.
- The x4800 mentioned above is an interesting system. Each of the 8 sockets houses an Intel 7500-series processor. Each processor has 3 coherent QPI links and the system is arranged as a glueless 8-socket twisted ladder "mobius" topology. Nodes are either 1 or 2 hops distant over the QPI links. As an aside the mapping of logical CPUIDs to physical resources is rather interesting on Solaris/x4800. On SPARC/Solaris the CPUID layout is strictly geographic, with the highest order bits identifying the socket, the next lower bits identifying the core within that socket, following by the pipeline (if present) and finally the logical thread context ("strand") on the core. But on Solaris on the x4800 the CPUID layout is as follows. [6:6] identifies the hyperthread on a core; bits [5:3] identify the socket, or package in Intel terminology; bits [2:0] identify the core within a socket. Such low-level details should be of interest only if you're binding threads -- a bad idea, the kernel typically handles placement best -- or if you're writing NUMA-aware code that's aware of the ambient placement and makes decisions accordingly.
- Solaris introduced the so-called critical-threads mechanism, which is expressed by putting a thread into the FX scheduling class at priority 60. The critical-threads mechanism applies to placement on cores, not on sockets, however. That is, it's an intra-socket policy, not an inter-socket policy.
- Solaris 11 introduces the Power Aware Dispatcher (PAD) which packs threads instead of spreading them out in an attempt to be able to keep sockets or cores at lower power levels. Maximum dispersal may be good for performance but is anathema to power management. PAD is off by default, but power management polices constitute yet another confounding factor with respect to scheduling and dispatching.
- The new policy is a compromise between packing and maximum dispersal. If your threads communicate heavily -- one thread reads cache lines last written by some other thread -- then the new dense packing policy may improve performance by reducing traffic on the coherent interconnect. On the other hand if your threads in your process communicate rarely, then it's possible the new packing policy might result on contention on shared computing resources. Unfortunately there's no simple litmus test that says whether packing or spreading is optimal in a given situation. The optimal answer varies by system load, application, number of threads, and platform hardware characteristics. Currently we don't have the necessary tools and sensoria to decide at runtime, so we're reduced to an empirical approach where we run trials and try to decide on a placement policy. The situation is quite frustrating. Relatedly, it's often hard to determine just the right level of concurrency to maximize throughput. (Understanding constructive vs destructive interference in the shared caches would be a good start. We could augment the lines with a small tag field indicating which strand last installed or accessed a line. Given that, we could add new CPU with performance counters that tallied misses where a thread evicts a line it installed and misses where a thread displaces a line installed by some other thread.)
Friday May 25, 2012
By Dave on May 25, 2012
HCLH Locks are NUMA-friendly hierarchical queue-based CLH locks. Subsequently, we've discovered that for correctness threads must be bound 1:1. Absent such binding the locks are vulnerable to exclusion failure related to node circulation & lifecycle issues. That particular invariant wasn't made clear in the HCLH paper.
Thursday Dec 22, 2011
By Dave on Dec 22, 2011
Lock Cohorting: A General Technique for Designing NUMA Locks by Dave Dice, Virendra Marathe and Nir Shavit in PPoPP 2012. The Cohort lock technique allows developers to construct NUMA-aware locks from NUMA-oblivious locks.
Thursday Nov 03, 2011
By Dave on Nov 03, 2011
Title: 7th Workshop on Transactional Computing
Deadline: December 1, 2011
Workshop: February 26, 2012
Location: New Orleans, Louisiana, USA (with PPoPP 2012)
TRANSACT 2012 is a forum for the presentation of research on all aspects of transactional computing. The scope of the workshop is intentionally broad, with the goal of encouraging interaction across the languages, architecture, systems, database, and theory communities. Papers may address implementation techniques, foundational results, applications and workloads, or experience with working systems. Environments of interest include the full range from multithreaded or multicore processors to high-end parallel computing.
Tuesday Nov 01, 2011
Monday Aug 01, 2011
Tuesday Jun 21, 2011
By Dave on Jun 21, 2011
The schedctl facility in Solaris allows a thread to request that the kernel defer involuntary preemption for a brief period. The mechanism is strictly advisory - the kernel can opt to ignore the request. Schedctl is typically used to bracket lock critical sections. That, in turn, can avoid convoying -- threads piling up on a critical section behind a preempted lock-holder -- and other lock-related performance pathologies. If you're interested see the man pages for schedctl_start() and schedctl_stop() and the schedctl.h include file. The implementation is very efficient. schedctl_start(), which asks that preemption be deferred, simply stores into a thread-specific structure -- the schedctl block -- that the kernel maps into user-space. Similarly, schedctl_stop() clears the flag set by schedctl_stop() and then checks a "preemption pending" flag in the block. Normally, this will be false, but if set schedctl_stop() will yield to politely grant the CPU to other threads. Note that you can't abuse this facility for long-term preemption avoidance as the deferral is brief. If your thread exceeds the grace period the kernel will preempt it and transiently degrade its effective scheduling priority. Further reading : US05937187 and various papers by Andy Tucker.
We'll now switch topics to the implementation of the "synchronized" locking construct in the HotSpot JVM. If a lock is contended then on multiprocessor systems we'll spin briefly to try to avoid context switching. Context switching is wasted work and inflicts various cache and TLB penalties on the threads involved. If context switching were "free" then we'd never spin to avoid switching, but that's not the case. We use an adaptive spin-then-park strategy. One potentially undesirable outcome is that we can be preempted while spinning. When our spinning thread is finally rescheduled the lock may or may not be available. If not, we'll spin and then potentially park (block) again, thus suffering a 2nd context switch. Recall that the reason we spin is to avoid context switching. To avoid this scenario I've found it useful to enable schedctl to request deferral while spinning. But while spinning I've arranged for the code to periodically check or poll the "preemption pending" flag. If that's found set we simply abandon our spinning attempt and park immediately. This avoids the double context-switch scenario above. This particular usage of schedctl is inverted in the sense that we cover the spin loop instead of the critical section. (I've experimented with extending the schedctl preemption deferral period over the critical section -- more about that in a subsequent blog entry).
One annoyance is that the schedctl blocks for the threads in a given process are tightly packed on special pages mapped from kernel space into user-land. As such, writes to the schedctl blocks can cause false sharing on other adjacent blocks. Hopefully the kernel folks will make changes to avoid this by padding and aligning the blocks to ensure that one cache line underlies at most one schedctl block at any one time. It's vaguely ironic that a facility designed to improve cooperation between threads suffers from false sharing.
Schedctl also exposes a thread's scheduling state. So if thread T2 holds a lock L, and T1 is contending for L, T1 can check T2's state to see whether it's running (ONPROC in Solaris terminology), ready, or blocked. If T2 is not running then it's usually prudent for T1 to park instead of continuing to spin, as the spin attempt is much more likely to be futile.
Tuesday Jun 14, 2011
By Dave on Jun 14, 2011
By Dave on Jun 14, 2011
When the number of "lanes" is just 1, a partitioned ticket lock is just a degenerate instance of a classic ticket lock.
We can easily provide K-exclusion by setting the number of lanes to K or larger and initialization the 1st K slots to 0 through K-1. For instance if we have 8 lanes and K=3, the the slots would be initialized to 0,1,2,0,0,0,0,0 respectively.
SPAA '11 Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, 2011
Friday Jun 10, 2011
By Dave on Jun 10, 2011
Of late system address space has been growing at a rate of between 1 and 1.5 addressable bits per year. Interestingly, though, we're now adding processors to systems faster than we're adding memory. That is, processors/GB RAM is currently increasing. If that trend holds true and programming models remain the same, then, arguably, synchronization will become even more important as more processors will be coordinating access to a given amount of memory. Proper fine-grained orchestration will become even more critical.
Saturday Apr 16, 2011
By Dave on Apr 16, 2011
Two sections of American Popular Music at Bridgewater State University are discussing the music of the last decade. The textbook stops around 1999. They are looking for your help. Imagine that it is 10 years from now and you must update the American popular music textbook to include a chapter on the years 2000-2011. What artists/groups do you feel absolutely must be in that chapter?
The survey takes about a minute to complete.
Dave is a senior research scientist in the Scalable Synchronization Research Group within Oracle Labs : Google Scholar.