Wednesday Oct 24, 2012

Polite busy-waiting with WRPAUSE on SPARC

Unbounded busy-waiting is an poor idea for user-space code, so we typically use spin-then-block strategies when, say, waiting for a lock to be released or some other event. If we're going to spin, even briefly, then we'd prefer to do so in a manner that minimizes performance degradation for other sibling logical processors ("strands") that share compute resources. We want to spin politely and refrain from impeding the progress and performance of other threads — ostensibly doing useful work and making progress — that run on the same core. On a SPARC T4, for instance, 8 strands will share a core, and that core has its own L1 cache and 2 pipelines. On x86 we have the PAUSE instruction, which, naively, can be thought of as a hardware "yield" operator which temporarily surrenders compute resources to threads on sibling strands. Of course this helps avoid intra-core performance interference. On the SPARC T2 our preferred busy-waiting idiom was "RD %CCR,%G0" which is a high-latency no-nop. The T4 provides a dedicated and extremely useful WRPAUSE instruction. The processor architecture manuals are the authoritative source, but briefly, WRPAUSE writes a cycle count into the the PAUSE register, which is ASR27. Barring interrupts, the processor then delays for the requested period. There's no need for the operating system to save the PAUSE register over context switches as it always resets to 0 on traps.

Digressing briefly, if you use unbounded spinning then ultimately the kernel will preempt and deschedule your thread if there are other ready threads than are starving. But by using a spin-then-block strategy we can allow other ready threads to run without resorting to involuntary time-slicing, which operates on a long-ish time scale. Generally, that makes your application more responsive. In addition, by blocking voluntarily we give the operating system far more latitude regarding power management. Finally, I should note that while we have OS-level facilities like sched_yield() at our disposal, yielding almost never does what you'd want or naively expect.

Returning to WRPAUSE, it's natural to ask how well it works. To help answer that question I wrote a very simple C/pthreads benchmark that launches 8 concurrent threads and binds those threads to processors 0..7. The processors are numbered geographically on the T4, so those threads will all be running on just one core. Unlike the SPARC T2, where logical CPUs 0,1,2 and 3 were assigned to the first pipeline, and CPUs 4,5,6 and 7 were assigned to the 2nd, there's no fixed mapping between CPUs and pipelines in the T4. And in some circumstances when the other 7 logical processors are idling quietly, it's possible for the remaining logical processor to leverage both pipelines -- "pipeline fusion". Some number T of the threads will iterate in a tight loop advancing a simple Marsaglia xor-shift pseudo-random number generator. T is a command-line argument. The main thread loops, reporting the aggregate number of PRNG steps performed collectively by those T threads in the last 10 second measurement interval. The other threads (there are 8-T of these) run in a loop busy-waiting concurrently with the T threads. We vary T between 1 and 8 threads, and report on various busy-waiting idioms. The values in the table are the aggregate number of PRNG steps completed by the set of T threads. The unit is millions of iterations per 10 seconds. For the "PRNG step" busy-waiting mode, the busy-waiting threads execute exactly the same code as the T worker threads. We can easily normalize the scores and compute the average rate of progress for individual worker threads by dividing the aggregate score by the number of worker threads T. I should note that the PRNG steps are extremely cycle-heavy and access almost no memory, so arguably this microbenchmark is not as representative of "normal" code as it could be. And for the purposes of comparison I included a row in the table that reflects a waiting policy where the waiting threads call poll(NULL,0,1000) and block in the kernel. Obviously this isn't busy-waiting, but the data is interesting for reference.

As we can see in the table below, WRPAUSE provides a good way to spin politely. And for short-term waiting it's much more efficient than parking in the kernel and potentially creating software timers for timed OS-level waiting. So we have a new facility that's as polite and effective -- with respect to sibling interference -- as is parking a thread, but that avoids the trip to the kernel and the other overheads associated with context switching. It's worth pointing out that the T=1 and T=2 scores for poll() and WRPAUSE forms are about equal because at T=1 we're leveraging both pipelines. And 3348 units of work is the approximate cycle cap for a core.

Aggregate progress
T = #worker threads
Wait Mechanism for 8-T threadsT=1T=2T=3T=4T=5T=6T=7T=8
Park thread in poll() 32653347334833483348334833483348
no-op 415 831 124316482060249729303349
RD %ccr,%g0 "pause" 14262429269228623013316232553349
PRNG step 412 829 124616702092251029303348
WRPause(8000) 32443361333133483349334833483348
WRPause(4000) 32153308331533223347334833473348
WRPause(1000) 30853199322432513310334833483348
WRPause(500) 29173070315032223270330933483348
WRPause(250) 26942864294930773205338833483348
WRPause(100) 21552469262227902911321433303348

Tuesday Jul 03, 2012

Thread placement policies on NUMA systems - update

In a prior blog entry I noted that Solaris used a "maximum dispersal" placement policy to assign nascent threads to their initial processors. The general idea is that threads should be placed as far away from each other as possible in the resource topology in order to reduce resource contention between concurrently running threads. This policy assumes that resource contention -- pipelines, memory channel contention, destructive interference in the shared caches, etc -- will likely outweigh (a) any potential communication benefits we might achieve by packing our threads more densely onto a subset of the NUMA nodes, and (b) benefits of NUMA affinity between memory allocated by one thread and accessed by other threads. We want our threads spread widely over the system and not packed together. Conceptually, when placing a new thread, the kernel picks the least loaded node NUMA node (the node with lowest aggregate load average), and then the least loaded core on that node, etc. Furthermore, the kernel places threads onto resources -- sockets, cores, pipelines, etc -- without regard to the thread's process membership. That is, initial placement is process-agnostic. Keep reading, though. This description is incorrect.

On Solaris 10 on a SPARC T5440 with 4 x T2+ NUMA nodes, if the system is otherwise unloaded and we launch a process that creates 20 compute-bound concurrent threads, then typically we'll see a perfect balance with 5 threads on each node. We see similar behavior on an 8-node x86 x4800 system, where each node has 8 cores and each core is 2-way hyperthreaded. So far so good; this behavior seems in agreement with the policy I described in the 1st paragraph.

I recently tried the same experiment on a 4-node T4-4 running Solaris 11. Both the T5440 and T4-4 are 4-node systems that expose 256 logical thread contexts. To my surprise, all 20 threads were placed onto just one NUMA node while the other 3 nodes remained completely idle. I checked the usual suspects such as processor sets inadvertently left around by colleagues, processors left offline, and power management policies, but the system was configured normally. I then launched multiple concurrent instances of the process, and, interestingly, all the threads from the 1st process landed on one node, all the threads from the 2nd process landed on another node, and so on. This happened even if I interleaved thread creating between the processes, so I was relatively sure the effect didn't related to thread creation time, but rather that placement was a function of process membership.

I this point I consulted the Solaris sources and talked with folks in the Solaris group. The new Solaris 11 behavior is intentional. The kernel is no longer using a simple maximum dispersal policy, and thread placement is process membership-aware. Now, even if other nodes are completely unloaded, the kernel will still try to pack new threads onto the home lgroup (socket) of the primordial thread until the load average of that node reaches 50%, after which it will pick the next least loaded node as the process's new favorite node for placement. On the T4-4 we have 64 logical thread contexts (strands) per socket (lgroup), so if we launch 48 concurrent threads we will find 32 placed on one node and 16 on some other node. If we launch 64 threads we'll find 32 and 32. That means we can end up with our threads clustered on a small subset of the nodes in a way that's quite different that what we've seen on Solaris 10. So we have a policy that allows process-aware packing but reverts to spreading threads onto other nodes if a node becomes too saturated. It turns out this policy was enabled in Solaris 10, but certain bugs suppressed the mixed packing/spreading behavior.

There are configuration variables in /etc/system that allow us to dial the affinity between nascent threads and their primordial thread up and down: see lgrp_expand_proc_thresh, specifically. In the OpenSolaris source code the key routine is mpo_update_tunables(). This method reads the /etc/system variables and sets up some global variables that will subsequently be used by the dispatcher, which calls lgrp_choose() in lgrp.c to place nascent threads. Lgrp_expand_proc_thresh controls how loaded an lgroup must be before we'll consider homing a process's threads to another lgroup. Tune this value lower to have it spread your process's threads out more.

To recap, the 'new' partial packing placement policy is as follows. Threads from the same process are packed onto a subset of the strands of a socket (50% for T-series). Once that socket reaches the 50% threshold the kernel then picks another preferred socket for that process. Threads from unrelated processes are spread across sockets. More precisely, different processes may have different preferred sockets (lgroups). Beware that I've simplified and elided details for the purposes of explication. The truth is in the code.


  • It's worth noting that initial thread placement is just that. If there's a gross imbalance between the load on different nodes then the kernel will migrate threads to achieve a better and more even distribution over the set of available nodes. Once a thread runs and gains some affinity for a node, however, it becomes "stickier" under the assumption that the thread has residual cache residency on that node, and that memory allocated by that thread resides on that node given the default "first-touch" page-level NUMA allocation policy. Exactly how the various policies interact and which have precedence under what circumstances could the topic of a future blog entry.

  • The scheduler is work-conserving.

  • The x4800 mentioned above is an interesting system. Each of the 8 sockets houses an Intel 7500-series processor. Each processor has 3 coherent QPI links and the system is arranged as a glueless 8-socket twisted ladder "mobius" topology. Nodes are either 1 or 2 hops distant over the QPI links. As an aside the mapping of logical CPUIDs to physical resources is rather interesting on Solaris/x4800. On SPARC/Solaris the CPUID layout is strictly geographic, with the highest order bits identifying the socket, the next lower bits identifying the core within that socket, following by the pipeline (if present) and finally the logical thread context ("strand") on the core. But on Solaris on the x4800 the CPUID layout is as follows. [6:6] identifies the hyperthread on a core; bits [5:3] identify the socket, or package in Intel terminology; bits [2:0] identify the core within a socket. Such low-level details should be of interest only if you're binding threads -- a bad idea, the kernel typically handles placement best -- or if you're writing NUMA-aware code that's aware of the ambient placement and makes decisions accordingly.

  • Solaris introduced the so-called critical-threads mechanism, which is expressed by putting a thread into the FX scheduling class at priority 60. The critical-threads mechanism applies to placement on cores, not on sockets, however. That is, it's an intra-socket policy, not an inter-socket policy.

  • Solaris 11 introduces the Power Aware Dispatcher (PAD) which packs threads instead of spreading them out in an attempt to be able to keep sockets or cores at lower power levels. Maximum dispersal may be good for performance but is anathema to power management. PAD is off by default, but power management polices constitute yet another confounding factor with respect to scheduling and dispatching.

  • The new policy is a compromise between packing and maximum dispersal. If your threads communicate heavily -- one thread reads cache lines last written by some other thread -- then the new dense packing policy may improve performance by reducing traffic on the coherent interconnect. On the other hand if your threads in your process communicate rarely, then it's possible the new packing policy might result on contention on shared computing resources. Unfortunately there's no simple litmus test that says whether packing or spreading is optimal in a given situation. The optimal answer varies by system load, application, number of threads, and platform hardware characteristics. Currently we don't have the necessary tools and sensoria to decide at runtime, so we're reduced to an empirical approach where we run trials and try to decide on a placement policy. The situation is quite frustrating. Relatedly, it's often hard to determine just the right level of concurrency to maximize throughput. (Understanding constructive vs destructive interference in the shared caches would be a good start. We could augment the lines with a small tag field indicating which strand last installed or accessed a line. Given that, we could add new CPU with performance counters that tallied misses where a thread evicts a line it installed and misses where a thread displaces a line installed by some other thread.)

Friday May 25, 2012

HCLH Locks - an additional constraint

HCLH Locks are NUMA-friendly hierarchical queue-based CLH locks. Subsequently, we've discovered that for correctness threads must be bound 1:1. Absent such binding the locks are vulnerable to exclusion failure related to node circulation & lifecycle issues. That particular invariant wasn't made clear in the HCLH paper.

Tuesday Mar 20, 2012

Foxboro says "No Dice"

Thursday Dec 22, 2011

Lock Cohorting - to appear in PPoPP 2012

Lock Cohorting: A General Technique for Designing NUMA Locks by Dave Dice, Virendra Marathe and Nir Shavit in PPoPP 2012. The Cohort lock technique allows developers to construct NUMA-aware locks from NUMA-oblivious locks.

Thursday Nov 03, 2011

Call for papers : Transact 2012

Title: 7th Workshop on Transactional Computing
Deadline: December 1, 2011
Workshop: February 26, 2012
Location: New Orleans, Louisiana, USA (with PPoPP 2012)

TRANSACT 2012 is a forum for the presentation of research on all aspects of transactional computing. The scope of the workshop is intentionally broad, with the goal of encouraging interaction across the languages, architecture, systems, database, and theory communities. Papers may address implementation techniques, foundational results, applications and workloads, or experience with working systems. Environments of interest include the full range from multithreaded or multicore processors to high-end parallel computing.

Tuesday Nov 01, 2011

MONITOR-MWAIT for spin loops

Some ramblings regarding the use of MONITOR-MWAIT for busy-wait loops.

Monday Aug 01, 2011

Make your ambition match your education

Seen in the DC Metro system in ads for a certain institution of higher education. How sad.

Tuesday Jun 21, 2011

Inverted schedctl usage in the JVM

The schedctl facility in Solaris allows a thread to request that the kernel defer involuntary preemption for a brief period. The mechanism is strictly advisory - the kernel can opt to ignore the request. Schedctl is typically used to bracket lock critical sections. That, in turn, can avoid convoying -- threads piling up on a critical section behind a preempted lock-holder -- and other lock-related performance pathologies. If you're interested see the man pages for schedctl_start() and schedctl_stop() and the schedctl.h include file. The implementation is very efficient. schedctl_start(), which asks that preemption be deferred, simply stores into a thread-specific structure -- the schedctl block -- that the kernel maps into user-space. Similarly, schedctl_stop() clears the flag set by schedctl_stop() and then checks a "preemption pending" flag in the block. Normally, this will be false, but if set schedctl_stop() will yield to politely grant the CPU to other threads. Note that you can't abuse this facility for long-term preemption avoidance as the deferral is brief. If your thread exceeds the grace period the kernel will preempt it and transiently degrade its effective scheduling priority. Further reading : US05937187 and various papers by Andy Tucker.

We'll now switch topics to the implementation of the "synchronized" locking construct in the HotSpot JVM. If a lock is contended then on multiprocessor systems we'll spin briefly to try to avoid context switching. Context switching is wasted work and inflicts various cache and TLB penalties on the threads involved. If context switching were "free" then we'd never spin to avoid switching, but that's not the case. We use an adaptive spin-then-park strategy. One potentially undesirable outcome is that we can be preempted while spinning. When our spinning thread is finally rescheduled the lock may or may not be available. If not, we'll spin and then potentially park (block) again, thus suffering a 2nd context switch. Recall that the reason we spin is to avoid context switching. To avoid this scenario I've found it useful to enable schedctl to request deferral while spinning. But while spinning I've arranged for the code to periodically check or poll the "preemption pending" flag. If that's found set we simply abandon our spinning attempt and park immediately. This avoids the double context-switch scenario above. This particular usage of schedctl is inverted in the sense that we cover the spin loop instead of the critical section. (I've experimented with extending the schedctl preemption deferral period over the critical section -- more about that in a subsequent blog entry).

One annoyance is that the schedctl blocks for the threads in a given process are tightly packed on special pages mapped from kernel space into user-land. As such, writes to the schedctl blocks can cause false sharing on other adjacent blocks. Hopefully the kernel folks will make changes to avoid this by padding and aligning the blocks to ensure that one cache line underlies at most one schedctl block at any one time. It's vaguely ironic that a facility designed to improve cooperation between threads suffers from false sharing.

Schedctl also exposes a thread's scheduling state. So if thread T2 holds a lock L, and T1 is contending for L, T1 can check T2's state to see whether it's running (ONPROC in Solaris terminology), ready, or blocked. If T2 is not running then it's usually prudent for T1 to park instead of continuing to spin, as the spin attempt is much more likely to be futile.

Tuesday Jun 14, 2011

MultiLane : a concurrent blocking multiset

MultiLane : a concurrent blocking multiset appeared in SPAA 2011.

It's rather trivial to extend the general idea and construct a deque instead of a queue-like construct.

Partitioned Ticket Lock

Partitioned Ticket Lock appeared in SPAA 2011.

When the number of "lanes" is just 1, a partitioned ticket lock is just a degenerate instance of a classic ticket lock.

We can easily provide K-exclusion by setting the number of lanes to K or larger and initialization the 1st K slots to 0 through K-1. For instance if we have 8 lanes and K=3, the the slots would be initialized to 0,1,2,0,0,0,0,0 respectively.

ACM DL Author-ize serviceBrief announcement: a partitioned ticket lock
David Dice
SPAA '11 Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, 2011

Friday Jun 10, 2011

Trends in processor count and memory growth

Of late system address space has been growing at a rate of between 1 and 1.5 addressable bits per year. Interestingly, though, we're now adding processors to systems faster than we're adding memory. That is, processors/GB RAM is currently increasing. If that trend holds true and programming models remain the same, then, arguably, synchronization will become even more important as more processors will be coordinating access to a given amount of memory. Proper fine-grained orchestration will become even more critical.

Saturday Apr 16, 2011

Writing Musical History

Two sections of American Popular Music at Bridgewater State University are discussing the music of the last decade. The textbook stops around 1999. They are looking for your help. Imagine that it is 10 years from now and you must update the American popular music textbook to include a chapter on the years 2000-2011. What artists/groups do you feel absolutely must be in that chapter?

The survey takes about a minute to complete.

Thursday Apr 14, 2011

Flat-Combining NUMA Locks

Flat-Combining NUMA Locks to appear in SPAA 2011.

Tuesday Apr 05, 2011

Cache index-aware memory allocation

Cache Index-Aware Memory Allocation to appear in
International Symposium on Memory Managment (ISMM) 2011. As a consequence of the design, this "malloc" allocator is also NUMA-aware and NUMA-friendly assuming the system uses the usual default first-touch NUMA page placement policy.

ACM DL Author-ize serviceCache index-aware memory allocation
Yehuda Afek, Dave Dice, Adam Morrison
ISMM '11 Proceedings of the international symposium on Memory management, 2011

Tuesday Mar 01, 2011

crazy concurrency

Crazy Concurrency -- high-end nerdcore.

Tuesday Feb 15, 2011

atomic fetch-and-add vs compare-and-swap

There are a number of cases where an atomic fetch-and-add instruction might yield better performance than the classic Load;Φ;CAS loop, where CAS is the atomic compare-and-swap instruction. The x86 architecture exposes LOCK:XADD which I'll use in the discussion below. (If you don't need the original value you can also use LOCK:INC or LOCK:ADD instead of LOCK:XADD).

  1. CAS is "optimistic" and admits failure, whereas XADD does not. With XADD there's no explicit window of vulnerability to remote interference, and thus no need for a retry loop. Arguably, XADD has better progress properties, assuming the underlying XADD implementation doesn't have an implicit loop, but even in that case the window would be narrower than with Load;Φ;CAS.

  2. If you use the typical Load;Φ;CAS idiom, assuming normal snoop-based cache coherence, the Load may cause a read-to-share bus transaction to get the underlying cache line into S or E state. The CAS, which effectively has store semantics with respect to the cache coherence protocol, may induce another bus transaction to upgrade the line to M state. So in the worst case the idiom might incur two bus transactions, but an XADD implementation will usually drive the line directly into M state, requiring just one bus transaction. Of course you might be able to speculate on the value and have fast-path that tries a "naked" CAS without any prior load. (I use this frequently in native HotSpot code). Also, it's possible for sophisticated processor implementations to perform coherence speculation and aggressively probe the target line into M state. Finally, in some cases you can profitably insert a prefetch-for-write (PREFETCHW) instruction prior to the load to avoid the upgrade transaction. But that approach has to be applied carefully as in some cases it can do more harm than good. Given all that, XADD, where available, has an advantage.

  3. Lets say you were trying to increment a variable with the usual Load;INC;CAS loop. When the CAS starts failing with sufficient frequency you can find that the branch to exit the loop (normally taken under no or light contention) starts to predict toward the failure path where we stay in the loop. So when the CAS ultimately succeeds, you'll incur a branch mispredict when trying to exit the loop. This can be quite painful on processors with deep pipelines and lots of out-of-order speculative machinery. Typically, this is in a piece of code where you don't want a long stall. Relatedly, when the CAS starts to fail frequently, the branch begins to predict that control stays in the loop and in turn the loop runs faster and we cycle around more quickly to the CAS. Typically, we'd want some back-off in the loop. Under light load with infrequent failures the mispredict to retry the loop served as a potentially useful implicit back-off. But under higher load we lose the benefit of implicit back-off arising from the mispredict. There's no loop and no such issues with XADD

Even though CAS has a higher (infinite) consensus number, there are times when fetch-and-add -- which has a consensus number of just 2 -- is preferable.

Update: see CR7023898

Monday Feb 14, 2011

False sharing induced by card table marking

Garbage-collected runtime environments frequently use card tables in conjunction with write barriers to accelerate root-scanning during garage collection. (See A Fast write barrier for generational garbage collectors by Urs Holzle, ECOOP-OOPSLA 1993 for details). Briefly, and skipping a few details, this design partitions the heap into an array of power-of-two sized card pages. When a mutator stores into a reference field of an object the runtime write barrier code will mark the card table entry for the card page containing (covering) that field as dirty. In the HotSpot JVM the card page size is 512 bytes and the card table is implemented as a simple array of bytes. That is, a given card table entry, which represents the state of a card page, is just one byte. The write barrier is emitted directly by the JIT and is usually just a shift and store instruction. In a subsequent non-moving minor GC, the collector can avoid scanning reference fields in a card that is not dirty.

This design is well-proven widely employed but unfortunately it can result in performance issues in highly concurrent environments. Lets say our cache line size is 64 bytes, which is fairly common in modern processors. This means that 64 cards (32KB = 64\*512) will share a cache line in the card table. So reference stores by different threads that just happen to fall within the same 32KB region cause writes to the same cache line underlying the card table. This can result in excessive write invalidation and cache coherence traffic, which can reduce performance and impede scaling. Interestingly, the impact can worsen after a full/moving GC as threads tend to allocate into different address ranges by virtue of thread-local allocation buffers (TLABs), but after a full collection the remaining objects tend to be more tightly packed and thus more prone to the problem. Furthermore, most card table stores are redundant, as often the card is already marked dirty. This suggests a simple solution: instead of using an unconditional store in the barrier, we first check the card table entry and only store if it is clean. This slightly increases the barrier path-length and adds a conditional branch -- unless we were to be somewhat clever with conditional moves by annulling a redundant store by changing the destination address to be a thread-local dummy variable. On the other hand it avoids the problem. (For historical background, some years ago Doug Lea noticed an odd slow-down after a full GC in some concurrent Java code. He contacted me and I speculated that false sharing in the card table could be the issue. We conjured up a JVM with an experimental -XX:+UseCondCardMark flag that let us emit write barriers as either the usual unconditional store, or a conditional form that avoids redundant stores. The conditional form provided relief).

I ran into the problem recently when experimenting with some concurrent queue implementations, which are reference-heavy, on a Sun 256-way Sun/Oracle T5440. This is 4-socket system where each socket contains a Niagara T2+ UltraSPARC processor having 64 logical processors. My benchmark has 50 producer threads and 50 consumer threads and measures and reports the message throughput over a timing interval. In the default configuration we can pass about 8.8K messages per msec. Using the same JVM, when I add the -XX:+UseCondCardMark flag we can achieve 52.5K messages per msec, clearly demonstrating the magnitude of the effect.

I should also note that we ran into this same issue when experimenting with Java-level lock elision using hardware transactional memory. If two unrelated concurrent transactions happened to store into reference fields in the same 32KB region we'd have false aborts because of write-sharing on the card table cache line. Again, -XX:+UseCondCardMark provided relief.

Update: see CR7029167

Thursday Jan 20, 2011

MultiLane : a concurrent blocking multiset

MultiLane : a concurrent blocking multiset - under submission.

It's rather trivial to extend the general idea and construct a deque instead of a queue-like construct.

Friday Jan 14, 2011

Partitioned Ticket Lock

The Partitioned Ticket Lock - under submission.

Friday Oct 15, 2010

Submissions are being accepted for VEE 2010

Submissions are now being accepted for VEE 2010 (more formally, the 2011 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments). VEE is co-located with ASPLOS this year.

Tuesday Sep 07, 2010

Cache index-aware memory allocation

Cache index-Aware memory allocation(under submission) by Dave Dice, Yehuda Afek and Adam Morrison. We describe a relatively simple set of changes to malloc allocators that may reduce the conflict miss rate of applications accessing blocks returned by malloc.

Wednesday Jul 21, 2010

Solaris scheduling : SPARC and CPUIDs

Since it's a commonly asked question and source of confusion I thought I'd write up the following.

First, I should introduce some terminology and state the mapping between solaris logical CPUIDs and physical resources. On a Solaris/SPARC Niagara-2 system the logical CPUIDs map geographically to physical IDs and resources. You can interpret a logical CPUID as follows: (DieNumber: D; CoreNumber:3 ; ThreadGroup:1 ; Strand:2). That is, you have 8 cores per die, 2 thread groups per core, and 4 strands per thread group. All the logical processors on a given core share an level-1 cache. The "ThreadGroup" is a rather obscure name for a pipeline. On a Niagara-1, for instance, there is only 1 pipeline per core, but you have 2 per core on an N2. You can query the CPUID on which a thread is currently running with getcpuid(), which is extremely fast.

Assuming a simple model where all your threads remain runnable and there's no preemption ...

With unbound threads, the solaris scheduler will balance (disperse fairly) 1st over dies, then over cores, then over pipelines, in order to maximize true parallelism and, to the extent possible, avoiding contention over shared resources by placing threads as 'far away' from each other as possible. This is usually the best policy -- particularly for completely independent threads -- but beware that it ignores the issue of inter-chip coherency costs. If you have high coherence costs (writes to shared variables) then packing a group of communicating threads on-chip can sometimes be better than letting them disperse over multiple chips. (As an aside, you're typically much better off letting the scheduler assign CPUs than by trying to bind yourself. Naive binding -- say, with sequential CPUIDs -- will almost always result in suboptimal performance).

In this future these polices might shift to allow better power management by trying to keep dies or cores "parked" (idle, drawing less power). Digressing slightly, recent linux schedulers _do try to impose some type of power management by default, making it sometimes hard to squeeze maximal performance out of a MP Nehalem/AMD system and introducing confounding factors for those of us benchmarking.

If the threads are entirely CPU-bound then typically the scheduler will place them and the thread:cpu relationship then becomes fairly stable. Even if the thread blocks briefly, if it comes back ONPROC (running) in under 3 msecs it's considered to have residual affinity and will go back to the processor where it last ran, barring gross dispatch queue length imbalance. Check the OpenSolaris kernel sources for "rechoose_interval" if you're curious.

For background, each core has its own local dispatch queue and makes its own local scheduling decisions. There's no centralized global scheduler agent and no centralized scheduling data structures (with the exception of a queue for unbound real-time threads). The scheduling policies and parameters are designed such that the local & independent decisions collectively result in achieving the desired global scheduling policy. (If you're biologically inspired, think of a communal insect colony). At a high level the scheduler attempts to maximize aggregate useful work completed in unit-time. Specifically, the scheduler tries to maximize the # of cpus doing useful work as well as attempting, where possible, to maintain affinity (minimize migration). With CMT/CMP/HT systems it also tries to disperse work (LWPs) over cores. As I mentioned above, over time more constraints are being added to the scheduler's responsibilities, such as trying to minimize or reduce power consumption, striking a balance between performance and energy.

The system disperses threads via stealing (pulling, where idle CPU steal from other dispatch queues) and queue balancing (pushing, where if there's a gross imbalance in dispatch queue depth, a local scheduler will try to pass some blocked threads to other less-loaded dispatch queues).

As an aside, threads in the real-time scheduling class cause more centralized dispatch. If they're unbound they can induce lots of preemption and migration, but that's a topic for another day.

Tuesday Jun 01, 2010

Europar 2010 : Transactional Mutex Locks

Transactional Mutex Locks by Luke Dalessandro, Dave Dice, Michael Scott, Nir Shavit and Michael Spear will appear in Europar 2010.

Monday Apr 05, 2010

TLRW: Return of the Read-Write Lock

TLRW: Return of the Read-Write Lock by Dave Dice and Nir Shavit will appear in SPAA 2010. (The paper introduces the concept of the TLRW-ByteLock).


Dave is a senior research scientist in the Scalable Synchronization Research Group within Oracle Labs : Google Scholar.


« February 2016