Solaris scheduling : SPARC and CPUIDs
By Dave on Jul 21, 2010
Since it's a commonly asked question and source of confusion I thought I'd write up the following.
First, I should introduce some terminology and state the mapping between solaris logical CPUIDs and physical resources. On a Solaris/SPARC Niagara-2 system the logical CPUIDs map geographically to physical IDs and resources. You can interpret a logical CPUID as follows: (DieNumber: D; CoreNumber:3 ; ThreadGroup:1 ; Strand:2). That is, you have 8 cores per die, 2 thread groups per core, and 4 strands per thread group. All the logical processors on a given core share an level-1 cache. The "ThreadGroup" is a rather obscure name for a pipeline. On a Niagara-1, for instance, there is only 1 pipeline per core, but you have 2 per core on an N2. You can query the CPUID on which a thread is currently running with getcpuid(), which is extremely fast.
Assuming a simple model where all your threads remain runnable and there's no preemption ...
With unbound threads, the solaris scheduler will balance (disperse fairly) 1st over dies, then over cores, then over pipelines, in order to maximize true parallelism and, to the extent possible, avoiding contention over shared resources by placing threads as 'far away' from each other as possible. This is usually the best policy -- particularly for completely independent threads -- but beware that it ignores the issue of inter-chip coherency costs. If you have high coherence costs (writes to shared variables) then packing a group of communicating threads on-chip can sometimes be better than letting them disperse over multiple chips. (As an aside, you're typically much better off letting the scheduler assign CPUs than by trying to bind yourself. Naive binding -- say, with sequential CPUIDs -- will almost always result in suboptimal performance).
In this future these polices might shift to allow better power management by trying to keep dies or cores "parked" (idle, drawing less power). Digressing slightly, recent linux schedulers _do try to impose some type of power management by default, making it sometimes hard to squeeze maximal performance out of a MP Nehalem/AMD system and introducing confounding factors for those of us benchmarking.
If the threads are entirely CPU-bound then typically the scheduler will place them and the thread:cpu relationship then becomes fairly stable. Even if the thread blocks briefly, if it comes back ONPROC (running) in under 3 msecs it's considered to have residual affinity and will go back to the processor where it last ran, barring gross dispatch queue length imbalance. Check the OpenSolaris kernel sources for "rechoose_interval" if you're curious.
For background, each core has its own local dispatch queue and makes its own local scheduling decisions. There's no centralized global scheduler agent and no centralized scheduling data structures (with the exception of a queue for unbound real-time threads). The scheduling policies and parameters are designed such that the local & independent decisions collectively result in achieving the desired global scheduling policy. (If you're biologically inspired, think of a communal insect colony). At a high level the scheduler attempts to maximize aggregate useful work completed in unit-time. Specifically, the scheduler tries to maximize the # of cpus doing useful work as well as attempting, where possible, to maintain affinity (minimize migration). With CMT/CMP/HT systems it also tries to disperse work (LWPs) over cores. As I mentioned above, over time more constraints are being added to the scheduler's responsibilities, such as trying to minimize or reduce power consumption, striking a balance between performance and energy.
The system disperses threads via stealing (pulling, where idle CPU steal from other dispatch queues) and queue balancing (pushing, where if there's a gross imbalance in dispatch queue depth, a local scheduler will try to pass some blocked threads to other less-loaded dispatch queues).
As an aside, threads in the real-time scheduling class cause more centralized dispatch. If they're unbound they can induce lots of preemption and migration, but that's a topic for another day.