While benchmarking a concurrent application on Linux I ran into an odd problem worth relating. Specifically, I'm using ubuntu 9.10 with linux kernel 2.6.31-1 running on a 1x4x2 Core2 i7-920 "Nehalem" (1 package; 4 cores/package; 2 logical processors/core via hyperthreading). I'd noticed that our scaling numbers were a bit odd, with more than the usual fall off past 4 threads (it's a 1x4x2 system so we expect some fade past 4 threads, even for ideally scalable benchmarks) and more variance than I expected. The benchmark harness runs for fixed time and reports aggregate progress over the measurement interval, and of course the system is otherwise idle. As a sanity check the benchmark also reports the total user CPU time consumed over the measurement interval. Interestingly, the CPU times values were unexpectedly low, considerably less than min(#threads,#cpus) \* MeasurementInterval. All the threads should stay running, or least ready, for the duration of the interval. Note too that the effect is independent of and still occurs with long intervals, so it's not a simple issue of allowing the scheduler time to steal and rebalance or otherwise converge on a state where the runnable threads are well-dispersed over the processors.
It appeared that the scheduler wasn't aggressively dispatching ready threads onto idle CPUs. Put another way there were prolonged periods where we had both idle CPUs and ready threads at the same time -- the kernel was failing to saturate the available processors.
To avoid the problem I initially tried binding threads to processors via sched_setaffinity, which provided complete relief. Still, I'm cautious about binding because it requires knowledge of the platform topology. On SPARC/CMT/Solaris, for instance, logical CPUIDs map to physical resources geographically in the following manner: the bits in the logical CPUID select, from most significant bits to least significant, chip then core then pipeline ("thread group" in Sun terminology) then strand. So if you just bind threads by "natural" order (thread N to CPUID N) then you'll end up with many threads sharing some cores and other cores completely idle, which is likely undesirable and may yield skewed scaling results. This, btw, is a common benchmarking pitfall. On Solaris/SPARC you're better off letting the kernel disperse threads onto processors as it'll balance 1st over chips, then cores, then pipelines, which is optimal for independent threads to make headway. (That policy is clearly not optimal if there's sharing -- particular if there are writers -- in which case you might win by packing the threads less "distant" from each other, for some interconnect distance metric, and assuming the increased cache pressure and replication doesn't do too much harm). Unlike SPARC/Solaris, the logical operating system-level CPUID to physical resource mappings on my ubuntu/x64 system are well-dispersed if you use natural CPU assignment, but there's no hard guarantee of that property although I vaguely recall that Intel advises a certain canonical mapping. In more detail, the logical CPUID to physical mapping on my system -- as discovered by iterating over the logical CPUID values and using the CPUID instruction to query physical resources -- is : 0 to C0S0, 1 to C1S0, 2 to C2S0, 3 to C3S0, 4 to C0S1, 5 to C1S1, 6 to C2S1, 7 to C3S1, where C is the core# and S is the relative strand# on the core.
I'm guessing that the linux kernel I'm using institutes polices that attempt to balance power with performance whereas Solaris currently optimizes for performance. After further poking through the Linux kernel sources I realized we could adjust the scheduling policy more to our liking via tunables exposed via the /proc file system. At that point I came upon the tune-sched-domains script that makes it easy to quickly adjust scheduler tunables. (Note that the script assumes bash). First, run tune-sched-domains with no arguments and examine the SD_WAKE_AFFINITY and SD_WAKE_IDLE settings. We want SD_WAKE_AFFINITY clear and SD_WAKE_IDLE set. (If I'm interpreting the comments in the kernel code correctly, WAKE_AFFINITY appears to try to place the wakee on the same CPU as the waker, presuming they communicate through memory that's already in the local cache, while WAKE_IDLE instructs the kernel to aggressively wake idle CPUs when making threads ready). If necessary, compute a new SD_ mask value and run the script again, passing the value (in decimal) as an argument to the script. These settings provided relief for the under-utilization problem.
In addition I noticed that the HotSpot JVM performed much better on multi-threaded workloads under the settings mentioned above.
While I didn't have time for the experiments, it may be the case that adjusting the LB_BIAS flag may also provide relief.