A scheduling and dispatching oddity on Linux

While benchmarking a concurrent application on Linux I ran into an odd problem worth relating. Specifically, I'm using ubuntu 9.10 with linux kernel 2.6.31-1 running on a 1x4x2 Core2 i7-920 "Nehalem" (1 package; 4 cores/package; 2 logical processors/core via hyperthreading). I'd noticed that our scaling numbers were a bit odd, with more than the usual fall off past 4 threads (it's a 1x4x2 system so we expect some fade past 4 threads, even for ideally scalable benchmarks) and more variance than I expected. The benchmark harness runs for fixed time and reports aggregate progress over the measurement interval, and of course the system is otherwise idle. As a sanity check the benchmark also reports the total user CPU time consumed over the measurement interval. Interestingly, the CPU times values were unexpectedly low, considerably less than min(#threads,#cpus) \* MeasurementInterval. All the threads should stay running, or least ready, for the duration of the interval. Note too that the effect is independent of and still occurs with long intervals, so it's not a simple issue of allowing the scheduler time to steal and rebalance or otherwise converge on a state where the runnable threads are well-dispersed over the processors.

It appeared that the scheduler wasn't aggressively dispatching ready threads onto idle CPUs. Put another way there were prolonged periods where we had both idle CPUs and ready threads at the same time -- the kernel was failing to saturate the available processors.

To avoid the problem I initially tried binding threads to processors via sched_setaffinity, which provided complete relief. Still, I'm cautious about binding because it requires knowledge of the platform topology. On SPARC/CMT/Solaris, for instance, logical CPUIDs map to physical resources geographically in the following manner: the bits in the logical CPUID select, from most significant bits to least significant, chip then core then pipeline ("thread group" in Sun terminology) then strand. So if you just bind threads by "natural" order (thread N to CPUID N) then you'll end up with many threads sharing some cores and other cores completely idle, which is likely undesirable and may yield skewed scaling results. This, btw, is a common benchmarking pitfall. On Solaris/SPARC you're better off letting the kernel disperse threads onto processors as it'll balance 1st over chips, then cores, then pipelines, which is optimal for independent threads to make headway. (That policy is clearly not optimal if there's sharing -- particular if there are writers -- in which case you might win by packing the threads less "distant" from each other, for some interconnect distance metric, and assuming the increased cache pressure and replication doesn't do too much harm). Unlike SPARC/Solaris, the logical operating system-level CPUID to physical resource mappings on my ubuntu/x64 system are well-dispersed if you use natural CPU assignment, but there's no hard guarantee of that property although I vaguely recall that Intel advises a certain canonical mapping. In more detail, the logical CPUID to physical mapping on my system -- as discovered by iterating over the logical CPUID values and using the CPUID instruction to query physical resources -- is : 0 to C0S0, 1 to C1S0, 2 to C2S0, 3 to C3S0, 4 to C0S1, 5 to C1S1, 6 to C2S1, 7 to C3S1, where C is the core# and S is the relative strand# on the core.

I'm guessing that the linux kernel I'm using institutes polices that attempt to balance power with performance whereas Solaris currently optimizes for performance. After further poking through the Linux kernel sources I realized we could adjust the scheduling policy more to our liking via tunables exposed via the /proc file system. At that point I came upon the tune-sched-domains script that makes it easy to quickly adjust scheduler tunables. (Note that the script assumes bash). First, run tune-sched-domains with no arguments and examine the SD_WAKE_AFFINITY and SD_WAKE_IDLE settings. We want SD_WAKE_AFFINITY clear and SD_WAKE_IDLE set. (If I'm interpreting the comments in the kernel code correctly, WAKE_AFFINITY appears to try to place the wakee on the same CPU as the waker, presuming they communicate through memory that's already in the local cache, while WAKE_IDLE instructs the kernel to aggressively wake idle CPUs when making threads ready). If necessary, compute a new SD_ mask value and run the script again, passing the value (in decimal) as an argument to the script. These settings provided relief for the under-utilization problem.

In addition I noticed that the HotSpot JVM performed much better on multi-threaded workloads under the settings mentioned above.

While I didn't have time for the experiments, it may be the case that adjusting the LB_BIAS flag may also provide relief.

Comments:

The linux scheduler also attempts to prevent waking up CPU's unnecessarily for short run threads, which is good since short-run thread are atypical behaviour.
Make sure your benchmark is actually running the threads for a reasonable length of time.

Posted by Noel Grandin on January 26, 2010 at 08:44 PM EST #

Noel, The effect was evident even after 60 seconds. Perhaps that wasn't long enough though. Do you have a sense for how long it would take the scheduler to react?

Regards, -Dave

Posted by David Dice on January 27, 2010 at 12:48 AM EST #

I would expect that from Windows on notebook, Moblin and Ubuntu Netbook Remix. But not from plain Linux. Yikes!

Posted by Dmitry V'jukov on January 27, 2010 at 03:36 AM EST #

Dave,

Were the threads blocking into the kernel for any reason?

I could imagine that a standard spin-then-block mutex could cause some thread blocking. Once combined with the above comment that Linux tries to keep CPUs sleeping as long as it can, might that partly explain the odd results?

- Milo

Posted by Milo Martin on January 27, 2010 at 10:34 AM EST #

Hi Milo, I eventually reduced the test case to a degenerate form where the threads had no calls (no possibility of taking locks for PLT resolution - dynamic linking) and only used memory accesses to communicate their progress. Relatedly, if I was reading /proc correctly we always had some threads (the set varied likely because of preemption) that were ready but not running. And as you'd expect all the threads got service -- ostensibly via preemption -- but they were being dispatched onto a only a subset of the processors. Also, if there we blocking (say on something implicit like a demand-zero-fill page in the kernel) then I wouldn't expect binding or adjusting the SD_ flags to provide relief.

Regards, -Dave

Regards,-Dave

Posted by Dave Dice on January 28, 2010 at 01:11 AM EST #

This is clearly replica watches the job for our legal fraternity to engage the establishment to necessary breitling watches steps by filing petitions in various courts. IF one fails another should be cartier watches filed taking every one to task. It is rolex watches useless to suggest ways and means to solve tag heuer watches the day to day problem to well paid employees tissot watches of government controlled establishments. Only active omega watches judiciary will resolve this problem.
http://www.watchvisa.com
http://www.watchvisa.com/breitling-watches.html
http://www.watchvisa.com/cartier-watches.html
http://www.watchvisa.com/movado-watches.html
http://www.watchvisa.com/omega-watches.html
http://www.watchvisa.com/rolex-watches.html
http://www.watchvisa.com/tag_heuer-watches.html
http://www.watchvisa.com/tissot-watches.html

Posted by jiangfan on February 26, 2010 at 05:47 PM EST #

I could imagine that a standard spin-then-block mutex could cause some thread blocking. Once combined with the above comment that Linux tries to keep CPUs sleeping as long as it can, might that partly explain the odd results?

Posted by cartier watch on May 29, 2010 at 10:34 PM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Dave

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today