Thread placement policies on NUMA systems - update

In a prior blog entry I noted that Solaris used a "maximum dispersal" placement policy to assign nascent threads to their initial processors. The general idea is that threads should be placed as far away from each other as possible in the resource topology in order to reduce resource contention between concurrently running threads. This policy assumes that resource contention -- pipelines, memory channel contention, destructive interference in the shared caches, etc -- will likely outweigh (a) any potential communication benefits we might achieve by packing our threads more densely onto a subset of the NUMA nodes, and (b) benefits of NUMA affinity between memory allocated by one thread and accessed by other threads. We want our threads spread widely over the system and not packed together. Conceptually, when placing a new thread, the kernel picks the least loaded node NUMA node (the node with lowest aggregate load average), and then the least loaded core on that node, etc. Furthermore, the kernel places threads onto resources -- sockets, cores, pipelines, etc -- without regard to the thread's process membership. That is, initial placement is process-agnostic. Keep reading, though. This description is incorrect.

On Solaris 10 on a SPARC T5440 with 4 x T2+ NUMA nodes, if the system is otherwise unloaded and we launch a process that creates 20 compute-bound concurrent threads, then typically we'll see a perfect balance with 5 threads on each node. We see similar behavior on an 8-node x86 x4800 system, where each node has 8 cores and each core is 2-way hyperthreaded. So far so good; this behavior seems in agreement with the policy I described in the 1st paragraph.

I recently tried the same experiment on a 4-node T4-4 running Solaris 11. Both the T5440 and T4-4 are 4-node systems that expose 256 logical thread contexts. To my surprise, all 20 threads were placed onto just one NUMA node while the other 3 nodes remained completely idle. I checked the usual suspects such as processor sets inadvertently left around by colleagues, processors left offline, and power management policies, but the system was configured normally. I then launched multiple concurrent instances of the process, and, interestingly, all the threads from the 1st process landed on one node, all the threads from the 2nd process landed on another node, and so on. This happened even if I interleaved thread creating between the processes, so I was relatively sure the effect didn't related to thread creation time, but rather that placement was a function of process membership.

I this point I consulted the Solaris sources and talked with folks in the Solaris group. The new Solaris 11 behavior is intentional. The kernel is no longer using a simple maximum dispersal policy, and thread placement is process membership-aware. Now, even if other nodes are completely unloaded, the kernel will still try to pack new threads onto the home lgroup (socket) of the primordial thread until the load average of that node reaches 50%, after which it will pick the next least loaded node as the process's new favorite node for placement. On the T4-4 we have 64 logical thread contexts (strands) per socket (lgroup), so if we launch 48 concurrent threads we will find 32 placed on one node and 16 on some other node. If we launch 64 threads we'll find 32 and 32. That means we can end up with our threads clustered on a small subset of the nodes in a way that's quite different that what we've seen on Solaris 10. So we have a policy that allows process-aware packing but reverts to spreading threads onto other nodes if a node becomes too saturated. It turns out this policy was enabled in Solaris 10, but certain bugs suppressed the mixed packing/spreading behavior.

There are configuration variables in /etc/system that allow us to dial the affinity between nascent threads and their primordial thread up and down: see lgrp_expand_proc_thresh, specifically. In the OpenSolaris source code the key routine is mpo_update_tunables(). This method reads the /etc/system variables and sets up some global variables that will subsequently be used by the dispatcher, which calls lgrp_choose() in lgrp.c to place nascent threads. Lgrp_expand_proc_thresh controls how loaded an lgroup must be before we'll consider homing a process's threads to another lgroup. Tune this value lower to have it spread your process's threads out more.

To recap, the 'new' partial packing placement policy is as follows. Threads from the same process are packed onto a subset of the strands of a socket (50% for T-series). Once that socket reaches the 50% threshold the kernel then picks another preferred socket for that process. Threads from unrelated processes are spread across sockets. More precisely, different processes may have different preferred sockets (lgroups). Beware that I've simplified and elided details for the purposes of explication. The truth is in the code.

Remarks:


  • It's worth noting that initial thread placement is just that. If there's a gross imbalance between the load on different nodes then the kernel will migrate threads to achieve a better and more even distribution over the set of available nodes. Once a thread runs and gains some affinity for a node, however, it becomes "stickier" under the assumption that the thread has residual cache residency on that node, and that memory allocated by that thread resides on that node given the default "first-touch" page-level NUMA allocation policy. Exactly how the various policies interact and which have precedence under what circumstances could the topic of a future blog entry.

  • The scheduler is work-conserving.

  • The x4800 mentioned above is an interesting system. Each of the 8 sockets houses an Intel 7500-series processor. Each processor has 3 coherent QPI links and the system is arranged as a glueless 8-socket twisted ladder "mobius" topology. Nodes are either 1 or 2 hops distant over the QPI links. As an aside the mapping of logical CPUIDs to physical resources is rather interesting on Solaris/x4800. On SPARC/Solaris the CPUID layout is strictly geographic, with the highest order bits identifying the socket, the next lower bits identifying the core within that socket, following by the pipeline (if present) and finally the logical thread context ("strand") on the core. But on Solaris on the x4800 the CPUID layout is as follows. [6:6] identifies the hyperthread on a core; bits [5:3] identify the socket, or package in Intel terminology; bits [2:0] identify the core within a socket. Such low-level details should be of interest only if you're binding threads -- a bad idea, the kernel typically handles placement best -- or if you're writing NUMA-aware code that's aware of the ambient placement and makes decisions accordingly.

  • Solaris introduced the so-called critical-threads mechanism, which is expressed by putting a thread into the FX scheduling class at priority 60. The critical-threads mechanism applies to placement on cores, not on sockets, however. That is, it's an intra-socket policy, not an inter-socket policy.

  • Solaris 11 introduces the Power Aware Dispatcher (PAD) which packs threads instead of spreading them out in an attempt to be able to keep sockets or cores at lower power levels. Maximum dispersal may be good for performance but is anathema to power management. PAD is off by default, but power management polices constitute yet another confounding factor with respect to scheduling and dispatching.

  • The new policy is a compromise between packing and maximum dispersal. If your threads communicate heavily -- one thread reads cache lines last written by some other thread -- then the new dense packing policy may improve performance by reducing traffic on the coherent interconnect. On the other hand if your threads in your process communicate rarely, then it's possible the new packing policy might result on contention on shared computing resources. Unfortunately there's no simple litmus test that says whether packing or spreading is optimal in a given situation. The optimal answer varies by system load, application, number of threads, and platform hardware characteristics. Currently we don't have the necessary tools and sensoria to decide at runtime, so we're reduced to an empirical approach where we run trials and try to decide on a placement policy. The situation is quite frustrating. Relatedly, it's often hard to determine just the right level of concurrency to maximize throughput. (Understanding constructive vs destructive interference in the shared caches would be a good start. We could augment the lines with a small tag field indicating which strand last installed or accessed a line. Given that, we could add new CPU with performance counters that tallied misses where a thread evicts a line it installed and misses where a thread displaces a line installed by some other thread.)

Comments:

So either minimize resource contention or minimize communication.

Adding a thread creation time flag to let the user pass this hint to the OS might not be a bad idea even though I guess the OS will dynamically rebalance.

Posted by bank kus on July 03, 2012 at 03:26 PM EDT #

Exactly. Regarding the flag, it's often the case that we have no idea about a thread's sharing/resource patterns at creation time, particularly given thread pools where a thread's role can change over time as it's applied to different tasks. Even if you have a guess about whether a thread is communication-bound or resource-bound, the optimal answer will vary considerably by platform. Regards, -Dave

Posted by guest on July 03, 2012 at 03:34 PM EDT #

Hi Dave,

Thanks for this post - I found it *very* interesting indeed. I'm doing some low latency tuning work with HotSpot 6-32 on Linux 64 bit (2 CPU / 8 core server using RHL : 2.6.18-194.el5), and have some interesting observations re. NUMA. and its use with CMS.

I've found that setting the -XX:+UseNUMA actually degrades GC times by a small amount, so I'm guessing it must have something to do with the thread placement algorithm. I've also found that by reducing the number of GC and CMS threads to 4 a piece (-XX:ParallelGCThreads=4 & -XX:ParallelCMSThreads=4) improves the times quite nicely (to mostly under 1 msec - typically between 550 - 850 nano), otherwise times are averaging around 1.2 - 1.5 msecs. I'd be very interested to hear your input on this.

Cheers,

Adrian.

Posted by Adrian Nakon on July 20, 2012 at 04:20 AM EDT #

Hi Adrian,

In the past, we observed circumstances where the placement, distribution, and balance policies on linux seemed to privilege power management over conservation of work. I'm not sure if that's the case today, however. (I've mostly switched to >= 3.2 kernels and don't have access to any 2.6-based systems on which to experiment). The XX:+UseNUMA switch enables quite a few different paths in the GC code. IIRC, one of the the key changes is try to encourage NUMA affinity between TLABs (thread local allocation buffers and young generation areas) and underlying heap pages. That is, to the extent possible, we want the pages underlying a TLAB to reside on the same node as the thread associated with the TLAB. In turn, this may avoid dragging the constituent cache lines over the interconnect, which improves latency and reduces channel contention on the interconnect. The latter is important as, in terms of growth trends, the interconnect is growing the least fast of all the critical links.

In your particular case I'm not sure that thread placement explains your observations. It may simply be that your particular system is over-provisioned at 8 threads and you're seeing destructive interference.

Regards, -Dave

Posted by guest on July 20, 2012 at 11:13 AM EDT #

Hi Dave,

Many Thanks for your reply.

Hmmm.... lots of food for thought there... . So, from what you say, there should be some real advantage to switching on the NUMA support. The TLAB explanation makes absolute sense - the last thing you'd want is jumping across nodes in order to access the memory you're after.

One of the ideas I had behind reducing the thread counts was the possibility of the GC threads interfering with each other due to contention etc. It certainly did make a difference in the case of these two settings - are there any more I should look out for?

I read your article re. Card Marking and false sharing - I haven't seen this setting in the v6-32 JVM that I'm using - is it a v7 addition?

If you know of any articles / info on low latency tuning for HotSpot I'd be very much obliged.

Thanks Again,

Adrian.

Posted by Adrian Nakon on July 20, 2012 at 11:34 AM EDT #

Hi Adrian,

The benefits of UseNUMA on a NUMA system varies with the NUMA ratio, number of nodes, platform, OS and application. The decision isn't always cut-and-dried. (If it were, then we'd surely configure UseNUMA automatically). Unfortunately when you don't get expected results it's hard to figure our why, and what you might do about it -- either via switches, or perhaps even changes to your code. For NUMA issues I usually start performance diagnosis by checking and ruling our placement and OS-level issues, and then progress to using performance counters. The latter can be tricky as you need a firm understanding of the hardware, OS, JVM and application. You can also procede empirically by varying switches and looking for sensitivity. This is often a good way to attack the problem.

Regarding card marking, I'm not sure if the UseCondCardMark changes have been back-ported into the v6-32 stream. For a given release, I usually just try the switch as I don't keep track of what gets back-ported into the legacy code bases (:>). I'm usually focused on the next release.

As for latency and tuning, the blogs by members of the GC team are a good place to start. I'd recommend https://blogs.oracle.com/jonthecollector/.

Regards, -Dave

Posted by dave on July 20, 2012 at 12:01 PM EDT #

Hi Dave,

Thanks for your help and input - much appreciated! :)

Cheers,

Adrian.

Posted by Adrian Nakon on July 20, 2012 at 12:08 PM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Dave

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today