Help for the NUMA Weary

When you think about running on a machine with a non-uniform memory architecture (NUMA), do you think, "Cool, some memory is really close"? Or do you think, "Why are my memory latency choices bad, worse and worst"? Or are you like me and try not to think about it at all?

Well for all of the above, this option rocks.

Executive summary: There is a option which attempts to improve the performance of an application on a NUMA machine by increasing the application's use of lower latency memory. This is done by creating per cpu pools of memory in the young generation. A thread AA that runs on a cpu XX will get objects allocated out of the pool for XX. When cpu XX first touches a new page of memory, Solaris tries to assign to XX memory that is closer to XX. Additionally, if a thread AA has run on XX, Solaris tries to keep AA running on XX. When this all works as we hope, AA is accessing the memory that is closer to it more of the time. This option does not improve GC performance but improves the performance of application threads. I'll say something about support for non-Solaris platforms at the end.

So if you're asking yourself if you should care about NUMA, two examples of Sun servers with NUMA architectures are the Sun Sparc E6900 and the AMD Opteron X4600. Sun's early chip multithreading (CMT) boxes (T1000, T2000, T5120 and T5220) are not NUMA, but the later CMT T5140 and T5240 are.

In the above summary I used the term "per cpu pools" for brevity when I should really have used the term "per node pools" to be more precise. Nodes in this context have 1 or more cpu's, local memory and interconnect as a minimum. I'll try to be more precise below, but if you see node and think cpu, it's close enough.

On a NUMA system there is some memory that is closer (local) to the cpu's in a node and some memory that is farther away (remote). Local and remote is relative to a node. On Solaris 10 the OS tries to improve the performance of a thread AA executing on a node RR by increasing AA's use of memory local to RR. This is achieved by a "first touch" policy (my words, not a technical term). AA can make a call to get memory but physical memory is not committed to AA until the memory is used (first touched). When AA executing on RR first touches memory, Solaris tries to assign it memory that is local to RR. If there is no available local memory, AA will get memory remote from RR.

The UseNUMA feature takes advantage of this policy to get better locality for an application thread that allocates objects and then uses them (as opposed to an architecture where one thread allocates objects and some other thread uses them).

In JDK6 update 2 we added -XX:+UseNUMA in the throughput collector (UseParallelGC). When you turn this feature on the, JVM then divides the young generation into separate pools, 1 pool for each node. When AA allocates an object, the JVM looks to see what node AA is on and then allocates the object from the pool for that node. In the diagram AA is running on RR and RR has its pool in the young generation.

Combine this with a first touch policy and the memory in the pool for RR is first touched by a thread running on RR and so is likely to be local to RR. And as I mentioned above if AA has run on RR. Solaris will try to keep AA executing on RR. So best case is that you have AA accessing local memory most of the time. It may sound a bit like wishful thinking, but we've seen very nice performance improvements on some applications.

Contrast this with the allocation without per node pools. As a thread does allocations, it marches deeper and deeper into the young generation. A thread actually does allocations out of thread local buffers (TLAB's) but even so, the TLAB's for a thread are generally scattered throughout the young generation and it is even more wishful thinking to expect those TLAB's to all be mapped to local memory.

This part is extra credit and you don't really need to know about it to use UseNUMA. Solaris has the concept of locality groups or lgroups. You can read more about lgroups at

Locality Group APIs

A node has an lgroup and within that lgroup are the resources that are closer to the node. There is actually a hierarchy of lgroups, but lets talk as if a node has an lgroup that has its closest resources (local resources) and the resources farther away are just someplace else (remote resources). Thread AA running on RR can ask what lgroup MM it is in and can ask if a page of memory is in MM. This type of information is used by the page scanner that I describe below.

There are a couple of caveats.

The young generation is divided into per node pools. If any of these pools are exhausted, a minor collection is done. That's potentially wasteful of memory so to ameliorate that, the sizes of the pools are adjusted dynamically so that threads that do more allocation get larger pools.

In situations where memory is tight and there are several processes running on the system, the per node pools can be a mixture of local and remote memory. That simply comes about when RR first touches a page in its pool and there is not local memory available. It just gets remote memory. To try to increase the amount of local memory in the pool, there is a scanner that looks to see if a page in a pool for RR is in the lgroup MM for RR. If the page is not, the scanner releases that page back to the OS in the hopes that, when AA on RR again first touches that page, the OS will allocate memory in MM for that page. Recall that eden in the young generation is usually empty after a minor collection so these pages can be released. The scanner also looks for small pages in the pool. On Solaris you can have a mixture of pages of different sizes in a pool and performance can be improved by using more large pages and fewer small pages. So the scanner also releases small pages in the hope that it will be allocated a large page the next time it uses the memory. This scanning is done after a collection and only scans a certain number of pages (NUMAPageScanRate) per collection so as to bound the amount of scanning done per collection.

To review, if you have

  • Thread AA running on node RR and the JVM allocating objects for AA in the pool for RR.
  • Solaris mapping memory for the pool for RR in the lgroup MM (i.e., local to RR) based on first touch.
  • Solaris keeping thread AA running on node RR.

then your application will run faster. And all you have to do is turn on -XX:+UseNUMA.

An implementation on linux is in the works and will be in an upcoming update of jdk6. The API's are different for binding the per node pools to local memory (e.g., the JVM requests that pages be bound rather than relying on first touch) but you really don't need to know about any differences. Just turn it on. We've looked at an implementation for windows platforms and have not figured out how to do it yet.

If you would like to know a little more about dealing with NUMA machines, you might find this useful.

Increasing Application Performance on NUMA Architectures


That looks really attractive - looking forward to when the linux implementation comes out :-)

Posted by Alex Lam on May 19, 2008 at 05:47 AM PDT #

> We've looked at an implementation for windows platforms and have not figured out how to do it yet. reads:
"In the Microsoft Windows environment ... The function to set memory affinity for a thread is VirtualAlloc( )[9]. This function gives the
developer the choice to bind memory immediately on allocation or to defer binding until first touch."
It seems it could be implemented in Solaris style.

Posted by guest on June 26, 2008 at 01:00 PM PDT #

Post a Comment:
Comments are closed for this entry.



« February 2017