X

Jon Masamitsu's Weblog

Help for the NUMA Weary

Guest Author
When you think about running on a machine with a non-uniform memory architecture (NUMA),
do you think, "Cool, some memory is really close"? Or do you think, "Why are my
memory latency choices bad, worse and worst"? Or are you like me and try not to
think about it at all?

Well for all of the above, this option rocks.


Executive summary: There is a option which attempts to improve the performance of an application
on a NUMA machine by increasing the application's use of lower latency memory. This
is done by creating per cpu pools of memory in the young generation. A thread AA that
runs on a cpu XX will get objects allocated out of the pool for XX. When cpu XX first touches
a new page of
memory, Solaris tries to assign to XX memory that is closer to XX. Additionally, if a
thread AA has run on XX, Solaris tries to keep AA running
on XX. When this all works as we hope,
AA is accessing the memory that is closer to it more of the time.
This option does not improve GC performance
but improves the performance of application threads.
I'll say something about support for
non-Solaris platforms at the end.

So if you're asking yourself if you should care about NUMA, two
examples of Sun servers with NUMA architectures are the Sun Sparc E6900
and the AMD Opteron X4600. Sun's early chip multithreading (CMT) boxes
(T1000, T2000, T5120 and T5220) are not NUMA, but the later CMT
T5140 and T5240 are.

In the above summary I used the term "per cpu pools" for brevity when
I should really have used the term "per node pools" to be more precise.
Nodes in this context have 1 or more cpu's, local memory and
interconnect as a minimum. I'll try to be more precise below, but
if you see node and think cpu, it's close enough.

On a NUMA system there is some memory that is closer (local) to the cpu's in
a node and some memory
that is farther away (remote). Local and remote is relative to a node.
On Solaris 10 the OS tries to improve the performance of a thread AA
executing on a node RR by increasing AA's use of memory local to RR.
This is achieved by a "first touch" policy (my words,
not a technical term). AA can make a call to get memory but physical memory is
not committed to AA until the memory is used (first touched).
When AA executing on RR
first touches memory, Solaris tries to assign it memory that is local to RR.
If there is no available local memory, AA will get memory remote from RR.

The UseNUMA feature takes advantage of this policy to get
better locality for an application thread that allocates objects
and then uses them (as opposed to an architecture where one
thread allocates objects and some other thread uses them).

In JDK6 update 2 we added -XX:+UseNUMA in the throughput collector (UseParallelGC).
When
you turn this feature on the, JVM then divides the young generation
into separate pools, 1 pool for each node. When AA allocates an object, the
JVM looks to see what node AA is on and then allocates
the object from the pool
for that node. In the diagram AA is running on RR and RR has its pool in
the young generation.

Combine this with
a first touch policy and the memory in the pool for RR is first touched
by a thread running on RR and so is likely to be local to RR.
And as I mentioned above if AA has run on RR. Solaris will try to keep AA executing on
RR. So best case is that
you have AA accessing local memory most of the time. It may sound
a bit like wishful thinking, but we've seen very nice performance
improvements on some applications.

Contrast this with the allocation without per node pools. As a
thread does allocations, it marches deeper and deeper into the
young generation. A thread actually does allocations out of thread
local buffers (TLAB's) but even so, the TLAB's for a thread are
generally scattered throughout the young generation and it is
even more wishful thinking to expect those TLAB's to all be mapped to
local memory.

This part is extra credit and you don't really need to know about it
to use UseNUMA. Solaris has the concept of locality groups
or lgroups. You can read more about lgroups at

Locality Group APIs

A node has an lgroup and within that lgroup are the resources that are closer to the
node. There is actually a hierarchy of lgroups, but lets talk as if a node has
an lgroup that has its closest resources (local resources) and the resources farther
away are just someplace else (remote resources). Thread AA running on RR can ask
what lgroup MM it is in and can ask if a page of memory is in MM. This type of
information is used by the page scanner that I describe below.

There are a couple of caveats.

The young generation is divided into per node pools. If
any of these pools are exhausted, a minor collection is done. That's potentially wasteful
of memory so to ameliorate that, the sizes of the pools are adjusted dynamically so
that threads that do more allocation get larger pools.

In situations where memory is tight and there are several processes running on the
system, the per node pools can be a mixture of local and remote memory. That simply
comes about when RR first touches a page in its pool and there is not local memory
available.
It just gets remote memory. To try to increase the amount of local memory in the
pool, there is a scanner that looks to see if a page in a pool for RR is in the lgroup
MM for RR. If the page is not, the scanner releases that page back to the OS in the
hopes that, when AA on RR again first touches that page, the OS will allocate memory in MM for
that page. Recall that eden in the young generation is usually empty
after a minor collection so these pages can be released.
The scanner also looks for small pages in the pool. On Solaris you
can have a mixture of pages of different sizes in a pool and performance
can be improved by using more large pages and fewer small pages. So the scanner
also releases small pages in the hope that it will be allocated a large page
the next time it uses the memory. This scanning is done
after a collection and only scans a certain number of pages (NUMAPageScanRate)
per collection so as to bound the amount of scanning done per collection.

To review, if you have


  • Thread AA running on node RR and the JVM allocating objects for AA in the pool for RR.

  • Solaris mapping memory for the pool for RR in the lgroup MM (i.e., local to RR) based on first touch.

  • Solaris keeping thread AA running on node RR.

then your application will run faster.
And all you have to do is turn on -XX:+UseNUMA.

An implementation on linux is in the works and will be in an upcoming update of jdk6.
The API's are different for binding the per node pools to local memory (e.g.,
the JVM requests that pages be bound rather than relying on first touch)
but you really don't need to know about any differences. Just turn it on.
We've looked at an implementation for windows
platforms and have not figured out how to do it yet.

If you would like to know a little more about dealing with NUMA machines,
you might find this useful.

Increasing Application Performance on NUMA Architectures

Join the discussion

Comments ( 2 )
  • Alex Lam Monday, May 19, 2008

    That looks really attractive - looking forward to when the linux implementation comes out :-)


  • guest Thursday, June 26, 2008

    > We've looked at an implementation for windows platforms and have not figured out how to do it yet.

    http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40555.pdf reads:

    "In the Microsoft Windows environment ... The function to set memory affinity for a thread is VirtualAlloc( )[9]. This function gives the

    developer the choice to bind memory immediately on allocation or to defer binding until first touch."

    It seems it could be implemented in Solaris style.


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.