One of my colleagues did an excellent bit of analysis recently, it pulls together a fair number of related topics, so I hope you'll find it interesting.
We'll start with NUMA. Non-Uniform Memory Access. This is in contrast to UMA - Uniform Memory Access. This relates to memory latency - how long does it take to get data from memory to the processor. If you take a single CPU box, the memory latency is basically a measurement of the wires between the processor and the memory chips, it typically is about 90ns, can be as low as 60ns. For a 3GHz chip this is from around 200 to 300 cycles, which is a fair length of time.
Suppose we add a second chip into the system. The memory latency increases because there's now a bunch of communication that needs to happen between the two chips. The communication consists of things like checking that more recent data is not in the cache of the other chip, co-ordinating access to the same memory bank, accessing memory that is controlled by the other processor. The upshot of all this is that memory latency increases. However, that's not all.
If you have two chips together with a bunch of memory, you can have various configurations. The most likely one is that each chip gets half the memory. If one chip has to access memory that the other chip owns, this is going to take longer than if the memory is attached to that chip. Typically you might find that local memory takes 90ns to access, and remote memory 120ns.
One way of dealing with this disparity is to interleave the memory, so one cacheline will be local, the next remote. Doing this you'll see an average memory latency of 105ns. Although the memory latency is longer than the optimal, there's nothing a programmer (or an operating system) can do about it.
However, those of use who care about performance will jump through hoops of fire to get that lower memory latency. Plus as the disparity in memory latency grows larger, it makes less and less sense to average the cost. Imagine a situation on a large multi-board system where the on-board memory latency might be 150ns, but the cross-board latency would be closer to 300ns (I should point out that I'm using top-of-the-head numbers for all latencies, I'm not measuring them on any systems). The impact of doing this averaging could be a substantial slow-down in performance for any application that doesn't fit into cache (which is most apps). (There are other reasons for not doing this, such as limiting the amount of traffic that needs to go across the various busses.)
So most systems with more than one CPU will see some element of NUMA. Solaris has contained some memory placement optimisations MPO since Solaris 9. These optimisations attempt to allocate memory locally to the processor that is running the application. OpenSolaris has the lgrpinfo command that provides an interface to see the levels of memory locality in the system.
Solaris will attempt to schedule threads so that they remain in their locality group - taking advantage of the local memory. Another way of controlling performance is to use binding to keep processes, or threads on a particular processor. This can be done through the pbind command. Processor sets can performance a similar job (as can zones, or even logical domains), or directly through processor_bind.
Binding can be a tricky thing to get right. For example in a system where there are multiple active users, it is quite possible to end up in a situation where one virtual processor is oversubscribed with processes, whilst another is completely idle. However, in situations where this level of control enables better performance then binding can be hugely helpful.
One situation where binding is commonly used is for running OpenMP programs. In fact, it is so common that the OpenMP library has built in support for binding through the environment variable SUNW_MP_PROCBIND. This variable enables the user to specify which threads are bound to which logical processors.
It is worth pointing out that binding does not just help memory locality issues. Another situation where binding helps is thread migrations. This is the situation where an interrupt, or another thread requires attention and this causes the thread currently running on the processor to be descheduled. In some situations the descheduled thread will get scheduled onto another virtual processor. In some instances that may be the correct decision. In other instances it may result in lower than expected performance because the data that the thread needs is still in the cache on the old processor, and also because the migration of that thread may cause a cascade of migrations of other threads.
The particular situation we hit was that one code when bound showed bimodal distributions of runtimes. It had a 50% chance of running fast or slow. We were using OpenMP as well as the SUNW_MP_PROCBIND environment variable, so in theory we'd controlled for everything. However, the program didn't hit the parallel section until after a few minutes of running, and examining what was happening using both pbind and also the Performance Analyzer indicated what the problem was.
The environment variable SUNW_MP_PROCBIND currently binds threads once the code reaches the parallel region. Until that point the process is unbound. Since the process is unbound, Solaris can schedule it to any available virtual CPU. During the unbound time, the process allocated the memory that it needed, and the MPO feature of Solaris ensured that the memory was allocated locally. I'm sure you can see where this is heading.... Now, once the code hit the parallel region, the binding occurred, and the main thread was bound to a particular locality group, half the time this group was the same group where it had been running before, and half the time it was a different locality group. If the locality group was the same, then memory would be local, otherwise memory would be remote (and performance slower).
We put together a LD_PRELOAD library to prove it. The following code has a parallel section in it which gets called during initialisation. This ensures that binding has already taken place by the time the master thread starts.
#pragma omp parallel sections
#pragma omp section
The code is compiled and used with:
$ cc -O -xopenmp -G -Kpic -o par.so par.c
$ LD_PRELOAD=./par.so ./a.out