CMT and Solaris Performance

I (too) long ago mentioned that I would talk a bit about CMT processor architectures, and the performance enhancements we've made (and are making) to Solaris as result. In this post, I would at least like to get started. :)
Some background... (why / what is CMT)
Looking behind us, we see there's been some trends in processor evolution. Caches have grown larger, clock speeds have grown faster, and pipelines have become more sophisticated. These trends seem to have held true for quite some time...so much so, that we've practically come to expect it. Look at clock speed. For how long has it been the de-facto rule of thumb for classifying and comparing processor performance? The notion that given two processors, the one with the higher clock speed is faster has historically been beaten into us. But how reliable is this metric these days? How much of a correlation is there in practice between clock speed and performance? In reality, it's pretty clear that the technique of cranking up clock speed to squeeze out more performance has become a game of diminishing returns. As clock speeds have increased, so too has the relevance and impact of latency on overall performance. What's the point of being able to chew through many instructions quickly if much of the processor's time is wasted stalled waiting for memory? Thanks to Amdahl we know that at some point, a higher clock speed won't help if latency dominates. One must either find a way to reduce the amount of latency experienced by the processor, or find something more productive to do in that latency besides stalling. Sure, caches can help to reduce memory latency, but they too have tradeoffs. As they grow larger, they become slower and more expensive. Whizzier pipeline designs may help the processor chew through more instructions in a given cycle, but stalls are still a problem. What to do?
A throughput approach
Chip Multi-Threaded (CMT) architecture diverges somewhat from what has been the traditional ways of getting more work out of a processor. Rather than butting heads with the laws of physics in an attempt to quickly burn though a single instruction stream (stumbling and stalling along the way), CMT processors do more by allowing multiple threads to execute in parallel. The nature of the parallelism depends on how the processor implements CMT, but in general this is a hallmark characteristic. This is in contrast to traditional (non-CMT) processors that can execute at most one thread at a time. Because multiple threads can execute in parallel, each CMT processor presents itself to the operating system as multiple "logical" CPUs, upon which threads may be scheduled to run. One nice benefit of CMT, is that if the operating system already works in traditional multi-processor configurations, then there would typically be little (or no) change required to get it to function on the CMT system. But what about performance? Here in lies the challenge. Typically, the logical CPUs presented by a CMT processor share some of the processor's physical resources and/or facilities. Depending on the CMT implementation this sharing can be extensive. This means that a given thread executing on a CMT processor may impact (for better or worse) the performance of the other executing threads on the same chip, depending on the nature of the sharing. Before I dig too deep here, and to give some context, let's look at some common CMT implementations:

Some processors (such as the UltraSPARC-IV and Dual-core AMD Opteron) implement CMT by incorporating multiple traditional processor cores in a single physical package. This technique is known as Chip Multi-Processing (or CMP), and from a performance perspective it is generally the tamest. In some cases, one or more on-chip caches may be shared between the processor cores (allowing for threads to constructively/destructively interfere with each other), and shared logic / datapaths to cache, memory, and memory controllers may exist which can also cause contention / bandwidth issues.

Far more interesting, are the Multi-Threaded processing architectures where each core may present multiple logical CPUs, sometimes called "strands". Examples include Intel's Hyper-Threaded P4/Xeon processors, and Sun's forthcoming Niagara processor (Niagara is actually a threaded CMP, 4 threads per core with up to 8 cores per chip). This architecture directly addreses the latency problem by allowing the processor core to execute instructions from a different instruction stream (thread) if a given thread stalls. Threads executing on CPUs presented by a given core will typically share everything except a set of registers (necessary for maintaining distinct thread state), so one thread's performance impact on another (on the same core) can be considerable. Everything including caches, TLBs, ALU/FPUs, and the pipeline itself are shared.

Looking ahead, it's quite possible (even probable) that future processors will have many cores, each with many threads. Various sharing relationships will likely exist between the various logical CPUs...some may share caches, some may share pipelines, some may share floating point units, and some may share something else, while others (on different physical processors) may share nothing at all. All sorts of new opportunities will (and already have) arisen for threads to contend for resources, and impact each other via cache interferance, or not...depending on where (on which logical CPUs) they are dispatched to run, and which processor resources are needed by the threads.

What follows is a brief description of some of the enhancements we've already made to Solaris to enhance performance on CMT processors (specifically CMP and Hyper-Threaded P4/Xeon). Andrei, Sasha, and myself have more recently been working extensively on the dispatcher implementation in preparation for Niagara and beyond. Once the pearly gates of OpenSolaris have opened, i'll follow up with a tour of the implementation for the enhancements described below, and will talk more about our existing work in progress, and the direction we're headed.
Load balancing, logical vs. physical awareness
To improve performance (and performance determanism) on CMP, and Hyper-Threaded P4/Xeon processors, the our first goal was to give the Solaris dispatcher awareness of which logical CPUs shared which physical processors, and to put some load balancing policy in place. Given 4 physical processors each with 2 logical CPUs, if one were to run 4 threads, mpstat(1M) should clearly show one thread per chip (with no doubling up). This policy seeks to minimize contention over any per physical processor resources /facilities. Once this awareness was in place, we enhanced psrinfo(1M) to display the logical to physical CPU mappings:

esaxe@badaboom$ psrinfo -vp
The physical processor has 2 virtual processors (0, 4)
  x86 (chipid 0x0 GenuineIntel family 15 model 2 step 6 clock 3000 MHz)
        Intel(r) Xeon(tm) MP CPU 3.00GHz
The physical processor has 2 virtual processors (1, 5)
  x86 (chipid 0x1 GenuineIntel family 15 model 2 step 6 clock 3000 MHz)
        Intel(r) Xeon(tm) MP CPU 3.00GHz
The physical processor has 2 virtual processors (2, 6)
  x86 (chipid 0x2 GenuineIntel family 15 model 2 step 6 clock 3000 MHz)
        Intel(r) Xeon(tm) MP CPU 3.00GHz
The physical processor has 2 virtual processors (3, 7)
  x86 (chipid 0x3 GenuineIntel family 15 model 2 step 6 clock 3000 MHz)
        Intel(r) Xeon(tm) MP CPU 3.00GHz

The above output is on a 4 processor Hyper-Threaded Xeon MP box. You can see CPUs 0 and 4 are Hyper-Thread's from the same chip. If I run 4 busy loops, we see the dispatcher in action via mpstat(1M):

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   315  207   37    0    3    0    0    25    0   0   0 100
  1    0   0    0    18    3    0    5    0    0    0     0  100   0   0   0
  2    0   0    0    15    2    0    5    0    0    0     0  100   0   0   0
  3    1   0   49    16    7   17    0    1    2    0    74    0   0   0 100
  4    0   0    0    15    1    0    6    0    0    0     0  100   0   0   0
  5    0   0    0    13    0    8    0    1    0    0     0    0   0   0 100
  6    0   0    0    33    0   49    0    1    0    0     3    0   0   0 100
  7    0   0    0    13    0    0    5    0    0    0     0  100   0   0   0

The "100" in the usr column indicates the CPU is at 100% user time, clearly a busy loop. With processors that are either multi-core, or multi-threaded, only a single level of load balancing (at the physical processor level) is needed. When multi-core, multi-threaded, multi-processor configurations emerge, the dispatcher will need to balance across additional levels as well.
Threaded processor enhancements
On threaded processor architectures, Solaris makes use of the "halt" (or equivalent) instruction to "suspend" idle() CPUs when no work is available. When a thread becomes runnable, a cross trap is sent to "wake up" a designated CPU. We found that on our 2 processor Hyper-Threaded P4 test box, that by not halting idle CPUs we could expect somewhere around a ~ 30% performance hit on a single threaded benchmark, and around 47% higher power consumption (as measured on an idle system).

Solaris also makes use of the Pentium IV pause instruction, which (when inserted in spin loops), lessens the performance impact of a spinning thread on another thread running on the same chip. This instruction was added to spin loops both in the kernel, and in libc.
More to come
Ok, i'm going to have to cut this post here. Obviously what i've touched on is the tip of the iceberg. I haven't even talked about application performance issues, the importance of workload observability on CMT, and the API conundrum.
Comments:

Thanks for the extra info

Posted by netegis on October 31, 2005 at 04:54 PM PST #

I need some information about the Solaris' Scheduler, something like graphs or diagrams.

Please can you help me?

Thank you.

Posted by Sebastian on October 16, 2007 at 09:38 AM PDT #

Hi Sebastian,

I recently gave a presentation to the SVOSUG group about the OpenSolaris scheduling and processor management.

You can find the presentation here:
http://blogs.sun.com/esaxe/resource/cpu_sched_svosug.pdf

Also, a good (more comprehensive) resource is the "Solaris Internals" book.

These would be good places to start...

Thanks,
-Eric

Posted by esaxe on October 16, 2007 at 10:03 AM PDT #

I know pidlock is used to protect process create/exit/swap and so on. When a user kernel thread sleeps holding pidlock, the dispatcher doesn't work or system will hang? So what the real function of pidlock?

Thanks?

Posted by minibao on October 13, 2008 at 06:11 PM PDT #

>minibao wrote:

I know pidlock is used to protect process create/exit/swap and so on. When a user kernel thread sleeps holding pidlock, the dispatcher doesn't work or system will hang? So what the real function of pidlock?

Hi Minibao,
pidlock protects the process list. Grabbing that lock prevents new processes from joining (or leaving), that list. One common use, is to grab pidlock when searching the process list to find a process of interest (and ensure the list is consistent), and then grabbing the process lock (p_lock) for the individual process before dropping pidlock.

The scheduler/dispatcher does not require any adaptive lock (like pidlock) for it's operation...so where a thread grabs the lock and goes to sleep for a long time this may cause lock contention issues for operations like fork(), etc...but it won't impede the internal operations of the scheduler/dispatcher subsystem.

Posted by esaxe on October 16, 2008 at 05:51 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

esaxe

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today