What Drove Processor Design Toward Chip Multithreading (CMT)?

I thought of a way of explaining the benefit of CMT (or more specifically, interleaved multithreading - see this article for details) using an analogy the other day. Bear with me as I wax lyrical on computer history...

Deep back in the origins of the computer, there was only one process (as well as one processor). There was no operating system, so in turn there were no concepts like:

  • scheduling
  • I/O interrupts
  • time-sharing
  • multi-threading

What am I getting at? Well, let me pick out a few of the advances in computing, so I can explain why interleaved multithreading is simply the next logical step.

The first computer operating systems (such as GM-NAA I/O) simply replaced (automated) some of the tasks that were undertaken manually by a computer operator - load a program, load some utility routines that could be used by the program (e.g. I/O routines), record some accounting data at the completion of the job. They did nothing during the execution of the job, but they had nothing to do - no other work could be done while the processor was effectively idle, such as when waiting for an I/O to complete.

Then muti-processing operating systems were developed. Suddenly we had the opportunity to use the otherwise wasted CPU resource while one program was stalled on an I/O. In this case the O.S. would switch in another program. Generically this is known as scheduling, and operating systems developed (and still develop) more sophisticated ways of sharing out the CPU resources in order to achieve the greatest/fairest/best utilization.

At this point we had enshrined in the OS the idea that CPU resource was precious, not plentiful, and there should be features designed into the system to minimize its waste. This would reduce or delay the need for that upgrade to a faster computer as we continued to add new applications and features to existing applications. This is analogous to conserving water to offset the need for new dams & reservoirs.

With CMT, we have now taken this concept into silicon. If we think of a load or store to or from main (uncached) memory as a type of I/O, then thread switching in interleaved multithreading is just like the idea of a voluntary context switch. We are not giving up the CPU for the duration of the "I/O", but we are giving up the execution unit, knowing that if there is another thread that can use it, it will.

In a way, we are delaying the need to increase the clock rate or pipe-lining abilities of the cores by taking this step.

Now the underlying details of the implementation can be more complex than this (and they are getting more complex as we release newer CPU architectures like the UltraSPARC T2 Plus - see the T5140 Systems Architecture Whitepaper for details), but this analogy to I/O's and context switches works well for me to understand why we have chosen this direction.

To continue to throw engineering resources at faster, more complicated CPU cores seems to be akin to the idea of the mainframe (the closest descendant to early computers) - just make it do more of the same type of workload.

See here for the full collection of UltraSPARC T2 Plus blogs

Comments:

Nice writeup! Like you said, it is very simple (and obvious). Instead of the CPU waiting for the thread to complete IO, let another thread run on the CPU. Similarly, instead of waiting for a cache miss, let another [hardware] thread use the computing resource!

Posted by Neel on April 10, 2008 at 02:26 AM PDT #

We have very recently been benchmarking some hardware threaded processors on a very large Siebel systems (T5220 and M5000). In getting to grips with some odd looking system level stats, then it's become obvious that those using traditional CPU metrics for capacity planning utilisation metrics are in for a rude awakening. I'd independently come to the conclusions that the hardware threading support on the SPARC64 VI is analogous to an OS despatching another process when the one currently in execution is blocked by an I/O request. In this case the re-scheduling is done by the core itself and the I/O is typically a main-memory fetch.

The issue that is now concerning me is the use of CPU utilisation measures as reported by (or even seen) by the OS. Hardware threading is a type of CPU virtualisation; in effect a virtual CPU. Unfortunately the OS generally knows and works on the statistics of the vCPU, and not the underlying utilisation of the core resources (itself a complicated game on a fine-grained processor like the T2, not necessarily easily characterised by a simple metric). That affects the way the OS behaves in issues such as prioritisation, scheduling etc.

However, of more immediate concern, is the measurement of capacity and extrapolation of low and medium CPU utilisation figures on machines with hardware thread support. It has become clear, at least on the M5000 which we've tested more thoroughly, that CPU utilisation at low and medium levels is a very misleading statistic. A system reporting 45% CPU utilisation can be pushed to saturation with an increase in workload of no more than 15-20%. Response times on CPU-intensive threads and reported CPU utilsation can suddenly increase very rapidly indeed as the hidden high level of core utilisation suddenly becomes apparent (effectively queueing by hardware threads for core resources in addition to the normal OS queueing apparent at high CPU utilisation).

We've yet to get to grips with testing T5220s at such flat-out loads, but we have a new project coming along where throughput is an important issue. Already we have seen projects where extrapolations have been made for hardware requirements on relatively low CPU utilisation levels. I've already taken some steps to get the test teams to consider these issues (at least there is corestat with the T1/T2). However, there is almost no general awareness among most IT people that I know of this issue. It's actually analogous to vCPUs under a hyperviser, but at least there we have the hyperviser itself to get statistics from. With hardware threads in cores we are faced with no common architectural model and, in many cases, a complete lack of measures.

Posted by Steve Jones on May 18, 2008 at 06:18 PM PDT #

Post a Comment:
Comments are closed for this entry.
About

Tim Cook's Weblog The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today