What Drove Processor Design Toward Chip Multithreading (CMT)?

I thought of a way of explaining the benefit of CMT (or more specifically,
interleaved multithreading -
see this article for details
) using an analogy the other day.
Bear with me as I wax lyrical on computer history...

Deep back in the origins of the computer, there was only one process
(as well as one processor). There was no operating system, so in turn
there were no concepts like:

  • scheduling
  • I/O interrupts
  • time-sharing
  • multi-threading

What am I getting at? Well, let me pick out a few of the advances in
computing, so I can explain why interleaved multithreading is simply the next logical step.

The first computer operating systems
(such as
simply replaced (automated) some of the tasks that were undertaken
manually by a computer operator - load a program, load some utility
routines that could be used by the program (e.g. I/O routines), record
some accounting data at the completion of the job. They did nothing
during the execution of the job, but they had nothing to do - no other
work could be done while the processor was effectively idle, such as when
waiting for an I/O to complete.

Then muti-processing operating systems were developed. Suddenly we
had the opportunity to use the otherwise wasted CPU resource while one
program was stalled on an I/O. In this case the O.S. would switch in
another program. Generically this is known as scheduling, and
operating systems developed (and still develop) more sophisticated
ways of sharing out the CPU resources in order to achieve the
greatest/fairest/best utilization.

At this point we had enshrined in the OS the idea that CPU resource
was precious, not plentiful, and there should be features designed
into the system to minimize its waste. This would reduce or delay the
need for that upgrade to a faster computer as we continued to add new
applications and features to existing applications. This is analogous
to conserving water to offset the need for new dams & reservoirs.

With CMT, we have now taken this concept into silicon. If
we think of a load or store to or from main (uncached) memory as a
type of I/O, then thread switching in interleaved multithreading is
just like the idea of a voluntary context switch.
We are not giving up the CPU for the duration of the "I/O", but we are
giving up the execution unit, knowing that if there is another thread
that can use it, it will.

In a way, we are delaying the need to increase the clock rate or
pipe-lining abilities of the cores by taking this step.

Now the underlying details of the implementation can be more complex
than this (and they are getting more complex as we release newer CPU
architectures like the UltraSPARC T2 Plus - see the

T5140 Systems Architecture Whitepaper
for details), but this
analogy to I/O's and context switches works
well for me to understand why we have chosen this direction.

To continue to throw engineering resources at
faster, more complicated
CPU cores
seems to be akin to the idea of the mainframe (the closest
descendant to early computers) - just make it do more of the same type
of workload.

See here for the full collection of UltraSPARC T2 Plus blogs

Join the discussion

Comments ( 2 )
  • Neel Thursday, April 10, 2008

    Nice writeup! Like you said, it is very simple (and obvious). Instead of the CPU waiting for the thread to complete IO, let another thread run on the CPU. Similarly, instead of waiting for a cache miss, let another [hardware] thread use the computing resource!

  • Steve Jones Monday, May 19, 2008

    We have very recently been benchmarking some hardware threaded processors on a very large Siebel systems (T5220 and M5000). In getting to grips with some odd looking system level stats, then it's become obvious that those using traditional CPU metrics for capacity planning utilisation metrics are in for a rude awakening. I'd independently come to the conclusions that the hardware threading support on the SPARC64 VI is analogous to an OS despatching another process when the one currently in execution is blocked by an I/O request. In this case the re-scheduling is done by the core itself and the I/O is typically a main-memory fetch.

    The issue that is now concerning me is the use of CPU utilisation measures as reported by (or even seen) by the OS. Hardware threading is a type of CPU virtualisation; in effect a virtual CPU. Unfortunately the OS generally knows and works on the statistics of the vCPU, and not the underlying utilisation of the core resources (itself a complicated game on a fine-grained processor like the T2, not necessarily easily characterised by a simple metric). That affects the way the OS behaves in issues such as prioritisation, scheduling etc.

    However, of more immediate concern, is the measurement of capacity and extrapolation of low and medium CPU utilisation figures on machines with hardware thread support. It has become clear, at least on the M5000 which we've tested more thoroughly, that CPU utilisation at low and medium levels is a very misleading statistic. A system reporting 45% CPU utilisation can be pushed to saturation with an increase in workload of no more than 15-20%. Response times on CPU-intensive threads and reported CPU utilsation can suddenly increase very rapidly indeed as the hidden high level of core utilisation suddenly becomes apparent (effectively queueing by hardware threads for core resources in addition to the normal OS queueing apparent at high CPU utilisation).

    We've yet to get to grips with testing T5220s at such flat-out loads, but we have a new project coming along where throughput is an important issue. Already we have seen projects where extrapolations have been made for hardware requirements on relatively low CPU utilisation levels. I've already taken some steps to get the test teams to consider these issues (at least there is corestat with the T1/T2). However, there is almost no general awareness among most IT people that I know of this issue. It's actually analogous to vCPUs under a hyperviser, but at least there we have the hyperviser itself to get statistics from. With hardware threads in cores we are faced with no common architectural model and, in many cases, a complete lack of measures.

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.