Sunday Oct 12, 2008

The Seduction of Single-Threaded Performance

The following is a dramatization. It is used to illustrate some concepts regarding performance testing and architecting of computer systems. Artistic license may have been taken with events, people and time-lines. The performance data I have listed is real and current however.

I got contacted recently by the Systems Architect of He has been a happy Sun customer for many years, but was a little displeased when he took delivery of a beta test system of one of our latest UltraSPARC servers.

"Not very fast", he said.

"Is that right, how is it not fast?", I inquired eagerly.

"Well, it's a lot slower than one of the LowMarginBrand x86 servers we just bought", he trumpeted indignantly.

"How were you measuring their speed?", I asked, getting wary.

"Ahh, simple - we were compressing a big file. We were careful to not let it be limited by I/O bandwidth or memory capacity, though..."

What then ensues is a discussion about what was being used to test "performance", whether it matches's typical production workload and further details about architecture and objectives.

Data compression utilities are a classic example of a seemingly mature area in computing. Lots of utilities, lots of different algorithms, a few options in some utilities, reasonable portability between operating systems, but one significant shortcoming - there is no commonly available utility that is multi-threaded.

Let me pretend I am still in this situation of using compression to evaluate system performance, and I am wanting to compare the new Sun SPARC Enterprise T5440 with a couple of current x86 servers. Here is my own first observation about such a test, using a single-threaded compression utility:

Single-Threaded Throughput

Now if you browse down to older blog entries, you will see I have written my own multi-threaded compression utility. It consists of a thread to read data, as many threads to compress or decompress data as demand requires, and one thread to write data. Let me see whether I can fully exploit the performance of the T5440 with Tamp...

Well, this turned out to be not quite the end of the story. I designed my tests with my input file located on a TMPFS (in-memory) filesystem, and with the output being discarded. This left the system focusing on the computation of compression, without being obscured by I/O. This is the same objective that had.

What I found on the T5440 was that Tamp would not use more than 12-14 threads for compression - it was limited by the speed at which a single thread could read data from TMPFS.

So, I chose to use another dimension by which we can scale up work on a server - add more sources of workload. This is represented by multiple "Units of Work" in my chart below.

After completing my experiments I discovered that, as expected, the T5440 may disappoint if we restrict ourselves to a workload that can not fully utilize the available processing capacity. If we add more work however, we will find it handily surpasses the equivalent 4-socket quad-core x86 systems.

Multi-Threaded Throughput

Observing Single-Thread Performance on a T5440

A little side-story, and another illustration of how inadequate a single-threaded workload is at determining the capability of the T5440. Take a look at the following output from vmstat, and answer this question:

Is this system "maxed out"?

(Note: the "us", "sy" and "id" columns list how much CPU time is spent in User, System and Idle modes, respectively)

 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr d0 d1 d2 d3   in   sy   cs us sy id 
 0 0 0 1131540 12203120 1  8  0  0  0  0  0  0  0  0  0 3359 1552 419  0  0 100 
 0 0 0 1131540 12203120 0  0  0  0  0  0  0  0  0  0  0 3364 1558 431  0  0 100 
 0 0 0 1131540 12203120 0  0  0  0  0  0  0  0  0  0  0 3366 1478 420  0  0 99 
 0 0 0 1131540 12203120 0  0  0  0  0  0  0  0  0  0  0 3354 1500 441  0  0 100 
 0 0 0 1131540 12203120 0  0  0  0  0  0  0  0  0  0  0 3366 1549 460  0  0 99 

Well, the answer is yes. It is running a single-threaded process, which is using 100% of one CPU. For the sake of my argument we will say the application is the critical application on the system. It has reached it's highest throughput and is therefore "maxed out". You see, when one CPU represents less than 0.5% of the entire CPU capacity of a system, then a single saturated CPU will be rounded down to 0%. In the case of the T5440, one CPU is 1/256th or 0.39%.

Here is a tip for watching a system that might be doing nothing, but then again might be doing something as fast as it can:

$ mpstat 3 | grep -v ' 100$'

This is what you might see:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    2   0   48   204    4    2    0    0    0    0   127    1   1   0  99
 32    0   0    0     2    0    3    0    0    0    0     0    0   8   0  92
 48    0   0    0     6    0    0    5    0    0    0     0  100   0   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    1   0   49   205    5    3    0    0    0    0   117    0   1   0  99
 32    0   0    0     4    0    5    0    0    1    0     0    0  14   0  86
 48    0   0    0     6    0    0    5    0    0    0     0  100   0   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0   48   204    4    2    0    0    0    0   103    0   1   0  99
 32    0   0    0     3    0    4    0    0    0    0     3    0  14   0  86
 48    0   0    0     6    0    0    5    0    0    0     0  100   0   0   0

mpstat uses "usr", "sys", and "idl" to represent CPU consumption. For more on "wt" you can read my older blog.

For more on utilization, see the CPU/Processor page on

To read more about the Sun SPARC Enterprise T5440 which is announced today, go to Allan Packer's blog listing all the T5440 blogs.

Tamp - a Multi-Threaded Compression Utility

Some more details on this:

  • It uses a freely-available Lempel-Ziv-derived algorithm, optimised for compression speed
  • It was compiled using the same compiler and optimization settings for SPARC and x86.
  • It uses a compression block size of 256KB, so files smaller than this will not gain much benefit
  • I was compressing four 1GB database files. They were being reduced in size by a little over 60%.
  • Browse my blog for more details and a download

Monday Apr 21, 2008

Comparing the UltraSPARC T2 Plus to Other Recent SPARC Processors

Update - now the UltraSPARC T2 Plus has been released, and is available in several new several Sun servers. Allan Packer has published a new collection of blog entries that provide lots of detail.

Here is my updated table of details comparing a number of current SPARC processors. I can not guarantee 100% accuracy on this, but I did quite a bit of reading...

Codename Panther Olympus-C Niagara Niagara 2 Victoria Falls
process 90nm 90nm 90nm 65nm 65nm
die size 335 mm2 421 mm2 379 mm2 342 mm2
pins 1368 1933 1831
transistors 295 M 540 M 279 M 503 M
clock 1.5 – 2.1 GHz 2.15 – 2.4 GHz 1.0 – 1.4 GHz 1.0 – 1.4 GHz 1.2 – 1.4 GHz
cores 2 2 8 8 8
threads/core 1 2 4 8 8
threads/chip 2 4 32 64 64
FPU : IU 1 : 1 1 : 1 1 : 8 1 : 1 1 : 1
integration 8 × small crypto 8 × large crypto, PCI-E, 2 × 10Gbe 8 × large crypto, PCI-E, multi-socket coherency
virtualization domains1 hypervisor
L1 i$ 64K/core 128K/core 16K/core
L1 d$ 64K/core 128K/core 8K/core
L2 cache (on-chip) 2MB, shared, 4-way, 64B lines 6MB, shared, 10-way, 256B lines 3MB, shared, 12-way, 64B lines 4MB, shared, 16-way, 64B lines
L3 cache 32MB shared, 4-way, tags on-chip, 64B lines n/a n/a
MMU on-chip
on-chip, 4 × DDR2 on-chip, 4 × FB-DIMM on-chip, 2 × FB-DIMM
Memory Models TSO TSO TSO, limited RMO
Physical Address Space 43 bits 47 bits 40 bits
i-TLB 16 FA + 512 2-way SA 64 FA
d-TLB 16 FA + 512 2-way SA 64 FA 128 FA
combined TLB 32 FA + 2048 2-way SA
Page sizes 8K, 64K, 512K, 4M, 32M, 256M 8K, 64K, 512K, 4M, 32M, 256M 8K, 64K, 4M, 256M
Memory bandwidth2 (GB/sec) 9.6 25.6 60+ 32


  • 1 - domains are implemented above the processor/chip level
  • 2 - theoretical peak - does not take cache coherency or other limits into account


  • FA - fully-associative
  • FPU - Floating Point Unit
  • i-TLB - Instruction Translation Lookaside Buffer (d means Data)
  • IU - Integer (execution) Unit
  • L1 - Level 1 (similarly for L2, L3)
  • MMU - Memory Management Unit
  • RMO - Relaxed Memory Order
  • SA - set-associative
  • TSO - Total Store Order


Tuesday Apr 08, 2008

What Drove Processor Design Toward Chip Multithreading (CMT)?

I thought of a way of explaining the benefit of CMT (or more specifically, interleaved multithreading - see this article for details) using an analogy the other day. Bear with me as I wax lyrical on computer history...

Deep back in the origins of the computer, there was only one process (as well as one processor). There was no operating system, so in turn there were no concepts like:

  • scheduling
  • I/O interrupts
  • time-sharing
  • multi-threading

What am I getting at? Well, let me pick out a few of the advances in computing, so I can explain why interleaved multithreading is simply the next logical step.

The first computer operating systems (such as GM-NAA I/O) simply replaced (automated) some of the tasks that were undertaken manually by a computer operator - load a program, load some utility routines that could be used by the program (e.g. I/O routines), record some accounting data at the completion of the job. They did nothing during the execution of the job, but they had nothing to do - no other work could be done while the processor was effectively idle, such as when waiting for an I/O to complete.

Then muti-processing operating systems were developed. Suddenly we had the opportunity to use the otherwise wasted CPU resource while one program was stalled on an I/O. In this case the O.S. would switch in another program. Generically this is known as scheduling, and operating systems developed (and still develop) more sophisticated ways of sharing out the CPU resources in order to achieve the greatest/fairest/best utilization.

At this point we had enshrined in the OS the idea that CPU resource was precious, not plentiful, and there should be features designed into the system to minimize its waste. This would reduce or delay the need for that upgrade to a faster computer as we continued to add new applications and features to existing applications. This is analogous to conserving water to offset the need for new dams & reservoirs.

With CMT, we have now taken this concept into silicon. If we think of a load or store to or from main (uncached) memory as a type of I/O, then thread switching in interleaved multithreading is just like the idea of a voluntary context switch. We are not giving up the CPU for the duration of the "I/O", but we are giving up the execution unit, knowing that if there is another thread that can use it, it will.

In a way, we are delaying the need to increase the clock rate or pipe-lining abilities of the cores by taking this step.

Now the underlying details of the implementation can be more complex than this (and they are getting more complex as we release newer CPU architectures like the UltraSPARC T2 Plus - see the T5140 Systems Architecture Whitepaper for details), but this analogy to I/O's and context switches works well for me to understand why we have chosen this direction.

To continue to throw engineering resources at faster, more complicated CPU cores seems to be akin to the idea of the mainframe (the closest descendant to early computers) - just make it do more of the same type of workload.

See here for the full collection of UltraSPARC T2 Plus blogs


Tim Cook's Weblog The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.


« August 2016