X

The Seduction of Single-Threaded Performance

The following is a dramatization. It is used to illustrate some
concepts regarding performance testing and architecting of computer
systems. Artistic license may have been taken with events, people and
time-lines. The performance data I have listed is real and current
however.

I got contacted recently by the Systems Architect of latestrage.com.
He has been a happy Sun customer for many years, but was a little
displeased when he took delivery of a beta test system of one of our
latest UltraSPARC servers.

"Not very fast", he said.

"Is that right, how is it not fast?", I inquired eagerly.

"Well, it's a lot slower than one of the LowMarginBrand x86 servers we
just bought", he trumpeted indignantly.

"How were you measuring their speed?", I asked, getting wary.

"Ahh, simple - we were compressing a big file. We were careful to not
let it be limited by I/O bandwidth or memory capacity, though..."

What then ensues is a discussion about what was being used to test
"performance", whether it matches latestrage.com's typical production
workload and further details about architecture and objectives.

Data compression utilities are a classic example of a seemingly mature
area in computing. Lots of utilities, lots of different algorithms, a
few options in some utilities, reasonable portability between operating
systems, but one significant shortcoming - there is no commonly
available utility that is multi-threaded.

Let me pretend I am still in this situation of using compression to
evaluate system performance, and I am wanting to compare the new Sun
SPARC Enterprise T5440 with a couple of current x86 servers. Here is my own
first observation about such a test, using a single-threaded
compression utility:


Single-Threaded Throughput

Now if you browse down to older blog entries, you will see I have written my own multi-threaded compression utility.
It consists of a thread to read data, as many threads to compress or
decompress data as demand requires, and one thread to write data. Let me see whether I can fully exploit the performance of the T5440 with Tamp...

Well, this turned out to be not quite the end of the story. I
designed my tests with my input file located on a TMPFS
(in-memory) filesystem, and with the output being discarded. This left the system focusing on the computation of compression, without being obscured by I/O. This is the same objective that latestrage.com had.

What I found on the T5440 was that Tamp would not use more than 12-14 threads for
compression - it was limited by the speed at which a single thread could
read data from TMPFS.

So, I chose to use another dimension by which we can scale up work
on a server - add more sources of workload. This is represented by
multiple "Units of Work" in my chart below.

After completing my experiments I discovered that, as expected, the T5440
may disappoint if we restrict ourselves to a workload that can not
fully utilize the available processing capacity. If we add more work
however, we will find it handily surpasses the equivalent 4-socket quad-core x86
systems.


Multi-Threaded Throughput

Observing Single-Thread Performance on a T5440

A little side-story, and another illustration of how inadequate a
single-threaded workload is at determining the capability of the T5440. Take a look at the following output from
vmstat, and answer this question:

Is this system "maxed out"?

(Note: the "us", "sy" and "id" columns list how much CPU time is spent in User, System and Idle modes, respectively)



 kthr      memory            page            disk          faults      cpu
r b w swap free re mf pi po fr de sr d0 d1 d2 d3 in sy cs us sy id
0 0 0 1131540 12203120 1 8 0 0 0 0 0 0 0 0 0 3359 1552 419 0 0 100
0 0 0 1131540 12203120 0 0 0 0 0 0 0 0 0 0 0 3364 1558 431 0 0 100
0 0 0 1131540 12203120 0 0 0 0 0 0 0 0 0 0 0 3366 1478 420 0 0 99
0 0 0 1131540 12203120 0 0 0 0 0 0 0 0 0 0 0 3354 1500 441 0 0 100
0 0 0 1131540 12203120 0 0 0 0 0 0 0 0 0 0 0 3366 1549 460 0 0 99

Well, the answer is yes. It is running a single-threaded process, which is using 100% of one CPU. For the sake of my argument we will say the application is the critical application on the system. It has reached it's highest throughput and is therefore "maxed out". You see, when one CPU represents less than 0.5% of the entire CPU capacity of a system, then a single saturated CPU will be rounded down to 0%. In the case of the T5440, one CPU is 1/256th or 0.39%.

Here is a tip for watching a system that might be doing nothing, but
then again might be doing something as fast as it can:



$ mpstat 3 | grep -v ' 100$'

This is what you might see:



CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0 2 0 48 204 4 2 0 0 0 0 127 1 1 0 99
32 0 0 0 2 0 3 0 0 0 0 0 0 8 0 92
48 0 0 0 6 0 0 5 0 0 0 0 100 0 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 1 0 49 205 5 3 0 0 0 0 117 0 1 0 99
32 0 0 0 4 0 5 0 0 1 0 0 0 14 0 86
48 0 0 0 6 0 0 5 0 0 0 0 100 0 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 48 204 4 2 0 0 0 0 103 0 1 0 99
32 0 0 0 3 0 4 0 0 0 0 3 0 14 0 86
48 0 0 0 6 0 0 5 0 0 0 0 100 0 0 0

mpstat uses "usr", "sys", and "idl" to represent CPU consumption. For more
on "wt" you can read my older blog.

For more on utilization, see the CPU/Processor page on solarisinternals.com

To read more about the Sun SPARC Enterprise T5440 which is announced today, go to Allan Packer's blog listing all the T5440 blogs.

Tamp - a Multi-Threaded Compression Utility

Some more details on this:

  • It uses a freely-available Lempel-Ziv-derived algorithm, optimised
    for compression speed
  • It was compiled using the same compiler and optimization settings
    for SPARC and x86.
  • It uses a compression block size of 256KB, so files smaller than this
    will not gain much benefit
  • I was compressing four 1GB database files. They were being reduced in
    size by a little over 60%.
  • Browse my blog for more details and a download

Join the discussion

Comments ( 3 )
  • Kevin Hutchinson Monday, October 13, 2008

    From your graph it appears that 2 16-core x86 boxes can do the same work as 1 T5440 server? Now, 16 core x86 boxes aren't so pricey (see http://www.sun.com/servers/x64/x4440/ for example) so I'm hoping you'll tell me I'm wrong and the T5440 does way more work than 2 x 16-core x86 boxes?


  • miked Wednesday, October 15, 2008

    An interesting way to look at the data. However, no matter how you slice it there will be 2x boxes, 2x footprint, 2x maintenance, 2x administration, 2x (or more) power & cooling required, and a \*lot\* more to virtualize if not using Solaris. Did I miss something?


  • Lasse Reinhold Monday, December 29, 2008

    Hi Kevin,

    QuickLZ was heavily optimized for x86 with naive support for RISC added later.


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.