Why does 1.6 beat 4.7?

Sun has upgraded the UltraSPARC T2 and UltraSPARC T2 Plus processors to 1.6 GHz. As described in some detail in yesterday's post, new results show SPEC CPU2006 performance improvements vs. previous systems that often exceed the clock speed improvement.  The scaling can be attributed to both memory system improvements and software improvements, such as the Sun Studio 12 Update 1 compiler.

A MHz improvement within a product line is often useful.  If yesterday's chip runs at speed n and today's at n\*1.12 then, intuitively, sure, I'll take today's.

Comparing MHz across product lines is often counter-intuitive.  Consider that Sun's new systems provide:

  • up to 68% more throughput than the 4.7 GHz POWER6+ [1], and
  • up to 3x the throughput of the Itanium 9150N [2].

The comparisons are particularly striking when one takes into account the cache size advantage for both the POWER6+ and the Itanium 9150N, and the MHz advantage for the POWER6+:

Processor GHz Number of
hw cache levels
Size of
last cache
(per chip)
SPECint_rate_base2006
UltraSPARC T2
UltraSPARC T2 Plus
1.6 2 4 MB 1 chip: 89
2 chips: 171
4 chips: 338
POWER6+ 4.7 3 32 MB Best 2 chip result: 102. UltraSPARC T2 Plus delivers 68% more integer throughput [1]
Itanium 9150N 1.6 3 24 MB Best 4 chip result: 114. UltraSPARC T2 Plus delivers 3x the integer throughput. [2]

These are per-chip results, not per-core or per-thread. Sun's CMT processors are designed for overall system throughput: how much work can the overall system get done.  

A mystery: With comparatively smaller caches and modest clock rates, why do the Sun CMT processors win?

The performance hole: Memory latency. From the point of view of a CPU chip, the big performance problem is that memory latency is inordinately long compared to chip cycle times.

A hardware designer can attempt to cover up that latency with very large caches, as in the POWER6+ and Itanium, and this works well when running a small number of modest-sized applications. Large caches become less helpful, though, as workloads become more complex.

MHz isn't everything. In fact, MHz hardly counts at all when the problem is memory latency. Suppose the hot part of an application looks like this:

  loop:
       computational instruction
       computational instruction
       computational instruction
       memory access instruction
       branch to loop

For an application that looks like this, the computational instructions may complete in only a few cycles, while the memory access instruction may easily require on the order of 100ns - which, for a 1 GHz chip, is on the order of 100 cycles. If the processor speed is increased by a factor of 4, but memory speed is not, then memory is still 100ns away, and when measured in cycles, it is now 400 cycles distant. The overall loop hardly speeds up at all.

Lest the reader think I am making this up - consider page 8 of this IBM talk from April, 2008 regarding the POWER6:

latencies

The IBM POWER systems have some impressive performance characteristics - if your application is tiny enough to fit in its first or second level cache. But memory latency is not impressive. If your workload requires multiple concurrent threads accessing a large memory space, Sun's CMT approach just might be a better fit.

Operating System Overhead A context switch from one process to another is mediated by operating system services. The OS parks context from the process that is currently running - typically saving dozens of program registers and other context (such as virtual address space information); decides which process to run next (which may require access to several OS data structures); and loads the context for the new process (registers, virtual address context, etc.). If the system is running many processes, then caches are unlikely to be helpful during this context switch, and thousands of cycles may be spent on main memory accesses.

Design for throughput: Sun's CMT approach handles the complexity of real-world applications by allowing up to 64 processes to be simultaneously on-chip. When a long-latency stall occurs, such as an access to main memory, the chip switches to executing instructions on behalf of other, non-stalled threads, thus improving overall system throughput. No operating system intervention is required as resources are shared among the processes on the chip.

[1] http://www.spec.org/cpu2006/results/res2009q2/cpu2006-20090427-07263.html
[2] http://www.spec.org/cpu2006/results/res2009q2/cpu2006-20090522-07485.html

Competitive results retrieved from www.spec.org   20 July 2009.  Sun's CMT results have been submitted to SPEC.  SPEC, SPECfp, SPECint are registered trademarks of the Standard Performance Evaluation Corporation.

Comments:

I think for many readers (myself included) much more important comparison is to Intel 5500 series (2 socket with SPECint_rate_base2006 starting at 188 - for really cheap E5520 CPU - i.e faster than T2) and AMD 84nn (4 socket - SPECint_rate_base2006 is 326 for 8439SE i.e very close to T2+) especially given the fact those CPUs run Solaris AND Linux (and even Windows if necessary).

Posted by Igor on July 23, 2009 at 03:24 AM PDT #

igor> Intel Xeon or AMD needs to compare as a whole against T2, numbers looks interesting, but what about cryptographic units inside USparc T2+, plus 10GB Ethernet performance, bus or memory bandwidth, and other parameters that Power or Itanium compares with T2, not Xeon or AMD systems.

Posted by gabojm on July 23, 2009 at 04:36 PM PDT #

Post a Comment:
Comments are closed for this entry.
About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today