Why does 1.6 beat 4.7?
By jhenning on Jul 22, 2009
Sun has upgraded the UltraSPARC T2 and UltraSPARC T2 Plus processors to 1.6 GHz. As described in some detail in yesterday's post, new results show SPEC CPU2006 performance improvements vs. previous systems that often exceed the clock speed improvement. The scaling can be attributed to both memory system improvements and software improvements, such as the Sun Studio 12 Update 1 compiler.
A MHz improvement within a product line is often useful. If yesterday's chip runs at speed n and today's at n\*1.12 then, intuitively, sure, I'll take today's.
Comparing MHz across product lines is often counter-intuitive. Consider that Sun's new systems provide:
- up to 68% more throughput than the 4.7 GHz POWER6+ , and
- up to 3x the throughput of the Itanium 9150N .
The comparisons are particularly striking when one takes into account the cache size advantage for both the POWER6+ and the Itanium 9150N, and the MHz advantage for the POWER6+:
hw cache levels
UltraSPARC T2 Plus
|1.6||2||4 MB||1 chip: 89
2 chips: 171
4 chips: 338
|POWER6+||4.7||3||32 MB||Best 2 chip result: 102. UltraSPARC T2 Plus delivers 68% more integer throughput |
|Itanium 9150N||1.6||3||24 MB||Best 4 chip result: 114. UltraSPARC T2 Plus delivers 3x the integer throughput. |
These are per-chip results, not per-core or per-thread. Sun's CMT processors are designed for overall system throughput: how much work can the overall system get done.
A mystery: With comparatively smaller caches and modest clock rates, why do the Sun CMT processors win?
The performance hole: Memory latency. From the point of view of a CPU chip, the big performance problem is that memory latency is inordinately long compared to chip cycle times.
A hardware designer can attempt to cover up that latency with very large caches, as in the POWER6+ and Itanium, and this works well when running a small number of modest-sized applications. Large caches become less helpful, though, as workloads become more complex.
MHz isn't everything. In fact, MHz hardly counts at all when the problem is memory latency. Suppose the hot part of an application looks like this:
loop: computational instruction computational instruction computational instruction memory access instruction branch to loop
For an application that looks like this, the computational instructions may complete in only a few cycles, while the memory access instruction may easily require on the order of 100ns - which, for a 1 GHz chip, is on the order of 100 cycles. If the processor speed is increased by a factor of 4, but memory speed is not, then memory is still 100ns away, and when measured in cycles, it is now 400 cycles distant. The overall loop hardly speeds up at all.
Lest the reader think I am making this up - consider page 8 of this IBM talk from April, 2008 regarding the POWER6:
The IBM POWER systems have some impressive performance characteristics - if your application is tiny enough to fit in its first or second level cache. But memory latency is not impressive. If your workload requires multiple concurrent threads accessing a large memory space, Sun's CMT approach just might be a better fit.
Operating System Overhead A context switch from one process to another is mediated by operating system services. The OS parks context from the process that is currently running - typically saving dozens of program registers and other context (such as virtual address space information); decides which process to run next (which may require access to several OS data structures); and loads the context for the new process (registers, virtual address context, etc.). If the system is running many processes, then caches are unlikely to be helpful during this context switch, and thousands of cycles may be spent on main memory accesses.
Design for throughput: Sun's CMT approach handles the complexity of real-world applications by allowing up to 64 processes to be simultaneously on-chip. When a long-latency stall occurs, such as an access to main memory, the chip switches to executing instructions on behalf of other, non-stalled threads, thus improving overall system throughput. No operating system intervention is required as resources are shared among the processes on the chip.
Competitive results retrieved from www.spec.org 20 July 2009. Sun's CMT results have been submitted to SPEC. SPEC, SPECfp, SPECint are registered trademarks of the Standard Performance Evaluation Corporation.