POWER6 Goes Thud: Part IV

A Look at POWER6's Lagging Architecture

Ever since Sun announced the world's first multi-core microprocessor, most of the entire high-volume chip-designing world (that world would be Sun, IBM, AMD, and Intel) have realized that the race was on to see who could build a microprocessor with the most cores and the most threads to handle the modern applications that thread-rich, Internet-based computing has spawned. That "most" includes everyone except IBM.

Declaring the end of instruction-level parallelism (ILP) and the advent of the era of thread-level parallelism (TLP) as far back as IDF 2003, Intel has already begun shipping quad-core processors (Sun is the first to ship systems). AMD will follow suit later this year. Sun, of course, set the bar in 2005 with its 8-core, 32-thread UltraSPARC T1 and will be shipping the 8-core, 64-thread "Niagara 2" processor in systems later this year. I'm guessing we'll probably beat the 2-core, 4-thread POWER6 to the volume market by a pretty solid margin, mostly because IBM failed to announce any availability of POWER6 in the volume (or high-end, for that matter) markets.

The reason for this enthusiasm around multithreading is described brilliantly in a whitepaper titled The Landscape of Parallel Computing Research: A View From Berkeley, written by a multidisciplinary group of Berkeley researchers, including the father of RISC, David Patterson. They see microprocessor performance hitting a "brick wall" due to three factors:

  • The Power Wall: "Power is expensive, but transistors are 'free'. That is, we can put more transistors on a chip than we have the power to turn on." Particularly true on a 65nm chip like POWER6 or Niagara 2, unless you do something about it.
  • The Memory Wall: "Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles." And increasing the size of the already great big caches traditionally used to mask memory latency aren't giving us a good return on the transistor investment anymore.
  • The ILP Wall: "There are diminishing returns on finding more ILP. ... Increasing parallelism is the primary method of improving processor performance". They dismiss increasing clock frequency as the primary method of improving processor performance as old conventional wisdom.

To illustrate the point, Figure 2 of the whitepaper shows the impact of the old faster-clockrate, bigger-caches mentality that has prevailed in microprocessor design for the last thirty years:

The Brick Wall

The upper right part of the chart shows a lag of processor performance since 2002. The green line coincides remarkably closely to the doubling of performance most people attribute incorrectly to Moore's Law (Moore's Law is a statement about transistor density, not performance; but given the fact that performance had been tracking so closely with transistor density from 1986 to 2002, one could be forgiven for collapsing the two -- but not any longer.)

So, higher clock frequencies are no longer the key to performance any more and neither is using your free Moore's Law transistors on building bigger caches. So what did IBM do? They more than doubled their clock rate (2.2GHz to 4.7GHz) and quadrupled the size of their L2 on-chip caches (1.92MB on POWER5+, 8MB on POWER6). And what performance speed-up did they get for their efforts?

rPerf Relative to POWER5+ To answer that, let's look at IBM's proprietary rPerf benchmark which, according to IBM, is a benchmark that "simulates some of the system operations such as CPU, cache and memory. However, the model does not simulate disk or network I/O operations." Take a look at the right to see how well doubling the frequency and quadrupling the cache size worked for POWER6 [\*] (the green line is, once again, roughly 52% performance increase per year).

POWER6 is clearly in the microprocessor category of diminishing performance returns described by the Berkeley whitepaper because it has pinned its hopes on old, unimaginative, and out of date techniques that the rest of the industry has largely abandoned. For all IBM's hype around POWER6 being "convention shattering", it's still only two cores per chip and two threads per core, just like POWER5+. Moreover, they sacrificed the out-of-order execution that gave POWER5+ a boost to get the frequency and cache size increases. In most respects, its enhancements are completely evolutionary and not at all revolutionary. And in some respects, they're actually going backwards.

There's more. POWER6 is pretty much all we're going to see from IBM for the next three years at least (plus or minus still more clock speed increases!), since POWER7 won't be around until 2010 at the earliest. And there's still no word from IBM on when the missing entry-level and high-end POWER6 systems will show up.

So how is Sun stacking up against this? Sun has already publicly stated that we will be releasing three new 100% binary compatible SPARC processors over the course of the next eighteen months, each one optimized and targeted to different application workloads. And at the 2007 Sun Analyst Summit, Sun's Vice President of Systems, John Fowler, gave a presentation that included this performance roadmap for SPARC (I've updated the little sunburst milestones and once again, I've drawn in the bright green line representing roughly 52% performance increase per year):

SPARC Processor Performance Increases

We're already devastating IBM with Sun's CoolThreadsTM servers in terms of performance, rack space, power, and cooling. By the end of this year, we'll do it again. And as I said in a previous blog entry, when the ROCK systems bring the power of chip multi-threading to the high end, there will be absolutely no reason any customer would want a POWER system unless that particular shade of IBM blue went better with his datacenter décor.

The reason for all this is that years ago, Sun recognized that we were facing exactly those problems that were mentioned in the Berkeley whitepaper and decided to do something extraordinary. Rather than focus on clock frequency and cache size, like IBM, we decided to question everything that everyone knew about how to make a microprocessor go fast. The result was we made a big bet on chip multi-threading, or CMT, that's been paying off since 2005. That's why Sun's CoolThreads servers are the fastest-ramping product line in Sun Microsystems history and why customers are so enthusiastically endorsing them.

And you ain't seen nothin' yet. That's what's keeping IBM up at night.



\* This chart is based on normalizing the following rPerf numbers from IBM to 12.27: POWER5+@1.9GHz: 12.27; POWER5+@2.1GHz: 13.83; POWER6@3.5GHz: 15.85; POWER6@4.2GHz: 18.38; POWER6@4.7GHz: 20.13.

Comments:

Are you claiming that each Rock core will perform as well as an UltraSpark IV+? That's quite a claim!

Posted by Kevin on June 07, 2007 at 09:53 PM EDT #

I'm just a bit confused by those 2 graphs at the end, you've got US IIIi down as 2004, but wasn't it released in 2002?

Even so, it looks like you're still claiming a 2x performance increase per year every year for the last 5 years, that's massively above the levels Moore's law aims for. Are you really claiming that Sun has a volume processor in their fabs ready to go today that's 35 times faster at specific real world loads than their volume processor of 2004 was?

If so, fantastic, can we see some benchmarks of it running things like Java application servers and Oracle databases? If it's the phenomenon that you're saying it is, then there's absolutely no point in keeping it quiet with Power6 out of the closet, you know the target you have to beat now after all.

The same goes for your enterprise graph, if you're claiming real world performance improvement of 16 times in 2-3 years from 2006's Ultrasparc IV+ then again it's amazing, a huge revolutionary leap, and I'd like to see some real benchmarks not just a graph with "ROCK 16\*" on it please, Power6 and Itanium will never match the improvement you're claiming so you don't need to hide it. Or of course, this could all just be more benchmarketing just like IBM is engaging in with Power6, and we'll find out at the end of this year and next that Niagara 2 and Rock are "just" good incremental improvements over what came before. But Sun would never do that, would they? :)

Posted by Ewan on June 08, 2007 at 04:24 AM EDT #

Ewan,

I do not know where you have been for the last 18 months, but Sun has been producing benchmarks on the UltraSPARC T1 which back up the claims of 15X performance per socket of the US-IIIi.

Despite IBM's just released 4.7 GHz POWER6, Sun's 18-month old UltraSPARC T1 remains the world's most power microprocessor for threaded commercial workloads.

For example, Sun recently announced a new SPECjbb2005 result for the UltraSPARC T1, at 96,523 SPECjbb2005 bops per processor. That is 11% higher than the POWER6 which only produced 86,497 SPECjbb2005 bops per processor.

If you look at SPECjAppServer2004 benchmarks, you will see several results where the Sun Fire T2000 is used as the database server for the appservers. There are examples running both Oracle and IBM DB2. Compare those results to other vendors results and you will some fairly big iron (HP rx8620) to support similar results.

To me, this is definitive proof of what John Meyer is saying. An older generation, 90nm, 1.4 GHz, 300 million transistor UltraSPARC outperforms a 65nm, 4.7 GHz, 750 million transistor processor.

And if you follow Sun, you know the next generation Niagara processor is in test now. It should double performance, meaning UltraSPARC T2 will likely more than double the perfomance per chip of POWER6.

CMT principals win. Period.

Full disclosure: I do not work for Sun Microsystems.

Posted by Mark on June 09, 2007 at 06:40 AM EDT #

That is an impressive SPECjbb2005 result, it doesn't seem to be on the spec website which is why I've never seen it before but I guess that's just an admin issue holding things up.

My worry about a result of 96,523 SPECjbb2005 bops per processor is that for the T1 I think the limit is actually 1 processor with 8 cores, so the 691,975 bops that the 8 processor 16-core Power6 p570 achieved isn't entirely comparable.

For the jAppServer2004 results, I see an excellent result for a 1 processor machine from a T2000 with 8 cores, 801.70 jops, but then there's a problem with jAppServer2004 for database performance testing in that the database isn't the focus of the test, it's an application server benchmark which needs to have a database behind it.

If I needed a relatively small and cost effective Java application server platform, I'd definitely investigate the T1000 and T2000 servers, but it is a pretty narrow requirement and right now it seems like if you have a larger database which needs a lot of processing power without investing in 64 cores worth of Oracle per-cpu processor licensing, Power6 very much seems the solution.

Power6 definitely has issues, the main one being the question of how much the performance fall off is on the 3.5Ghz and 4.2Ghz models, but the 4.7Ghz machine right now seems the king of the castle.

Posted by Ewan on June 11, 2007 at 01:49 AM EDT #

Post a Comment:
Comments are closed for this entry.
About

jmeyer

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today