POWER6 Goes Thud: Part IV
By jmeyer on Jun 06, 2007
A Look at POWER6's Lagging Architecture
Ever since Sun announced the world's first multi-core microprocessor, most of the entire high-volume chip-designing world (that world would be Sun, IBM, AMD, and Intel) have realized that the race was on to see who could build a microprocessor with the most cores and the most threads to handle the modern applications that thread-rich, Internet-based computing has spawned. That "most" includes everyone except IBM.
Declaring the end of instruction-level parallelism (ILP) and the advent of the era of thread-level parallelism (TLP) as far back as IDF 2003, Intel has already begun shipping quad-core processors (Sun is the first to ship systems). AMD will follow suit later this year. Sun, of course, set the bar in 2005 with its 8-core, 32-thread UltraSPARC T1 and will be shipping the 8-core, 64-thread "Niagara 2" processor in systems later this year. I'm guessing we'll probably beat the 2-core, 4-thread POWER6 to the volume market by a pretty solid margin, mostly because IBM failed to announce any availability of POWER6 in the volume (or high-end, for that matter) markets.
The reason for this enthusiasm around multithreading is described brilliantly in a whitepaper titled The Landscape of Parallel Computing Research: A View From Berkeley, written by a multidisciplinary group of Berkeley researchers, including the father of RISC, David Patterson. They see microprocessor performance hitting a "brick wall" due to three factors:
- The Power Wall: "Power is expensive, but transistors are 'free'. That is, we can put more transistors on a chip than we have the power to turn on." Particularly true on a 65nm chip like POWER6 or Niagara 2, unless you do something about it.
- The Memory Wall: "Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles." And increasing the size of the already great big caches traditionally used to mask memory latency aren't giving us a good return on the transistor investment anymore.
- The ILP Wall: "There are diminishing returns on finding more ILP. ... Increasing parallelism is the primary method of improving processor performance". They dismiss increasing clock frequency as the primary method of improving processor performance as old conventional wisdom.
To illustrate the point, Figure 2 of the whitepaper shows the impact of the old faster-clockrate, bigger-caches mentality that has prevailed in microprocessor design for the last thirty years:
The upper right part of the chart shows a lag of processor performance since 2002. The green line coincides remarkably closely to the doubling of performance most people attribute incorrectly to Moore's Law (Moore's Law is a statement about transistor density, not performance; but given the fact that performance had been tracking so closely with transistor density from 1986 to 2002, one could be forgiven for collapsing the two -- but not any longer.)
So, higher clock frequencies are no longer the key to performance any more and neither is using your free Moore's Law transistors on building bigger caches. So what did IBM do? They more than doubled their clock rate (2.2GHz to 4.7GHz) and quadrupled the size of their L2 on-chip caches (1.92MB on POWER5+, 8MB on POWER6). And what performance speed-up did they get for their efforts?
To answer that, let's look at IBM's proprietary rPerf benchmark which, according to IBM, is a benchmark that "simulates some of the system operations such as CPU, cache and memory. However, the model does not simulate disk or network I/O operations." Take a look at the right to see how well doubling the frequency and quadrupling the cache size worked for POWER6 [\*] (the green line is, once again, roughly 52% performance increase per year).
POWER6 is clearly in the microprocessor category of diminishing performance returns described by the Berkeley whitepaper because it has pinned its hopes on old, unimaginative, and out of date techniques that the rest of the industry has largely abandoned. For all IBM's hype around POWER6 being "convention shattering", it's still only two cores per chip and two threads per core, just like POWER5+. Moreover, they sacrificed the out-of-order execution that gave POWER5+ a boost to get the frequency and cache size increases. In most respects, its enhancements are completely evolutionary and not at all revolutionary. And in some respects, they're actually going backwards.
There's more. POWER6 is pretty much all we're going to see from IBM for the next three years at least (plus or minus still more clock speed increases!), since POWER7 won't be around until 2010 at the earliest. And there's still no word from IBM on when the missing entry-level and high-end POWER6 systems will show up.
So how is Sun stacking up against this? Sun has already publicly stated that we will be releasing three new 100% binary compatible SPARC processors over the course of the next eighteen months, each one optimized and targeted to different application workloads. And at the 2007 Sun Analyst Summit, Sun's Vice President of Systems, John Fowler, gave a presentation that included this performance roadmap for SPARC (I've updated the little sunburst milestones and once again, I've drawn in the bright green line representing roughly 52% performance increase per year):
We're already devastating IBM with Sun's CoolThreadsTM servers in terms of performance, rack space, power, and cooling. By the end of this year, we'll do it again. And as I said in a previous blog entry, when the ROCK systems bring the power of chip multi-threading to the high end, there will be absolutely no reason any customer would want a POWER system unless that particular shade of IBM blue went better with his datacenter décor.
The reason for all this is that years ago, Sun recognized that we were facing exactly those problems that were mentioned in the Berkeley whitepaper and decided to do something extraordinary. Rather than focus on clock frequency and cache size, like IBM, we decided to question everything that everyone knew about how to make a microprocessor go fast. The result was we made a big bet on chip multi-threading, or CMT, that's been paying off since 2005. That's why Sun's CoolThreads servers are the fastest-ramping product line in Sun Microsystems history and why customers are so enthusiastically endorsing them.
And you ain't seen nothin' yet. That's what's keeping IBM up at night.
\* This chart is based on normalizing the following rPerf numbers from IBM to 12.27: POWER5firstname.lastname@example.orgGHz: 12.27; POWER5email@example.comGHz: 13.83; POWER6@3.5GHz: 15.85; POWER6@4.2GHz: 18.38; POWER6@4.7GHz: 20.13.