The Fallacy of IBM's Power6

IBM is leaking FUD about its processors again. The Power5+, it is said, will be released later this year, ramping to 3GHz. The Power6, according to a "leaked" non-disclosure preso discussed by TheRegister, will sport "very large frequency enhancements". At the end of another article, IBM suggests the Power6 will run at an "ultra-high frequency".

In engineering terms, those kinds of phrases generally imply at least an "order of magnitude" type of increase. That's [3GHz \* 10\^1], or an increase to 30GHz! But let's view this thru a marketing lens and say IBM is only talking about a "binary" order of magnitude [3GHz \* 2\^1]. That still puts the chip at 6GHz.

And therein lies part of the problem. First, even Intel can't get past 4GHz. In an embarrassing admission, they pulled their plans for a 4GHz Pentium and will concentrate their massive brain trust of chip designers on more intelligent ways to achieve increasing performance. More on that in a minute. Now I know IBM has some pretty impressive semiconductor fab processes and fabrication process engineers. But getting acceptable yields from a 12" wafer with 1/2 billion transistor chips at 6GHz and a 65nm fab process is pure rocket science. They can probably do it, at great shareholder expense. But even if that rocket leaves the atmosphere, they are still aiming in the wrong direction. As Sun, and now Intel, have figured out, modern apps and the realities of DRAM performance (even with large caches) render "ultra-high" clock rates impotent.

I've also got to hand it to IBM's chip designers...Here is an interesting technical overview of the z990 (MainFrame) CPU. The Power6 is targeted as the replacement for the z990, so it'll have to meet the z990 feature bar. The Power6 is rumored to be a common chip for their M/F zSeries and Unix pSeries platforms... (but they've been talking about a common chip for 10 years now, according to Gartner). Here is an excerpt of the z990 description:

"These include millicode, which is the vertical microcode that executes on the processor, and the recovery unit (R-unit), which holds the complete microarchitected state of the processor and is checkpointed after each instruction. If a hardware error is detected, the R-unit is then used to restore the checkpointed state and execute the error-recovery algorithm. Additionally, the z990 processor, like its predecessors, completely duplicates several major functional units for error-detection purposes and uses other error-detection techniques (parity, local duplication, illegal state checking, etc.) in the remainder of the processor to maintain state-of-the-art RAS characteristics. It also contains several mechanisms for completely transferring the microarchitected state to a spare processor in the system in the event of a catastrophic failure if it determines that it can no longer continue operating."

Wow! Still, they are continuing to fund rocket science based on the old "Apollo" blueprints. And that "dog don't hunt" any longer, to mix metaphors. Single thread performance and big SMP designs are still important. Sun leads the world in that area, with the 144 core E25K. And our servers with US-IVs (et al), AMD Opterons, and the engineering collaboration we're doing with Fujitsu should continue that leadership. But extreme clock rates are not the answer going forward.

In the benchmarketing world of TPC-C and SPECrates, where datasets fit nicely inside processor caches, performance appears stellar. But the problem, you see, is that for real applications, especially when micro-partitioning and multiple OS kernels and stacked applications are spread across processors, the L1/L2/L3 caches only contain a fraction of the data and instructions that the apps need to operate. At 6GHz, there is a new clock tick every 0.17 ns (light only travels about 2 inches in that time)!! However, about every 100 instructions or so, the data needed by a typical app might not appear in the processor cache chain. This is called a "cache miss" and it results in a DRAM access (or worse - to disk). Typical DRAM latency is about 150-300ns for large/complex SMP servers. Think about that... a 6GHz CPU will simply twiddle it's proverbial thumbs for over 1000 click ticks  (doing nothing but generating heat) before that DRAM data makes it way back up to the CPU so that work can continue. If this happens every 100 instructions, we're at <10% efficiency (100 instructions, followed by 1000 idle cycles, repeat). Ouch!! And that ratio just gets worse as the CPU clock rate increases. Sure, big caches can help some, but not nearly enough to overcome this fundamental problem.

What to do? The answer is to build extremely efficient thread engines that can accept multiple thread contexts from the OS and manage those on chip. And we're not talking 2-way hyper-threading here. Say a single processor can accept dozens of threads from the OS. Say there are 8 cores on that processor so that 8 threads can run concurrently, with the other threads queued up ready to run. When any one of those 8 threads need to reach down into DRAM for a memory reference (and they will, frequently), one of the H/W queued threads in the chip's run queue will instantly begin to execute on the core vacated by the "stalled" thread that is now patiently waiting for its DRAM retrieval. We've just described a design that can achieve near 100% efficiency even when DRAM latency is taken into account. Ace's Hardware reports that "Niagara has reached first silicon, and is running in Sun's labs".

I won't comment on the veracity of that report. But if true, we are years ahead of competition. We're orbiting the Sun, and IBM is still sending its design team to the moon.

An analogy - consider an Olympic relay race... There are 8 teams of 4 runners. Each runner sprints for all they are worth around the lap once, and then hands the baton, in flight, to the next runner. We've got 32 "threads" that are constantly tearing up the track at full speed. On the other hand, a 6GHz single threaded core is like a single runner who sprints like a mad man around the track once, and then sits down for 15 minutes to catch his breath. Then does it again. Which model describes the kind of server you'd like running your highly threaded enterprise applications?


In fact, the power5 processors are SMT-enabled... so at 6GHz (with the power6), the processor can switch between threads during a cache miss.

Posted by Rayson Ho on January 14, 2005 at 07:16 PM EST #

True... That was why I included the stmt that "we're not talking 2-way hyper-threading here". My general comments and analogy are valid. Picture that Olympic race with two runners who each take 15 minute breathers after each lap. Compared to the 32, in which there are always 8 sprinting at full speed. I using "hyper-threading" as a general expression to describe 2-way SMT. Cheers.

Posted by guest on January 14, 2005 at 10:33 PM EST #

dont get too carried away about IBM's lack of a parallelism story a la niagara or Azul. have you looked at Cell?

Posted by James Governor on January 25, 2005 at 04:27 AM EST #

Based on your words "... for real applications ...", these are like SAP R/3, Java or Oracle eBusiness Suite (did I mention that Java applications are "real" ones for Java owner SUN?), it seems that the value of a technology should be measured in the way it support applications with the lower cost. Let's see the facts: the "hottest" low-powered Sunfire 25K with old 144 Sparc-III CPUs (poor 33M transistor cores covered by a Sparc-IV CHIP dress) performs at 51,070 SAPs, that is 354 SAPs/CPU, and an IBM pSeries 595 with 64 Power5 CPUs (first proof of concept) performs at 100,700 SAPs (did I mention 32 CHIPs?), that is 1,573 SAPs/CPU. Something similar is for Java where the "spectacular" 6900 with 48 Sparc-III cores (once again called Sparc-IV by SUN) performs at 421,773 ops/sec (8786 ops/CPU) and the IBM pSeries 595 with 64 Power5 cores performs at 2,200,162 (34,377 ops/CPU). Both results show Power5 performing 4 times Sparc-III !!!!!!!! If you think about the promise of "The Rock", 30 times the performance of first Sparc-III (750 MHz). Does somebody mention that "The Rock" is a CHIP with 8 cores?, so the promise is only 3.75 times the performance of the scrappy Sparc-III !!!!, and that poor promise for 2006 (maybe 2007). Could be somebody so fool to purchase a Sunfire twice the cost of a pSeries performing half? with the "excellent" promise to be worst in the future?. You Sun should re-think on how to tell these stories to customers, we deserve some respect !!!!

Posted by Jack Cummins on February 01, 2005 at 10:28 AM EST #

You say: " As Sun, and now Intel, have figured out, modern apps and the realities of DRAM performance (even with large caches) render "ultra-high" clock rates impotent." That's true. But why then it is IBM who first started to use DDRII in their servers, not Sun? Sorry, but Sun's own hardware designs seem a bit outdated. Why in the world Sun dosn't focus on development of large-scale servers out of commodity CPUs (like Opterons)? Instead, they prefer to waste resources on their own development. PS: Yes, we run Suns there. Several 6800s, bunch of V-series (Sparc, not Opteron) and lots of older Ultra Enterprises. I don't know why our management perefers Sun to IBM.

Posted by Mike on February 14, 2005 at 07:55 PM EST #

Hi Mike, I'm not a DDR-2 expert... However, according to the reference below, DDR-2 has \*worse\* latency characteristics than traditional DDR, typically using a latency timing of 4-4-4 @ 533MHz.

Posted by Dave on February 14, 2005 at 09:56 PM EST #

Niagra does well with memory bound applications where each thread has to go to DRAM often. The Power6 will run circles around Niagra in more compute bound applications - many of which surprise surprise fit in the caches. We are not even talking about Floating Point performance needed for scientific computing where Power6 will shine. Niagra is designed for throughput computing and does a better job there but that is only part of the story.

Posted by GS on February 06, 2006 at 07:10 PM EST #

Post a Comment:
Comments are closed for this entry.



« July 2016

No bookmarks in folder