The Fallacy of IBM's Power6
By dcb on Jan 14, 2005
IBM is leaking FUD about its processors again. The Power5+, it is said, will be released later this year, ramping to 3GHz. The Power6, according to a "leaked" non-disclosure preso discussed by TheRegister, will sport "very large frequency enhancements". At the end of another news.com article, IBM suggests the Power6 will run at an "ultra-high frequency".
In engineering terms, those kinds of phrases generally imply at least an "order of magnitude" type of increase. That's [3GHz \* 10\^1], or an increase to 30GHz! But let's view this thru a marketing lens and say IBM is only talking about a "binary" order of magnitude [3GHz \* 2\^1]. That still puts the chip at 6GHz.
And therein lies part of the problem. First, even Intel can't get past 4GHz. In an embarrassing admission, they pulled their plans for a 4GHz Pentium and will concentrate their massive brain trust of chip designers on more intelligent ways to achieve increasing performance. More on that in a minute. Now I know IBM has some pretty impressive semiconductor fab processes and fabrication process engineers. But getting acceptable yields from a 12" wafer with 1/2 billion transistor chips at 6GHz and a 65nm fab process is pure rocket science. They can probably do it, at great shareholder expense. But even if that rocket leaves the atmosphere, they are still aiming in the wrong direction. As Sun, and now Intel, have figured out, modern apps and the realities of DRAM performance (even with large caches) render "ultra-high" clock rates impotent.
I've also got to hand it to IBM's chip designers...Here is an interesting technical overview of the z990 (MainFrame) CPU. The Power6 is targeted as the replacement for the z990, so it'll have to meet the z990 feature bar. The Power6 is rumored to be a common chip for their M/F zSeries and Unix pSeries platforms... (but they've been talking about a common chip for 10 years now, according to Gartner). Here is an excerpt of the z990 description:
"These include millicode, which is the vertical microcode that executes on the processor, and the recovery unit (R-unit), which holds the complete microarchitected state of the processor and is checkpointed after each instruction. If a hardware error is detected, the R-unit is then used to restore the checkpointed state and execute the error-recovery algorithm. Additionally, the z990 processor, like its predecessors, completely duplicates several major functional units for error-detection purposes and uses other error-detection techniques (parity, local duplication, illegal state checking, etc.) in the remainder of the processor to maintain state-of-the-art RAS characteristics. It also contains several mechanisms for completely transferring the microarchitected state to a spare processor in the system in the event of a catastrophic failure if it determines that it can no longer continue operating."
Wow! Still, they are continuing to fund rocket science based on the old "Apollo" blueprints. And that "dog don't hunt" any longer, to mix metaphors. Single thread performance and big SMP designs are still important. Sun leads the world in that area, with the 144 core E25K. And our servers with US-IVs (et al), AMD Opterons, and the engineering collaboration we're doing with Fujitsu should continue that leadership. But extreme clock rates are not the answer going forward.
In the benchmarketing world of TPC-C and SPECrates, where datasets fit nicely inside processor caches, performance appears stellar. But the problem, you see, is that for real applications, especially when micro-partitioning and multiple OS kernels and stacked applications are spread across processors, the L1/L2/L3 caches only contain a fraction of the data and instructions that the apps need to operate. At 6GHz, there is a new clock tick every 0.17 ns (light only travels about 2 inches in that time)!! However, about every 100 instructions or so, the data needed by a typical app might not appear in the processor cache chain. This is called a "cache miss" and it results in a DRAM access (or worse - to disk). Typical DRAM latency is about 150-300ns for large/complex SMP servers. Think about that... a 6GHz CPU will simply twiddle it's proverbial thumbs for over 1000 click ticks (doing nothing but generating heat) before that DRAM data makes it way back up to the CPU so that work can continue. If this happens every 100 instructions, we're at <10% efficiency (100 instructions, followed by 1000 idle cycles, repeat). Ouch!! And that ratio just gets worse as the CPU clock rate increases. Sure, big caches can help some, but not nearly enough to overcome this fundamental problem.
What to do? The answer is to build extremely efficient thread engines that can accept multiple thread contexts from the OS and manage those on chip. And we're not talking 2-way hyper-threading here. Say a single processor can accept dozens of threads from the OS. Say there are 8 cores on that processor so that 8 threads can run concurrently, with the other threads queued up ready to run. When any one of those 8 threads need to reach down into DRAM for a memory reference (and they will, frequently), one of the H/W queued threads in the chip's run queue will instantly begin to execute on the core vacated by the "stalled" thread that is now patiently waiting for its DRAM retrieval. We've just described a design that can achieve near 100% efficiency even when DRAM latency is taken into account. Ace's Hardware reports that "Niagara has reached first silicon, and is running in Sun's labs".
I won't comment on the veracity of that report. But if true, we are years ahead of competition. We're orbiting the Sun, and IBM is still sending its design team to the moon.
An analogy - consider an Olympic relay race... There are 8 teams of 4 runners. Each runner sprints for all they are worth around the lap once, and then hands the baton, in flight, to the next runner. We've got 32 "threads" that are constantly tearing up the track at full speed. On the other hand, a 6GHz single threaded core is like a single runner who sprints like a mad man around the track once, and then sits down for 15 minutes to catch his breath. Then does it again. Which model describes the kind of server you'd like running your highly threaded enterprise applications?