Tuesday Nov 20, 2007

POWER6 Goes Thud: Part VII

Whither the POWER6 product line? After half a year, it's still missing! What went wrong?[Read More]

Tuesday Aug 28, 2007

POWER6 Goes Thud: Part VI

For a company that invented virtualization, IBM is still trying to catch up to Sun.[Read More]

POWER6 Goes Thud: Part V

Three months is a long time to be in labor with octuplets and have only one baby to show for it.[Read More]

Wednesday Jun 06, 2007

POWER6 Goes Thud: Part IV

A Look at POWER6's Lagging Architecture

Ever since Sun announced the world's first multi-core microprocessor, most of the entire high-volume chip-designing world (that world would be Sun, IBM, AMD, and Intel) have realized that the race was on to see who could build a microprocessor with the most cores and the most threads to handle the modern applications that thread-rich, Internet-based computing has spawned. That "most" includes everyone except IBM.

Declaring the end of instruction-level parallelism (ILP) and the advent of the era of thread-level parallelism (TLP) as far back as IDF 2003, Intel has already begun shipping quad-core processors (Sun is the first to ship systems). AMD will follow suit later this year. Sun, of course, set the bar in 2005 with its 8-core, 32-thread UltraSPARC T1 and will be shipping the 8-core, 64-thread "Niagara 2" processor in systems later this year. I'm guessing we'll probably beat the 2-core, 4-thread POWER6 to the volume market by a pretty solid margin, mostly because IBM failed to announce any availability of POWER6 in the volume (or high-end, for that matter) markets.

The reason for this enthusiasm around multithreading is described brilliantly in a whitepaper titled The Landscape of Parallel Computing Research: A View From Berkeley, written by a multidisciplinary group of Berkeley researchers, including the father of RISC, David Patterson. They see microprocessor performance hitting a "brick wall" due to three factors:

  • The Power Wall: "Power is expensive, but transistors are 'free'. That is, we can put more transistors on a chip than we have the power to turn on." Particularly true on a 65nm chip like POWER6 or Niagara 2, unless you do something about it.
  • The Memory Wall: "Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles." And increasing the size of the already great big caches traditionally used to mask memory latency aren't giving us a good return on the transistor investment anymore.
  • The ILP Wall: "There are diminishing returns on finding more ILP. ... Increasing parallelism is the primary method of improving processor performance". They dismiss increasing clock frequency as the primary method of improving processor performance as old conventional wisdom.

To illustrate the point, Figure 2 of the whitepaper shows the impact of the old faster-clockrate, bigger-caches mentality that has prevailed in microprocessor design for the last thirty years:

The Brick Wall

The upper right part of the chart shows a lag of processor performance since 2002. The green line coincides remarkably closely to the doubling of performance most people attribute incorrectly to Moore's Law (Moore's Law is a statement about transistor density, not performance; but given the fact that performance had been tracking so closely with transistor density from 1986 to 2002, one could be forgiven for collapsing the two -- but not any longer.)

So, higher clock frequencies are no longer the key to performance any more and neither is using your free Moore's Law transistors on building bigger caches. So what did IBM do? They more than doubled their clock rate (2.2GHz to 4.7GHz) and quadrupled the size of their L2 on-chip caches (1.92MB on POWER5+, 8MB on POWER6). And what performance speed-up did they get for their efforts?

rPerf Relative to POWER5+ To answer that, let's look at IBM's proprietary rPerf benchmark which, according to IBM, is a benchmark that "simulates some of the system operations such as CPU, cache and memory. However, the model does not simulate disk or network I/O operations." Take a look at the right to see how well doubling the frequency and quadrupling the cache size worked for POWER6 [\*] (the green line is, once again, roughly 52% performance increase per year).

POWER6 is clearly in the microprocessor category of diminishing performance returns described by the Berkeley whitepaper because it has pinned its hopes on old, unimaginative, and out of date techniques that the rest of the industry has largely abandoned. For all IBM's hype around POWER6 being "convention shattering", it's still only two cores per chip and two threads per core, just like POWER5+. Moreover, they sacrificed the out-of-order execution that gave POWER5+ a boost to get the frequency and cache size increases. In most respects, its enhancements are completely evolutionary and not at all revolutionary. And in some respects, they're actually going backwards.

There's more. POWER6 is pretty much all we're going to see from IBM for the next three years at least (plus or minus still more clock speed increases!), since POWER7 won't be around until 2010 at the earliest. And there's still no word from IBM on when the missing entry-level and high-end POWER6 systems will show up.

So how is Sun stacking up against this? Sun has already publicly stated that we will be releasing three new 100% binary compatible SPARC processors over the course of the next eighteen months, each one optimized and targeted to different application workloads. And at the 2007 Sun Analyst Summit, Sun's Vice President of Systems, John Fowler, gave a presentation that included this performance roadmap for SPARC (I've updated the little sunburst milestones and once again, I've drawn in the bright green line representing roughly 52% performance increase per year):

SPARC Processor Performance Increases

We're already devastating IBM with Sun's CoolThreadsTM servers in terms of performance, rack space, power, and cooling. By the end of this year, we'll do it again. And as I said in a previous blog entry, when the ROCK systems bring the power of chip multi-threading to the high end, there will be absolutely no reason any customer would want a POWER system unless that particular shade of IBM blue went better with his datacenter décor.

The reason for all this is that years ago, Sun recognized that we were facing exactly those problems that were mentioned in the Berkeley whitepaper and decided to do something extraordinary. Rather than focus on clock frequency and cache size, like IBM, we decided to question everything that everyone knew about how to make a microprocessor go fast. The result was we made a big bet on chip multi-threading, or CMT, that's been paying off since 2005. That's why Sun's CoolThreads servers are the fastest-ramping product line in Sun Microsystems history and why customers are so enthusiastically endorsing them.

And you ain't seen nothin' yet. That's what's keeping IBM up at night.

\* This chart is based on normalizing the following rPerf numbers from IBM to 12.27: POWER5+@1.9GHz: 12.27; POWER5+@2.1GHz: 13.83; POWER6@3.5GHz: 15.85; POWER6@4.2GHz: 18.38; POWER6@4.7GHz: 20.13.

POWER6 Goes Thud: Part III

I was going to devote another blog entry to exposing the benchmarketing perfidy that IBM pulled in their POWER6 announcement the other day, but I can't possibly top the great analyses performed by the enigmatic, masked BM Seer. Thanks, Seer!

Monday Jun 04, 2007

POWER6 Goes Thud: Part II

A Look at IBM's TPC-C Results

One has to wonder what IBM was thinking when they published results on the fifteen-year-old TPC-C benchmark as part of their drive to impress the world about the performance characteristics of the single POWER6TM system they announced the other day. IBM's reasons couldn't possibly have included helping their customers make intelligent buying decisions by reflecting a modern, real-world workload with a reasonable database architecture. Criticisms of TPC-C over the years are legion, and it's probably not useful to repeat them here. Suffice to say that Gartner, IDC, and Oracle have weighed in, and this is one of the reasons that the Transaction Processing Performance Council (TPC) announced a new benchmark in March, TPC-E, which goes much further to reflect 21st-century OLTP workloads.

According to TPC-C's description: "TPC-C simulates a complete environment where a population of terminal operators executes transactions against a database. The benchmark is centered around the principal activities (transactions) of an order-entry environment. These transactions include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses." It should be noted that there is no set piece of code to execute: the submitter is allowed to architect the system and write the SQL code however he wishes. And this is where the fun begins.

Let's take a look at IBM's configuration, which you can see for yourself in the full disclosure report (FDR) that they submitted to the TPC.

IBM's TPC-C Hardware Configuration

Pretty impressive database engine, isn't it? But did I mention that according to the FDR (page 27), the total database table size was about 13TB? Let me say that again, so you'll know it's not a misprint: IBM used almost 120 terabytes worth of disks (3,482 of them, to be exact) to store 13 terabytes of table data. When was the last time you asked your boss to give you money for 42 4Gb fibre channel connections to access 13TB of data? Or funds for 3,482 spindles so you could use only 10% of the space on them? Where do you think you'll be able to dig up 3,312 36GB disk drives? Are you even running DB2 in your datacenter?

I should mention that nothing IBM did here was in violation of the rules of the TPC, but it strains the imagination to see how their result could possibly be useful to anyone making OLTP database performance comparisons. This is why Sun hasn't published a TPC-C result since 2001, preferring instead iGEN OLTP or running real applications such as Oracle Apps or SAP.

If you still believe that IBM's TPC-C results actually tell you something useful, then take a look at the following chart, compiled by my colleague Pavel Anni using IBM's previous FDR:

IBM's TPC-C RESULTS System p 570
System p 570
Number of CPUs: 8 8 0%
Number of Cores: 16 16 0%
Number of Threads: 32 32 0%
Amount of RAM, GB: 512 768 50%
Frequency, GHz: 2.2 4.7 114%
tpmC: 1,025,170 1,616,162 58%
$/tpmC: 4.42 3.54 -20%
List price, server: $3,407,192 $2,141,159 -37%
List price, complete: $7,621,279 $9,419,485 24%
3-Year Maintenance: $346,608 $567,162 64%
Discount: $3,423,619 $4,273,465 25%
Price, including 3-Year TCO: $4,544,268 $5,713,181 26%
Database Software: DB2 UDB 8.2 DB2 Enterprise 9 N/A
Availability Date: May 31, 2006 November 21, 2007 N/A

At least IBM has given you something to compare the performance of the old and new processors, right? Think again. If that were IBM's benign intent, why did they add 50% more memory? Why did they use a different version of DB2? Customers looking to upgrade their POWER5+ systems to POWER6 have no idea how much of the 58% performance improvement was due to better performance of the new processor and how much was due to the larger memory configuration and newer software. And considering a 114% increase in clock rate (and a quadrupling of their level-2 cache sizes), why did they achieve only a 58% performance boost? And where are IBM's power consumption and cooling numbers? TPC-C does not require reporting such data, but I'd be willing to bet that POWER6's much-ballyhooed power reduction features were inactive for this run. Especially since their press release said that customers could choose higher performance or lower power consumption, but not both.

So IBM (1) wasn't interested in running a real-world configuration on a modern-day workload, even though TPC-E is available; and (2) wasn't interested in giving customers a valid apples-to-apples comparison with the old benchmark. So why did they do this?

Giving customers useful information is not what this is about. If you don't believe me, here's what IBM Fellow Bruce Lindsay said (page 73): "Well, the benchmarking business is dirty work. The idea is to get the numbers by hook and by crook. ... [M]uch of what we're doing in the TPC-C realm these days is in a performance range that goes beyond what any user is doing." There's a word for this in the industry: benchmarketing. And IBM is the best in the business.

They understand that customers are hungry for a number, any number, however irrelevant, because it's much harder for a customer to take the time to benchmark his own code on a try-and-buy system or in a benchmark center. I can understand this; I used to be in the exact same situation in a previous job. I know how hard it is to be under the gun to finish a proposal or make a project deadline with no time to do the kind of rigorous performance bake-off that is truly necessary to make an intelligent purchasing decision. But IBM is well aware of your predicament too, has been taking advantage of it for years, and will continue to do so until their customers speak up.

TPC, TPC Benchmark C, TPC-C, tpmC are trademarks of the TPC. Please see www.tpc.org for more details.

Sunday Jun 03, 2007

POWER6 Goes Thud: Part I

Or, How Clark W. Griswold Wound Up With the Wagonqueen Family Truckster

A little over a month ago, Sun announced seven new Sun SPARC Enterprise Servers, along with new virtualization capabilities, new reliability/availability/serviceability features, breathtaking memory and I/O bandwidth, and new world records on seven performance benchmarks. Every one of the new servers -- entry level, mid-range, and high-end -- is shipping to customers today. Including, of course, the ones that have those new capabilities and set those performance records.

Can you imagine if we had issued a press release announcing:

  1. only one of the seven servers, say, the eight-way mid-range M5000
  2. extravagant performance over existing servers, providing as "proof"
  3. "ultra-high frequency" processors but absolutely no data about whether the machines will turn your datacenter into a puddle of molten metal
  4. a promise for a set of software features and virtualization technologies that exist today only in other people's operating systems
  5. absolutely no word on where the missing six servers were or when they would be available

We wouldn't have had the nerve to face our customers the next day. Apparently, IBM doesn't grapple with perception issues the same way we do, because what I just described is exactly what IBM foisted on the public two weeks ago when they gave us an IOU for a new POWER6TM product line.

Wagonqueen Family Truckster I'm reminded of the scene in National Lampoon's Vacation in which the hapless Clark W. Griswold drives into Lou Glutz Motors to pick up his new Antarctic blue Sports Wagon with the C.B. radio and Rally Fun-Pack only to be told by the slimy salesman, Ed, that it won't be in for another six weeks. When Clark demands that his trade-in be returned immediately so he can take his business elsewhere, he discovers that it has been smashed pancake-thin in the scrapyard metal crusher. Knowing that Clark has nowhere else to turn, Ed convinces him that the metallic pea Wagonqueen Family Truckster (there are dozens of them, unsold, on the lot) is the car he really wants to take his family across country to Walley World.

I predict IBM will be forced to do the same thing by selling customers down-clocked versions of POWER6 for a long time (they announced 3.5GHz, 4.2GHz, and 4.7GHz parts, and benchmarked the 4.7s). If you look closely at the full disclosure report that IBM turned in with its TPC-C results, you will notice that it lists an availability date of November 21, 2007. IBM announced the POWER6-ized System p 570 on May 21, 2007. I'll do the math for you: that's exactly six months to the day in between. Would you like to take a stab at what's the absolute limit the Transaction Processing Council places on the period of time between announcement of a result and the availability of a system? If you guessed "exactly six months", give yourself a pat on the back. Next question: what price does IBM pay if they don't make that November deadline and have to rescind the result? If you guessed "precisely $0.00", go have a congratulatory beer. But they will have gotten six full months of penalty-free hype, which is, after all, the point of running TPC-C in this day and age in the first place. In the meantime, IBM will try to sell customers a bunch of down-clocked Family Trucksters whose performance on real-world applications can only be guessed at.

Obviously, I can't say that IBM won't be shipping any 4.7GHz POWER6 systems in the next six months, I'm just saying that you probably won't get one. When a company has little to offer in the way of technology innovation except ratcheting up the processor's clock rate, it pushes the laws of physics into areas where it gets extremely low yields on those dies. These are of course sold at a huge price premium and then allocated only to the best customers, typically under an early-access program. The Griswolds out there will just have to wait to see the performance promised rather extravagantly by IBM's marketing department. And if IBM promises you a 4.7GHz system, how confident do you feel, based on their apparent lack of confidence, that they can deliver? Got a schedule you need to keep on your journey to Walley World?

When Sun's internal engineering aliases were abuzz with IBM's IOU-for-an-announcement, I sent out an e-mail predicting that IBM will follow standard Imperial procedure and dump their garbage before going to light speed. (I know, I'm mixing my movie metaphors. And only one person wrote back saying he got the reference.) But sometimes you just get the feeling that history is repeating itself.

I'll be taking a closer look at IBM's claims around performance, virtualization, AIX, and more in the following posts. In the meantime, don't let anyone try to sell you the Family Truckster ;-)




« April 2017