No, there isn't a Santa Claus

I really don't like to use this blog for refuting IBM exaggerations about mainframes. That's the world I come from, and I have a lot of respect and nostalgia for it, but I'm too frequently drawn into pointing out distortions in press releases or marketing FUD. I'd really rather spend the infrequent time I spend on this blog talking about Sun technology. There's so much great new stuff in Solaris, in our servers and storage products, and software stack, that it's just a nuisance to have to refute silly attacks on us.

But, once again into the fray. I receive Mainframe Executive magazine, and the May/June issue's closing column by Jon Toigo of Toigo Partners had some incorrect statements that I just had to correct, saying:

  • that LPARs support up to 1,500 virtual environments. Actual maximum is 60.
  • that z/VM and z/Linux make use of IBM's Sysplex, Workload Manager (WLM), and Intelligent Resource Director (IRD). No, those are only useable in a z/OS environment, which doesn't host virtual machines. That's z/VM's job, and not only does it not have those features (frequently claimed to have magic properties!) but it also has substantial costs that affect the cost-per-server figures that Toigo cited.
  • that z/Linux can use a feature called DFHSM to reduce disk space needs. No, that also is a z/OS-only feature.
  • that VMware systems (why were no others mentioned?) can only support about 20 guests on a high end server. Too low by a factor of 4 or 5 (Sun virtualization technologies like Solaris Containers and Logical Domains were not mentioned, alas) Besides, as I mentioned in my Don't Keep Your Users Hostage blog, user counts without reference to service levels is the wrong way to think about capacity.
If you stipulate that another platform can run only 1/4 the work that it can actually run, and omit the very substantial costs on the other platform - z, and believe grossly exaggerated claims about its capabilities, and fail to mention features of other platforms that provide comparable or superior features that z cannot do (VMotion anyone?), well, you're going to be a few orders of magnitude off.

I don't mean to pick on Mr. Toigo. I e-mailed him, and he said that he wanted to be accurate, and would contact IBM to verify facts. You can't ask for more than that from a journalist. I don't know if he'll come to see the light regarding the exaggerations I point out in the Ten Percent Solution (he is after all writing for a mainframe publication), but at least we can straighten out errors of indisputable fact - stuff you can look up in vendor manuals.

(This confusion about system features is very common, even among IBMers, because so many people think that "mainframe" implies "z/OS function set", when z/OS is only one of several operating systems that run on z. When you are not running an operating system, you don't get to use its features - for good or bad!).

All this was inspired by IBM's recent claims, which I have refuted at length on this blog. I won't repeat my points in full, because the material is here and here. but IBM makes the absurd claim that customers run database servers at 10% busy, and through the magic wand of a few buzzwords, you can run any collection of workloads at 90% CPU utilization, and somehow you can only do this using features that only exist on z. All complete rubbish, including mistakes about which IBM products have which features. It's silly that I have to correct IBM employees about IBM technology.

Here's the comment I sent to the blog of IBMer Tony Pearson:


I'm sorry you didn't take the opportunity to challenge my blog, cited as "some might question, dispute or challenge this ten percent". That would have been a good time to expose errors, if they exist, in my refutation of IBM claims.

However, I see errors of fact in your blog:

(1) You say WLM and IRD make it possible to run mainframes at 90% utilization. This is impossible: z/VM and z/Linux do not implement these z/OS-only functions. See, for example http://publib.boulder.ibm.com/infocenter/eserver/v1r1/en_US/index.htm?info/veicinfo/eicartechzseries.htm or http://www-03.ibm.com/servers/eserver/zseries/zos/wlm/

(2) David Boyes' Test Plan Charlie ran no workload other than booting OS images. It cannot be used to extrapolate capacity for doing any actual work. It also used Linux kernel customizations to reduce overhead that you could not use "in real life".

(3) You say you can define z/VM LPARs in a Sysplex. Sysplex is a System z feature only available with z/OS, so what you suggest is impossible. You cannot use Sysplex for coordinating times or recovery with z/VM or z/Linux. z/VM only supports guest Sysplex within a single z/VM instance, and only for z/OS guests.

(4) Actual cost per IFL is $125,000, not $100,000, and that doesn't count the cost for RAM.

You are right in suggesting that you would have to add up actual software and hardware costs of both platforms for a fair comparison. I've done so, and even using IBM's "10% solution on x86, 90% busy on z" argument that I dispute, and the server counts at http://www-03.ibm.com/press/us/en/pressrelease/23592.wss, IBM z cost about 7 times as much as the distributed solution IBM compares it to.

Your figures disagree with the IBM press announcement, which had claimed that 26 IFLs had sufficient capacity to do the job of 760 x86 cores (which is 380 servers, not 1,500). The page http://www-03.ibm.com/press/us/en/pressrelease/23592.wss used to have a footnote number 3 with the math, which now has been removed. In your analysis, a 64-CPU z10 E64 would be needed. That costs about $26 million dollars, excluding RAM, disks, and software licenses. That is over 14 times more expensive than 1,500 x2100s (the Sun price includes the RAM and pre-installed OS). If the CPUs are configured as IFLs, then they cost $125,000 each, totalling $8 million dollars. With the minimum RAM configuration of 160GB, it still costs 5.38 times as much as the x2100s.

I will address several other errors and points of contention in my blog. The most important mistake, though, is the implication that it is hard to consolidate or virtualize servers on x86 (or SPARC) servers at high utilization, or simply to share assets among production, test, and disaster recovery for reduced costs. Nobody need run at 10% busy, or pay a high premium to get higher utilization. If the x86 servers are managed for higher utilization, far fewer will be needed, and the price difference will be even higher.

(end here)

That's the comment I placed on the blog. It will be interesting to see what response it generates, if any. It's very interesting to me that IBM removed the "justification" in footnote number 3 that used to exist on the announcement page I referred to above. Also interesting, in my previous blog I linked to another IBM blog which had the "basis" of their claims. That blog has also had the content I referred to expurgated! Curiouser and curiouser!

There are other mistakes in the blog: There is no rule of thumb saying you can reduce by 15% the capacity needed to run consolidated workloads. I'm interested in learning where that came from. I should mention again that IBM's LSPR figures don't run the RPE benchmark anyway - they run proprietary IBM benchmarks, so all of the IBM projections are specious reasoning, comparing their servers running one workload to other servers running different workloads.

I did enjoy the part where he talks about the poor scalability of System z. That part was accurate. Sun's high end SPARC servers not only are "bigger" in every capacity metric than z10, but they also are much better at vertical scale - we don't suffer from the problem he describes. That's why IBM doesn't produce LSPR for above 32-CPU systems, and they truncate results for "Single Image" capacity before that. They just don't scale as well. That's another reason not to believe these projections: they assume they understand the scalability of the application as more CPUs are added! The original IBM press release used fuzzy math based on linear scale - which System z doesn't achieve (as the IBM blog says).

I guess I should mention that Sun uses NPIV too, and that Sun also has hierarchical storage management capabilities. ZFS, a free feature of the Solaris 10 operating environment, also provides on-disk data compression to reduce disk space needs.

The 90% fallacy

In fact, the whole "90% utilization" premise is completely flawed, regardless of vendor.

Let's reason this through: If you have 90% average utilization, you probably have periods higher than 90% and lower, unless you are incredibly lucky enough to have absolutely static, predictable resource demand. That rarely happens, especially with interactive or on-line systems, which is the kind of workload under discussion.

You do not want to run on-line systems at close to 100% busy due to queueing effects that would raise response times, regardless of platform. Queuing theory applies to everybody!

The only way you can do this is if you have an individual workload with predictable characteristics on a platform with enough capacity to serve it. No workload manager in the world helps you run on systems that don't have enough capacity to meet service level objectives. Workload managers shift resources between workloads to meet service level objectives. When there is only one application, or insufficient capacity to meet service levels, there's nothing a workload manager can do.

On the other hand, when you are consolidating workloads, then you can run at extremely high utilizations levels only if the different workloads have different service levels and priorities which would permit you to starve the lower-priority workloads while giving preferential service to the high-priority workloads. If the aggregate capacity requirements of the high-priority workloads are less than the server capacity, and you're willing to starve low priority workloads (we sometimes call these "cycle soakers", since they soak up whatever capacity is left over) then you can run your systems at or near 100% busy.

There's absolutely no magic to this, and nothing that makes it possible on one platform and impossible on another. As long as you have a resource manager: Solaris Resource Manager on Solaris (There is no technical obstacle to running Solaris systems at high utilization. Sun runs compute farms for circuit design close to 100% busy for months at a time), System Resource Manager on z/VM (same initials!), Workload Manager (WLM) on z/OS, then you can run flat out (Linux doesn't seem to have anything suitable), But you also require a workload you can shed or run slower if needed, and a way to throttle it while running more important work. (The real difference here is that mainframe systems have a tradition of being run fulling loaded, due to the high acquisition costs, while distributed systems have less economic pressure, and frequently are purchased by individual lines of business who didn't want to share!)

Even this is an over-simplification, as "capacity" consists of many facets (CPU, I/O bandwidth, I/O operations per second, network bandwidth and latency) - many applications, such as the OLTP example touted by IBM, are more likely to be I/O bound than CPU bound. Throwing more CPU capacity at such systems is just a way to go idle sooner, and they naturally run I/O bound no matter what you do!.

The moral of the story: don't believe the hype. High utilization isn't the answer to all questions (the right question is "how do I minimize the cost of computing: acquisition costs, operational costs, energy, real-estate, staff and license costs - while maintaining service levels and meeting business needs".) High utilization is helpful, of course - idle machines are an expense. But you can only get "close to 100%" with fortunate combinations of workload and business priorities.

Finally, there is no magic wand on mainframes that makes it possible to run them 90% for any old workload. And nobody needs to run their production databases at 10% busy on distributed systems. Customers frequently stack prod, test, development, QA, and disaster recovery on the same machines to reduce server counts If you choose to manage to higher utilization, there is no reason to run with the 1,500 servers outlined in Pearsons blog: 125 production machines at 70% busy might reasonably be left alone, but the 125 backup servers could easily be consolidated with the grossly too-many 1250 test machines running 5% busy. Nobody needs to do set up so many almost-idle machines for test, development, and QA. <script type="text/javascript"> var sc_project=6611784; var sc_invisible=1; var sc_security="4251aa3a"; </script> <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script>

visit tracker on tumblr
Comments:

I can't thank you enough for pointing out the inaccuracies in the Mainframe Executive piece. I feel like I (and everyone else) who read IBM's press releases and chatted with their knowledge folk about the virtualization-via-mainframe stuff might have been had. I have requested formal clarification from IBM (via analyst relations and Tony P.) and will print the straight scoop when received. I have also blogged about it.

Frankly, I have had lots of issues with VMware on servers in our labs and am now testing several alternatives, including Virtual Iron and Citrix. I look forward to testing Sun's wares too. It stuck in my head, when I read IBM's announcement, that the cost to virtualize x86 in a rock solid multi-processor mainframe environment (where I cut my teeth) was significantly lower than that of doing it on x86 with extents. Adding my background with mainframes (pre-zOS) it sounded like a much more resilient and resource efficient approach as well. I suspect that, for all the marketecture, there is still some truth in these assumptions.

I do appreciate your clarifications and I really hate to disseminate erroneous information. Thank you!

Posted by Jon Toigo on June 15, 2008 at 09:04 AM MST #

Jon, thanks indeed for the kind comments. I really appreciate that and your interest in getting facts straight.

I'll be glad to work with you on this, and will suggest, via private email, experts you can use for corroboration (they already write for the mainframe magazine, and are in favor of z/Linux - I like and respect them anyway :-) - so you don't have to worry about them having anti-IBM bias!) I want everything to be on the up-and-up and based on facts.

Some possibke confusion I'd like to correct: IBM isn't claiming they virtualize x86 machines (eg: run Intel machine code on z). That is technically possible but would be horrifically slow. Instead, they say then can replace 1500 x86 servers - but would run using native z code, not x86 - a claim I've vigorously disputed. Nonetheless, I don't want you to think IBM claimed something they don't. Fair is fair. I'm also not sure what you mean by "extents" - maybe later you can tell me. Also, the thing about DFHSM is that it hasn't been ported +from+ z/OS to z/Linux, not +to+ z/OS (which is the only OS to run it - that's why it doesn't apply to z/VM and z/Linux)

I'm sorry you had difficulties with VMware - many people have done very well with it. It's also true that crowding too many users onto a machine isn't the right answer, as you say in your blog. See my "Don't keep your users hostage" blog entry. We're in sync on this.

I think you are wise to try multiple virtualization technologies. VMware opened up a market that is rapidly growing. I hope you take a look at Sun xVM Server, a new x86 hypervisor which will be released in a few months, and will be a great new alternative for x86 virtualization. Or look at Sun VirtualBox, a developer virtual machine product which we already provide for free download!

thanks again, Jeff

Posted by Jeff Savit on June 15, 2008 at 11:33 AM MST #

1) You can speed up and slow down a batch workload to keep interactive sessions happy. Without the batch workload, it is not so easy. (Unless you're a timelord?)

2) Departmental servers don't have as much pricing pressure? Thats a good one. That helps the small server economy, and glosses over the liklihood that you can't run any two apps on the same OS image? Lots of copies of OS doesn't hurt the OS vendor either.
Instead of properly managing app and library dependencies, just build a VM for each.

Posted by Robert Clark on June 16, 2008 at 05:29 AM MST #

Hi Robert, and thanks for the comment.

To your points:

(1) I agree, limited by the constraint that batch work often has hard "must complete" times, so you can't infinitely defer that, either. But yes: batch is much more tolerant of delay than on-line - pretty much by definition! It can be a valuable source of wiggle-room...

(2) Price pressure is relative: the cost of a small server is small compared to a large one (especially a mainframe!), and often within the purchasing authority of an department manager (also, vendors are subject to a lot of price competition in this space - so you get commodity prices with lower margins) That makes it relatively easy for people to buy them for their own departments, which can lead to server sprawl and low usage due to there being only a single "user". Consolidation and sharing, frequently assisted by virtualization, is the way to reduce server sprawl.

For Sun, and Solaris, there certainly is a darned good likelihood that you can run multiple instances of the same app on the same OS image, even without virtualization. Unix was a multi-user system from its inception! :-) Virtualization does make it easier: Solaris Containers is an excellent way of doing this, providing separate virtual environments in the same Solaris instance. But, you're correct in your comment that virtual machines make it possible for you to host multiple applications on a box without aligning their app and library (or even OS version) dependencies. In a perfect world every dependency would be neatly lined up - but that isn't always possible, and virtual machines are a valuable way to get around that.

thanks again for posting. Jeff

Posted by Jeff Savit on June 16, 2008 at 06:20 AM MST #

ROFL. How many machines in use in your average Windows shop have been added because the vendor-supplied admin interface running on them requires one very specific and obsolete... wait for it.. version of JAVA?

Don't trust a Dentist that gives your children candy?

Posted by Robert Clark on June 16, 2008 at 11:31 AM MST #

Oh, well - Windows is a special case, eh? :-) (and why can't the different apps with different paths and CLASSPATHs?) That works on Windows, too! But yes, for sure virtual machines are a great way to get around conflicts in the software stack.

I also take your point about potential conflict of interest... it's important to make sure that the vendor's interest is aligned with the customer's. It should not be an adversarial thing where one is working to the detriment of the other.

thanks again for posting, Jeff

Posted by Jeff Savit on June 16, 2008 at 12:07 PM MST #

For some reason my last post came through anonymously. I did not intend for it to. I am Joe Temple. jliitemp@us.ibm.com

Posted by Joe Temple on June 23, 2008 at 07:21 AM MST #

Jeff, your post is rather long and rather than build a point by point discussion too long for a single comment I will put up several comments.

Starting with the moral of the story:

There are several:
• "Use open, standard benchmarks, such as those from SPEC and TPC."

Better to use your own. They have not been hyper tuned and specifically designed for. They have a better chance of representing reality. But be careful not to measure wall clock time on “hello world” or lap tops will beat servers every time.

• "Read and understand what they measure, instead of just accepting them uncritically."

Yes, particularly understand that the industry standard benchmarks run with low enough variability and low thread interaction that it makes sense to turn on a hard affinity scheduler.
Your workload probably does not work this way.

•"Get the price-tag associated with the system used to run the benchmark."

Better to understand your total costs including admin, power, cooling, floorspace, outages, licensing, etc.

• "Relate benchmarks to reality. Nobody buys computers to run Dhrystone."

Only performance engineers run benchmarks for a living.

• "Don't permit games like "assume the other guy's system is barely loaded while ours is maxed out". That distorts price/performance dishonestly."

Understand what your utilization story is by measuring it. Don’t permit games in which hypertuned benchmarks with little or no load variability and low thread interaction represent your virtualized or consolidated workload. Understand the differences in utilization saturation design points in your IT infrastructure and what drives them.

• "Don't compare the brand-new machine to the competitor's 2 year old machine"

Understand what the vintage of your machine population is. When you embark on a consolidation or virtualization project compare alternative consolidated solutions, but understand that the relative capacity of mixed workload solutions is not represented by any of the existing industry standard benchmarks.

• "Insist that your vendors provide open benchmarks and not just make stuff up."

Get underneath benchmarketing and really understand what vendor data is telling you. Relate benchmark results to design characteristics. Characterize your workloads. (Greg Pfister's In Search of Clusters and Neil Guther's Guerilla Capacity Planning suggest taxonomies for doing so.) Understand how fundamental design attributes are featured or masked by benchmark loads. Understand that ultimately standard benchmarks are “made up” loads that scale well. Learn to derate claims appropriately, by knowing your own situation. (Neil Gunther's Guerilla Capacity Planning suggests a method for doing so)
• "Be suspicious!"

Be aware of your own biases. Most marketing hype is preaching to the choir. Do not trust “near linear scaling” claims. Measure your situation. Don’t accept the assertion that the lowest hardware price leads to the lowest cost solution. Pay attention to your costs, and don’t mask business priorities with flat service levels. Be aware of your chargeback policies and their effects. Work to adjust when those effects distort true value and costs.

Posted by Joe Temple on June 23, 2008 at 11:48 PM MST #

Jeff said,"There are other mistakes in the blog: There is no rule of thumb saying you can reduce by 15% the capacity needed to run consolidated workloads."

Actually sometimes its more than 15%. I would point out here that the fundamental claim about utilization is based on measured data and solid system theory. When you carve work up into N equal pieces to run on distributed capacity, the headroom required to meet the SLA grows by the square root of N. There is considerable variation in the amount of headroom required by a single machine, because workloads have different “coefficients of variability” (Standard Deviation/Mean). It is also true that many individual workloads are spiky, this means that the coefficient of variability is high. As you consolidate more and more work the average utilization goes up and the coefficient of variability goes down. The variability of load is one of the key factors in deciding whether to consolidate or distribute work. Also, consolidation of work can lead to efficiencies in interconnection which cuts down on the CPU used to communicate between parts fo the solution. The efficiencies come in the form of reduction of lock retention time and improvments buffer latencies, that save CPU beyond the improved latency of communicating internally as opposed to over an external connection.

Posted by Joe Temple on June 24, 2008 at 12:57 AM MST #

Jeff said, "I did enjoy the part where he talks about the poor scalability of System z. That part was accurate. Sun's high end SPARC servers not only are "bigger" in every capacity metric than z10, but they also are much better at vertical scale - we don't suffer from the problem he describes."

I think Jeff would have to agree that IBM’s p6 Power 595 “scales well”. Originally z10 was going use the p6 memory/cache nest (the principle design element for scaling). On the LSPR workloads that version of the design scaled WORSE than the present design. This is not a matter of the M9000 scaling well and the z10 not. It is a matter of the workloads used to illustrate scaling. For highly parallel queries and cpu intense analytics the large UNIX machines scale very well, particularly when the codes are “NUMA aware”.

Mixed workloads behave differently and have high amounts of thread interaction caused by sharing the hardware. They scale more like the LSPR load. I would refer you to two excellent books: In Search of Clusters by Gregory Pfister, and Guerilla Capacity Planning by Neil Gunther. Pfister points out that workloads fall into regions called parallel hell, parallel nirvana, and parallel purgatory. Careful examination of machine designs and benchmark definitions will show that the “industry standard benchmarks fall largely in parallel nirvana and parallel purgatory. Large UNIX machines tend to be designed for these benchmarks and so are particularly well suited to parallel purgaotory. Clusters of distributed systems do very well in parallel nirvana. The mainframe resides in parallel hell as do its primary workloads. The current confusion is where virtualization takes workloads, since there are no good benchmarks for it. My guess is that when the dust settles things like SPECvirt will not “scale well” on any machine. Gunther points out that vendor claims of near linear scaling are not to be trusted and shows a method to “derate” scaling claims. His suggested scaling values for data base servers is closer LSPR like scaling than TPC or SPEC benchmarks scale.

Posted by Joe Temple on June 24, 2008 at 01:28 AM MST #

This format is very difficult for parry and riposte, but let's try. I would like to use different colors, but I can't (AFAIK) put in HTML markup to permit that. So: Joe's stuff verbatim within brackets, and each of his sections starts with a quote of a sentence of mine (which I identify, within quotes) for context. Each stanza identified by name and employer (this is Jeff speaking):

Joe(IBM): [[[Jeff, your post is rather long and rather than build a point by point discussion too long for a single comment I will put up several comments. Starting with the moral of the story: There are several: • quoting Jeff: "Use open, standard benchmarks, such as those from SPEC and TPC."

Better to use your own. They have not been hyper tuned and specifically designed for. They have a better chance of representing reality. But be careful not to measure wall clock time on “hello world” or lap tops will beat servers every time.]]]

Jeff(Sun): In a perfect world, every customer would have the opportunity to test their applications on a wide variety of hardware platforms to see how they perform. But they don't, and they rely on open standard benchmarks to give them some information about how the platforms would perform. Or, they do have applications they could benchmark, but they're non-portable, or run solely on a single CPU (making all non-uniprocessor results worthless), or otherwise have poor scalability or any of a hundred other problems. Imagine comparing IBM processors based on the speed of somebody writing to tape with a blocksize of 80 bytes! Even if they get a useful result, the next customer doesn't benefit at all and has to start from scratch. It's not trivial to make good benchmarks that aren't flawed in some way. That's why the benchmark organizations exist - to provide benchmarks that characterize performance and give a level playing field for all vendors. IBM, Sun, and others are active in them - our employers must think they have value. Obviously there is "benchmarketing" and misuse of benchmarks. THAT is what I'm railing against. Hence, my following bullet that says "read and understand". But frankly, benchmarks Specweb/specwebssl/Specjvm, the SPEC fileserver benchmarks, and benchmarks like TPC.org's TPC-E provide representative characterization of system performance (with sad exceptions like TPC-C, which is broken and obsolete, but IBM still uses for POWER). A lot of people have worked very hard to make them be as good as they are. IBM uses these benchmarks all the time - with the notable exception of System z. That's the point, isn't it. It works in a monopoly priced marketplace where it doesn't have to compete on price/performance, as it does with its x86 and POWER products. Where else are you going to run CICS, IMS, and JES2? To the second observation about wall clock time on trivial applications: yes, obviously.

Joe(IBM): [[[quoting Jeff: •"Read and understand what they measure, instead of just accepting them uncritically."
Yes, particularly understand that the industry standard benchmarks run with low enough variability and low thread interaction that it makes sense to turn on a hard affinity scheduler. Your workload probably does not work this way.]]]

Jeff(Sun): I'm not sure what's intended by that. Are you claiming that benchmarks should be run against systems without fully loading them to see what they can achieve at max loads? Hmm. Anyway, see below my comments about low variability and low thread count - which applies nicely to IBM's LSPR.]]]

Joe(IBM): [[[quoting Jeff: •"Get the price-tag associated with the system used to run the benchmark." Better to understand your total costs including admin, power, cooling, floorspace, outages, licensing, etc."

Jeff(Sun): That's what I meant.

Joe(IBM): [[[quoting Jeff: • Relate benchmarks to reality. Nobody buys computers to run Dhrystone." Only performance engineers run benchmarks for a living.]]]

Jeff(Sun): Sounds like a dog's life, eh? OTOH, they don't have users...

Joe(IBM): [[[quoting Jeff: •"Don't permit games like "assume the other guy's system is barely loaded while ours is maxed out". That distorts price/performance dishonestly." Understand what your utilization story is by measuring it. Don’t permit games in which hypertuned benchmarks with little or no load variability and low thread interaction represent your virtualized or consolidated workload. Understand the differences in utilization saturation design points in your IT infrastructure and what drives them."]]]

Jeff(Sun): Your comment has nothing to do with what I'm describing. What I'm talking about is the dishonest attempt to make expensive products look competitive by proposing that they be run at 90% utilization, while the opposition is stipulated to be at 10%, and claim magic technology (like WLM, which z/Linux can't use) to permit higher utilization and claim better cost per unit of work on your own kit. That's nothing more than a trick to make mainframes look only 1/9th as expensive as they are. Imagine comparing EPA mileage between two cars by spilling 90% of the gas out of the competitor's tank before starting. As far as "no load variability and low thread interaction", I suggest you take a good look at IBM's LSPR. See http://www-03.ibm.com/servers/eserver/zseries/lspr/lsprwork.html which describes long running batch jobs (NO thread interaction at all) on systems run 100% busy (NO load variability). The IMS, CICS (mostly a single address space, remember), and WAS workloads in LSPR should not be assumed to be different in this regard either. This doesn't make LSPR evil: it is not - it's very useful for comparisons within the same platform family. But consider SPECjAppserver, which has interactions between web container, JSP/servlet, EJB container, database, JMS messaging layer, and transaction management - many in different thread and process contexts. I suggest you reconsider your characterization about thread interaction. Complaints about thread interaction and variability of load are misplaced and misleading.

Joe(IBM): [[[quoting Jeff: •"Don't compare the brand-new machine to the competitor's 2 year old machine" Understand what the vintage of your machine population is. When you embark on a consolidation or virtualization project compare alternative consolidated solutions, but understand that the relative capacity of mixed workload solutions is not represented by any of the existing industry standard benchmarks.]]]

Jeff(Sun): We're talking at mixed purposes. What I mean is that one vendor's 2008 product tends to look a lot better than the competition's 2002 box, making invidious comparisons easy. Moore's Law has marched on.

Joe(IBM): [[[quoting Jeff: • "Insist that your vendors provide open benchmarks and not just make stuff up."
Get underneath benchmarketing and really understand what vendor data is telling you. Relate benchmark results to design characteristics. Characterize your workloads. (Greg Pfister's In Search of Clusters and Neil Guther's Guerilla Capacity Planning suggest taxonomies for doing so.) Understand how fundamental design attributes are featured or masked by benchmark loads. Understand that ultimately standard benchmarks are “made up” loads that scale well. Learn to derate claims appropriately, by knowing your own situation. (Neil Gunther's Guerilla Capacity Planning suggests a method for doing so)]]]

Jeff(Sun): This is not the "making stuff up" that I was referring to. I was referring to misuse of benchmarks in the z10 announcement, which IBM was required to redact from the announcement web page and the blogs that linked to it. I'm not arguing against synthetic benchmarks that honestly try to mimic reality, I'm arguing against attempts to game the system that I discussed in my "Ten Percent Solution" blog entry.

Joe(IBM): [[[quoting Jeff: • "Be suspicious!"Be aware of your own biases. Most marketing hype is preaching to the choir. Do not trust “near linear scaling” claims. Measure your situation. Don’t accept the assertion that the lowest hardware price leads to the lowest cost solution. Pay attention to your costs, and don’t mask business priorities with flat service levels. Be aware of your chargeback policies and their effects. Work to adjust when those effects distort true value and costs."]]]

Jeff(Sun): With this I cannot disagree. That's exactly what I have been discussing in my blog entries: unsubstantiated claims of "near linear scaling" to permit 1,500 servers to be consolidated onto a single z (well, the trick here is to stipulate that 1,250 of the 1,500 do no work!) or to ignore service levels (see my "Don't keep your users hostage" entry). I'll also add "beware of the 'sunk cost fallacy'": you shouldn't throw more money into using a too-expensive product that has excess capacity because you've already sunk costs there.

Posted by Jeff Savit on June 26, 2008 at 08:41 AM MST #

(Housekeeping: this is for Joe Temple's June 24, 2008 at 10:57 AM EDT entry about reducing CPU by 15% Since there are only two blocks of text, so I can respond without the tedious pseudo-markup I did previously -- Jeff)

I understand the queueing theory. It even is intuitively clear: at a given moment, one node might be saturated while another is idle; on a shared system those wasted CPU cycles can be used. Old news. In this instance it serves the same purpose as the frictionless surfaces we stipulated in freshman physics classes: it describes ideal conditions that permit simplified math (like F=M\*A), but might not actually exist in real life. Some 30 years ago I had the privilege of studying systems architecture and performance in classes taught by the visionary Hal Lorin, at the time still at IBM. See http://www.research.ibm.com/journal/sj/351/books.pdf for a description of just one of his books. One of his remarks that sticks in my head: "If I \*knew\* that the workload had Poisson arrival and exponentially distributed service times, then I wouldn't have to do any of this messy performance work." In other words, simple queuing model concepts are inadequate in the face of workloads that don't arrive randomly at an M/M/1 queue, and are implemented on real systems that can have sharp discontinuities in performance as load increases. In over 20 years as an IBM customer, consolidating VM/CMS, VSE, MVS, and OS/390 workloads from smaller onto larger systems, I never had an IBM representative suggest that I sum the compute requirements of the workloads and reduce them by 15% when sizing a consolidated replacement system. I would have taken that as malpractice if they had.

Remember that we're talking about an imaginary workload, which has not been characterized at all. It has no known standard deviation because it's purely hypothetical. We don't know the variability of load. Saying "15% reduction" is claim with no basis. We only know that there are 1500 servers where 125 are 70% utilized (with no further details about what that means) and an additional 1375 servers do essentially nothing, for the risible consolidation scenario in which the consolidated system uses z/OS workload management features non-existent on the proposed z/Linux under z/VM environment. What we're doing in this particular back and forth is quibbling over a minor detail, when the major distortion lies elsewhere. To take an uncharacterized workload and suggest you can lop off 15% is inappropriate.

An equally possible scenario is that the 70% average utilization includes peak periods at 100%, and that the systems don't deliver sufficient capacity to meet the peaks (I used to work with mainframe systems that had ample capacity all day long: except at open and close of the New York stock exchanges, at which times they were saturated. No 15% rule would have applied) Consolidation onto fewer, larger, machines would permit higher throughput at peaks. That's an argument in favor of consolidation (which would be done much more easily by simply moving from small 1RU servers to more powerful ones of the same architecture than porting to a different platform, but that's another topic) - but it would result in a net +increase+ in consumed capacity, not a reduction.

The problem is not just with the workload - it's also with the reality of computer systems, which don't work like simple queues of customers at a bank teller. As you rightly pointed out in your previous comment, one should never take claims of linear scalability as givens. Look again at LSPR, where different workloads exhibit different levels of scalability. I'm looking at single-image workloads where 32 z10 CPUs do only 17 to 18 times as much work as a single z10 CPU (and no single-OS results are published over 32 CPUs). In real life, as you well know, there are many places in complex computer systems where you don't get linear scale. On System z, you don't add L2 cache capacity as you add CPUs, so you will experience decreased hit ratios and increased memory latencies if cache working sets exceed the L2 cache per book. NUMA properties become more important if the workload spans a book. This is an area IBM is beginning to address with HiperDispatch (VM has had per-CPU dispatch queues (the Processor Local Dispatch Vector) since VM/XA, but I doubt either it or z/Linux has support equivalent to HiperDispatch.) Solaris is several years ahead of IBM on programming board-level NUMA awareness into its OS, but at least IBM's made a start. On the OS level there is contention for locking data structures on a shared system that doesn't exist when there are multiple OS instances, so more time may be spent in spinlocks instead of doing useful work.

Basing this on CPU is also naive and a mere distraction: the hypothetical x2100s at their minimum configurations have the same RAM as the maximum possible z10 configuration. If configured reasonably they'll have several times as much RAM as the z10 at a tiny fraction of the cost. Good luck trying to run databases (the hypothetical workload), even low utilization instances, on the z10 with much less memory than on the original system. Estimates of 15% reduction in CPU needs will go out the window when the consolidated system is thrashing. Besides memory, on a very prosaic level, there are disk areas that get increasingly hot as you add load, increasing latencies measured in milliseconds rather than microsystems. JES2 checkpoint area, VTOCs, cataloges, VSAM clusters, the VM directory area.

Let's recall that this is a z/VM+z/Linux proposal, so let's look at its scalability. Have a look at http://www.vm.ibm.com/perf/reports/zvm/html/24way.html#FIG24WETR which shows decreased ETR and increased CPU per transaction going from 16 to 24 CPUs. Throw the "15% reduction in CPU" out the window. It's simply wrong.

Crikey! A 16 way z990 doing only 1322 web transactions per second, and a 24-way doing under 1,000! Any single contemporary 1RU x86 or SPARC server will dramatically outperform that at a tiny fraction of the cost, floor space, and environmentals! Obviously, the right answer is to consolidate multiple z990s onto 1RU, low-cost Sun servers.

Posted by Jeff Savit on June 28, 2008 at 04:20 AM MST #

Housekeeping (1) typo correction: my prior comment had "microsystems" where "microseconds" was intended. Sorry.
(2) Joe Temple's comment at 6/24/08 8:28:03 AM on the topic of scalability is vertically far above this comment, so I'll use the annoying fake markup for interleaved comment and response as I did a few posts ago. It's annoying to compose, and probably annoying to read, but it's hard to connect the comment and response with this forum (AFAIK). By the way, for anybody who might be reading this besides me and Joe (is anyone?): Joe Temple is an IBM Distinguished Engineer, and in my opinion a person who has earned respect. I strongly disagree with him on a variety of points, and will make my points vigorously (I've also seen some of his other public statements that I do agree with), but that disagreement should not be construed as disrespect for him.

Joe(IBM): "I think Jeff would have to agree that IBM’s p6 Power 595 “scales well”.

Jeff(Sun): I stipulated months ago in my blog that I expect POWER scales better than z. It has to: it competes in non-monopoly conditions, unlike the mainframe (again: where else can you run CICS?) In fact, I even saw an application benchmark in which the supposedly "mid-frame" IBM Series i (what was once called AS 400) beat the pants off IBM's System z (the embarrassing results were withdrawn and can no longer be found...) But I digress. I already made the statement that I expect p>z. On the question of whether it "scales well" in comparison to its competitors, I defer to my colleague John Meyer, who has done great work debunking System p, which has tripled clock rate in the last 3 years, with only 2x performance improvement, and has numerous single points of failure. It's unsurprising that a design for POWER would not work the same for z, as they have substantial architectural differences. I agree with Joe that memory and cache design is a key component of scale, but IMO that's one area in which Z sharply falls short. z10 is better than previous designs, which I considered under provisioned in this regard, but not enough to be competitive. The operating system is another essential aspect, which I'll touch on in a moment.

Oh, in case anyone forgets: it's not just raw performance or scale, it's also price. IBM Z operates under monopoly economics conditions. It is a big cash cow for IBM, with enviable margins. It is several times as expensive as other IBM products (and their competitors, like Sun) in the highly competitive Open space, as well as not scaling as well. While I think you'll do best buying product from Sun :-) just go ahead and price competitive products from IBM or other non-Sun companies. I know of an electronics company shifting from Z to AIX , and a Korean company already switched from Z to the same number of HP Integrity boxes, in both cases for massive savings. That's TCO, not just purchase price. Ultimately, it's not about tech details. It's about the money. Z is expensive.

Joe(IBM): "This is not a matter of the M9000 scaling well and the z10 not. It is a matter of the workloads used to illustrate scaling. For highly parallel queries and cpu intense analytics the large UNIX machines scale very well, particularly when the codes are “NUMA aware”."

Jeff(Sun): Actually it is. First of all: IBM can run on System z the same supposedly "highly parallel queries and cpu intense analytics" it runs on POWER and x86. There's nothing stopping them from doing so (in fact, I'm sure they have but don't publish the results). If the Z platform scaled as well, IBM would brag about it. Yet there is only silence. As I mentioned before, several of the standard benchmarks are database bound, or have heavy thread interaction, so the workload characterization is inaccurate. Not everything is simplistic SpecInt. IBM sells z/Linux for web, database, and file server environments, yet refuses to provide public benchmarks of z9 or z10 running web, database, or file servers even though universally accepted (including by IBM for other products) benchmarks are available to model those behaviors. There's nothing NUMA-aware in Spec Java benchmark code. And if Oracle or MySQL is NUMA aware on Sun or IBM POWER, but not on IBM Z, then that's a liability of Z, not something to make excuses for.

It's not only the workload that separates a E25K or M9000 from a z9 or z10. There are important architectural differences that let a Sun E25K or M9000 scale better than z: not least of which is adding cache and I/O capacity as you add CPUs, while z's cache and I/O capability doesn't grow as CPUs are characterized. Most important is Solaris - Sun's crown jewel - which has had years of development effort to make it scale linearly as you add CPU, I/O, and RAM. It came as a shock to me when I joined Sun, but Solaris scales better than z/OS. We can benchmark systems with more than 32 CPUs in a single OS instance. IBM can't. And the hardware scales more too. Time to reevaluate long-held assumptions about what is "big iron". It's not our fault that IBM's Z lags Sun in scale and NUMA awareness (a reality of all high end computer systems). z/OS is a latecomer to the world of really large memories and large numbers of CPUs - which is the future for all of us. Solaris is several years ahead in this regard, and has demonstrated it on systems with over 100 CPUs. And z/VM lags poorly in this regard (Alas! I still have a warm spot for VM). My previous comment quoted an IBM web page showing negative scale at only 20 or so CPUs. That's simply not competitive, and of utmost importance when a z/Linux workload is being discussed, since they run under z/VM.

Joe(IBM): "Mixed workloads behave differently and have high amounts of thread interaction caused by sharing the hardware. They scale more like the LSPR load."

Jeff(Sun): I will again point out that many of the LSPR workloads have no thread interaction, and many standard benchmarks (that Sun runs, and IBM runs except on Z) have high thread interaction. Standardized benchmark from SPEC.org or TPC.org for large DSS, OLTP, Java application server, web, file server also have high thread interaction. In fact, the Java workloads in LSPR (including z/Linux) could easily be replaced by SPEC's. Certainly, there's no excuse for not running open benchmarks when z/Linux is the OS. I can see it for z/OS, where primary workloads are IMS, CICS, and batch, but z/Linux is intended for the same workloads running on UNIX. Run the benchmarks.

Joe(IBM): "industry standard benchmarks fall largely in parallel nirvana and parallel purgatory."

Jeff(Sun): As I mentioned, so do many IBM LSPR benchmarks. The batch job LSPR is nirvana, as COBOL jobs don't interact with one another. The Java benchmarks in LSPR could very easily have been the SPEC Java application server benchmarks, which have considerable inter-thread, non-parallelizable benchmarks. And really: so what? There are benchmarks that map to realistic workloads, and IBM is trying to sell into this market which the benchmarks characterize. If IBM claims z10 and z/Linux run Java app server, or NFS, or CIFS, or web server well THEN PROVE IT USING THE SAME BENCHMARK EVERYBODY ELSE DOES (sorry for the shouting) instead of making excuses.

Joe(IBM): "Large UNIX machines tend to be designed for these benchmarks and so are particularly well suited to parallel purgatory."

Jeff(Sun): I would be interested in seeing some fragment of proof that large UNIX machines are well suited to "parallel purgatory" (as if it was somehow bad to be well suited to a wide category of performance!) As far as the claim that they are designed for these benchmarks: if the claim is that they are designed to do well on benchmarks that are universally accepted as reasonable proxies for performance of different workload categories, then I suppose that's true. The question then, is "why isn't Z designed to run the reasonable proxies for these workloads, when IBM is trying so hard to claim those workloads run well on Z? Why won't IBM prove its claims?" If the claim is that UNIX vendors attempt to game the system by hacking together systems that run benchmarks well but are otherwise useless, then I absolutely deny that. At least, we at Sun don't do that. Otherwise, we'd be running TPC-C.

Joe(IBM): "My guess is that when the dust settles things like SPECvirt will not “scale well” on any machine."

Jeff(Sun): I certainly don't know. That has little to do with application or function oriented benchmarks like SPECnfs, SPECjAppServer, and so forth. Run them.

Joe(IBM): "Gunther points out that vendor claims of near linear scaling are not to be trusted"

Jeff(Sun): A valuable piece of advice. Next time IBM tells you that a z10 scales linearly and can run the workload of 1500 other servers, don't trust them unless they provide empirical evidence. And that's what we do at Sun, by providing public, open benchmarks with actual results and details of the configurations that produced them. And you can look up the price tags too.

Posted by Jeff Savit on June 28, 2008 at 06:58 AM MST #

An update: Tony replied to the comment I posted, after he updated his blog item with a few corrections and redactions. However, he made things worse. I'm posting here the same comment I'm about to post on his blog (linked above at the top of this blog entry)

Tony,

I wasn't going to post again on your blog - I have quite a conversation going on my blog already :-) - but saw my name mentioned several times, and considered that sufficient inducement.

I'm glad you made appropriate changes to your blog, and redacted material as required by the 3rd party in question (footnote 3). Unfortunately, mistakes here still require correction, and you have new ones including a real howler.

(1) WLM and IRD are z/OS capabilities, and don't play here. Regardless of whether you're running z/VM and z/Linux images in LPARs in a single CEC or multiple, neither of them apply. z/VM has a share-based resource manager, but that's quite different, and comparable to those available in Solaris, AIX, VMware, and other platforms. I regret not saving the blog's original wording that claimed WLM and IRD were what made 90% possible. That claim was simply wrong. Now you claim there's "advances in "Hypervisor" technology" that make this possible. I enjoy when people discard the basis for a claim but keep the claim nonetheless. So, just what are these "advances", specifically, or is this just marketing-speak?

(2) It is only technically correct to say that David Boyes' Test Plan Charlie didn't run "I/O intensive workloads", because it ran no workload at all. One cannot use this to extrapolate any behavior under load. David and I know one another for years (I'm working with him and his colleagues on a project right now and have news to blog on shortly when I'm finished with this distraction) so I am completely familiar with this story. The size of the machine wasn't the sole issue either, as it was also a test of architectural limits. Quite a few of these have been removed in the move from S/390 and VM/ESA to System z and z/VM, though there's still a lot of work to be done to handle truly large systems. For example, even z/VM 5.3 supports only 256GB, which would rule out a single z/VM instance supporting the 1,500 database servers which enjoyed 1.5TB of RAM as originally described.

It's certainly possible to have hundreds or even a few thousand logged on users. Large CMS shops did that 15 years ago. Lots of us did that. But counting users without regard to cost and service level is inappropriate, as IBM's Walt Doherty taught me (and I blogged on).

Now the big one.

(3) In redacting the 3rd party material as required, you replaced it with material of dubious validity, misapplying the well-known "Barton's rule of 4" for relating MHz and MIPS. Barton Robinson, another person I know for many years (we've presented together at conferences like SHARE and SHARE Europe), is probably the best known VM (and z/Linux) performance expert in the world. Nobody considering z/Linux should try to do it without his products; they're that good and essential. He came up with the rule of thumb that 4 MHz of Intel is roughly equivalent to 1 MIPS of mainframe. Everybody understands that this is just a crude estimate: MIPS are highly variable on a given z based on workload (RR instructions give much higher MIPS rates than SS, for example) or even level of multiprogramming. Intel and AMD models vary dramatically in how much work is done per clock cycle, the above quote was several years ago for a single core CPU, and on and on. Useful for doing rough back of the envelope sizing, but you could NEVER call this accurate. To put 4 digits of precision ("6.866" - that's a "rough equivalency?) with this rough rule of thumb is nonsense, and gives the impression of scientific accuracy that doesn't exist. The ratios could be wildly different.

But, that's not the big problem. The big problem is that you got the ratio wrong! The ratio wasn't 4 MIPS to 1 GHz of x86, it was 4 MHz to 1 MIPS! (for a single core). Big oops. It's not "the rule of thumb was that 4 MIPS could do about 1 GHz worth of x86 processor work." as you said, it's (quoting Barton) "if your application has a requirement for a one minute period of 1GHz, then you could assume it would have a 250 MIP requirement for a minute" at http://www.mail-archive.com/linux-390@vm.marist.edu/msg18587.html He feels that the ratio may have improved for 1 Z MIPS to match 6 x86 Mhz. Let's use that for sake of argument). So, you're a few orders of magnitude off! Ow! Were you in a hurry to replace the material you had to cut out and still contrive an equivalent number of MIPS? You should have done a sanity check. If it only took 4 MIPS on z to equate to 1GHz, a single 800 MIPS z CPU would be the equivalent of 200GHz (sic) processor and IBM would be basing supercomputers on z processors many times faster than any existing computer!

So your math is all wrong. Here's how it works instead. Even assuming that z10 CPUs are better than earlier z and it takes 6 MHz to equate to 1 MIPS, and even assuming that dual cores don't count (the ratio is for a single core): A 2.8GHz processor would be matched by 467 MIPS on z10. 1500 processors would require 700,000 MIPS. More than 23 fully configured z10s. About $184 million for IFLs alone.

After such a major hole in your math it hardly seems sporting to continue, but I will:

(4) Subcapacity pricing doesn't apply to IFLs, which are already discounted. It really doesn't have to, because the purpose of knee-capping (slowing down) CPUs for subcapacity is to reduce z/OS software stack license fees. Since an IFL engine is crippled to not run z/OS (it's not enhanced to run Linux better or faster), there's no purpose to kneecapping it, so IBM doesn't. That's not unreasonable, but it has nothing to do with IFL or z/Linux. So, the E64 is 64 CPUs times $125,000, plus $6K per GB of RAM (a minimum of 160GB, and 1500GB for parity with the +minimum+ configuration it's claimed to replace). Not counting the z/Linux and z/VM software licenses, DASD and other substantial costs.

I suppose it's fair for me to mention that your claim that "Moving Oracle workloads from x86 over to mainframe is quite common" is hardly proven by citing a two year old press release. Surely if there was motion in that space there would be figures to prove it. A press release is just a press release (the accidental substitution of Cisco for Oracle is confusing to the reader)

I'm sorry you couldn't find the other material I referred to (760 x86 cores and 26 z10 IFLs). IBM has altered the web pages I referred to and removed content. Nothing I can help with.

To your closing points: I never suggested that IBM was trying to reverse engineer AMD processors. I think that was a misconception on Jon's point, reading "emulate AMD" where "provide equivalent capacity" was intended. To your last point about x86 and virtualization (you omitted others, like SPARC and even POWER), I suggest you get some outside information. The distributed marketplace has completely changed, and now virtualization is everywhere. You no longer need a mainframe to do it.

Jeff

Posted by Jeff Savit on June 29, 2008 at 12:21 PM MST #

Post a Comment:
Comments are closed for this entry.
About

jsavit

Search

Categories
Archives
« July 2014
SunMonTueWedThuFriSat
  
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today