Tuesday Apr 15, 2008

Will it run on multi-chip CMT?

Cooltst v3.0 is out, updated to assess workload suitability for single- and multi-chip CMT. When the UltraSPARC T1 was released, Cool Threads Selection Tool (cooltst) was developed to help gauge how well given workloads might run on the new chip which traded speed for throughput, allowing cooler, lower power, lower cost computing for many applications. But which applications? A single threaded application would tap just a tiny fraction of the 8 cores and 32 hardware threads of the UltraSPARC T1 processor.

Iguazu FallsMuch has changed since then. There is much empirical data showing various applications running well on CMT. The UltraSPARC T2 processor was released, increasing CMT power to 64 hardware threads. This processor also added dedicated floating point units per core so that, far from being relegated to a niche web server market, it claimed (and still holds) a high performance computing record.

Now UltraSPARC T2 Plus systems have been released, further extending CMT power to 2 chips, 8 cores per chip, 8 hardware threads per core - 128 virtual CPU's in a 1RU box. Cooltst helps you assess how well your workload may tap that throughput potential. You can read about it and download it starting at sunsource.net.

There's nothing magical about cooltst's heuristics. You can make much the same assessment yourself using ordinary tools like ps (to look at the software threads) and cpustat (to look at instruction characteristics). All the source code is included so you can see what it's doing. On Linux systems a loadable kernel module is included to measure instruction characteristics in place of Solaris' built-in cpustat command. The output of cooltst is tabular data and a narrative description and  of your workload characteristics, and a bottom line recommendation.

Disclosure Statement:

SPEC and SPEComp are registered trademarks of Standard Performance Evaluation Corporation. Results are current as of 11/11/2007. Complete results may be found at the url referenced above or at http://www.spec.org/omp/results/ompm2001.html

My photo:

Iguazu Falls, on the border of Argentina and Brazil. It's over twice as big as Niagara Falls in terms of water flow, because it covers such a wide area.

Friday Jan 25, 2008

Solaris Application Programming

Darryl Gove's book, Solaris Application Programming, is out. Knowing Darryl, this is a must for anyone interested in building high quality applications with good performance. He covers all the tools, tips, and techniques, of which Solaris offers many. This goes right to the top of my book wish-list. Never mind that as a Sun employee I can get free access to the book electronically. I want paper I can touch and pages I can turn; they will surely be well thumbed through.


Tuesday Dec 11, 2007

SPEC announces SPECpower_ssj2008

Today SPEC announced SPECpower_ssj2008, the first industry standard power performance benchmark. It measures electric power used at various load levels from active idle to 100% of possible throughput. The workload tested is a server side Java workload. The methodology is applicable to many workloads, and I hope in the future we will see more standard benchmarks, and application of these methods to measuring power consumption of customers' own workloads. This benchmark is the result of long hard work by dedicated engineers from many companies, universities, and Lawrence Berkeley National Laboratory. Congratulations!


Wednesday Nov 14, 2007

How fast is fast enough?

Occasionally I'm asked to comment on a cooltst analysis of workload suitability for UltraSPARC T1 and T2 systems, and I can't draw a conclusion because I don't know the customer's intentions. An example: I saw one process using 11.6% of a CPU. If a single thread were consuming that much CPU I'd hesitate to recommend the workload for T1 or T2 processors because I might expect the workload to keep only 9 threads busy (100% / 11.6%). Since a T1 processor has 32 hardware threads and a T2 processor has 64, most of the throughput potential of the system could be wasted.

However, the process that was using 11.6% had 546 threads. If the CPU is spread at all evenly across those threads, then there is ample parallelism to make full use of a T1 or T2 processor. You need to dig down into the LWP details of the cooltst report to see what's going on. In a future relase of cooltst, that part of the report will get a little easier to read.

But what if a single thread were using 11.6%? Would that automatically disqualify the workload? Not necessarily. In this case the workload was running on a 4 processor 1.06 GHz UltraSPARC-IIIi system. If the customer was meeting his quality of service requirements using only one ninth of one of those relatively slow processors, then a T1 or T2 processor ought to be able to meet or exceed those requirements, regardless of how much excess capacity of the system went unused.

So what is the customer's intention? Does he want dramatically improved response time over his current system, or to maintain current service levels? Will the workload remain the same, or grow? Will it grow by adding more work to a single thread of computation, or will there be more users doing the same thing? Will he be consolidating the workloads of several current systems onto virtual machines on a single Sun SPARC Enterprise T5220? Despite the workload measurements shown by cooltst, there could be ample parallelism to exploit the throughput potential of UltraSPARC T1 and T2.

Sunday Nov 11, 2007

Will that run on an UltraSPARC T2?

Will a workload run well on a Sun SPARC Enterprise T5120, T5220, and Sun Blade T6320? The question has become easier to answer for the UltraSPARC T2 than it was for the UltraSPARC T1, which has limited floating point capability to match the majority of commercial workloads. However that means that for an UltraSPARC T1 you must be careful to ensure that your workload does not contain excessive floating point instructions.

By contrast the UltraSPARC T2 has enough floating point power to set the world record for single chip (Search form: enter "1 chip" in the "# CPU" field) SPECompM2001, the industry standard high performance computing benchmark for Open MP. Still with the T2 you need to determine whether your workload has sufficient parallelism to make effective use of the T2's 8 cores and 64 hardware threads. Back when we used to worry about using 64 CPU's in a Sun Enterprise 10000, the high performance computing community developed technologies like Open MP and MPI for portable expression of parallelism.

What's different today is those 64 virtual CPU's fitting into 1 RU of space, so we're looking at a much wider range of applications. Of course if you're consolidating multiple workloads onto a T2 based system using LDOMs or other virtualization method, then you already have those workloads running in parallel.

But what of the case where you want to mostly run a single application? If it's written in Java then it's likely to be multi-threaded, either because the language makes it natural to write the application that way, or because the utility classes called by the application are themselves multi-threaded. But a Java application might still be dominated by one or a few threads, and another application might be very well threaded.

The CoolThreads Selection Tool (cooltst) looks at the workload running on a system, and advises you about its suitability for T1 systems. As it happens, you can also use it to get an idea about the suitability of a workload for T2 systems - with some caveats. First, if you run cooltst on a T2 system it will complain that it doesn't know what processor it's running on; cooltst predates the T2 processor. But then you probably won't want to gauge the suitability of a workload already running on T2, to be migrated to a T2. Second, ignore anything it warns you about floating point content being too high for a T1; that isn't a problem for T2.

You can also just look at the workload yourself using standard Solaris (prstat) or Linux (top) commands, to see how the CPU consumption is spread across application threads. In upcoming blog entries I'll talk more about this simple do-it-yourself analysis.

Disclosure Statement:

SPEC and SPEComp are registered trademarks of Standard Performance Evaluation Corporation. Results are current as of 11/11/2007. Complete results may be found at the url referenced above or at http://www.spec.org/omp/results/ompm2001.html


Thursday Aug 16, 2007

speed vs. throughput in the real world

Performance engineers often talk about the difference between speed and throughput with respect to industry standard benchmarks, e.g. SPEC CPU2006. Glenn Fawcett wrote an excellent posting about that difference in the real world. Reading Glenn's scenario it's easy to see how technically able people can easily get it confused as we all adjust to the "CMT era."


Thursday Aug 24, 2006

SPEC releases CPU2006 benchmark suite

Bid farewell to CPU2000, arguably the world's most widely used benchmark suite with more than 6,000 results published to date. Today SPEC announced its replacement, CPU2006. The old suite, formidable in its day, was being overtaken by modern systems. Originally some of the benchmark run times had been well in excess of an hour, so that running three repetitions of each of the 14 floating point benchmarks and each of the 10 integer benchmarks took a considerable time. But some of today's systems run some of the benchmarks in less than 15 seconds.

The new CPU2006 suite has 14 integer benchmarks and 17 floating point benchmarks, covering a wider range of application areas. Some of the best old benchmarks like gcc remain, but with considerably bigger more demanding workloads. And there are many new benchmarks, some of which came from a wide ranging search program that paid "bounties" to authors to contribute code to the suite and work with SPEC to adapt the applications into suitable benchmarks.

SPEC member companies, universities, and individuals spent many hours (and sleepless nights) working on these benchmarks, starting soon after CPU2000 was released. They tested and performed detailed performance analysis to ensure that the benchmarks test a wide range of application areas and coding styles, that they run on any standard conforming system, and that they are not biased towards any particular operating system, compiler, processor pipeline, or cache organization. For every benchmark included in the suite, one or two benchmarks was worked on for months only to drop out along the way. Equally important as the benchmarks are the run rules and lots of work also went into these to ensure that tests are run under appropriate conditions for fair and accurate comparisons. All this work required that the vendor representatives temporarily set aside their competitive instincts to share comparative performance analyses, to help their competitors port code, and to compromise on technical and run rule issues.

CPU2006 is off to a fast start with already 66 results published from a wide range of vendors. This ensures that competition will heat up quickly and we should see many new results in coming months.

Congratulations to all the members of SPEC's CPU committee!

Wednesday Aug 23, 2006

Computer, tune thyself

Performance engineers are sometimes accused of being obsessed with performance, to the exclusion of common sense. We tweak and tune systems for the maximum possible throughput, and frankly some ordinary users are left behind - people who don't really care how fast it is so long as it's fast enough - people who have their own jobs to do, and for whom the computer is just another tool.

Verlag Heinz Heise publishing (iX Magazin, Germany) has long offered constructive criticism to SPEC like this article on the release of the CPU2000 benchmark suite. Not ones to just heckle the teams from the sideline, iX runs their own benchmark tests publishing them in their own magazine and web site and also at SPEC. This lets them show performance that they consider real world, tuned according to good application development practice but short of what they consider heroic benchmark tuning. For example in 2001 they published this result of 88.3 SPECint_rate_base2000 on our 24 chip 750 MHz Sun Fire 6800. We tuned the same system to achieve 96.1 SPECint_rate_base2000 and 101 SPECint_rate2000 (peak).

So we could show the user how to get 14% more performance from the system. However today you can get 78.8 SPECint_rate_base2000 and 89.8 SPECint_rate2000from a 4 chip 3 GHz Sun Fire V40z. So who's to blame the user if instead of investing much effort in performance tuning he would rather just ride Moore's Law for a while?

It's relatively easy for an independent party like Verlag Heinz Heise to say what they mean by real world optimization, compared to the difficulty of the members of SPEC agreeing on what we mean by peak versus baseline optimization. As we described the baseline metric, it "...uses performance compiler flags that a compiler vendor would suggest for a given program knowing only its own language." (See CPU2000 FAQ.) But of course degree of optimization is a continuum, and there will be application developers who do not optimize even as much as SPEC base and there will be those who optimize more than SPEC peak.

One benefit of SPEC's full disclosure reports is that they can serve as examples of good tuning practice to supplement system documentation. SPEC benchmarks may not use tuning that is not documented, supported, and recommended for general use. So a visit to the flags disclosure pages may provide a wealth of good tuning ideas for your own applications.

Tuning for system level benchmarks may be even more complex than compiler options for the CPU benchmarks. For example, in this SPECjAppServer2004 result there are tuning notes for the emulator software, the database software, the driver software, and the operating systems, as well as for the J2EE application server:

JVM Options: -server -Xms3g -Xmx3g -Xmn800m -Xss128k -XX:+AggressiveHeap
-XX:+UseParallelGC -XX:ParallelGCThreads=32 -XX:PermSize=128m
-XX:MaxTenuringThreshold=3 -XX:LargePageSizeInBytes=4m -XX:SurvivorRatio=20
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:-TraceClassUnloading
-XX:+UseParallelOldGC -Dweblogic.DevPollSocketReaders=2
Java process started in FX class using /usr/bin/priocntl -e -c FX


Again the SPEC full disclosure reports may guide the reader in discovering and using various system tuning parameters. Of course with system tuning you'll probably also want to take a course and/or get a good book like Solaris Internals.

For the user without time or inclination to learn system performance tuning, who wants his system to be just fast enough while he waits for Moore's Law to bring next year's more powerful server, I'm afraid the computer industry just hasn't done enough. Directions are positive though. Suppose you discover, for instance, that it's good for your C application to align branch targets based on execution frequency. A Java application may have such an optimization applied automatically by a Java Virtual Machine with dynamic optimization such as HotSpot. Operating systems may with your permission automatically apply system updates including performance updates, such as Microsoft Update (Sorry, the corresponding URL on microsoft.com only works with Explorer.) and Sun Update Connection.

Computer, tune thyself.

Wednesday Dec 14, 2005

To Sun bloggers, about benchmarks

Speaking as a SPEC representative, we love to have people using our benchmarks, but everyone has to follow the rules - bloggers too. Standard benchmarks have reporting and fair use rules that require for instance that performance claims be backed up by fully rule compliant test runs, and in some cases by prior independent results review; and that comparisons among products must include the appropriate information to offer a fair comparison.

Speaking as a Sun engineer, we have some great products and I don't want you to stop talking about them! Just please take care to follow the rules. If you're not sure what they are, ask. Inside Sun go to SAE on SunWeb for information on the benchmarks. Email them. Or email me.

See BMSeer's postings for examples of footnotes saying what he's comparing, the basis for comparison, when he looked at the competitive numbers, and substantiation of the numbers - like this entry about SPECweb2005. You may find the required disclosure inelegant, but bits are free. Put them in. It might just be exactly the information some reader really needs to know.

Thank you.

See also:


I am a software engineer in San Diego, president of the Standard Performance Evaluation Corporation (spec.org), formerly a mathematician and a violist.


« August 2016