Tuesday Apr 15, 2008

Will it run on multi-chip CMT?

Cooltst v3.0 is out, updated to assess workload suitability for single- and multi-chip CMT. When the UltraSPARC T1 was released, Cool Threads Selection Tool (cooltst) was developed to help gauge how well given workloads might run on the new chip which traded speed for throughput, allowing cooler, lower power, lower cost computing for many applications. But which applications? A single threaded application would tap just a tiny fraction of the 8 cores and 32 hardware threads of the UltraSPARC T1 processor.

Iguazu FallsMuch has changed since then. There is much empirical data showing various applications running well on CMT. The UltraSPARC T2 processor was released, increasing CMT power to 64 hardware threads. This processor also added dedicated floating point units per core so that, far from being relegated to a niche web server market, it claimed (and still holds) a high performance computing record.

Now UltraSPARC T2 Plus systems have been released, further extending CMT power to 2 chips, 8 cores per chip, 8 hardware threads per core - 128 virtual CPU's in a 1RU box. Cooltst helps you assess how well your workload may tap that throughput potential. You can read about it and download it starting at sunsource.net.

There's nothing magical about cooltst's heuristics. You can make much the same assessment yourself using ordinary tools like ps (to look at the software threads) and cpustat (to look at instruction characteristics). All the source code is included so you can see what it's doing. On Linux systems a loadable kernel module is included to measure instruction characteristics in place of Solaris' built-in cpustat command. The output of cooltst is tabular data and a narrative description and  of your workload characteristics, and a bottom line recommendation.

Disclosure Statement:

SPEC and SPEComp are registered trademarks of Standard Performance Evaluation Corporation. Results are current as of 11/11/2007. Complete results may be found at the url referenced above or at http://www.spec.org/omp/results/ompm2001.html

My photo:

Iguazu Falls, on the border of Argentina and Brazil. It's over twice as big as Niagara Falls in terms of water flow, because it covers such a wide area.

Wednesday Nov 14, 2007

How fast is fast enough?

Occasionally I'm asked to comment on a cooltst analysis of workload suitability for UltraSPARC T1 and T2 systems, and I can't draw a conclusion because I don't know the customer's intentions. An example: I saw one process using 11.6% of a CPU. If a single thread were consuming that much CPU I'd hesitate to recommend the workload for T1 or T2 processors because I might expect the workload to keep only 9 threads busy (100% / 11.6%). Since a T1 processor has 32 hardware threads and a T2 processor has 64, most of the throughput potential of the system could be wasted.

However, the process that was using 11.6% had 546 threads. If the CPU is spread at all evenly across those threads, then there is ample parallelism to make full use of a T1 or T2 processor. You need to dig down into the LWP details of the cooltst report to see what's going on. In a future relase of cooltst, that part of the report will get a little easier to read.

But what if a single thread were using 11.6%? Would that automatically disqualify the workload? Not necessarily. In this case the workload was running on a 4 processor 1.06 GHz UltraSPARC-IIIi system. If the customer was meeting his quality of service requirements using only one ninth of one of those relatively slow processors, then a T1 or T2 processor ought to be able to meet or exceed those requirements, regardless of how much excess capacity of the system went unused.

So what is the customer's intention? Does he want dramatically improved response time over his current system, or to maintain current service levels? Will the workload remain the same, or grow? Will it grow by adding more work to a single thread of computation, or will there be more users doing the same thing? Will he be consolidating the workloads of several current systems onto virtual machines on a single Sun SPARC Enterprise T5220? Despite the workload measurements shown by cooltst, there could be ample parallelism to exploit the throughput potential of UltraSPARC T1 and T2.

Sunday Nov 11, 2007

Will that run on an UltraSPARC T2?

Will a workload run well on a Sun SPARC Enterprise T5120, T5220, and Sun Blade T6320? The question has become easier to answer for the UltraSPARC T2 than it was for the UltraSPARC T1, which has limited floating point capability to match the majority of commercial workloads. However that means that for an UltraSPARC T1 you must be careful to ensure that your workload does not contain excessive floating point instructions.

By contrast the UltraSPARC T2 has enough floating point power to set the world record for single chip (Search form: enter "1 chip" in the "# CPU" field) SPECompM2001, the industry standard high performance computing benchmark for Open MP. Still with the T2 you need to determine whether your workload has sufficient parallelism to make effective use of the T2's 8 cores and 64 hardware threads. Back when we used to worry about using 64 CPU's in a Sun Enterprise 10000, the high performance computing community developed technologies like Open MP and MPI for portable expression of parallelism.

What's different today is those 64 virtual CPU's fitting into 1 RU of space, so we're looking at a much wider range of applications. Of course if you're consolidating multiple workloads onto a T2 based system using LDOMs or other virtualization method, then you already have those workloads running in parallel.

But what of the case where you want to mostly run a single application? If it's written in Java then it's likely to be multi-threaded, either because the language makes it natural to write the application that way, or because the utility classes called by the application are themselves multi-threaded. But a Java application might still be dominated by one or a few threads, and another application might be very well threaded.

The CoolThreads Selection Tool (cooltst) looks at the workload running on a system, and advises you about its suitability for T1 systems. As it happens, you can also use it to get an idea about the suitability of a workload for T2 systems - with some caveats. First, if you run cooltst on a T2 system it will complain that it doesn't know what processor it's running on; cooltst predates the T2 processor. But then you probably won't want to gauge the suitability of a workload already running on T2, to be migrated to a T2. Second, ignore anything it warns you about floating point content being too high for a T1; that isn't a problem for T2.

You can also just look at the workload yourself using standard Solaris (prstat) or Linux (top) commands, to see how the CPU consumption is spread across application threads. In upcoming blog entries I'll talk more about this simple do-it-yourself analysis.

Disclosure Statement:

SPEC and SPEComp are registered trademarks of Standard Performance Evaluation Corporation. Results are current as of 11/11/2007. Complete results may be found at the url referenced above or at http://www.spec.org/omp/results/ompm2001.html


Tuesday Oct 09, 2007

Solaris OS on UltraSPARC T2

I take it for granted that the same Solaris interfaces and capabilities are available wherever I go, home, office, or travel, never worrying that behind one screen is an UltraSPARC processor, behind another an AMD Athlon processor, and behind another a SunRay server with who-knows-what in it. As Lily Tomlin said "We don't care, we don't have to care."

But today I salute the people who make sure I don't have to care.  Steve Sistare writes about how Solaris OS was optimized to run on the UltraSPARC T2 processor, scheduling for 64 hardware threads, taking best advantage of cache associativity and page mapping, using hardware acceleration, etc. And Ubuntu Linux runs as well.

Wednesday Sep 12, 2007


No, not that Barcelona - the city, home to the Universitat Politecnica de Catalunya, a very strong engineering school. An intern from UPC did some great work for me a few years ago on the relationship of thread level parallelism and instruction level parallelism.

A lot of creative music also comes out of Barcelona, including The Pinker Tones. If you like an eclectic mix of styles you can save yourself the trouble of mixing and just get the Million Color Revolution, including these cuts worth a preview. Every track on the album is good.

  • Karma Hunters (in English) - vote for the instant karma party
  • L'heroes (in French) - I am my grandma's hero
  • Pink Freud (in German) - barbershop quartet
  • Maybe Next Saturday (in English) - robot boy asks her to the dance
  • Sonido Total (in Spanish) - music in space
  • Gone, Go On - (in English) Howdy Doody's love song


Thursday Aug 16, 2007

speed vs. throughput in the real world

Performance engineers often talk about the difference between speed and throughput with respect to industry standard benchmarks, e.g. SPEC CPU2006. Glenn Fawcett wrote an excellent posting about that difference in the real world. Reading Glenn's scenario it's easy to see how technically able people can easily get it confused as we all adjust to the "CMT era."



I am a software engineer in San Diego, president of the Standard Performance Evaluation Corporation (spec.org), formerly a mathematician and a violist.


« July 2016