The Death of Clock Speed

Sun just introduced yet another chip multi-threading (CMT) SPARC system, the Sun SPARC Enterprise T5440. To me, that's a Batoka (sorry, branding police) because I can't keep the model names straight. In any case, this time we've put 256 hardware threads and up to 1/2 TeraByte of memory into a four-RU server. That works out to a thread-count of about 36 threads per vertical inch, which doesn't compete with fine Egyptian cotton, but it can beat the crap out of servers with 3X faster processor clocks. If you don't understand why, then you should take the time to grok this fact. Clock speed is dead as a good metric for assessing the value of a system. For years it has been a shorthand proxy for performance, but with multi-core and multi-threaded processors becoming ubiquitous that has all changed.

Old-brain logic would say that for a particular OpenMP problem size, a faster clock will beat a slower one. But it isn't about clock anymore--it is about parallelism and latency hiding and efficient workload processing. Which is why the T5440 can perform anywhere from 1.5X to almost 2X better on a popular OpenMP benchmark against systems with almost twice the clock rate of the T5540. Those 256 threads and 32 separate floating point units are a huge advantage in parallel benchmarks and in real-world environments in which heavy workloads are the norm, especially those that can benefit from the latency-hiding offered by a CMT processor like the UltraSPARC T2 Plus. Check out BM Seer over the next few days for more specific, published benchmark results for the T5440.

Yes, sure. If you have a single-threaded application and don't need to run a lot of instances, then the higher clock rate will help you significantly. But if that is your situation, then you have a big problem since you will no longer see clock rate increasing as it has in the past. You are surely starting to see this now with commodity CPUs slowly increasingly clock rates and embrace of multicore, but you can look to our CMT systems to understand the logical progression of this trend. It's all going parallel, baby, and it is past time to start thinking about how you are going to deal with that new reality. When you do, you'll see the value of this highly threaded approach for handling real-world workloads. You can see my earlier post on CMT and HPC for more information, including a pointer to an interesting customer analysis of the value of CMT for HPC workloads.

A pile of engineers have been blogging about both the hardware and software aspects of the new system. Check out Allan Packer's stargate blog entry for jump coordinates into the Batoka T5440 blogosphere.


I agree that it's a very macho Batoka. Who comes up with the idea to call it a bunch of numbers anyway? :)

Posted by Kristofer on October 13, 2008 at 11:48 PM EDT #

How do T2's scale, from 1-2 (and even 4) sockets with lots of containers, each with their own dedicated processor sets?

Namely, if I had 1 container per core, and ran some benchmark such as SPECjbb2005 in each container, how close would a 2-socket system be to providing twice the total performance of a 1-socket system.


Posted by Mark Travis on October 21, 2008 at 04:18 PM EDT #


I asked your question of Denis Sheahan, one of our experts on CMT performance. His response:

It should scale well because containers have no performance overhead.

This approach is essentially what we do for our SpecJBB submissions. For instance on a 4 way T5440 we run with 32 JVMs. We create 31 processor sets, one per core, and bind a JVM to each set. The last JVM runs in the default set.

We did the same on all our SpecJBB submissions.

We could have created 32 containers and would have achieved the exact same performance

For Siebel we actually did use containers with dedicated resources, one for the webserver,
one for Siebel and one for the database

Posted by Joshua Simons on October 23, 2008 at 06:16 PM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed

Josh Simons


« February 2015