CMT for HPC: Sun Launches UltraSPARC T2 Servers
By Josh Simons on Oct 09, 2007
Today we announced our first servers based on the UltraSPARC T2 (Niagara2) processor. They are officially named the Sun SPARC Enterprise T5120, the Sun SPARC Enterprise T5220, and the Sun Blade T6320. For those who enjoy code names, the rack servers are known internally as "Huron," following in the Great Lakes theme from our UltraSPARC T1-based systems. The blade is called "Glendale." For detailed specifications on these new machines, start here. UltraSHORT summary: 64 threads, eight floating point units, on-chip 10GbE, low power, 1RU or 2RU or blade form factors. And looking interesting for some HPC workloads.
The UltraSPARC T2 is Sun's second generation CMT (chip multithreaded) processor. The first-generation UltraSPARC T1, which has 32 threads and only one floating point unit, performs well on many throughput-oriented tasks, but isn't suitable as a general-purpose processor for High Performance Computing. Some HPC areas like life sciences and some parts of the intelligence community have integer-intensive workloads and can use the UltraSPARC T1 to advantage. For example, see the numerous entries on Lawrence Spracklen's blog.
So, what can we say about the UltraSPARC T2 and its platforms relative to HPC?
As usual, application performance will depend greatly on the specifics of your application, but having seen the results of several benchmarks on the UltraSPARC T2, I can make some observations. First, remember the primary value proposition of these CMT systems is throughput, and not single-thread performance. We use relatively low-performing cores, but give you eight of them on a single chip, each with multiple threads. Therefore your application or workload must benefit from lots of threads and from the CMT's ability to hide memory latency by performing real work while waiting for memory operations to complete.
I'll leave it to the benchmarking folks to give you the official story on exact results and instead make some general observations. First, these new systems generate leading performance numbers on a popular floating-point rate (i.e. throughput) benchmark. However, to achieve those numbers we obviously must run enough instances of the benchmark to make use of all of our threads, which increases the memory footprint and therefore the cost of the system. How much that matters to you in real life depends on how your application's memory footprint scales in practice.Consider for example, an OpenMP application. Using OpenMP to parallelize an application leaves the memory footprint essentially unchanged and instead varies the number of threads used within the application. As you'd expect, the thread-rich T2-based systems deliver some very interesting OpenMP benchmark results.
Beyond performance issues, let's not lose sight of the fact that these tiny boxes have 64 hardware threads (eight FPUs), making them interesting platforms for HPC developers working on parallel algorithms, possibly even for MPI developers wanting to debug their distributed applications on a single machine. And, of course, you should expect to be able to cluster these machines for building larger HPC systems using either the on-board 10GbE or InfiniBand.
For other Sun blogger perspectives on these new systems, start with Allan Packer's cross-reference entry>.