Today we announced our first servers based on the
UltraSPARC T2 (Niagara2) processor. They are
officially named the Sun SPARC Enterprise T5120, the Sun SPARC Enterprise T5220, and the Sun Blade T6320. For those who enjoy
code names, the rack servers are known internally as "Huron," following in the Great Lakes theme from
our UltraSPARC T1-based systems. The blade is called "Glendale." For detailed specifications on these new machines,
start here. UltraSHORT
summary: 64 threads, eight floating point units, on-chip 10GbE, low power, 1RU or 2RU or blade form factors. And looking
interesting for some HPC workloads.
The UltraSPARC T2 is Sun's second generation CMT (chip multithreaded) processor. The first-generation
UltraSPARC T1, which has
32 threads and only one floating point unit, performs well on many throughput-oriented tasks, but isn't suitable as a
general-purpose processor for High Performance Computing. Some HPC areas like life sciences and some parts of
the intelligence community have integer-intensive workloads and can use the UltraSPARC T1 to advantage. For
example, see the numerous entries on Lawrence Spracklen's blog.
So, what can we say about the UltraSPARC T2 and its platforms relative to HPC?
As usual, application performance will depend greatly on the specifics of your application, but having
seen the results of several benchmarks on the UltraSPARC T2, I can make some observations. First,
remember the primary value proposition of these CMT systems is throughput, and not single-thread
performance. We use relatively low-performing cores, but give you eight of them on a single chip,
each with multiple threads. Therefore your application or workload must benefit from lots of threads and
from the CMT's ability to hide memory latency by performing real work while waiting for memory
operations to complete.
I'll leave it to the benchmarking folks to give you the official story on exact results and instead
general observations. First, these new systems generate leading performance numbers on a popular
floating-point rate (i.e. throughput) benchmark. However, to achieve those numbers we obviously
must run enough instances of the benchmark to make use of all of our threads, which increases
the memory footprint and therefore the cost of the system. How much that matters to you in real
life depends on how your application's memory footprint scales in practice.
Consider for example, an OpenMP
application. Using OpenMP to parallelize an application leaves
the memory footprint essentially unchanged and instead varies the number of threads used within the
application. As you'd expect, the thread-rich T2-based systems deliver some very interesting OpenMP
Beyond performance issues, let's not lose sight of the fact that these tiny boxes have 64 hardware threads (eight FPUs), making them interesting platforms for HPC developers
working on parallel algorithms, possibly even for MPI developers wanting to debug their distributed applications
on a single machine. And, of course, you should expect to be able to cluster these machines for building
larger HPC systems using either the on-board 10GbE or InfiniBand.
For other Sun blogger perspectives on these new systems, start with Allan Packer's cross-reference