Monday Oct 13, 2008

The Death of Clock Speed

Sun just introduced yet another chip multi-threading (CMT) SPARC system, the Sun SPARC Enterprise T5440. To me, that's a Batoka (sorry, branding police) because I can't keep the model names straight. In any case, this time we've put 256 hardware threads and up to 1/2 TeraByte of memory into a four-RU server. That works out to a thread-count of about 36 threads per vertical inch, which doesn't compete with fine Egyptian cotton, but it can beat the crap out of servers with 3X faster processor clocks. If you don't understand why, then you should take the time to grok this fact. Clock speed is dead as a good metric for assessing the value of a system. For years it has been a shorthand proxy for performance, but with multi-core and multi-threaded processors becoming ubiquitous that has all changed.

Old-brain logic would say that for a particular OpenMP problem size, a faster clock will beat a slower one. But it isn't about clock anymore--it is about parallelism and latency hiding and efficient workload processing. Which is why the T5440 can perform anywhere from 1.5X to almost 2X better on a popular OpenMP benchmark against systems with almost twice the clock rate of the T5540. Those 256 threads and 32 separate floating point units are a huge advantage in parallel benchmarks and in real-world environments in which heavy workloads are the norm, especially those that can benefit from the latency-hiding offered by a CMT processor like the UltraSPARC T2 Plus. Check out BM Seer over the next few days for more specific, published benchmark results for the T5440.

Yes, sure. If you have a single-threaded application and don't need to run a lot of instances, then the higher clock rate will help you significantly. But if that is your situation, then you have a big problem since you will no longer see clock rate increasing as it has in the past. You are surely starting to see this now with commodity CPUs slowly increasingly clock rates and embrace of multicore, but you can look to our CMT systems to understand the logical progression of this trend. It's all going parallel, baby, and it is past time to start thinking about how you are going to deal with that new reality. When you do, you'll see the value of this highly threaded approach for handling real-world workloads. You can see my earlier post on CMT and HPC for more information, including a pointer to an interesting customer analysis of the value of CMT for HPC workloads.

A pile of engineers have been blogging about both the hardware and software aspects of the new system. Check out Allan Packer's stargate blog entry for jump coordinates into the Batoka T5440 blogosphere.

Thursday Apr 10, 2008

Maramba (T5240) Unplugged: A Look Under the Hood

Aside from being interesting machines for HPC, our latest systems are beautiful to behold. At least to an engineer, I suppose. We had a fun internal launch event (free ice cream!) on Sun's Burlington campus yesterday where we had a chance to look inside the boxes and ask questions of the hardware guys who designed these systems. Here is my annotated guide to the layout of a T5240 system.

[annotated T5240 system]

A. Software engineers examining the Sun SPARC Enterprise T5420. Note the uniforms. Sometimes I think the hardware guys are better dressers than the software folks.

B. Front section of the machine. Disks, DVD, etc, go here.

C. Ten hot-swappable fans with room for two more. The system will compensate for a bad fan and continue to run. The system also automatically adjusts the fan speeds to control the amount of cooling needed, depending on demand. For example, when the mezzanine (K) is in place, more cooling is needed. If it isn't, why spend the energy spinning the fans faster than you need?

D. This is a plexiglass air plenum which is hard to see in the photo because it is clear though the top part near the fans does have some white-background diagrams on it. The plenum sits at an angle near the top of the photo and then runs horizontally over the memory DIMMs (F). It captures the air coming out of the fans and forces it past the DIMMS for efficient cooling. The plenum is removed when installing the mezzanine.

E. These are the two heat sinks covering the two UltraSPARC T2plus processors. The two procs are linked with four point-to-point coherency links which are also buried under that area of the board.

F. The memory DIMMs. On the motherboard you can see there are eight DIMM slots per socket for a maximum of 64 GB of memory with current FBDIMMs. But see (K) if 64 GB isn't enough memory for you.

G. There are four connectors shown here with the rightmost connector filled with an air controller as the others would be in an operating system. The mezzanine assembly (K) mates to the motherboard here.

H. PCIe bridge chips are here in this area, supporting external IO connectivity.

I. This is the heat sink for Neptune, Sun's multithreaded 10 GbE technology. It supports two 10 GbE links or some number of slower links and has a large number of DMA engines to offer a good impedance match with our multithreaded processors. On the last generation T2 processors, the Neptune technology was integrated onto the T2 silicon. In T2plus, we moved the 10 GbE off-chip to make room for the coherency logic that allows us to create a coherent 128-thread system with two sockets. As a systems company, we can make tradeoffs like this at will.

J. This is where the service processor for the system is located, included associated SDRAM memory. The software guys got all tingly when they saw where all their ILOM code actually runs. ILOM is Sun's unified service processor that provides lights out management (LOM.)

K. And, finally, the mezzanine assembly. This plugs into the four sockets described above (G) and allows the T5240 to be expanded to 128 GB...all in a two rack-unit enclosure. The density is incredible.

Kicking Butt with OpenMP: The Power of CMT

[t5240 internal shot]

Yesterday we launched our 3rd generation of multicore / multithreaded SPARC machines and again the systems should turn some heads in the HPC world. Last October, we introduced our first single-socket UltraSPARC T2 based systems with 64 hardware threads and eight floating point units. I blogged about the HPC aspects here. These systems showed great promise for HPC as illustrated in this analysis done by RWTH Aachen.

We have now followed that with two new systems that offer up to 128 hardware threads and 16 floating point units per node. In one or two rack units. With up to 128 GB of memory in the 2U system. These systems, the Sun SPARC Enterprise T5140 and T5240 Servers, both contain two UltraSPARC T2plus processors, which maintain coherency across four point-to-point links with a combined bandwidth of 50 GB per second. Further details are available in the Server Architecture whitepaper.

An HPC person might wonder how a system with a measly clock rate of 1.4 GHz and 128 hardware threads spread across two sockets might perform against a more conventional two-socket system running at over 4 GHz. Well, wonder no more. The OpenMP threaded parallelization model is a great way to parallelize codes on shared memory machine and good illustration of the value proposition of our throughput computing approach. In short, we kick butt with these new systems on SPEComp and have established a new world record for two-socket systems. Details here.

Oh, and we also just released world record two-socket numbers for both SPECint_rate2006 and SPECfp_rate2006 using these new boxes. Check out the details here.

Tuesday Oct 09, 2007

UltraSPARC T2 Systems Exposed!

A few photos of our new Sun SPARC Enterprise T5x20 and Sun Blade T6320 systems with their covers removed. Seen today at the employee product launch party here in Burlington. Huron (T5x20) first, then Glendale (T6320).

[huron w/o cover]
[glendale w/o cover]

CMT for HPC: Sun Launches UltraSPARC T2 Servers

[ultrasparc t2 chip]

Today we announced our first servers based on the UltraSPARC T2 (Niagara2) processor. They are officially named the Sun SPARC Enterprise T5120, the Sun SPARC Enterprise T5220, and the Sun Blade T6320. For those who enjoy code names, the rack servers are known internally as "Huron," following in the Great Lakes theme from our UltraSPARC T1-based systems. The blade is called "Glendale." For detailed specifications on these new machines, start here. UltraSHORT summary: 64 threads, eight floating point units, on-chip 10GbE, low power, 1RU or 2RU or blade form factors. And looking interesting for some HPC workloads.

The UltraSPARC T2 is Sun's second generation CMT (chip multithreaded) processor. The first-generation UltraSPARC T1, which has 32 threads and only one floating point unit, performs well on many throughput-oriented tasks, but isn't suitable as a general-purpose processor for High Performance Computing. Some HPC areas like life sciences and some parts of the intelligence community have integer-intensive workloads and can use the UltraSPARC T1 to advantage. For example, see the numerous entries on Lawrence Spracklen's blog.

So, what can we say about the UltraSPARC T2 and its platforms relative to HPC?

As usual, application performance will depend greatly on the specifics of your application, but having seen the results of several benchmarks on the UltraSPARC T2, I can make some observations. First, remember the primary value proposition of these CMT systems is throughput, and not single-thread performance. We use relatively low-performing cores, but give you eight of them on a single chip, each with multiple threads. Therefore your application or workload must benefit from lots of threads and from the CMT's ability to hide memory latency by performing real work while waiting for memory operations to complete.

I'll leave it to the benchmarking folks to give you the official story on exact results and instead make some general observations. First, these new systems generate leading performance numbers on a popular floating-point rate (i.e. throughput) benchmark. However, to achieve those numbers we obviously must run enough instances of the benchmark to make use of all of our threads, which increases the memory footprint and therefore the cost of the system. How much that matters to you in real life depends on how your application's memory footprint scales in practice.

Consider for example, an OpenMP application. Using OpenMP to parallelize an application leaves the memory footprint essentially unchanged and instead varies the number of threads used within the application. As you'd expect, the thread-rich T2-based systems deliver some very interesting OpenMP benchmark results.

Beyond performance issues, let's not lose sight of the fact that these tiny boxes have 64 hardware threads (eight FPUs), making them interesting platforms for HPC developers working on parallel algorithms, possibly even for MPI developers wanting to debug their distributed applications on a single machine. And, of course, you should expect to be able to cluster these machines for building larger HPC systems using either the on-board 10GbE or InfiniBand.

For other Sun blogger perspectives on these new systems, start with Allan Packer's cross-reference entry.

Wednesday Aug 08, 2007

UltraSPARC T2: Bigger Basket, More Eggs

With its 64 threads, the UltraSPARC T2 processor can handle big consolidation workloads. But it is fair to ask whether you should be comfortable putting more eggs (applications) into that bigger basket. We think you should be, in part because of all the work we've done around Fault Management for T2 platforms.

Scott Davenport has led the team that designed and developed the FMA implementation for T2 systems. He has an excellent blog entry that describes exactly what the team has done to diagnose CPU, memory, IO, crypto, and networking issues. Definitely worth a read.

UltraSPARC T2: Be My Guest, Guest, Guest...

I mentioned in my last blog entry that our new UltraSPARC T2 processor (code name Niagara 2) would make a nice consolidation platform with its 64 threads and with Sun's SPARC virtualization product (LDOMs) that allows multiple OS instances (Solaris or Linux) to be run simultaneously on a single system.

Lest you dismiss this as random bluster, check out Ash's blog and watch his flash demo. Picture a roughly pizza-box sized computer running 64 separate operating systems.

Pretty slick!

Tuesday Aug 07, 2007

The UltraSPARC T2 Processor: More of Everything, Please

Sun officially announced the UltraSPARC T2 processor today. Technical specifications, datasheets, etc. are available here.

The question is, who should care?

Fortunately, this is a question easily answered. :-)

Here is my unordered list of who I believe should pay attention to this announcement:

    Customers who like the T1, but need more horsepower. The T2 has 64 threads on a single chip, up from the T1's 32 threads. Couple a T2-based system with our SPARC virtualization technology (LDOMS) and you'll have quite a nice consolidation platform.

    Customers who like the idea of the T1, but who have workloads with floating point requirements. The T1 has one floating point unit on the chip to serve all 32 threads. The T2 has EIGHT floating point units--one per core. I expect some HPC customers with throughput requirements will find the UltraSPARC-T2 very interesting. Note the SPEC estimates cited in the announcement materials (estimated SPECint_rate2006: 78.3, estimated SPECfp_rate2006: 62.3.[\*]) Lots more performance data here.

    Customers who have significant networking requirements in addition to their throughput computing needs. The T2 comes with integrated, on-chip 10 GbE.

    Anyone who needs beefy crypto performance. Yep, our chip guys managed to cram per-core dedicated crytpography functions onto the chip as well.

    Educators and entrepreneurs who will be interested in using the UltraSPARC T2 design as the basis of their work. We expect to release the T2's design under open source, much as we've done already with the UltraSPARC T1. We've kickstarted some innovation with T1 and I expect to see even more interest with T2.

    And, last, anyone who enjoys saying things like this (you know who you are):

      Woo-eee, look at this 1 Gbyte flash drive I just bought for $10!

      I remember when we bought our first Fujitsu Eagle 1 Gbyte hard disk in the late 1980s. It cost $10K, fit in a 19" rack, and needed two people to lift it!

    Soon (before the end of this year) you'll be able to buy a T2-based system and say something like this:

      Woo-eee, look at this 1RU (1 rack-unit = 1.75 inches) server with 64 hardware threads, integrated 10 GbE networking, and onboard crypto I just bought!

      I remember when we bought that 64-CPU Starfire system back in the mid-1990s. It was six feet high, 40 inches wide and weighed about 1800 lbs!

There's actually a serious point buried in that last bit of silliness, but I'll leave that for a future blog entry.

[\*] All Sun UltraSPARC T2 SPEC CPU metrics quoted are from full “reportable” runs, but are nevertheless designated as “estimates” because they use preproduction systems. SPEC, SPECint, SPECfp registered trademarks of Standard Performance Evaluation Corporation. Sun UltraSPARC T2 1.4GHz (1 chip, 8 cores, 64 threads) 78.3 est. SPECint_rate2006, 62.3 est. SPECfp_rate2006.


Josh Simons


« July 2016