Monday Oct 13, 2008

The Death of Clock Speed


Sun just introduced yet another chip multi-threading (CMT) SPARC system, the Sun SPARC Enterprise T5440. To me, that's a Batoka (sorry, branding police) because I can't keep the model names straight. In any case, this time we've put 256 hardware threads and up to 1/2 TeraByte of memory into a four-RU server. That works out to a thread-count of about 36 threads per vertical inch, which doesn't compete with fine Egyptian cotton, but it can beat the crap out of servers with 3X faster processor clocks. If you don't understand why, then you should take the time to grok this fact. Clock speed is dead as a good metric for assessing the value of a system. For years it has been a shorthand proxy for performance, but with multi-core and multi-threaded processors becoming ubiquitous that has all changed.

Old-brain logic would say that for a particular OpenMP problem size, a faster clock will beat a slower one. But it isn't about clock anymore--it is about parallelism and latency hiding and efficient workload processing. Which is why the T5440 can perform anywhere from 1.5X to almost 2X better on a popular OpenMP benchmark against systems with almost twice the clock rate of the T5540. Those 256 threads and 32 separate floating point units are a huge advantage in parallel benchmarks and in real-world environments in which heavy workloads are the norm, especially those that can benefit from the latency-hiding offered by a CMT processor like the UltraSPARC T2 Plus. Check out BM Seer over the next few days for more specific, published benchmark results for the T5440.

Yes, sure. If you have a single-threaded application and don't need to run a lot of instances, then the higher clock rate will help you significantly. But if that is your situation, then you have a big problem since you will no longer see clock rate increasing as it has in the past. You are surely starting to see this now with commodity CPUs slowly increasingly clock rates and embrace of multicore, but you can look to our CMT systems to understand the logical progression of this trend. It's all going parallel, baby, and it is past time to start thinking about how you are going to deal with that new reality. When you do, you'll see the value of this highly threaded approach for handling real-world workloads. You can see my earlier post on CMT and HPC for more information, including a pointer to an interesting customer analysis of the value of CMT for HPC workloads.

A pile of engineers have been blogging about both the hardware and software aspects of the new system. Check out Allan Packer's stargate blog entry for jump coordinates into the Batoka T5440 blogosphere.


Thursday Apr 10, 2008

Maramba (T5240) Unplugged: A Look Under the Hood

Aside from being interesting machines for HPC, our latest systems are beautiful to behold. At least to an engineer, I suppose. We had a fun internal launch event (free ice cream!) on Sun's Burlington campus yesterday where we had a chance to look inside the boxes and ask questions of the hardware guys who designed these systems. Here is my annotated guide to the layout of a T5240 system.

[annotated T5240 system]

A. Software engineers examining the Sun SPARC Enterprise T5420. Note the uniforms. Sometimes I think the hardware guys are better dressers than the software folks.

B. Front section of the machine. Disks, DVD, etc, go here.

C. Ten hot-swappable fans with room for two more. The system will compensate for a bad fan and continue to run. The system also automatically adjusts the fan speeds to control the amount of cooling needed, depending on demand. For example, when the mezzanine (K) is in place, more cooling is needed. If it isn't, why spend the energy spinning the fans faster than you need?

D. This is a plexiglass air plenum which is hard to see in the photo because it is clear though the top part near the fans does have some white-background diagrams on it. The plenum sits at an angle near the top of the photo and then runs horizontally over the memory DIMMs (F). It captures the air coming out of the fans and forces it past the DIMMS for efficient cooling. The plenum is removed when installing the mezzanine.

E. These are the two heat sinks covering the two UltraSPARC T2plus processors. The two procs are linked with four point-to-point coherency links which are also buried under that area of the board.

F. The memory DIMMs. On the motherboard you can see there are eight DIMM slots per socket for a maximum of 64 GB of memory with current FBDIMMs. But see (K) if 64 GB isn't enough memory for you.

G. There are four connectors shown here with the rightmost connector filled with an air controller as the others would be in an operating system. The mezzanine assembly (K) mates to the motherboard here.

H. PCIe bridge chips are here in this area, supporting external IO connectivity.

I. This is the heat sink for Neptune, Sun's multithreaded 10 GbE technology. It supports two 10 GbE links or some number of slower links and has a large number of DMA engines to offer a good impedance match with our multithreaded processors. On the last generation T2 processors, the Neptune technology was integrated onto the T2 silicon. In T2plus, we moved the 10 GbE off-chip to make room for the coherency logic that allows us to create a coherent 128-thread system with two sockets. As a systems company, we can make tradeoffs like this at will.

J. This is where the service processor for the system is located, included associated SDRAM memory. The software guys got all tingly when they saw where all their ILOM code actually runs. ILOM is Sun's unified service processor that provides lights out management (LOM.)

K. And, finally, the mezzanine assembly. This plugs into the four sockets described above (G) and allows the T5240 to be expanded to 128 GB...all in a two rack-unit enclosure. The density is incredible.


Kicking Butt with OpenMP: The Power of CMT

[t5240 internal shot]

Yesterday we launched our 3rd generation of multicore / multithreaded SPARC machines and again the systems should turn some heads in the HPC world. Last October, we introduced our first single-socket UltraSPARC T2 based systems with 64 hardware threads and eight floating point units. I blogged about the HPC aspects here. These systems showed great promise for HPC as illustrated in this analysis done by RWTH Aachen.

We have now followed that with two new systems that offer up to 128 hardware threads and 16 floating point units per node. In one or two rack units. With up to 128 GB of memory in the 2U system. These systems, the Sun SPARC Enterprise T5140 and T5240 Servers, both contain two UltraSPARC T2plus processors, which maintain coherency across four point-to-point links with a combined bandwidth of 50 GB per second. Further details are available in the Server Architecture whitepaper.

An HPC person might wonder how a system with a measly clock rate of 1.4 GHz and 128 hardware threads spread across two sockets might perform against a more conventional two-socket system running at over 4 GHz. Well, wonder no more. The OpenMP threaded parallelization model is a great way to parallelize codes on shared memory machine and good illustration of the value proposition of our throughput computing approach. In short, we kick butt with these new systems on SPEComp and have established a new world record for two-socket systems. Details here.

Oh, and we also just released world record two-socket numbers for both SPECint_rate2006 and SPECfp_rate2006 using these new boxes. Check out the details here.

About

Josh Simons

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today