Monday Oct 13, 2008

The Death of Clock Speed


Sun just introduced yet another chip multi-threading (CMT) SPARC system, the Sun SPARC Enterprise T5440. To me, that's a Batoka (sorry, branding police) because I can't keep the model names straight. In any case, this time we've put 256 hardware threads and up to 1/2 TeraByte of memory into a four-RU server. That works out to a thread-count of about 36 threads per vertical inch, which doesn't compete with fine Egyptian cotton, but it can beat the crap out of servers with 3X faster processor clocks. If you don't understand why, then you should take the time to grok this fact. Clock speed is dead as a good metric for assessing the value of a system. For years it has been a shorthand proxy for performance, but with multi-core and multi-threaded processors becoming ubiquitous that has all changed.

Old-brain logic would say that for a particular OpenMP problem size, a faster clock will beat a slower one. But it isn't about clock anymore--it is about parallelism and latency hiding and efficient workload processing. Which is why the T5440 can perform anywhere from 1.5X to almost 2X better on a popular OpenMP benchmark against systems with almost twice the clock rate of the T5540. Those 256 threads and 32 separate floating point units are a huge advantage in parallel benchmarks and in real-world environments in which heavy workloads are the norm, especially those that can benefit from the latency-hiding offered by a CMT processor like the UltraSPARC T2 Plus. Check out BM Seer over the next few days for more specific, published benchmark results for the T5440.

Yes, sure. If you have a single-threaded application and don't need to run a lot of instances, then the higher clock rate will help you significantly. But if that is your situation, then you have a big problem since you will no longer see clock rate increasing as it has in the past. You are surely starting to see this now with commodity CPUs slowly increasingly clock rates and embrace of multicore, but you can look to our CMT systems to understand the logical progression of this trend. It's all going parallel, baby, and it is past time to start thinking about how you are going to deal with that new reality. When you do, you'll see the value of this highly threaded approach for handling real-world workloads. You can see my earlier post on CMT and HPC for more information, including a pointer to an interesting customer analysis of the value of CMT for HPC workloads.

A pile of engineers have been blogging about both the hardware and software aspects of the new system. Check out Allan Packer's stargate blog entry for jump coordinates into the Batoka T5440 blogosphere.


Monday Apr 21, 2008

HPC User Forum: Interconnect Panel

Last week I attended the IDC HPC User Forum meeting in Norfolk, Virginia. The meeting brings together HPC users and vendors for two days of presentation and discussion about all things HPC.

The theme of this meeting (26th in the series) was CFD -- Computational Fluid Dynamics. In addition to numerous talks on various aspects of CFD, there were a series of short vendor talks and four panel sessions with both industry and user members. I sat on both the interconnect and operating system panels and prepared a very short slide set for each session. This blog entry covers the interconnect session.

Because the organizers opted against panel member presentations at the last minute, these slides were not actually shown at the conference, though they will be included in the conference proceedings. I use them here to highlight some of the main points I made during the panel discussion.

[interconnect panel slide #1]

I participated on the interconnect panel as a stand-in for David Caplan who was unable to attend the conference due to a late-breaking conflict. Thanks to David for supplying a slide deck from which I extracted a few slides for my own use. Everything else is my fault. :-)

Fellow members of the interconnect panel included Ron Brightwell from Sandia National Lab, Patrick Geoffray from Myricom, Christian Bell from QLogic, Gilad Shainer from Mellanox, Larry Stewart from SiCortex, Richard Walsh from IDC, and (I believe--we were at separate tables at it was hard to see) Nick Nystrom from PSC, and Pete Wyckoff from OSC.

The framing questions suggested by Richard Walsh prior to the meeting were used to guide the panel's discussion. The five areas were:

  • competing technologies -- five year look ahead
  • unifying the interconnect software stack
  • interconnects and parallel programming models
  • interconnect transceiver and media trends
  • performance analysis tools

I had little to say about media and transceivers, though we all seemed to agree that optical was the future direction for interconnects. There was some discussion earlier in the conference about future optical connectors and approaches (Luxtera and Hitachi Cable both gave interesting presentations) and this was not covered further in the panel.

As a whole, the panel did not have much useful to say about performance analysis. It was clear from comments that the users want a toolset that allows them to understand at the application level how well they are using their underlying interconnect fabric for computational and storage transfers. While much low-level data can be extracted from current systems, from the HCAs, and from fabric switches, there seemed to be general agreement that the vendor community has not created any facility that would allow these data to be aggregated and displayed in a manner that would be intuitive and useful to the application programmer or end user.

[interconnect panel slide #2]

Moving on to my first slide, there was little disagreement that both Ethernet and InfiniBand will continue as important HPC interconnects. It was widely agreed that 10 GbE in particular will gain broader acceptance as it becomes less expensive and several attendees suggested that point begins to occur when vendors integrate 10 GbE on their system motherboards rather than relying on separate plug-in network interface cards. As you will see on a later slide, the fact that Sun is already integrating 10 GbE on board, and in some cases on-chip, in current systems was part of one my themes during this panel.

InfiniBand (IB) has both a significant latency and bandwidth advantage over 10 GbE and it has rapidly gained favor within the HPC community. The fact that we are seeing penetration of IB into Financial Services is especially interesting because that community (which we do consider "HPC" from a computational perspective) is beginning to adopt IB not as an enabler of scalable distributed computing (e.g. MPI), but rather as an enabler of the higher messaging rates needed for distributed trading applications. In addition, we see an increased interest in InfiniBand for Oracle RAC performance using the RDS protocol.

The HPC community benefits any time its requirements or its underlying technologies begin to see significant adoption in the wider enterprise computing space. As the much larger enterprise market gets on the bandwagon, the ecosystem grows and strengthens as vendors invest more and as more enterprise customers embrace related products.

While I declared myself a fan of iWARP (RDMA over Ethernet) as a way of broadening the appeal of Ethernet for latency-sensitive applications and for reducing the load on host CPUs, there was a bit of controversy in this area among panel members. The issue arose when the QLogic and Myricom representatives expressed disapproval of some comments made by the Mellanox representative related to OpenFabrics. The latter described the OpenFabrics effort as interconnect independent, while the former were more of a mind that iWARP, which had been united with the OpenIB effort to form what is now called OpenFabrics, was not well integrated in that certain API semantics were different between IB and Ethernet. I need to learn more from Ted Kim, our resident iWARP expert.

Speaking of OpenFabrics, a few points. First, Sun is a member of the OpenFabrics Alliance and as offerors of Linux-based HPC solutions we will support the OFED stack. Second, Sun does fully support InfiniBand on Solaris as an essential part of our HPC on Solaris program. We do this by implementing the important OpenFabrics APIs on Solaris and even sharing code for some of the upper-level protocols.

Second, there was some additional disagreement among the same three panel members regarding the OpenFabrics APIs and the semantics of InfiniBand generally. Both the QLogic and Myricom representatives felt strongly that extra cruft had accrued in both standards that is unnecessary for the efficient support of HPC applications. We did not have enough time to delve into the specifics of their disagreement, but I would certainly like to hear more at some point.

[interconnect panel slide #3]
[interconnect panel slide #4]

The above two slides (courtesy of David Caplan) illustrate several points. First, they show that attention to seemingly small details can yield very large benefits at a large scale. In this case, the observation that we could run three separate 4x InfiniBand connections within one single 12x InfiniBand cable allowed us to build the 500 TFLOP Ranger system at TACC in Austin with essentially one (though, in actuality two) very large central InfiniBand switches. The system used 1/6 the cables required for a conventional installation of this size and many fewer switching elements. And, second, to build successful, effective HPC solutions one must view the problem at a system level. In this case, Sun's Constellation system architecture leverages the cabling (and connector) innovation and combines it with both the ultra-dense switch product as well as an ultra-dense blade chassis with 48 four-socket blades to create a nicely-packaged solution that scales into the petaflop range.

[interconnect panel slide #5]

I took this photo the day we announced our two latest CMT systems, the T5240 and T5140. I included it in the deck for several reasons. First, it allowed me to make the point that Sun has very aggressively embraced 10 GbE, putting it on the motherboard in these systems. Second, it is another example of system-level thinking in that we have carefully impedance-matched these 10 GbE capabilities to our multi-core and multi-threaded processors by supporting a large number of independent DMA channels in the 10 GbE hardware. In other words, as processors continue to get more parallel (more threads, more cores) one must keep everything in balance. Which is why you see so many DMA channels here and why we greatly increased the off-chip memory bandwidth (about 50 GB/s) and coherency bandwidth (about the same) of these UltraSPARC T2plus based systems over that available in commodity systems. And it isn't just about the hardware: our new Crossbow networking stack in OpenSolaris allows application workloads to benefit from the underlying throughput and parallelism available in all this new hardware. At the end of the day, it's about the system.

[interconnect panel slide #6]

This last side again emphasizes the value of being able to optimize at a system level. In this case, the fact that we've been able to integrate dual 10 GbE interfaces on-chip in our UltraSPARC T2-based systems because it made sense to tightly couple the networking resources with computation for best latency and performance. You might be confused if you've noticed I said earlier that our very latest CMT systems have on-board 10 GbE, but not on-chip 10 GbE. This illustrates yet another advantage held by true system vendors -- the ability to shift optimization points at will to add value for customers. In this case, we opted to shift the 10 GbE interfaces off-chip in the UltraSPARC T2plus to make room for the 50 GB/s of coherency logic we added to allow two T2plus sockets to be run together as a single, large 128-thread, 16 FPU system.

Another system vendor advantage is illustrated by our involvement in the DARPA-funded program UNIC, which stands for Ultraperformance Nanophotonic Intrachip Communications. This project seeks to create macro-chip assemblies that tie multiple physical pieces of silicon together using extremely high bandwidth and low latency optical connections. Think of the macro-chip approach as a way of creating very large scale silicon with the economics of small-scale silicon. The project is exploring the physical-level technologies that might support one or two orders of magnitude bandwidth increase well into the multi-terabyte range. From my HPC perspective, it remains to be seen what protocol would run across these links: Coherent links lead to macro-chips; Non coherent or perhaps semi-coherent links lead to some very interesting clustering capabilities with some truly revolutionary interconnect characteristics. How we use this technology once it has been developed remains to be seen, but the possibilities are all extremely interesting.

To wrap up this overly-long entry, I'll close with the observation that one of the framing questions used for the panel session was stated backwards. In essence, the question asked what programming model changes for HPC would be engendered by evolution in the interconnect space. As I pointed out during the session, the semantics of the interconnect should be dictated by the requirements of the programming models and not the other way 'round. This lead to a good, but frustrating discussion about future programming models for HPC.

It was a good topic because it is an important topic that needs to be discussed. While there was broad agreement that MPI as a programming model will be used for HPC for a very long time, it is also clear that new models should be examined. New models are needed to 1) help new entrants into HPC, as well as others, more easily take advantage of multi-core and multi-threaded processors, 2) create scalable, high-performance applications, and 3) deal with the unavoidable hardware failures in underlying cluster hardware. It is, for example, ironic in the extreme that the HPC community has embraced one of the most brittle programming models available (i.e. MPI) and is attempting to use it to scale applications into the petascale range and beyond. There may be lessons to be learned from the enterprise computing community which has long embedded reliability in their middleware layers to allow e-commerce sites, etc., to ride through failures with no service interruptions. More recently, they are doing so increasingly large scale computational and storage infrastructures. On the software front, Google's MapReduce, the open-source Hadoop project, Pig, etc. have all embraced resiliency as an important characteristic of their scalable application architectures. While I am not suggesting that these particular approaches be adopted for HPC with its very different algorithmic requirements, I do believe there are lessons to be learned. At some point soon I believe resiliency will become more important to HPC than peak performance. This was discussed in more detail during the operating system panel.

It was also a frustrating topic because as a vendor I would like some indication from the community as to promising directions to pursue. As with all vendors, we have limited resources and would prefer to place our bets in fewer places with increased chances for success. Are the PGAS languages (e.g. UPC, CaF) the right approach? Are there others? Is there only one answer, or are the needs of the commercial HPC community different from, for example, those of the highest-end of the HPC computing spectrum? I was accused at the meeting of saying the vendors weren't going to do anything to improve the situation. Instead, my point was that we prefer to make data-driven decisions where possible, but the community isn't really giving us the data. In the meantime, UPC does have some uptake with the intelligence community and perhaps more broadly. And Cray, IBM, and Sun have all defined new languages for HPC that aim to deliver good, scalable performance with higher productivity than existing languages. Where this will go, I do not know. But go it must...

My next entry will describe the Operating System Panel.


Thursday Apr 10, 2008

Maramba (T5240) Unplugged: A Look Under the Hood

Aside from being interesting machines for HPC, our latest systems are beautiful to behold. At least to an engineer, I suppose. We had a fun internal launch event (free ice cream!) on Sun's Burlington campus yesterday where we had a chance to look inside the boxes and ask questions of the hardware guys who designed these systems. Here is my annotated guide to the layout of a T5240 system.

[annotated T5240 system]

A. Software engineers examining the Sun SPARC Enterprise T5420. Note the uniforms. Sometimes I think the hardware guys are better dressers than the software folks.

B. Front section of the machine. Disks, DVD, etc, go here.

C. Ten hot-swappable fans with room for two more. The system will compensate for a bad fan and continue to run. The system also automatically adjusts the fan speeds to control the amount of cooling needed, depending on demand. For example, when the mezzanine (K) is in place, more cooling is needed. If it isn't, why spend the energy spinning the fans faster than you need?

D. This is a plexiglass air plenum which is hard to see in the photo because it is clear though the top part near the fans does have some white-background diagrams on it. The plenum sits at an angle near the top of the photo and then runs horizontally over the memory DIMMs (F). It captures the air coming out of the fans and forces it past the DIMMS for efficient cooling. The plenum is removed when installing the mezzanine.

E. These are the two heat sinks covering the two UltraSPARC T2plus processors. The two procs are linked with four point-to-point coherency links which are also buried under that area of the board.

F. The memory DIMMs. On the motherboard you can see there are eight DIMM slots per socket for a maximum of 64 GB of memory with current FBDIMMs. But see (K) if 64 GB isn't enough memory for you.

G. There are four connectors shown here with the rightmost connector filled with an air controller as the others would be in an operating system. The mezzanine assembly (K) mates to the motherboard here.

H. PCIe bridge chips are here in this area, supporting external IO connectivity.

I. This is the heat sink for Neptune, Sun's multithreaded 10 GbE technology. It supports two 10 GbE links or some number of slower links and has a large number of DMA engines to offer a good impedance match with our multithreaded processors. On the last generation T2 processors, the Neptune technology was integrated onto the T2 silicon. In T2plus, we moved the 10 GbE off-chip to make room for the coherency logic that allows us to create a coherent 128-thread system with two sockets. As a systems company, we can make tradeoffs like this at will.

J. This is where the service processor for the system is located, included associated SDRAM memory. The software guys got all tingly when they saw where all their ILOM code actually runs. ILOM is Sun's unified service processor that provides lights out management (LOM.)

K. And, finally, the mezzanine assembly. This plugs into the four sockets described above (G) and allows the T5240 to be expanded to 128 GB...all in a two rack-unit enclosure. The density is incredible.


Kicking Butt with OpenMP: The Power of CMT

[t5240 internal shot]

Yesterday we launched our 3rd generation of multicore / multithreaded SPARC machines and again the systems should turn some heads in the HPC world. Last October, we introduced our first single-socket UltraSPARC T2 based systems with 64 hardware threads and eight floating point units. I blogged about the HPC aspects here. These systems showed great promise for HPC as illustrated in this analysis done by RWTH Aachen.

We have now followed that with two new systems that offer up to 128 hardware threads and 16 floating point units per node. In one or two rack units. With up to 128 GB of memory in the 2U system. These systems, the Sun SPARC Enterprise T5140 and T5240 Servers, both contain two UltraSPARC T2plus processors, which maintain coherency across four point-to-point links with a combined bandwidth of 50 GB per second. Further details are available in the Server Architecture whitepaper.

An HPC person might wonder how a system with a measly clock rate of 1.4 GHz and 128 hardware threads spread across two sockets might perform against a more conventional two-socket system running at over 4 GHz. Well, wonder no more. The OpenMP threaded parallelization model is a great way to parallelize codes on shared memory machine and good illustration of the value proposition of our throughput computing approach. In short, we kick butt with these new systems on SPEComp and have established a new world record for two-socket systems. Details here.

Oh, and we also just released world record two-socket numbers for both SPECint_rate2006 and SPECfp_rate2006 using these new boxes. Check out the details here.

About

Josh Simons

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today