HPC User Forum: Interconnect Panel
By Josh Simons on Apr 21, 2008
Last week I attended the IDC HPC User Forum meeting in Norfolk, Virginia. The meeting brings together HPC users and vendors for two days of presentation and discussion about all things HPC.
The theme of this meeting (26th in the series) was CFD -- Computational Fluid Dynamics. In addition to numerous talks on various aspects of CFD, there were a series of short vendor talks and four panel sessions with both industry and user members. I sat on both the interconnect and operating system panels and prepared a very short slide set for each session. This blog entry covers the interconnect session.
Because the organizers opted against panel member presentations at the last minute, these slides were not actually shown at the conference, though they will be included in the conference proceedings. I use them here to highlight some of the main points I made during the panel discussion.
I participated on the interconnect panel as a stand-in for David Caplan who was unable to attend the conference due to a late-breaking conflict. Thanks to David for supplying a slide deck from which I extracted a few slides for my own use. Everything else is my fault.
Fellow members of the interconnect panel included Ron Brightwell from Sandia National Lab, Patrick Geoffray from Myricom, Christian Bell from QLogic, Gilad Shainer from Mellanox, Larry Stewart from SiCortex, Richard Walsh from IDC, and (I believe--we were at separate tables at it was hard to see) Nick Nystrom from PSC, and Pete Wyckoff from OSC.
The framing questions suggested by Richard Walsh prior to the meeting were used to guide the panel's discussion. The five areas were:
- competing technologies -- five year look ahead
- unifying the interconnect software stack
- interconnects and parallel programming models
- interconnect transceiver and media trends
- performance analysis tools
I had little to say about media and transceivers, though we all seemed to agree that optical was the future direction for interconnects. There was some discussion earlier in the conference about future optical connectors and approaches (Luxtera and Hitachi Cable both gave interesting presentations) and this was not covered further in the panel.
As a whole, the panel did not have much useful to say about performance analysis. It was clear from comments that the users want a toolset that allows them to understand at the application level how well they are using their underlying interconnect fabric for computational and storage transfers. While much low-level data can be extracted from current systems, from the HCAs, and from fabric switches, there seemed to be general agreement that the vendor community has not created any facility that would allow these data to be aggregated and displayed in a manner that would be intuitive and useful to the application programmer or end user.
Moving on to my first slide, there was little disagreement that both Ethernet and InfiniBand will continue as important HPC interconnects. It was widely agreed that 10 GbE in particular will gain broader acceptance as it becomes less expensive and several attendees suggested that point begins to occur when vendors integrate 10 GbE on their system motherboards rather than relying on separate plug-in network interface cards. As you will see on a later slide, the fact that Sun is already integrating 10 GbE on board, and in some cases on-chip, in current systems was part of one my themes during this panel.
InfiniBand (IB) has both a significant latency and bandwidth advantage over 10 GbE and it has rapidly gained favor within the HPC community. The fact that we are seeing penetration of IB into Financial Services is especially interesting because that community (which we do consider "HPC" from a computational perspective) is beginning to adopt IB not as an enabler of scalable distributed computing (e.g. MPI), but rather as an enabler of the higher messaging rates needed for distributed trading applications. In addition, we see an increased interest in InfiniBand for Oracle RAC performance using the RDS protocol.
The HPC community benefits any time its requirements or its underlying technologies begin to see significant adoption in the wider enterprise computing space. As the much larger enterprise market gets on the bandwagon, the ecosystem grows and strengthens as vendors invest more and as more enterprise customers embrace related products.
While I declared myself a fan of iWARP (RDMA over Ethernet) as a way of broadening the appeal of Ethernet for latency-sensitive applications and for reducing the load on host CPUs, there was a bit of controversy in this area among panel members. The issue arose when the QLogic and Myricom representatives expressed disapproval of some comments made by the Mellanox representative related to OpenFabrics. The latter described the OpenFabrics effort as interconnect independent, while the former were more of a mind that iWARP, which had been united with the OpenIB effort to form what is now called OpenFabrics, was not well integrated in that certain API semantics were different between IB and Ethernet. I need to learn more from Ted Kim, our resident iWARP expert.
Speaking of OpenFabrics, a few points. First, Sun is a member of the OpenFabrics Alliance and as offerors of Linux-based HPC solutions we will support the OFED stack. Second, Sun does fully support InfiniBand on Solaris as an essential part of our HPC on Solaris program. We do this by implementing the important OpenFabrics APIs on Solaris and even sharing code for some of the upper-level protocols.
Second, there was some additional disagreement among the same three panel members regarding the OpenFabrics APIs and the semantics of InfiniBand generally. Both the QLogic and Myricom representatives felt strongly that extra cruft had accrued in both standards that is unnecessary for the efficient support of HPC applications. We did not have enough time to delve into the specifics of their disagreement, but I would certainly like to hear more at some point.
The above two slides (courtesy of David Caplan) illustrate several points. First, they show that attention to seemingly small details can yield very large benefits at a large scale. In this case, the observation that we could run three separate 4x InfiniBand connections within one single 12x InfiniBand cable allowed us to build the 500 TFLOP Ranger system at TACC in Austin with essentially one (though, in actuality two) very large central InfiniBand switches. The system used 1/6 the cables required for a conventional installation of this size and many fewer switching elements. And, second, to build successful, effective HPC solutions one must view the problem at a system level. In this case, Sun's Constellation system architecture leverages the cabling (and connector) innovation and combines it with both the ultra-dense switch product as well as an ultra-dense blade chassis with 48 four-socket blades to create a nicely-packaged solution that scales into the petaflop range.
I took this photo the day we announced our two latest CMT systems, the T5240 and T5140. I included it in the deck for several reasons. First, it allowed me to make the point that Sun has very aggressively embraced 10 GbE, putting it on the motherboard in these systems. Second, it is another example of system-level thinking in that we have carefully impedance-matched these 10 GbE capabilities to our multi-core and multi-threaded processors by supporting a large number of independent DMA channels in the 10 GbE hardware. In other words, as processors continue to get more parallel (more threads, more cores) one must keep everything in balance. Which is why you see so many DMA channels here and why we greatly increased the off-chip memory bandwidth (about 50 GB/s) and coherency bandwidth (about the same) of these UltraSPARC T2plus based systems over that available in commodity systems. And it isn't just about the hardware: our new Crossbow networking stack in OpenSolaris allows application workloads to benefit from the underlying throughput and parallelism available in all this new hardware. At the end of the day, it's about the system.
This last side again emphasizes the value of being able to optimize at a system level. In this case, the fact that we've been able to integrate dual 10 GbE interfaces on-chip in our UltraSPARC T2-based systems because it made sense to tightly couple the networking resources with computation for best latency and performance. You might be confused if you've noticed I said earlier that our very latest CMT systems have on-board 10 GbE, but not on-chip 10 GbE. This illustrates yet another advantage held by true system vendors -- the ability to shift optimization points at will to add value for customers. In this case, we opted to shift the 10 GbE interfaces off-chip in the UltraSPARC T2plus to make room for the 50 GB/s of coherency logic we added to allow two T2plus sockets to be run together as a single, large 128-thread, 16 FPU system.
Another system vendor advantage is illustrated by our involvement in the DARPA-funded program UNIC, which stands for Ultraperformance Nanophotonic Intrachip Communications. This project seeks to create macro-chip assemblies that tie multiple physical pieces of silicon together using extremely high bandwidth and low latency optical connections. Think of the macro-chip approach as a way of creating very large scale silicon with the economics of small-scale silicon. The project is exploring the physical-level technologies that might support one or two orders of magnitude bandwidth increase well into the multi-terabyte range. From my HPC perspective, it remains to be seen what protocol would run across these links: Coherent links lead to macro-chips; Non coherent or perhaps semi-coherent links lead to some very interesting clustering capabilities with some truly revolutionary interconnect characteristics. How we use this technology once it has been developed remains to be seen, but the possibilities are all extremely interesting.
To wrap up this overly-long entry, I'll close with the observation that one of the framing questions used for the panel session was stated backwards. In essence, the question asked what programming model changes for HPC would be engendered by evolution in the interconnect space. As I pointed out during the session, the semantics of the interconnect should be dictated by the requirements of the programming models and not the other way 'round. This lead to a good, but frustrating discussion about future programming models for HPC.
It was a good topic because it is an important topic that needs to be discussed. While there was broad agreement that MPI as a programming model will be used for HPC for a very long time, it is also clear that new models should be examined. New models are needed to 1) help new entrants into HPC, as well as others, more easily take advantage of multi-core and multi-threaded processors, 2) create scalable, high-performance applications, and 3) deal with the unavoidable hardware failures in underlying cluster hardware. It is, for example, ironic in the extreme that the HPC community has embraced one of the most brittle programming models available (i.e. MPI) and is attempting to use it to scale applications into the petascale range and beyond. There may be lessons to be learned from the enterprise computing community which has long embedded reliability in their middleware layers to allow e-commerce sites, etc., to ride through failures with no service interruptions. More recently, they are doing so increasingly large scale computational and storage infrastructures. On the software front, Google's MapReduce, the open-source Hadoop project, Pig, etc. have all embraced resiliency as an important characteristic of their scalable application architectures. While I am not suggesting that these particular approaches be adopted for HPC with its very different algorithmic requirements, I do believe there are lessons to be learned. At some point soon I believe resiliency will become more important to HPC than peak performance. This was discussed in more detail during the operating system panel.
It was also a frustrating topic because as a vendor I would like some indication from the community as to promising directions to pursue. As with all vendors, we have limited resources and would prefer to place our bets in fewer places with increased chances for success. Are the PGAS languages (e.g. UPC, CaF) the right approach? Are there others? Is there only one answer, or are the needs of the commercial HPC community different from, for example, those of the highest-end of the HPC computing spectrum? I was accused at the meeting of saying the vendors weren't going to do anything to improve the situation. Instead, my point was that we prefer to make data-driven decisions where possible, but the community isn't really giving us the data. In the meantime, UPC does have some uptake with the intelligence community and perhaps more broadly. And Cray, IBM, and Sun have all defined new languages for HPC that aim to deliver good, scalable performance with higher productivity than existing languages. Where this will go, I do not know. But go it must...
My next entry will describe the Operating System Panel.