Last week I attended the IDC HPC User Forum meeting in Norfolk, Virginia. The meeting brings
together HPC users and vendors for two days of presentation and discussion about all things
The theme of this meeting (26th in the series) was CFD -- Computational Fluid Dynamics. In addition
to numerous talks on various aspects of CFD, there were a series of short vendor talks and four
panel sessions with both industry and user members. I sat on both the interconnect and operating
system panels and prepared a very short slide set for each session. This blog entry covers the
Because the organizers opted against panel member presentations at the last minute, these slides were not
actually shown at the conference, though they will be included in the conference proceedings.
I use them here to highlight some of the main points I made during the panel discussion.
I participated on the interconnect panel as a stand-in for David Caplan who was
unable to attend the conference due to a late-breaking conflict. Thanks to David for supplying
a slide deck from which I extracted a few slides for my own use. Everything else is my
Fellow members of the interconnect panel included Ron Brightwell from Sandia National Lab,
Patrick Geoffray from Myricom, Christian
Bell from QLogic, Gilad Shainer from Mellanox,
Larry Stewart from SiCortex, Richard Walsh from
IDC, and (I believe--we were at separate tables at it was hard to see) Nick Nystrom from
PSC, and Pete
Wyckoff from OSC.
The framing questions suggested by Richard Walsh prior to the meeting were used to guide the
panel's discussion. The five areas were:
- competing technologies -- five year look ahead
- unifying the interconnect software stack
- interconnects and parallel programming models
- interconnect transceiver and media trends
- performance analysis tools
I had little to say about media and transceivers, though we all seemed to agree that optical was the
future direction for interconnects. There was some discussion earlier in the conference about future
optical connectors and approaches (Luxtera and Hitachi Cable both
gave interesting presentations) and this was not
covered further in the panel.
As a whole, the panel did not have much useful to say about performance analysis. It
was clear from comments that the users want a toolset that allows them to understand at the application level how
well they are using their underlying interconnect fabric for computational and storage transfers. While much low-level
data can be extracted from current systems, from the HCAs, and from fabric switches, there seemed to be general
agreement that the vendor community has not created any facility that would allow these data to be aggregated and
displayed in a manner that would be intuitive and useful to the application programmer or end user.
Moving on to my first slide, there was little disagreement that both Ethernet and
InfiniBand will continue as
important HPC interconnects. It was widely agreed that 10 GbE in particular will gain broader
acceptance as it becomes less expensive and several attendees suggested that point begins to
occur when vendors integrate 10 GbE on their system motherboards rather than relying on
separate plug-in network interface cards. As you will see on a later slide, the fact that Sun is already integrating 10 GbE on board, and
in some cases on-chip, in current systems was part of one my themes during this panel.
InfiniBand (IB) has both a significant latency and bandwidth advantage over 10 GbE and it has rapidly
gained favor within the HPC community. The fact that we are seeing penetration of IB into Financial
Services is especially interesting because that community (which we do consider "HPC" from a
computational perspective) is beginning to adopt IB not as an enabler of scalable distributed
computing (e.g. MPI), but rather as an enabler of the higher messaging rates needed for distributed
trading applications. In addition, we see an increased interest in InfiniBand for Oracle RAC
performance using the RDS protocol.
The HPC community benefits any time its requirements or its underlying
technologies begin to see significant adoption in the wider enterprise computing space. As the
much larger enterprise market gets on the bandwagon, the ecosystem grows and strengthens
as vendors invest more and as more enterprise customers embrace related products.
While I declared myself a fan of iWARP (RDMA over
Ethernet) as a way of broadening the appeal of Ethernet for latency-sensitive applications
and for reducing the load on host CPUs, there was a bit of controversy in this area among
panel members. The issue arose when the QLogic and Myricom representatives expressed
disapproval of some comments made by the Mellanox representative related to
The latter described the OpenFabrics effort as interconnect independent, while the former
were more of a mind that iWARP, which had been united with the OpenIB effort to form what
is now called OpenFabrics, was not well integrated in that certain API semantics were
different between IB and Ethernet. I need to learn more from Ted Kim, our
resident iWARP expert.
Speaking of OpenFabrics, a few points. First, Sun is a member of the OpenFabrics Alliance and as
offerors of Linux-based HPC solutions we will support the OFED stack. Second, Sun does fully
support InfiniBand on Solaris as an essential part of our HPC on Solaris program. We do this by implementing
the important OpenFabrics APIs on Solaris and even sharing code for some of the upper-level
Second, there was some additional disagreement among the same three panel members regarding
the OpenFabrics APIs and the semantics of InfiniBand generally. Both the QLogic and Myricom
representatives felt strongly that extra cruft had accrued in both standards that is unnecessary
for the efficient support of HPC applications. We did not have enough time to delve into the
specifics of their disagreement, but I would certainly like to hear more at some point.
The above two slides (courtesy of David Caplan) illustrate several points. First, they show that
attention to seemingly small details can yield very large benefits at a large scale. In this case, the observation
that we could run three separate 4x InfiniBand connections within one single 12x InfiniBand cable allowed
us to build the 500 TFLOP Ranger system at TACC in Austin with essentially one (though, in actuality two) very large
central InfiniBand switches. The system used 1/6 the cables required for a conventional
installation of this size and many fewer switching elements. And, second, to build successful, effective HPC
solutions one must view the problem at a system level. In this case, Sun's Constellation system architecture
leverages the cabling (and connector) innovation and combines it with both the ultra-dense switch
product as well as an ultra-dense blade chassis with 48 four-socket blades to create a nicely-packaged
solution that scales into the petaflop range.
I took this photo the day we announced our two latest CMT systems, the
T5140. I included
it in the deck for several reasons. First, it allowed me to make the point that Sun has very aggressively
embraced 10 GbE, putting it on the motherboard in these systems. Second, it is another
example of system-level thinking in that we have carefully impedance-matched these 10 GbE capabilities
to our multi-core and multi-threaded processors by supporting a large number of independent DMA
channels in the 10 GbE hardware. In other words, as processors continue to get more parallel (more threads,
more cores) one must keep everything in balance. Which is why you see so many DMA channels here
and why we greatly increased the off-chip memory bandwidth (about 50 GB/s) and coherency bandwidth
(about the same) of these UltraSPARC T2plus based systems over that available in commodity systems. And it isn't just about
the hardware: our new Crossbow networking stack in
OpenSolaris allows application workloads to benefit from
the underlying throughput and parallelism available in all this new hardware. At the end of the day, it's about the system.
This last side again emphasizes the value of being able to optimize at a system level. In this case,
the fact that we've been able to integrate dual 10 GbE interfaces on-chip in our UltraSPARC T2-based
systems because it made sense to tightly couple the networking resources with computation for
best latency and performance. You might be confused if you've noticed I said earlier that our very latest CMT
systems have on-board 10 GbE, but not on-chip 10 GbE. This illustrates yet another advantage held
by true system vendors -- the ability to shift optimization points at will to add value for customers.
In this case, we opted to shift the 10 GbE interfaces off-chip in the UltraSPARC T2plus to make room
for the 50 GB/s of coherency logic we added to allow two T2plus sockets to be run together as
a single, large 128-thread, 16 FPU system.
Another system vendor advantage is illustrated by our involvement in the DARPA-funded program
UNIC, which stands for Ultraperformance Nanophotonic Intrachip Communications. This project seeks to create macro-chip assemblies that tie multiple physical pieces of silicon together using extremely high
bandwidth and low latency optical connections. Think of the macro-chip approach as a way of creating very large scale
silicon with the economics of small-scale silicon. The project is exploring the physical-level technologies that might support
one or two orders of magnitude bandwidth increase well into the multi-terabyte range. From my HPC perspective, it remains to be
seen what protocol would run across these links: Coherent links lead to macro-chips; Non coherent or perhaps semi-coherent links
lead to some very interesting clustering capabilities with some truly revolutionary interconnect characteristics. How we use
this technology once it has been developed remains to be seen, but the possibilities are all extremely interesting.
To wrap up this overly-long entry, I'll close with the observation that one of the framing questions used for
the panel session was stated backwards. In essence, the question asked what programming model changes for HPC
would be engendered by evolution in the interconnect space. As I pointed out during the session, the semantics
of the interconnect should be dictated by the requirements of the programming models and not the other way 'round.
This lead to a good, but frustrating discussion about future programming models for HPC.
It was a good topic because it is an important topic that needs to be discussed. While there was broad agreement
that MPI as a programming model will be used for HPC for a very long time, it is also clear that new models should be
examined. New models are needed to 1) help new entrants into HPC, as well as others, more easily take advantage of
multi-core and multi-threaded processors, 2) create scalable, high-performance applications, and 3) deal with the
unavoidable hardware failures in underlying cluster hardware. It is, for example, ironic in the extreme that the HPC
community has embraced one of the most brittle programming models available (i.e. MPI) and is attempting to use it to
scale applications into the petascale range and beyond. There may be lessons to be learned
from the enterprise computing community which has long embedded reliability in their middleware layers to
allow e-commerce sites, etc., to ride through failures with no service interruptions. More recently, they are
doing so increasingly large scale computational and storage infrastructures. On the software front, Google's
MapReduce, the open-source Hadoop project, Pig, etc. have all embraced resiliency as an important characteristic
of their scalable application architectures. While I am not suggesting that these particular approaches be adopted for HPC with its very
different algorithmic requirements, I do believe there are lessons to be learned. At some point soon I believe resiliency will become more important to HPC than peak performance. This was discussed in more detail during the operating system panel.
It was also a frustrating topic because as a vendor I would like some indication from the community as to promising
directions to pursue. As with all vendors, we have limited resources and would prefer to place our bets in fewer
places with increased chances for success. Are the PGAS languages (e.g. UPC, CaF) the right approach? Are
there others? Is there only one answer, or are the needs of the commercial HPC community different from, for
example, those of the highest-end of the HPC computing spectrum? I was accused at the meeting of saying
the vendors weren't going to do anything to improve the situation. Instead, my point was that we prefer to
make data-driven decisions where possible, but the community isn't really giving us the data. In the meantime,
UPC does have some uptake with the intelligence community and perhaps more broadly. And Cray, IBM, and
Sun have all defined new languages for HPC that aim to deliver good, scalable performance with higher
productivity than existing languages. Where this will go, I do not know. But go it must...
My next entry will describe the Operating System Panel.