Friday Sep 04, 2009

HPC Virtual Conference: No Travel Budget? No Problem!

Sun is holding a virtual HPC conference on September 17th featuring Andy Bechtolsheim as keynote speaker. Andy will be talking about the challenges around creating Exaflop systems by 2020, after which he will participate in a chat session with attendees. In fact, each of the conference speakers (see agenda) will chat with attendees after their presentations.

There will also be two sets of exhibits to "visit" to find information on HPC solutions for specific industries or to get information on specific HPC technologies. Industries covered include MCAE, EDA, Government/Education/Research, Life Sciences, and Digital Media. There will be technology exhibits on storage software and hardware, integrated software stack for HPC, compute and networking hardware, and HPC services.

This is a free event. Register here.

Tuesday Apr 14, 2009

You Say Nehalem, I Say Nehali

Let me say this first: NEHALEM, NEHALEM, NEHALEM. You can call it the Intel Xeon Series 5500 processor if you like, but the HPC community has been whispering about Nehalem and rubbing its collective hands together in anticipation of Nehalem for what seems like years now. So, let's talk about Nehalem.

Actually, let's not. I'm guessing that between the collective might of the Intel PR machine, the traditional press, and the blogosphere you are already well-steeped in the details of this new Intel processor and why it excites the HPC community. Rather than talk about the processor per se, let's talk instead about the more typical HPC scenario: not about a single Nehalem, but rather piles of Nehali and how to best to deploy them in an HPC cluster configuration.

Our HPC clustering approach is based on the Sun Constellation architecture, which we've deployed at TACC and other large-scale HPC sites around the world. These existing systems house compute nodes in the Sun Blade 6000 System chassis which holds four blade shelves, each with twelve blade systems for a total of 48 blades per chassis. Constellation also includes matching InfiniBand infrastructure, including InfiniBand Network Express Modules (NEMs) and a range of InfiniBand switches that can be used to build petascale compute clusters. You can see several of the 82 TACC Ranger Constellation chassis in this photo, interspersed with (black) inline cooling units.

As part of our continued focus on HPC customer requirements, we've done something interesting with our new Nehalem-based Vayu blade (officially, the Sun Blade X6275 Server Module): each blade houses two separate nodes. Here is a photo of the Vayu blade:

Each of the nodes is a diskless, two-socket Nehalem system with 12 DDR3 DIMM slots per node (up to 96 GB per node) and on-board QDR InfiniBand. It's actually not quite correct to call this node diskless because it does include two Sun Flash Module slots (one per node) that each provide up to 24 GB of FLASH storage through a SATA interface. I am sure our HPC customers will use what amounts to an ultra-fast disk for some interesting applications.

Using Vayu blades, each Sun Constellation chassis can now support a total of 2 nodes/blade \* 12 blades/shelf \* 4 shelves/chassis = 96 nodes with a peak floating-point performance of about 9 TFLOPs. While a chassis can support up to 96 GB / node \* 96 nodes = 9.2 TB of memory, there are some subtleties involved in optimizing and configuring memory for Nehalem systems so I recommend reading John Nerl's blog entry for a detailed discussion of this topic.

For a quick visual tour of Vayu, see the annotated photo below. The major components are: A = Nehalem 4-core processor; B = Memory DIMMs; C = Mellanox Connect-X QDR InfiniBand ASIC; D = Tylersburg I/O chipset; E = Base Management Controller (BMC) / Service Processor (one per node); F = Sun Flash Modules (SATA, 24GB per node.) The connector at the top right supports two PCIe2 interfaces, two InfiniBand interfaces, and 2 GbE interfaces. The GbE logic is hiding under the BMC daughter board.

To accommodate this high-density blade approach we've developed a new QDR InfiniBand NEM (officially called the SunBlade 6048 QDR IB Switched NEM), which is shown below. This Network Express Module plugs into a blade shelf's midplane and forms the first level of InfiniBand fabric in a cluster configuration. Specifically, the two on-board 36-port Mellanox QDR switch chips act as leaf switches from which larger configurations can be built. Of the 72 total switch ports available, 24 are used for the Vayu blades and 9 from each switch chip are used to interconnect the two switches, leaving a total of 72 - 24 - 2\*9 = 30 QDR links available for off-shelf connections. These links leave the NEM through 10 physical connectors, each of which carries three QDR X4 links over a single cable. As we discussed when the first version of Constellation was released, aggregating 4X links into 12X cables results in significant customer benefits related to reliability, density, and complexity. In any case, these cables can be connected to Constellation switches to form larger, tree-based fabrics. Or they can be used in switchless configurations to build torus-based topologies by using the ten cables to carry X, Y, and Z traffic between shelves and across chassis. As mentioned, for example, here. The NEM also provides GbE connectivity for each of the 24 nodes in the blade shelf.

Looking back at the TACC photo, we can now double the compute density shown there with our newest Constellation-based systems using the new Vayu blade. Oh, and by the way, we can also remove those inline cooling units and pack those chassis side-by-side for another significant increase in compute density. I'll leave how we accomplish that last bit for a future blog entry.

Friday Jan 09, 2009

HPC and Virtualization: Oak Ridge Trip Report

Just before Sun's Winter Break, I attended a meeting at Oak Ridge National Laboratory in Tennessee with Stephen Scott, Geoffroy Vallee, Christian Engelmann, Thomas Naughton, and Anand Tikotekar, all of the Systems Research Team (SRT) at ORNL. Attending from Sun were Tim Marsland, Greg Lavender, Rebecca Arney, and myself. The topic was HPC and virtualization, an area the SRT has been exploring for some time and one I've been keen on as well as it has become clear v12n has much to offer the HPC community. This is my trip report.

I arrived at Logan Airport in Boston early enough on Monday to catch an earlier flight to Dulles, narrowly avoiding the five-hour delay that eventually afflicted my original flight. The flight from Boston to Knoxville via Dulles went smoothly and I arrived without difficulty to a rainy and chilly Tennessee evening. I was thrilled to have made it through Dulles without incident since more often than not I have some kind of travel difficulty when my trips pass through IAD (more on that later.) The 25 mile drive to the Oak Ridge DoubleTree was uneventful.

Oak Ridge is still very much a Lab town from what I could see, much like Los Alamos, but certainly less isolated. Movie reviews in the Oak Ridge Observer are rated with atoms rather than stars. Stephen Scott, who leads the System Research Team (SRT) at ORNL, mentioned that the plot plan for his house is stamped "Top Secret -- Manhattan Project" because the plan shows the degree difference between "ORNL North" and "True North", an artifact of the time when period maps of the area deliberately skewed the position of Oak Ridge to lessen the chance that a map could be used to successfully bomb ORNL targets from the air during the war.

We spent all day Tuesday with Stephen and most of the System Research Team. Tim talked about what Sun is doing with xVM and our overall virtualization strategy and ended with a set of questions that we spent some time discussing. Greg then talked in detail about both Crossbow and InfiniBand, specifically with respect to aspects related to virtualization. We spent the rest of the day hearing about some of the work on resiliency and virtualization being done by the team. See the end of this blog entry for pointers to some of the SRT papers as well as other HPC/virtualization papers I have found to be interesting.

Resiliency isn't something the HPC community has traditionally cared much about. Nodes were thin and cheap. If a node crashed, restart the job, replace the node, use checkpoint-restart if you can. Move on; life on the edge is hard. But the world is changing. Nodes are getting fatter again--more cores, more memory, more IO. Big SMPs in tiny packages with totally different economics from traditional large SMPs. Suddenly there is enough persistent state on a node that people start to care how long their nodes stay up. Capabilities like Fault Management start to look really interesting, especially if you are a commercial HPC customer using HPC in production.

In addition, clusters are getting larger. Much larger, even with fatter nodes. Which means more frequent hardware failures. Bad news for MPI, the world's most brittle programming model. Certainly, some more modern programming models would be welcome, but in the meantime what can be done to keep these jobs running longer in the presence of continual hardware failures? This is one promise of virtualization. And one reason why a big lab like ORNL is looking seriously at virtualization technologies for HPC.

Live migration -- the ability to shift running OS instances from one node to another -- is particularly interesting from a resiliency perspective. Linking live migration to a capable fault management facility (see, for example, what Sun has been doing in this area) could allow jobs to avoid interruption due to an impending node failure. Research by the SRT (see the Proactive Fault Tolerance paper, below) and others has shown this is a viable approach for single-node jobs and also for increasing the survivability of MPI applications in the presence of node failures. Admittedly, the current prototype depends on Xen TCP tricks to handle MPI traffic interruption and continuation, but with sufficient work to virtualize the InfiniBand fabric, this technique could be extended to that realm as well. In addition, the use of an RDMA-enabled interconnect can itself greatly increase the speed of live migration as is demonstrated in the last paper listed in the reference section below.

We discussed other benefits of virtualization. Among them, the use of multiple virtual machines per physical node to simulate a much larger cluster for demonstrating an application's basic scaling capabilities in advance of being allowed access to a real, full-scale (and expensive) compute resource. Such pre-testing becomes very important in situations in which large user populations are vying for access to relatively scarce, large-scale, centralized research resources.

Geoffroy also spoke about "adapting systems to applications, not applications to systems" by which he meant that virtualization allows an application user to bundle their application into a virtual machine instance with any other required software, regardless of the "supported" software environment available on a site's compute resource. Being able to run applications using either old versions of operating systems or perhaps operating systems with which a site's administrative staff has no experience, does truly allow the application provider to adapt the system to their application without placing an additional administrative burden on a site's operational staff. Of course, this does push the burden of creating a correct configuration onto the application provider, but the freedom and flexibility should be welcomed by those who need it. Those who don't could presumably bundle their application into a "standard" guest OS instance. This is completely analogous to the use and customization of Amazon Machine Instances (AMIs) on the Amazon Elastic Compute Cloud (EC2) infrastructure.

Observability was another simpatico area of discussion. DTrace has taken low-cost, fine-grained observability to new heights (new depths, actually). Similarly, SRT is looking at how one might add dynamic instrumentation at the hypervisor level to offer a clearer view of where overhead is occurring within a virtualized environment to promote user understanding and also offer a debugging capability for developers.

A few final tidbits to capture before closing. Several other research efforts are looking at HPC and virtualization. Among them V3VEE (University of New Mexico and Northwestern University), XtreemOS (a bit of a different approach to virtualization for HPC and Grids). SRT is also working on a virtualized version of OSCAR called OSCAR-V.

The Dulles Vortex of Bad Travel was more successful on my way home. My flight from Knoxville was delayed with an unexplained mechanical problem that could not be fixed in Knoxville, requiring a new plane to be flown from St. Louis. I arrived very late into Dulles, about 10 minutes before my connection to Boston was due to leave from the other end of the terminal. I ran to the gate, arriving two minutes before the flight was scheduled to depart and it was already gone-- no sign of the gate agents or the plane. Spent the night at an airport hotel and flew home first thing the next morning. Dulles had struck again--this was at least the third time I've had problems like this when passing through IAD. I have colleagues that refuse to travel with me through this airport. With good reason, apparently.

Reading list:

Proactive Fault Tolerance for HPC with Xen Virtualization, Nagarajan, Mueller, Engelmann, Scott

The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Computational Kernels and Software, Youseff, Seymour, You, Dongarra, Wolski

Performance Implications of Virtualizing Multicore Cluster Machines, Ranadive, Kesavan, Gavrilovska, Schwan

High Performance Virtual Machine Migration with RDMA over Modern Interconnects, Huang, Gao, Liu, Panda

Thursday Dec 18, 2008

Fresh Bits: InfiniBand Updates for Solaris 10

Fresh InfiniBand bits for Solaris 10 Update 6 have just been announced by the IB Engineering Team:

The Sun InfiniBand Team is pleased to announce the availability of the Solaris InfiniBand Updates 2.1. This comprises updates to the previously available Solaris InfiniBand Updates 2. InfiniBand Updates 2 has been removed from the current download pages. (Previous versions of InfiniBand Updates need to be carefully matched to the OS Update versions that they apply to.)

The primary deliverable of Solaris InfiniBand Updates 2.1 is a set of updates of the Solaris driver supporting HCAs based on Mellanox's 4th generation silicon, ConnectX. These updates include the fixes that have been added to the driver since its original delivery, and functionality in this driver is equivalent to what was delivered as part of OpenSolaris 2008.11. In addition, there continues to be a cxflash utility that allows Solaris users to update firmware on the ConnectX HCAs. This utility is only to be used for ConnectX HCAs.

Other updates include:

  • uDAPL InfiniBand service provider library for Solaris (compatible with Sun HPC ClusterTools MPI)
  • Tavor and Arbel/memfree drivers that are compatible with new interfaces in the uDAPL library
  • Documentation (README and man pages)
  • A renamed flash utility for Tavor-, Arbel memfull, Arbel memfree, and Sinai based HCAs. Instead of "fwflash" this utility is rename "ihflash" to avoid possible namespace conflicts with a general firmware flashing utility in Solaris

All are compatible with Solaris 10 10/08 (Solaris 10, Update 6), for both SPARC and X86.

You can download the package from the "Sun Downloads" A-Z page by visiting and scrolling down or searching for the link for "Solaris InfiniBand (IB) Updates 2.1" or alternatively use this link.

Please read the README before installing the updates. This contains both installation instructions and other information you will need to know before running this product.

Please note again that this Update package is for use on Solaris 10/08 (Solaris 10, Update 6) only. A version of the Hermon driver has also been integrated into Update 7 and will be available with that Update's release.

Congratulations to the Solaris IB Hermon project team and the extended IB team for their efforts in making this product available!

Wednesday Nov 19, 2008

Sun Supercomputing: Red Sky at Night, Sandia's Delight

Yesterday we officially announced that Sun will be supplying Sandia National Laboratories its next generation clustered supercomputer, named Red Sky. Douglas Doerfler from the Scalable Architectures Department at Sandia spoke at the Sun HPC Consortium Meeting here in Austin and gave an overview of the system to assembled customers and Sun employees. As Douglas noted, this was the world premiere Red Sky presentation.

The system is slated to replace Thunderbird and other aging cluster resources at Sandia. It is a Sun Constellation system using the Sun Blade 6000 blade architecture, but with some differences. First, the system will use a new diskless two-node Intel blade to double the density of the overall system. The initial system will deliver 160 TFLOPs peak performance in a partially populated configuration with expansion available to 300 TFLOPs.

Second, the interconnect topology is a 3D torus rather than a fat-tree. The torus will support Sandia's secure red/black switching requirement with a middle "swing" section that can be moved to either the red or black side of the machine as needed with the required air gap.

Primary software components include CentOS, Open MPI, OpenSM, and Lash for deadlock-free routing across the torus. The filesystem will be based on Lustre. oneSIS will be used for diskless cluster management, including booting over InfiniBand.

Monday Jul 14, 2008

Solaris InfiniBand: A Big Day!

Yesterday, the Sun InfiniBand engineering team released Solaris 10 driver support for ConnectX (a.k.a. Hermon), the latest generation of InfiniBand silicon from Mellanox. This is important news for both Solaris HPC customers as well as those enterprise customers interested in the best bandwidth and latencies available for applications like Oracle RAC. Congratulations to the team!

In addition to the driver, the update also includes a new flash updating tool for ConnectX, a uDAPL update, and several additional components, all of which is described in the documentation.

The specific ConnectX-related Sun part numbers supported by this release are: X4217A-Z HCA card, X4216A-Z EM, and X5196A-Z, the 24 Port NEM for the SunBlade 6048 family of servers. It also supports third-party cards based on the following Mellanox chips: MT25408, MT25418, and MT25428.

The release, called "Solaris InfiniBand Updates 2" is available for free download here.

Thursday Jun 19, 2008

Inside NanoMagnum, the Sun Datacenter Switch 3x24

Here is a quick look under the covers of the new Sun Datacenter Switch 3x24, the new InfiniBand switch just announced by Sun at ISC 2008 in Dresden. First some photos and then an explanation of how this switch is used as a Sun Constellation System component to build clusters with up to 288 nodes.

First, the photos:

Nano Magnum's three heat sinks sit atop Mellanox 24-port InfiniBand 4x switch chips. The purple object is an air plenum that guides air past the sinks from the rear of the unit.
Looking down on the Nano, you can see the three heat sinks that cover the switch chips and the InfiniBand connectors along the bottom of the photo. The unit has two rows of twelve connectors with the bottom row somewhat visible under the top row in this photo.
The Nano Magnum is in the foreground. The unit sitting on top of Nano's rear deck for display purposes is an InfiniBand NEM. See text for more information.

You might assume NanoMagnum is either a simple 24-port InfiniBand switch or, if you know that each connector actually carries three separate InfiniBand 4X connections, a simple 72-port switch. In fact, it is neither. NanoMagnum is a core switch and none of the three InfiniBand switch chips is connected to the others. Since it isn't intuitive how a box containing three unconnected switch chips can be used to create single, fully-connected clusters, let's look in detail at how this is done. I've created two diagrams that I hope will make the wiring configurations clear.

Before getting into cluster details, I should explain that a NEM, or Network Express Module, is an assembly that plugs into the back of each of the four shelves in a Sun Blade 6048 chassis. In the case of an InfiniBand NEM, it contains the InfiniBand HCA logic needed for each blade as well as two InfiniBand leaf switch elements that are used to tie the shelves into cluster configurations. You can see a photo of a NEM above.

The first diagram (below) illustrates how any blade in a shelf can reach any blade in any other shelf connected to a NanoMagnum switch. There are a few important points to note. First, all three switch chips in the NanoMagnum are connected to every switch port, which means that regardless of which switch chip your signal enters, it can be routed to any other port in the switch. Second, you will notice that only one switch chip in the NEM is being used. The second is used only for creating redundant configurations and the cool thing about that is that from an incremental cost perspective, one need only buy additional cables and additional switches--the leaf switching elements are already included in the configuration.

If the above convinced you that any blade can reach any other blade connected to the same switch, the rest is easy. The diagram below shows the components and connections needed to build a 288-node Sun Constellation System using four NanoMagnums.

Clusters of smaller size can be built in a similar way, as can clusters that are over-subscribed (i.e. not non-blocking.)

Wednesday Jun 18, 2008

Sun Announces NanoMagnum at ISC 2008 in Dresden

Last night in Dresden at ISC 2008 Sun announced NanoMagnum, the latest addition to the Sun Constellation System architecture. Nano, more properly called the Sun Datacenter Switch 3x24, is the world's densest DDR InfiniBand core switch, with three 24-port IB switches encased within a 1 rack-unit form factor. Nano complements its big brother Magnum, the 3456-port InfiniBand switch used at TACC and elsewhere, and allows the Constellation architecture to scale down by supporting smaller clusters of C48 (Sun Blade 6048) chassis. Nano also uses the same ultra-dense cabling developed for the rest of the Constellation family components, with three 4X DDR connections carried in one 12X cable for reduced cabling complexity.

Here are some photos I took of the unveiling on the show floor at ISC.

Bjorn Andersson, Director of HPC, and Marc Hamilton, VP System Practices Americas, prepare to unveil NanoMagnum at ISC 2008
Ta da. The Sun Datacenter Switch 3x24, sitting in front of a Sun Blade 6048 chassis and on top of a scale model of TACC Ranger, the world's largest Sun Constellation system
A closeup of the new switch
Bjorn uncorks a magnum of champagne to celebrate
Lisa Robinson Schoeller (Blade Product Line Manager) and Bjorn prepare to spread good cheer
The official toast

Monday Apr 21, 2008

HPC User Forum: Interconnect Panel

Last week I attended the IDC HPC User Forum meeting in Norfolk, Virginia. The meeting brings together HPC users and vendors for two days of presentation and discussion about all things HPC.

The theme of this meeting (26th in the series) was CFD -- Computational Fluid Dynamics. In addition to numerous talks on various aspects of CFD, there were a series of short vendor talks and four panel sessions with both industry and user members. I sat on both the interconnect and operating system panels and prepared a very short slide set for each session. This blog entry covers the interconnect session.

Because the organizers opted against panel member presentations at the last minute, these slides were not actually shown at the conference, though they will be included in the conference proceedings. I use them here to highlight some of the main points I made during the panel discussion.

[interconnect panel slide #1]

I participated on the interconnect panel as a stand-in for David Caplan who was unable to attend the conference due to a late-breaking conflict. Thanks to David for supplying a slide deck from which I extracted a few slides for my own use. Everything else is my fault. :-)

Fellow members of the interconnect panel included Ron Brightwell from Sandia National Lab, Patrick Geoffray from Myricom, Christian Bell from QLogic, Gilad Shainer from Mellanox, Larry Stewart from SiCortex, Richard Walsh from IDC, and (I believe--we were at separate tables at it was hard to see) Nick Nystrom from PSC, and Pete Wyckoff from OSC.

The framing questions suggested by Richard Walsh prior to the meeting were used to guide the panel's discussion. The five areas were:

  • competing technologies -- five year look ahead
  • unifying the interconnect software stack
  • interconnects and parallel programming models
  • interconnect transceiver and media trends
  • performance analysis tools

I had little to say about media and transceivers, though we all seemed to agree that optical was the future direction for interconnects. There was some discussion earlier in the conference about future optical connectors and approaches (Luxtera and Hitachi Cable both gave interesting presentations) and this was not covered further in the panel.

As a whole, the panel did not have much useful to say about performance analysis. It was clear from comments that the users want a toolset that allows them to understand at the application level how well they are using their underlying interconnect fabric for computational and storage transfers. While much low-level data can be extracted from current systems, from the HCAs, and from fabric switches, there seemed to be general agreement that the vendor community has not created any facility that would allow these data to be aggregated and displayed in a manner that would be intuitive and useful to the application programmer or end user.

[interconnect panel slide #2]

Moving on to my first slide, there was little disagreement that both Ethernet and InfiniBand will continue as important HPC interconnects. It was widely agreed that 10 GbE in particular will gain broader acceptance as it becomes less expensive and several attendees suggested that point begins to occur when vendors integrate 10 GbE on their system motherboards rather than relying on separate plug-in network interface cards. As you will see on a later slide, the fact that Sun is already integrating 10 GbE on board, and in some cases on-chip, in current systems was part of one my themes during this panel.

InfiniBand (IB) has both a significant latency and bandwidth advantage over 10 GbE and it has rapidly gained favor within the HPC community. The fact that we are seeing penetration of IB into Financial Services is especially interesting because that community (which we do consider "HPC" from a computational perspective) is beginning to adopt IB not as an enabler of scalable distributed computing (e.g. MPI), but rather as an enabler of the higher messaging rates needed for distributed trading applications. In addition, we see an increased interest in InfiniBand for Oracle RAC performance using the RDS protocol.

The HPC community benefits any time its requirements or its underlying technologies begin to see significant adoption in the wider enterprise computing space. As the much larger enterprise market gets on the bandwagon, the ecosystem grows and strengthens as vendors invest more and as more enterprise customers embrace related products.

While I declared myself a fan of iWARP (RDMA over Ethernet) as a way of broadening the appeal of Ethernet for latency-sensitive applications and for reducing the load on host CPUs, there was a bit of controversy in this area among panel members. The issue arose when the QLogic and Myricom representatives expressed disapproval of some comments made by the Mellanox representative related to OpenFabrics. The latter described the OpenFabrics effort as interconnect independent, while the former were more of a mind that iWARP, which had been united with the OpenIB effort to form what is now called OpenFabrics, was not well integrated in that certain API semantics were different between IB and Ethernet. I need to learn more from Ted Kim, our resident iWARP expert.

Speaking of OpenFabrics, a few points. First, Sun is a member of the OpenFabrics Alliance and as offerors of Linux-based HPC solutions we will support the OFED stack. Second, Sun does fully support InfiniBand on Solaris as an essential part of our HPC on Solaris program. We do this by implementing the important OpenFabrics APIs on Solaris and even sharing code for some of the upper-level protocols.

Second, there was some additional disagreement among the same three panel members regarding the OpenFabrics APIs and the semantics of InfiniBand generally. Both the QLogic and Myricom representatives felt strongly that extra cruft had accrued in both standards that is unnecessary for the efficient support of HPC applications. We did not have enough time to delve into the specifics of their disagreement, but I would certainly like to hear more at some point.

[interconnect panel slide #3]
[interconnect panel slide #4]

The above two slides (courtesy of David Caplan) illustrate several points. First, they show that attention to seemingly small details can yield very large benefits at a large scale. In this case, the observation that we could run three separate 4x InfiniBand connections within one single 12x InfiniBand cable allowed us to build the 500 TFLOP Ranger system at TACC in Austin with essentially one (though, in actuality two) very large central InfiniBand switches. The system used 1/6 the cables required for a conventional installation of this size and many fewer switching elements. And, second, to build successful, effective HPC solutions one must view the problem at a system level. In this case, Sun's Constellation system architecture leverages the cabling (and connector) innovation and combines it with both the ultra-dense switch product as well as an ultra-dense blade chassis with 48 four-socket blades to create a nicely-packaged solution that scales into the petaflop range.

[interconnect panel slide #5]

I took this photo the day we announced our two latest CMT systems, the T5240 and T5140. I included it in the deck for several reasons. First, it allowed me to make the point that Sun has very aggressively embraced 10 GbE, putting it on the motherboard in these systems. Second, it is another example of system-level thinking in that we have carefully impedance-matched these 10 GbE capabilities to our multi-core and multi-threaded processors by supporting a large number of independent DMA channels in the 10 GbE hardware. In other words, as processors continue to get more parallel (more threads, more cores) one must keep everything in balance. Which is why you see so many DMA channels here and why we greatly increased the off-chip memory bandwidth (about 50 GB/s) and coherency bandwidth (about the same) of these UltraSPARC T2plus based systems over that available in commodity systems. And it isn't just about the hardware: our new Crossbow networking stack in OpenSolaris allows application workloads to benefit from the underlying throughput and parallelism available in all this new hardware. At the end of the day, it's about the system.

[interconnect panel slide #6]

This last side again emphasizes the value of being able to optimize at a system level. In this case, the fact that we've been able to integrate dual 10 GbE interfaces on-chip in our UltraSPARC T2-based systems because it made sense to tightly couple the networking resources with computation for best latency and performance. You might be confused if you've noticed I said earlier that our very latest CMT systems have on-board 10 GbE, but not on-chip 10 GbE. This illustrates yet another advantage held by true system vendors -- the ability to shift optimization points at will to add value for customers. In this case, we opted to shift the 10 GbE interfaces off-chip in the UltraSPARC T2plus to make room for the 50 GB/s of coherency logic we added to allow two T2plus sockets to be run together as a single, large 128-thread, 16 FPU system.

Another system vendor advantage is illustrated by our involvement in the DARPA-funded program UNIC, which stands for Ultraperformance Nanophotonic Intrachip Communications. This project seeks to create macro-chip assemblies that tie multiple physical pieces of silicon together using extremely high bandwidth and low latency optical connections. Think of the macro-chip approach as a way of creating very large scale silicon with the economics of small-scale silicon. The project is exploring the physical-level technologies that might support one or two orders of magnitude bandwidth increase well into the multi-terabyte range. From my HPC perspective, it remains to be seen what protocol would run across these links: Coherent links lead to macro-chips; Non coherent or perhaps semi-coherent links lead to some very interesting clustering capabilities with some truly revolutionary interconnect characteristics. How we use this technology once it has been developed remains to be seen, but the possibilities are all extremely interesting.

To wrap up this overly-long entry, I'll close with the observation that one of the framing questions used for the panel session was stated backwards. In essence, the question asked what programming model changes for HPC would be engendered by evolution in the interconnect space. As I pointed out during the session, the semantics of the interconnect should be dictated by the requirements of the programming models and not the other way 'round. This lead to a good, but frustrating discussion about future programming models for HPC.

It was a good topic because it is an important topic that needs to be discussed. While there was broad agreement that MPI as a programming model will be used for HPC for a very long time, it is also clear that new models should be examined. New models are needed to 1) help new entrants into HPC, as well as others, more easily take advantage of multi-core and multi-threaded processors, 2) create scalable, high-performance applications, and 3) deal with the unavoidable hardware failures in underlying cluster hardware. It is, for example, ironic in the extreme that the HPC community has embraced one of the most brittle programming models available (i.e. MPI) and is attempting to use it to scale applications into the petascale range and beyond. There may be lessons to be learned from the enterprise computing community which has long embedded reliability in their middleware layers to allow e-commerce sites, etc., to ride through failures with no service interruptions. More recently, they are doing so increasingly large scale computational and storage infrastructures. On the software front, Google's MapReduce, the open-source Hadoop project, Pig, etc. have all embraced resiliency as an important characteristic of their scalable application architectures. While I am not suggesting that these particular approaches be adopted for HPC with its very different algorithmic requirements, I do believe there are lessons to be learned. At some point soon I believe resiliency will become more important to HPC than peak performance. This was discussed in more detail during the operating system panel.

It was also a frustrating topic because as a vendor I would like some indication from the community as to promising directions to pursue. As with all vendors, we have limited resources and would prefer to place our bets in fewer places with increased chances for success. Are the PGAS languages (e.g. UPC, CaF) the right approach? Are there others? Is there only one answer, or are the needs of the commercial HPC community different from, for example, those of the highest-end of the HPC computing spectrum? I was accused at the meeting of saying the vendors weren't going to do anything to improve the situation. Instead, my point was that we prefer to make data-driven decisions where possible, but the community isn't really giving us the data. In the meantime, UPC does have some uptake with the intelligence community and perhaps more broadly. And Cray, IBM, and Sun have all defined new languages for HPC that aim to deliver good, scalable performance with higher productivity than existing languages. Where this will go, I do not know. But go it must...

My next entry will describe the Operating System Panel.


Josh Simons


« July 2016