Friday Sep 04, 2009

HPC Virtual Conference: No Travel Budget? No Problem!

Sun is holding a virtual HPC conference on September 17th featuring Andy Bechtolsheim as keynote speaker. Andy will be talking about the challenges around creating Exaflop systems by 2020, after which he will participate in a chat session with attendees. In fact, each of the conference speakers (see agenda) will chat with attendees after their presentations.

There will also be two sets of exhibits to "visit" to find information on HPC solutions for specific industries or to get information on specific HPC technologies. Industries covered include MCAE, EDA, Government/Education/Research, Life Sciences, and Digital Media. There will be technology exhibits on storage software and hardware, integrated software stack for HPC, compute and networking hardware, and HPC services.

This is a free event. Register here.

Friday May 29, 2009

Rur Valley Journal: What's Up at Jülich?

What's up at Jülich? The latest 200+ TeraFLOPs of Sun-supplied HPC compute power is now up and running!

The JuRoPa (Jülich Research on PetaFLOP Architectures) system at the Jülich Research Center in Jülich, Germany has just come online this week. A substantial part the system is built with the Sun Constellation System architecture, which marries highly dense blade systems with an efficient, high-performance QDR InfiniBand fabric in an HPC cluster configuration.

We delivered 23 cabinets filled with a total of 1104 Sun Blade x6275 servers or 2208 nodes. Each of these nodes is a dual-socket Nehalem-EP system running at 2.93 GHz. The systems are connected with quad data-rate (QDR) 4X InfiniBand using a total of six of our latest 648-port QDR switches. As usual, we use 12X InfiniBand cables to route three 4X connections, thereby greatly reducing the number of cables and connectors, and increasing the reliability of the fabric. For more detail on the Nehalem-EP blades and other components used in this system, see this blog entry.

I've annotated one of the official photos below. Marc Hamilton has many more photos on his blog, including some cool "underground shots" at Jülich.

Tuesday Apr 14, 2009

You Say Nehalem, I Say Nehali

Let me say this first: NEHALEM, NEHALEM, NEHALEM. You can call it the Intel Xeon Series 5500 processor if you like, but the HPC community has been whispering about Nehalem and rubbing its collective hands together in anticipation of Nehalem for what seems like years now. So, let's talk about Nehalem.

Actually, let's not. I'm guessing that between the collective might of the Intel PR machine, the traditional press, and the blogosphere you are already well-steeped in the details of this new Intel processor and why it excites the HPC community. Rather than talk about the processor per se, let's talk instead about the more typical HPC scenario: not about a single Nehalem, but rather piles of Nehali and how to best to deploy them in an HPC cluster configuration.

Our HPC clustering approach is based on the Sun Constellation architecture, which we've deployed at TACC and other large-scale HPC sites around the world. These existing systems house compute nodes in the Sun Blade 6000 System chassis which holds four blade shelves, each with twelve blade systems for a total of 48 blades per chassis. Constellation also includes matching InfiniBand infrastructure, including InfiniBand Network Express Modules (NEMs) and a range of InfiniBand switches that can be used to build petascale compute clusters. You can see several of the 82 TACC Ranger Constellation chassis in this photo, interspersed with (black) inline cooling units.

As part of our continued focus on HPC customer requirements, we've done something interesting with our new Nehalem-based Vayu blade (officially, the Sun Blade X6275 Server Module): each blade houses two separate nodes. Here is a photo of the Vayu blade:

Each of the nodes is a diskless, two-socket Nehalem system with 12 DDR3 DIMM slots per node (up to 96 GB per node) and on-board QDR InfiniBand. It's actually not quite correct to call this node diskless because it does include two Sun Flash Module slots (one per node) that each provide up to 24 GB of FLASH storage through a SATA interface. I am sure our HPC customers will use what amounts to an ultra-fast disk for some interesting applications.

Using Vayu blades, each Sun Constellation chassis can now support a total of 2 nodes/blade \* 12 blades/shelf \* 4 shelves/chassis = 96 nodes with a peak floating-point performance of about 9 TFLOPs. While a chassis can support up to 96 GB / node \* 96 nodes = 9.2 TB of memory, there are some subtleties involved in optimizing and configuring memory for Nehalem systems so I recommend reading John Nerl's blog entry for a detailed discussion of this topic.

For a quick visual tour of Vayu, see the annotated photo below. The major components are: A = Nehalem 4-core processor; B = Memory DIMMs; C = Mellanox Connect-X QDR InfiniBand ASIC; D = Tylersburg I/O chipset; E = Base Management Controller (BMC) / Service Processor (one per node); F = Sun Flash Modules (SATA, 24GB per node.) The connector at the top right supports two PCIe2 interfaces, two InfiniBand interfaces, and 2 GbE interfaces. The GbE logic is hiding under the BMC daughter board.

To accommodate this high-density blade approach we've developed a new QDR InfiniBand NEM (officially called the SunBlade 6048 QDR IB Switched NEM), which is shown below. This Network Express Module plugs into a blade shelf's midplane and forms the first level of InfiniBand fabric in a cluster configuration. Specifically, the two on-board 36-port Mellanox QDR switch chips act as leaf switches from which larger configurations can be built. Of the 72 total switch ports available, 24 are used for the Vayu blades and 9 from each switch chip are used to interconnect the two switches, leaving a total of 72 - 24 - 2\*9 = 30 QDR links available for off-shelf connections. These links leave the NEM through 10 physical connectors, each of which carries three QDR X4 links over a single cable. As we discussed when the first version of Constellation was released, aggregating 4X links into 12X cables results in significant customer benefits related to reliability, density, and complexity. In any case, these cables can be connected to Constellation switches to form larger, tree-based fabrics. Or they can be used in switchless configurations to build torus-based topologies by using the ten cables to carry X, Y, and Z traffic between shelves and across chassis. As mentioned, for example, here. The NEM also provides GbE connectivity for each of the 24 nodes in the blade shelf.

Looking back at the TACC photo, we can now double the compute density shown there with our newest Constellation-based systems using the new Vayu blade. Oh, and by the way, we can also remove those inline cooling units and pack those chassis side-by-side for another significant increase in compute density. I'll leave how we accomplish that last bit for a future blog entry.

Friday Apr 03, 2009

HPC in Second Life (and Second Life in HPC)

We held an HPC panel session yesterday in Second Life for Sun employees interested in learning more about HPC. Our speakers were Cheryl Martin, Director of HPC Marketing; Peter Bojanic, Director for Lustre; Mike Vildibill, Director of Sun's Strategic Engagement Team (SET); and myself. We covered several aspects of HPC: what it is, why it is important, and how Sun views it from a business perspective. We also talked about some of the hardware and software technologies and products that are key enablers for HPC: Constellation, Lustre, MPI, etc.

As we were all in-world at the time, I thought it would be interesting to ponder whether Second Life itself could be described as "HPC" and whether we were in fact holding the HPC meeting within an HPC application. Having viewed this excellent SL Architecture talk given by Ian (Wilkes) Linden, VP of Systems Engineering at Linden Lab, I conclude that SL is definitely an HPC application. Consider the following information taken from Ian's presentation.

As you can see, the geography of SL has been exploding in size over the last 5-6 years. As of Dec 2008 that geography is simulated using more than 15K instances of the SL simulator process that in addition to computing the physics of SL also run an average of 30 million simultaneous server-side scripts to create additional aspects of the SL user experience. And look at the size of their dataset: 100TB is very respectable from an HPC perspective. And a billion files! Many HPC sites are worrying what will happen when they get to that level of scale while Linden Lab is already dealing with it. I was surprised they aren't using Lustre, since I assume their storage needs are exploding as well. But I digress.

The SL simulator described above would be familiar to any HPC programmer. It's a big C++ code. The problem space (the geography of SL) has been decomposed into 256m X 256m chunks that are each assigned to once instance of the simulator. Each simulator process runs on its own CPU core and "adjacent" simulator instances exchange edge data to ensure consistency across sub-domain boundaries. And it's a high-level physics simulation. Smells like HPC to me.

Wednesday Nov 19, 2008

Sun Supercomputing: Red Sky at Night, Sandia's Delight

Yesterday we officially announced that Sun will be supplying Sandia National Laboratories its next generation clustered supercomputer, named Red Sky. Douglas Doerfler from the Scalable Architectures Department at Sandia spoke at the Sun HPC Consortium Meeting here in Austin and gave an overview of the system to assembled customers and Sun employees. As Douglas noted, this was the world premiere Red Sky presentation.

The system is slated to replace Thunderbird and other aging cluster resources at Sandia. It is a Sun Constellation system using the Sun Blade 6000 blade architecture, but with some differences. First, the system will use a new diskless two-node Intel blade to double the density of the overall system. The initial system will deliver 160 TFLOPs peak performance in a partially populated configuration with expansion available to 300 TFLOPs.

Second, the interconnect topology is a 3D torus rather than a fat-tree. The torus will support Sandia's secure red/black switching requirement with a middle "swing" section that can be moved to either the red or black side of the machine as needed with the required air gap.

Primary software components include CentOS, Open MPI, OpenSM, and Lash for deadlock-free routing across the torus. The filesystem will be based on Lustre. oneSIS will be used for diskless cluster management, including booting over InfiniBand.

Thursday Jun 19, 2008

Inside NanoMagnum, the Sun Datacenter Switch 3x24

Here is a quick look under the covers of the new Sun Datacenter Switch 3x24, the new InfiniBand switch just announced by Sun at ISC 2008 in Dresden. First some photos and then an explanation of how this switch is used as a Sun Constellation System component to build clusters with up to 288 nodes.

First, the photos:

Nano Magnum's three heat sinks sit atop Mellanox 24-port InfiniBand 4x switch chips. The purple object is an air plenum that guides air past the sinks from the rear of the unit.
Looking down on the Nano, you can see the three heat sinks that cover the switch chips and the InfiniBand connectors along the bottom of the photo. The unit has two rows of twelve connectors with the bottom row somewhat visible under the top row in this photo.
The Nano Magnum is in the foreground. The unit sitting on top of Nano's rear deck for display purposes is an InfiniBand NEM. See text for more information.

You might assume NanoMagnum is either a simple 24-port InfiniBand switch or, if you know that each connector actually carries three separate InfiniBand 4X connections, a simple 72-port switch. In fact, it is neither. NanoMagnum is a core switch and none of the three InfiniBand switch chips is connected to the others. Since it isn't intuitive how a box containing three unconnected switch chips can be used to create single, fully-connected clusters, let's look in detail at how this is done. I've created two diagrams that I hope will make the wiring configurations clear.

Before getting into cluster details, I should explain that a NEM, or Network Express Module, is an assembly that plugs into the back of each of the four shelves in a Sun Blade 6048 chassis. In the case of an InfiniBand NEM, it contains the InfiniBand HCA logic needed for each blade as well as two InfiniBand leaf switch elements that are used to tie the shelves into cluster configurations. You can see a photo of a NEM above.

The first diagram (below) illustrates how any blade in a shelf can reach any blade in any other shelf connected to a NanoMagnum switch. There are a few important points to note. First, all three switch chips in the NanoMagnum are connected to every switch port, which means that regardless of which switch chip your signal enters, it can be routed to any other port in the switch. Second, you will notice that only one switch chip in the NEM is being used. The second is used only for creating redundant configurations and the cool thing about that is that from an incremental cost perspective, one need only buy additional cables and additional switches--the leaf switching elements are already included in the configuration.

If the above convinced you that any blade can reach any other blade connected to the same switch, the rest is easy. The diagram below shows the components and connections needed to build a 288-node Sun Constellation System using four NanoMagnums.

Clusters of smaller size can be built in a similar way, as can clusters that are over-subscribed (i.e. not non-blocking.)

Wednesday Jun 18, 2008

Sun Announces Hercules at ISC 2008 in Dresden

Last night in Dresden at the International Supercomputing Conference (ISC 2008), Sun unveiled Hercules, our newest Sun Constellation System blade module. Officially named the Sun Blade X6450 Server Module, Hercules is a four-socket, quad-core blade with Xeon 7000 series processors (Tigerton) that fits into the Sun Blade 6048 Chassis, the computational heart of Sun's Constellation System architecture for HPC. According to Lisa Robinson Schoeller, Blade Product Line Manager, the most notable features of Hercules are its 50% increase in DIMM slots per socket (six instead of the usual four), the achievable compute density at the chassis level (71% increase over IBM and 50% increase over HP), and the fact that Hercules is diskless (though it does also support a 16 GB on-board CF card that could be used for local booting.) A single Constellation chassis full of these puppies delivers over 7 TeraFLOPs of peak floating-point performance.

Lisa Schoeller and Bjorn Andersson, Director for HPC, showing off Hercules, Sun's latest Intel-based Constellation blade system

Sun Announces NanoMagnum at ISC 2008 in Dresden

Last night in Dresden at ISC 2008 Sun announced NanoMagnum, the latest addition to the Sun Constellation System architecture. Nano, more properly called the Sun Datacenter Switch 3x24, is the world's densest DDR InfiniBand core switch, with three 24-port IB switches encased within a 1 rack-unit form factor. Nano complements its big brother Magnum, the 3456-port InfiniBand switch used at TACC and elsewhere, and allows the Constellation architecture to scale down by supporting smaller clusters of C48 (Sun Blade 6048) chassis. Nano also uses the same ultra-dense cabling developed for the rest of the Constellation family components, with three 4X DDR connections carried in one 12X cable for reduced cabling complexity.

Here are some photos I took of the unveiling on the show floor at ISC.

Bjorn Andersson, Director of HPC, and Marc Hamilton, VP System Practices Americas, prepare to unveil NanoMagnum at ISC 2008
Ta da. The Sun Datacenter Switch 3x24, sitting in front of a Sun Blade 6048 chassis and on top of a scale model of TACC Ranger, the world's largest Sun Constellation system
A closeup of the new switch
Bjorn uncorks a magnum of champagne to celebrate
Lisa Robinson Schoeller (Blade Product Line Manager) and Bjorn prepare to spread good cheer
The official toast

Josh Simons


« April 2014