Let me say this first: NEHALEM, NEHALEM, NEHALEM. You can call it the
Intel Xeon Series 5500
processor if you like, but the HPC community has been whispering about Nehalem and rubbing
its collective hands together in anticipation of Nehalem for what seems like years now. So,
let's talk about Nehalem.
Actually, let's not. I'm guessing that between the collective might of the Intel PR machine, the
traditional press, and the blogosphere you are already well-steeped in the details of this new
Intel processor and why it excites the HPC community. Rather than talk about the processor
per se, let's talk instead about the more typical HPC scenario: not about a single Nehalem,
but rather piles of Nehali and how to best to deploy them in an HPC cluster configuration.
Our HPC clustering approach is based on the Sun Constellation architecture, which we've
deployed at TACC and other large-scale HPC sites around the world. These existing systems house compute nodes
in the Sun Blade 6000 System chassis which holds four blade shelves, each with
twelve blade systems for a total of 48 blades per chassis. Constellation also includes
matching InfiniBand infrastructure, including InfiniBand Network Express Modules (NEMs)
and a range of InfiniBand switches that can be used to build petascale compute clusters. You can
see several of the 82 TACC Ranger Constellation chassis in this photo, interspersed with (black) inline
As part of our continued focus on HPC customer requirements, we've done something interesting with our
new Nehalem-based Vayu blade (officially, the Sun Blade X6275 Server Module): each blade houses two
separate nodes. Here is a photo of the Vayu blade:
Each of the nodes is a diskless, two-socket Nehalem system
with 12 DDR3 DIMM slots per node (up to 96 GB per node) and on-board
QDR InfiniBand. It's
actually not quite correct to call this node diskless because it does include
two Sun Flash Module slots (one per node) that each provide up to 24 GB of
FLASH storage through a SATA interface. I am sure our HPC customers will
use what amounts to an ultra-fast disk for some interesting applications.
Using Vayu blades, each Sun Constellation chassis can now support a total of 2 nodes/blade \* 12 blades/shelf \* 4 shelves/chassis = 96 nodes with a peak floating-point performance of about 9 TFLOPs. While a chassis can support up to 96 GB / node \* 96 nodes = 9.2 TB of memory, there are some subtleties involved in optimizing and configuring memory for Nehalem systems so I recommend
reading John Nerl's blog
entry for a detailed discussion of this topic.
For a quick visual tour of Vayu, see the annotated photo below. The major components are: A = Nehalem 4-core processor; B = Memory DIMMs; C = Mellanox
Connect-X QDR InfiniBand ASIC; D = Tylersburg I/O chipset; E = Base Management Controller (BMC) / Service Processor (one per
node); F = Sun Flash Modules (SATA, 24GB per node.) The connector at the top right supports two
PCIe2 interfaces, two InfiniBand interfaces, and 2 GbE interfaces. The GbE logic is hiding under the
BMC daughter board.
To accommodate this high-density blade approach we've developed a new QDR InfiniBand NEM
(officially called the SunBlade 6048 QDR IB Switched NEM), which is shown
below. This Network Express Module plugs into a blade shelf's midplane and forms
the first level of InfiniBand fabric in a cluster configuration. Specifically, the two on-board
36-port Mellanox QDR switch chips act as leaf switches from which larger configurations can be built.
Of the 72 total switch ports available, 24 are used for the Vayu blades and 9 from each switch chip are used
to interconnect the two switches, leaving a total of 72 - 24 - 2\*9 = 30 QDR links available for off-shelf
connections. These links leave the NEM through 10 physical connectors, each of which carries three QDR X4
links over a single cable. As we discussed when the first version of Constellation was released, aggregating
4X links into 12X cables results in significant customer benefits related to reliability, density, and complexity. In
any case, these cables can be connected to Constellation switches to form larger, tree-based
fabrics. Or they can be used in switchless configurations to build torus-based topologies by using the ten
cables to carry X, Y, and Z traffic between shelves and across chassis. As mentioned, for example,
here. The NEM also provides
GbE connectivity for each of the 24 nodes in the blade shelf.
Looking back at the TACC photo, we can now double the compute density shown
there with our newest
Constellation-based systems using the new Vayu blade. Oh, and by the
way, we can also remove those inline cooling units and pack those
chassis side-by-side for another significant increase in compute
density. I'll leave how we accomplish that last bit for a future blog