Friday Feb 12, 2010

Mirror, Mirror

Today is my last day at Sun, so I write to say goodbye to all my friends and colleagues who I will dearly miss.

I believe Oracle plans to maintain all previous blog content, but just in case I have made a copy of this entire blog, which is called, fittingly, the Mirror of the Navel of Narcissus. Yes, even more references to self-indulgence. :-)

Follow the link to find out what I'm doing next!

Tuesday Apr 14, 2009

You Say Nehalem, I Say Nehali


Let me say this first: NEHALEM, NEHALEM, NEHALEM. You can call it the Intel Xeon Series 5500 processor if you like, but the HPC community has been whispering about Nehalem and rubbing its collective hands together in anticipation of Nehalem for what seems like years now. So, let's talk about Nehalem.

Actually, let's not. I'm guessing that between the collective might of the Intel PR machine, the traditional press, and the blogosphere you are already well-steeped in the details of this new Intel processor and why it excites the HPC community. Rather than talk about the processor per se, let's talk instead about the more typical HPC scenario: not about a single Nehalem, but rather piles of Nehali and how to best to deploy them in an HPC cluster configuration.

Our HPC clustering approach is based on the Sun Constellation architecture, which we've deployed at TACC and other large-scale HPC sites around the world. These existing systems house compute nodes in the Sun Blade 6000 System chassis which holds four blade shelves, each with twelve blade systems for a total of 48 blades per chassis. Constellation also includes matching InfiniBand infrastructure, including InfiniBand Network Express Modules (NEMs) and a range of InfiniBand switches that can be used to build petascale compute clusters. You can see several of the 82 TACC Ranger Constellation chassis in this photo, interspersed with (black) inline cooling units.

As part of our continued focus on HPC customer requirements, we've done something interesting with our new Nehalem-based Vayu blade (officially, the Sun Blade X6275 Server Module): each blade houses two separate nodes. Here is a photo of the Vayu blade:

Each of the nodes is a diskless, two-socket Nehalem system with 12 DDR3 DIMM slots per node (up to 96 GB per node) and on-board QDR InfiniBand. It's actually not quite correct to call this node diskless because it does include two Sun Flash Module slots (one per node) that each provide up to 24 GB of FLASH storage through a SATA interface. I am sure our HPC customers will use what amounts to an ultra-fast disk for some interesting applications.

Using Vayu blades, each Sun Constellation chassis can now support a total of 2 nodes/blade \* 12 blades/shelf \* 4 shelves/chassis = 96 nodes with a peak floating-point performance of about 9 TFLOPs. While a chassis can support up to 96 GB / node \* 96 nodes = 9.2 TB of memory, there are some subtleties involved in optimizing and configuring memory for Nehalem systems so I recommend reading John Nerl's blog entry for a detailed discussion of this topic.

For a quick visual tour of Vayu, see the annotated photo below. The major components are: A = Nehalem 4-core processor; B = Memory DIMMs; C = Mellanox Connect-X QDR InfiniBand ASIC; D = Tylersburg I/O chipset; E = Base Management Controller (BMC) / Service Processor (one per node); F = Sun Flash Modules (SATA, 24GB per node.) The connector at the top right supports two PCIe2 interfaces, two InfiniBand interfaces, and 2 GbE interfaces. The GbE logic is hiding under the BMC daughter board.

To accommodate this high-density blade approach we've developed a new QDR InfiniBand NEM (officially called the SunBlade 6048 QDR IB Switched NEM), which is shown below. This Network Express Module plugs into a blade shelf's midplane and forms the first level of InfiniBand fabric in a cluster configuration. Specifically, the two on-board 36-port Mellanox QDR switch chips act as leaf switches from which larger configurations can be built. Of the 72 total switch ports available, 24 are used for the Vayu blades and 9 from each switch chip are used to interconnect the two switches, leaving a total of 72 - 24 - 2\*9 = 30 QDR links available for off-shelf connections. These links leave the NEM through 10 physical connectors, each of which carries three QDR X4 links over a single cable. As we discussed when the first version of Constellation was released, aggregating 4X links into 12X cables results in significant customer benefits related to reliability, density, and complexity. In any case, these cables can be connected to Constellation switches to form larger, tree-based fabrics. Or they can be used in switchless configurations to build torus-based topologies by using the ten cables to carry X, Y, and Z traffic between shelves and across chassis. As mentioned, for example, here. The NEM also provides GbE connectivity for each of the 24 nodes in the blade shelf.

Looking back at the TACC photo, we can now double the compute density shown there with our newest Constellation-based systems using the new Vayu blade. Oh, and by the way, we can also remove those inline cooling units and pack those chassis side-by-side for another significant increase in compute density. I'll leave how we accomplish that last bit for a future blog entry.

Monday Oct 13, 2008

The Death of Clock Speed


Sun just introduced yet another chip multi-threading (CMT) SPARC system, the Sun SPARC Enterprise T5440. To me, that's a Batoka (sorry, branding police) because I can't keep the model names straight. In any case, this time we've put 256 hardware threads and up to 1/2 TeraByte of memory into a four-RU server. That works out to a thread-count of about 36 threads per vertical inch, which doesn't compete with fine Egyptian cotton, but it can beat the crap out of servers with 3X faster processor clocks. If you don't understand why, then you should take the time to grok this fact. Clock speed is dead as a good metric for assessing the value of a system. For years it has been a shorthand proxy for performance, but with multi-core and multi-threaded processors becoming ubiquitous that has all changed.

Old-brain logic would say that for a particular OpenMP problem size, a faster clock will beat a slower one. But it isn't about clock anymore--it is about parallelism and latency hiding and efficient workload processing. Which is why the T5440 can perform anywhere from 1.5X to almost 2X better on a popular OpenMP benchmark against systems with almost twice the clock rate of the T5540. Those 256 threads and 32 separate floating point units are a huge advantage in parallel benchmarks and in real-world environments in which heavy workloads are the norm, especially those that can benefit from the latency-hiding offered by a CMT processor like the UltraSPARC T2 Plus. Check out BM Seer over the next few days for more specific, published benchmark results for the T5440.

Yes, sure. If you have a single-threaded application and don't need to run a lot of instances, then the higher clock rate will help you significantly. But if that is your situation, then you have a big problem since you will no longer see clock rate increasing as it has in the past. You are surely starting to see this now with commodity CPUs slowly increasingly clock rates and embrace of multicore, but you can look to our CMT systems to understand the logical progression of this trend. It's all going parallel, baby, and it is past time to start thinking about how you are going to deal with that new reality. When you do, you'll see the value of this highly threaded approach for handling real-world workloads. You can see my earlier post on CMT and HPC for more information, including a pointer to an interesting customer analysis of the value of CMT for HPC workloads.

A pile of engineers have been blogging about both the hardware and software aspects of the new system. Check out Allan Packer's stargate blog entry for jump coordinates into the Batoka T5440 blogosphere.


About

Josh Simons

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today