Friday Sep 04, 2009

HPC Virtual Conference: No Travel Budget? No Problem!

Sun is holding a virtual HPC conference on September 17th featuring Andy Bechtolsheim as keynote speaker. Andy will be talking about the challenges around creating Exaflop systems by 2020, after which he will participate in a chat session with attendees. In fact, each of the conference speakers (see agenda) will chat with attendees after their presentations.

There will also be two sets of exhibits to "visit" to find information on HPC solutions for specific industries or to get information on specific HPC technologies. Industries covered include MCAE, EDA, Government/Education/Research, Life Sciences, and Digital Media. There will be technology exhibits on storage software and hardware, integrated software stack for HPC, compute and networking hardware, and HPC services.

This is a free event. Register here.

Tuesday Apr 14, 2009

You Say Nehalem, I Say Nehali

Let me say this first: NEHALEM, NEHALEM, NEHALEM. You can call it the Intel Xeon Series 5500 processor if you like, but the HPC community has been whispering about Nehalem and rubbing its collective hands together in anticipation of Nehalem for what seems like years now. So, let's talk about Nehalem.

Actually, let's not. I'm guessing that between the collective might of the Intel PR machine, the traditional press, and the blogosphere you are already well-steeped in the details of this new Intel processor and why it excites the HPC community. Rather than talk about the processor per se, let's talk instead about the more typical HPC scenario: not about a single Nehalem, but rather piles of Nehali and how to best to deploy them in an HPC cluster configuration.

Our HPC clustering approach is based on the Sun Constellation architecture, which we've deployed at TACC and other large-scale HPC sites around the world. These existing systems house compute nodes in the Sun Blade 6000 System chassis which holds four blade shelves, each with twelve blade systems for a total of 48 blades per chassis. Constellation also includes matching InfiniBand infrastructure, including InfiniBand Network Express Modules (NEMs) and a range of InfiniBand switches that can be used to build petascale compute clusters. You can see several of the 82 TACC Ranger Constellation chassis in this photo, interspersed with (black) inline cooling units.

As part of our continued focus on HPC customer requirements, we've done something interesting with our new Nehalem-based Vayu blade (officially, the Sun Blade X6275 Server Module): each blade houses two separate nodes. Here is a photo of the Vayu blade:

Each of the nodes is a diskless, two-socket Nehalem system with 12 DDR3 DIMM slots per node (up to 96 GB per node) and on-board QDR InfiniBand. It's actually not quite correct to call this node diskless because it does include two Sun Flash Module slots (one per node) that each provide up to 24 GB of FLASH storage through a SATA interface. I am sure our HPC customers will use what amounts to an ultra-fast disk for some interesting applications.

Using Vayu blades, each Sun Constellation chassis can now support a total of 2 nodes/blade \* 12 blades/shelf \* 4 shelves/chassis = 96 nodes with a peak floating-point performance of about 9 TFLOPs. While a chassis can support up to 96 GB / node \* 96 nodes = 9.2 TB of memory, there are some subtleties involved in optimizing and configuring memory for Nehalem systems so I recommend reading John Nerl's blog entry for a detailed discussion of this topic.

For a quick visual tour of Vayu, see the annotated photo below. The major components are: A = Nehalem 4-core processor; B = Memory DIMMs; C = Mellanox Connect-X QDR InfiniBand ASIC; D = Tylersburg I/O chipset; E = Base Management Controller (BMC) / Service Processor (one per node); F = Sun Flash Modules (SATA, 24GB per node.) The connector at the top right supports two PCIe2 interfaces, two InfiniBand interfaces, and 2 GbE interfaces. The GbE logic is hiding under the BMC daughter board.

To accommodate this high-density blade approach we've developed a new QDR InfiniBand NEM (officially called the SunBlade 6048 QDR IB Switched NEM), which is shown below. This Network Express Module plugs into a blade shelf's midplane and forms the first level of InfiniBand fabric in a cluster configuration. Specifically, the two on-board 36-port Mellanox QDR switch chips act as leaf switches from which larger configurations can be built. Of the 72 total switch ports available, 24 are used for the Vayu blades and 9 from each switch chip are used to interconnect the two switches, leaving a total of 72 - 24 - 2\*9 = 30 QDR links available for off-shelf connections. These links leave the NEM through 10 physical connectors, each of which carries three QDR X4 links over a single cable. As we discussed when the first version of Constellation was released, aggregating 4X links into 12X cables results in significant customer benefits related to reliability, density, and complexity. In any case, these cables can be connected to Constellation switches to form larger, tree-based fabrics. Or they can be used in switchless configurations to build torus-based topologies by using the ten cables to carry X, Y, and Z traffic between shelves and across chassis. As mentioned, for example, here. The NEM also provides GbE connectivity for each of the 24 nodes in the blade shelf.

Looking back at the TACC photo, we can now double the compute density shown there with our newest Constellation-based systems using the new Vayu blade. Oh, and by the way, we can also remove those inline cooling units and pack those chassis side-by-side for another significant increase in compute density. I'll leave how we accomplish that last bit for a future blog entry.

Saturday Oct 25, 2008

Digital Photographers Beware: Flash memory is not as reliable as you may think

Using flash memory cards for long-term photo storage is a really bad idea. I didn't realize just how bad until I heard a recent talk on flash technology, which included reliability statistics for the two main varieties of flash, MLC and SLC.

Before sharing the numbers, you need to know that flash memory reliability is generally measured by two quantities: endurance and retention. Endurance measures the number of write cycles each sector of the memory can handle before serious or fatal errors begin to occur. Retention measures how long a flash memory part can be expected to hold its data reliably before attempts to read the data will fail. Of course this is all statistical, but nonetheless the numbers will give you a rough idea of the lifetimes of these memory devices.

Multi Level Cell (MLC) is the consumer-grade version of flash memory. It has an endurance of about 1000 write cycles. Think about how often you fill and erase your memory cards to decide if this number bothers you. If you are like me, you tend to keep and use memory cards for a long time, which means 1000 cycles is not a very comforting number. As an aside, I've noticed that my Canon G9 lets me erase a memory card with either a standard formatting operation or what is called a "low-level formatting." Since the low-level operation takes considerably longer than the standard version, I suspect it erases every sector on the card, which imposes unnecessary wear, i.e. more cycles are consumed from the 1000-cycle budget for these memories. From now on I'll be using the standard formatting option, which I suspect only affects metadata blocks. If I recall correctly, my Canon 10D just offers one formatting option. It is very fast, so I am guessing it does not erase every data block on the card.

Let's now talk about data retention.

Some photographers I know have decided to use flash memory for long-term photo storage since cards are getting so cheap and the form-factor is small and convenient. Really, really bad idea. The retention statistic for MLC parts is a mere 3-4 years. And, worse, retention and endurance are not independent: flash memory that has been used for many cycles can have a significantly reduced retention time.

The characteristics of Single Level Cell (SLC) are considerably better, but you still probably should not use these devices for long-term storage. While endurance for SLC is about 100,000 cycles, data retention is about 10 years. A decade is good, but not for archival purposes. And don't count on 10 years if you have heavily used the card---in which case, retention time can be considerably shorter.

While clearly SLC is more appealing than MLC, it is about 4X more expensive than MLC and about half as dense. I am hoping that the so-called "professional" flash cards sold as digital film are built using SLC, but I have found no way to determine that from vendor websites. If anyone has any concrete information on the products made by the major flash memory card vendors, please share.

Photography aside, I think it is safe to assume those ubiquitous USB dongles that are given away at conferences and others events are made with MLC. Keep than in mind when deciding what you store on these devices and how long you need it to be recoverable.

For an interesting discussion of retention and endurance along with several examples of how to determine suitability for use in several non-photographic circumstances, see this document titled Practical Guide to Endurance and Data Retention [PDF].

And to find out who is working to create ultra-reliable SLC flash memory parts, read this short article.


Josh Simons


« April 2014