Friday Sep 04, 2009

HPC Virtual Conference: No Travel Budget? No Problem!

Sun is holding a virtual HPC conference on September 17th featuring Andy Bechtolsheim as keynote speaker. Andy will be talking about the challenges around creating Exaflop systems by 2020, after which he will participate in a chat session with attendees. In fact, each of the conference speakers (see agenda) will chat with attendees after their presentations.

There will also be two sets of exhibits to "visit" to find information on HPC solutions for specific industries or to get information on specific HPC technologies. Industries covered include MCAE, EDA, Government/Education/Research, Life Sciences, and Digital Media. There will be technology exhibits on storage software and hardware, integrated software stack for HPC, compute and networking hardware, and HPC services.

This is a free event. Register here.

Monday Jul 13, 2009

Performance Facts, Performance Wisdom

I was genuinely excited to see that members of Sun's Strategic Applications Engineering team have started a group blog about performance called BestPerf. These folks are the real deal -- they are responsible for generating all of the official benchmark results published by Sun -- and they collectively have a deep background in all things related to performance. I like the blog because while they do cover specific benchmark results in detail, they also share best practices and include broader discussions about achieving high performance as well. There is a lot of useful material for anyone seeking a better understanding of performance.

Here are some recent entries that caught my eye.

Find out how a Sun Constellation system running SLES 10 beat IBM BlueGene/L on a NAMD Molecular Dynamics benchmark here.

See how the Solaris DTrace facility can be used to perform detailed IO analyses here.

Detailed Fluent cluster benchmark results using the Sun Fire x2270 and SLES 10? Go here.

How to use Solaris containers, processor sets, and scheduling classes to improve application performance? Go here.

Friday Apr 03, 2009

HPC in Second Life (and Second Life in HPC)

We held an HPC panel session yesterday in Second Life for Sun employees interested in learning more about HPC. Our speakers were Cheryl Martin, Director of HPC Marketing; Peter Bojanic, Director for Lustre; Mike Vildibill, Director of Sun's Strategic Engagement Team (SET); and myself. We covered several aspects of HPC: what it is, why it is important, and how Sun views it from a business perspective. We also talked about some of the hardware and software technologies and products that are key enablers for HPC: Constellation, Lustre, MPI, etc.

As we were all in-world at the time, I thought it would be interesting to ponder whether Second Life itself could be described as "HPC" and whether we were in fact holding the HPC meeting within an HPC application. Having viewed this excellent SL Architecture talk given by Ian (Wilkes) Linden, VP of Systems Engineering at Linden Lab, I conclude that SL is definitely an HPC application. Consider the following information taken from Ian's presentation.

As you can see, the geography of SL has been exploding in size over the last 5-6 years. As of Dec 2008 that geography is simulated using more than 15K instances of the SL simulator process that in addition to computing the physics of SL also run an average of 30 million simultaneous server-side scripts to create additional aspects of the SL user experience. And look at the size of their dataset: 100TB is very respectable from an HPC perspective. And a billion files! Many HPC sites are worrying what will happen when they get to that level of scale while Linden Lab is already dealing with it. I was surprised they aren't using Lustre, since I assume their storage needs are exploding as well. But I digress.

The SL simulator described above would be familiar to any HPC programmer. It's a big C++ code. The problem space (the geography of SL) has been decomposed into 256m X 256m chunks that are each assigned to once instance of the simulator. Each simulator process runs on its own CPU core and "adjacent" simulator instances exchange edge data to ensure consistency across sub-domain boundaries. And it's a high-level physics simulation. Smells like HPC to me.

Thursday Nov 13, 2008

Big News for HPC Developers: More Free Stuff

'Tis the Season. Supercomputing season, that is. Every November the HPC community--users, researchers, and vendors--attend the world's biggest conference on HPC: Supercomputing. This year SC08 is being held in Austin Texas, to which I'll be flying in a few short hours.

As part of the seasonal rituals vendors often announce new products, showcase new technologies and generally strut their stuff at the show and even before the show in some cases. Sun is no exception as you will see if you visit our booth at the show and if you take note of two announcements we made today that should be seen as a Big Deal to HPC developers. The first concerns MPI and the second our Sun Studio developer tools.

The first announcement extends Sun's support of Open MPI to Linux with the release of ClusterTools 8.1. This is huge news for anyone looking for a pre-built and extensively tested version of Open MPI for RHEL 4 or 5, SLES 9 or 10, OpenSolaris, or Solaris 10. Support contracts are available for a fee if you need one, but you can download the CT 8.1 bits here for free and use them to your heart's content, no strings attached.

Here are some of the major features supported in ClusterTools 8.1:

  • Support for Linux (RHEL 4&5, SLES 9&10), Solaris 10, OpenSolaris
  • Support for Sun Studio compilers on Solaris and Linux, plus the GNU/gcc toolchain on Linux
  • MPI profiling support with Sun Studio Analyzer (see SSX 11.2008), plus support for VampirTrace and MPI PERUSE
  • InfiniBand multi-rail support
  • Mellanox ConnectX Infiniband support
  • DTrace provider support on Solaris
  • Enhanced performance and scalability, including processor affinity support
  • Support for InfiniBand, GbE, 10GbE, and Myrinet interconnects
  • Plug-ins for Sun Grid Engine (SGE) and Portable Batch System (PBS)
  • Full MPI-2 standard compliance, including MPI I/O and one sided communication

The second event was the release of Sun Studio Express 11/08, which among other enhancements adds complete support for the new OpenMP 3.0 specification, including tasking. If you are questing for ways to extract parallelism from your code to take advantage of multicore processors, you should be looking seriously at OpenMP. And you should do it with the Sun Studio suite, our free compilers and tools which really kick butt on OpenMP performance. You can download everything--the compilers, the debugger, the performance analyzer (including new MPI performance analysis support) and other tools for free from here. Solaris 10, OpenSolaris, and Linux (RHEL 5/SuSE 10/Ubuntu 8.04/CentOS 5.1) are all supported. That includes an extremely high-quality (and free) Fortran compiler among other goodies. (Is it sad that us HPC types still get a little giddy about Fortran? What can I say...)

The full list of capabilities in this Express release are too numerous to list here, so check out this feature list or visit the wiki.

Saturday Apr 26, 2008

HPC User Forum: Operating System Panel

As I mentioned in an earlier entry, I participated in the HPC interconnect panel discussion at IDC's HPC User Forum meeting in Norfolk, Virginia last week. I also sat on the Operating System panel, the subject of this blog entry.

Because the organizers opted against panel member presentations, the following slides were not actually shown at the conference, though they will be included in the conference proceedings. I use them here to highlight some of the main points I made during the panel discussion.

[os panel slide 1]

My fellow panelists during this session were Kenneth Rozendal from IBM, John Hesterberg from SGI, Benoit Marchand from eXludus, Ron Brightwell from Sandia National Laboratory, John Vert from Microsoft, Ramesh Joginpaulli from AMD, and Richard Walsh from IDC.

The framing topic areas suggested by Richard Walsh prior to the conference were used to guide the discussion:

  • Ensuring scheduling efficiency on fat nodes
  • Managing cache and bandwidth resources on multi-core chips
  • Linux and windows: strengths, weaknesses and alternatives
  • OS scalability and resiliency requirements of petascale systems

As you'll see, we covered a wider array of topics during the course of the panel session.

[os panel slide 2]

Beowulf clusters have been popular with the HPC community since about 1998. The idea arose in part as a reaction against expensive, "fat" SMP systems and proprietary, expensive software. Typical Beowulf clusters were built of "thin" nodes (typically one or two single-CPU sockets), commodity ethernet, and an open source software stack customized for HPC environments.

With multi-core and multi-threaded processors now becoming the norm, nodes are maintaining their svelte one or two rack unit form factors, but become much beefier internally. As an extreme example, consider Sun's new SPARC Enterprise T5140 server, which crams 128 hardware threads, 16 FPUs, 64 GB of memory, and close to 600 GBs of storage into a single rack unit (1.75") form factor. Or the two rack-unit version (the T5240) that doubles the memory to 128 GB and supports up to almost 2.4 TB of local disk storage in the chassis. I call nodes like these Sparta nodes because they are slim and trim...and very powerful. Intel and AMD's embracing of multicore ensures that future systems will generally become more Spartan over time.

Clusters need an interconnect. While traditional Beowulf clusters have used commodity Ethernet, they have often done so at the expense of performance for distributed applications that have significant bandwidth and/or latency requirements. As was discussed in the interconnect panel session at the HPC User Forum, InfiniBand (IB) is now making significant inroads into HPC at attractive price points and will continue to do so. Ethernet will also continue to play a role, but commodity 1 GbE is not at all in the same league with IB with respect to either bandwidth or latency. And InfiniBand currently enjoys a significant price advantage over 10 GbE, which does offer (at least currently) comparable bandwidths to IB, though without a latency solution. The use of IB in Sparta clusters allows the nodes to be more tightly coupled in the sense that a broader range of distributed applications will perform well on these systems due to the increased bandwidth and much lower latencies achievable with InfiniBand OS bypass capabilities.

This trend towards beefier nodes will have profound effects on HPC operating system requirements. Or said in a different way, this trend (and others discussed below) will alter the view of the HPC community towards operating systems. The traditional HPC view of an OS is one of "software that gets in the way of my application." In this new world, while we must still pay attention to OS overhead and deliver good application performance, the role of the OS will expand and deliver significant value for HPC.

[os panel slide 3]

The above is a photo I shot of the T5240 I described earlier. This is the 2RU server that recently set a new two-socket SPEComp record as well as a SPECcpu record. Details on the benchmarks are here. If you'd like a quick walkthrough of this system's physical layout, check out my annotated version of the above photo here.

[os panel slide 4]

The industry shift towards multicore processors has created concern within the HPC community and more broadly as well. There are several challenges to be addressed if the value and power of these processors are to be realized.

The increased number of CPUs and hardware threads within these systems will require careful attention be paid to operating system scalability to ensure that application performance does not suffer due to inefficiencies in the underlying OS. Vendors like Sun, IBM, SGI, etc, who have worked on OS scaling issues for many years have experience in this area, but there will doubtless be continuing scalability and performance challenges as these more closely coupled hardware complexes become available with ever large memory configurations and with ever faster IO subsystems.

There was some disagreement within the panel session over the ramifications to application architectures of these beefier nodes when they are used as part of an HPC cluster. Will users continue to run one MPI process per CPU or thread, or will a fewer number of MPI processes be used per node with each process then consuming additional on-node parallelism via OpenMP or some other threading model? I am of the opinion that the mixed/hybrid style (combined MPI and threads) will be necessary for scaling to very large-size clusters because at some point scaling MPI will become problematic. In addition, regardless of the cluster size under consideration, using MPI within a node is not very efficient. MPI libraries can be optimized in how they use shared memory segments for transferring message data between MPI processes, but any data transfers are much less efficient than using a threading model which takes full advantage of the fact that all of the memory on a node is immediately accessible to all of the threads within one address space.

The tradeoff is that this shift from pure MPI programming to hybrid programming does require application changes and the mixed model can be more difficult since it requires thinking about two levels of parallelism. If this shift to multi-core and multi-threaded processors were not such a fundamental sea change, I would agree that recoding would not be worthwhile. However, I do view this as profound a shift as that which caused the HPC community to move to distributed programming models with PVM and MPI and to recode their applications at that time.

Another challenge is that of efficient use of available memory bandwidth, both between sockets and between sockets and memory. As more computational power is crammed into a socket, it becomes more important for 1) processor and system designers to increase available memory bandwidth, 2) operating system designers to provide efficient and effective capabilities to allow applications to make effective use of bandwidth, and 3) for tool vendors to provide visibility into application memory utilization to help application programmers optimized their use of the memory subsystem. In many case, memory performance will become the gating factor on performance rather than CPU.

As the compute, memory, and IO capacities of these beefier nodes continues to grow, the resiliency of these nodes will become a more important factor. With more applications and more state within a node, downtime will be less acceptable within the HPC community. This will be especially true in commercial HPC environments where many ISV applications are commonly used and where these applications may often be able to run within a single beefy node. In such circumstances, OS capabilities like proactive fault management which identifies nascent problems and takes action to avoid system interruption become much more important to HPC customers. An interesting development, since capabilities like fault management have traditionally been developed for the enterprise computing market.

The last item--interconnect pressures--is fairly obvious. As nodes get beefier and perform more work, they put a larger demand on the cluster interconnect, both for compute-related communication and for storage data transfers. InfiniBand, with its aggressive bandwidth roadmap, and an ability to construct multi-rail fabrics, will play an important role in maintaining system balance. Well-crafted low level software (OS, IB stack, MPI) will be needed to handle the larger loads at scale.

[os panel slide 5]

Beyond the challenges of multi-core and multi-threading, there are opportunities. For all but the highest end customers, beefier nodes will allow node counts to grow more slowly, allowing datacenter footprints to grow more slowly, decreasing the rate of complexity growth and scaling issues.

Much more important, however, is that with more hardware resources per node it will now be possible to dedicate some small to moderate amount of processing power to handling OS tasks while minimizing the impact of that processing on application performance. The ability to fence off application workload from OS processing using concepts like processor sets, processor bindings, and the ability to turn off device interrupt processing, etc. should allow applications to be run with reduced jitter while still supporting a full, standard OS environment on each node, and not having to resort to microkernel or other approaches to deliver low jitter. Using standard OSes (possible stripped down by removing or disabling unneeded functions) is very important for several reasons.

As mentioned earlier, OS capabilities are becoming more important, not less. I've mentioned fault management and scalability. In a few slides we'll talk about power management, virtualization and other capabilities that will be needed for successful HPC installations in the future. Attempt to build all of that into a microkernel and you end up with a kernel. You might as well start with all of the learning and innovation that has accrued to standard OSes and minimize and improve where necessary rather than building one-off or few-off custom software environments.

I worry about how well-served the very high end of the HPC market will be in the future. It isn't a large market or one that is growing like other segments of HPC. While it is a segment that solves incredibly important problems, it is also quite a difficult market for vendors to satisfy. The systems are huge, the software scaling issues tremendously difficult, and, frankly, both the volume and the margins are low. This segment has argued for many years that meeting their scaling and other requirements guarantees that a vendor will be well-positioned to satisfy any other HPC customer's requirements. It is essentially the "scale down" argument. But that argument is in jeopardy to the extent that the high-end community embraces a different approach than is needed for the bulk of the HPC market. Commercial HPC customers want and need a full instance of Solaris or Linux on their compute nodes because they have both throughput and capability problems to run and because they run lots of ISV applications. They don't want a microkernel or some other funky software environment.

I absolutely understand and respect the software work being done at our national labs and elsewhere to take advantage of the large-scale systems they have deployed. But this does not stop me from worrying about the ramifications of the high-end delaminating from the rest of the HPC market.

[os panel slide 6]

The HPC community is accustomed to being on the leading/bleeding edge, creating new technologies that eventually flow into the larger IT markets. InfiniBand is one such example that is still in process. Parallel computing techniques may be another as the need for parallelization begins to be felt far beyond HPC due to the tailing off of clock speed increases and the emergence of multi-core CPUs.

Virtualization is an example of the opposite. This is a trend that is taking the enterprise computing markets by storm. The HPC community to date has not been interested in a technology that in their view simply adds another layer of "stuff" between the hardware and their applications, reducing performance. I would argue that virtualization is coming and will be ubiquitous and the HPC community needs to actively engaged to 1) influence virtualization technology to align it with HPC needs, and 2) find ways in which virtualization can be used to advantage within the HPC community rather than simply being victimized by it. It is coming, so let's embrace it.

The two immediate challenges are both performance related: base performance of applications in a virtualized environment and virtualizing the InfiniBand (or Ethernet) interconnect while still being able to deliver high performance on distributed applications, including both compute and storage. The first issue may not be a large one since most HPC codes are compute intensive and such code should run at nearly full speed in a virtualized environment. And early research on virtualized IB, for example by DK Panda's Network-based Computing Laboratory at OSU, has shown promising results. In addition, the PCI-IOV (IO Virtualization) standard will add hardware support for PCI virtualization that should help achieve high performance.

What about the potential benefits of virtualization for HPC? I can think of several possibilities:

  • Coupling live migration [PDF] with fault management to dynamically shift a running guest OS instance off of a failing node, thereby avoiding application interruptions.
  • Using the clean interface between hypervisor and Guest OS instances to perform checkpointing of a guest OS instance (or many instances in the case of an MPI job) rather than attempting to checkpoint individual processes within an OS instance. The HPC community has tried for many years to create the latter capability, but there are always limitations. Perhaps we can do a more complete job by working at this lower level.
  • Virtualization can enable higher utilization of available resources (those beefy nodes again) while maintaining a security and failure barrier between applications and users. This is ideal in academic or other service environments in which multi-tenancy is an issue, for example in cases where academic users and industry partners with privacy or security concerns share the same compute resources.
  • Virtualization can also be used to decrease the administrative burden on system administration staff and allow it to be more responsive to the needs of its user population. For example, a Solaris or Linux-based HPC installation could easily allow virtualized Windows-based ISV applications to be run dynamically on its systems without having to permanently maintain Windows systems in the environment.

They key point is that virtualization is coming and we as a community should find the best ways of using the technology to our advantage. The above are some ideas--I'd like to hear others.

[os panel slide 7]

The power and cooling challenges are straightforward; the solutions are not. We must deliver effective power management capabilities for compute, storage, and interconnect that support high performance, but also deliver significant improvements over current practice. To do this effectively requires work across the entire hardware and software stack. Processor and system design. Operating system design. System management framework design. And it will require a very comprehensive review of coding practices at all levels of the stack. Polling is not eco-efficient, for example.

Power management issues are yet another reason why operating systems become more important to HPC as capabilities developed primarily for the much larger enterprise computing markets gain relevance for HPC customers.

[os panel slide 8]

Here I sketch a little of Sun's approach with respect to OSes for HPC. First, we will offer both Linux and Solaris-based HPC solutions, including a full stack of HPC software on top of the base operating systems. We recognize quite clearly the position currently held by Linux in HPC and see no reason why we should not be a preferred provider of such systems. At the same time, we believe there is a strong value proposition for Solaris in HPC and that we can deliver performance along an array of increasingly relevant enterprise-derived capabilities that will benefit the HPC community. We also realize it is incumbent upon us to prove this point to you and we intend to do so.

I will finish by commenting on one bullet on this final slide. For the other products and technologies, I will defer to future blog posts. The item I want to end with is Project Indiana and OpenSolaris due to its relevance to HPC customers.

In 2005, Sun created an open source and open development effort based on the source code for the Solaris Operating System. Called OpenSolaris, the community now numbers well over 75,000 members and it continues to grow.

Project Indiana is an OpenSolaris project whose goal is to produce OpenSolaris binary distributions that will be made freely available to anyone with optional for-fee support available from Sun. An important part of this project is a modernization effort that moves Solaris to a network-based package management system, updates many of the open-source utilities that are included in the distro, and adds open-source programs and utilities that are commonly expected to be present. To those familiar with Linux, the OpenSolaris user experience should become much more familiar as these changes roll out. In my view, this was a necessary and long-overdue step towards lowering the barrier for Linux (and other) users, enabling them to more easily step into the Solaris environment and benefit from the many innovations we've introduced (see slide for some examples.)

In addition to OpenSolaris binary distros, you will see other derivative distros appearing. In particular, we are working to define an OpenSolaris-based distro that will include a full HPC software stack and will address both developers and deployers. This effort has been running within Sun for awhile and will soon transition to an OpenSolaris project so we can more easily solicit community involvement in this effort. This Solaris HPC distro is meant to complement similar work being done by a Linux-focused engineering team within our expanded Lustre group, which is also doing its work in the open and also encourages community involvement.

There was some grumbling at the HPC User Forum about the general Linux community and its lack of focus or interest in HPC. While clearly there have been some successes (for example, some of the Linux scaling work done by SGI), there is frustration. One specific example mentioned was the difficulty in getting InfiniBand support into the Linux kernel. My comment on that? We don't need to ask Linus' permission to put HPC-enabling features into Solaris. In fact, with our CEO making it very clear that HPC is one of the top three strategic focus areas for Sun [PDF, 2MB], we welcome HPC community involvement in OpenSolaris. It's free, it's open, and we want Solaris to be your operating system of choice for HPC.


Josh Simons


« June 2016