HPC User Forum: Operating System Panel
By Josh Simons on Apr 26, 2008
As I mentioned in an earlier entry, I participated in the HPC interconnect panel discussion at IDC's HPC User Forum meeting in Norfolk, Virginia last week. I also sat on the Operating System panel, the subject of this blog entry.
Because the organizers opted against panel member presentations, the following slides were not actually shown at the conference, though they will be included in the conference proceedings. I use them here to highlight some of the main points I made during the panel discussion.
My fellow panelists during this session were Kenneth Rozendal from IBM, John Hesterberg from SGI, Benoit Marchand from eXludus, Ron Brightwell from Sandia National Laboratory, John Vert from Microsoft, Ramesh Joginpaulli from AMD, and Richard Walsh from IDC.
The framing topic areas suggested by Richard Walsh prior to the conference were used to guide the discussion:
- Ensuring scheduling efficiency on fat nodes
- Managing cache and bandwidth resources on multi-core chips
- Linux and windows: strengths, weaknesses and alternatives
- OS scalability and resiliency requirements of petascale systems
As you'll see, we covered a wider array of topics during the course of the panel session.
Beowulf clusters have been popular with the HPC community since about 1998. The idea arose in part as a reaction against expensive, "fat" SMP systems and proprietary, expensive software. Typical Beowulf clusters were built of "thin" nodes (typically one or two single-CPU sockets), commodity ethernet, and an open source software stack customized for HPC environments.
With multi-core and multi-threaded processors now becoming the norm, nodes are maintaining their svelte one or two rack unit form factors, but become much beefier internally. As an extreme example, consider Sun's new SPARC Enterprise T5140 server, which crams 128 hardware threads, 16 FPUs, 64 GB of memory, and close to 600 GBs of storage into a single rack unit (1.75") form factor. Or the two rack-unit version (the T5240) that doubles the memory to 128 GB and supports up to almost 2.4 TB of local disk storage in the chassis. I call nodes like these Sparta nodes because they are slim and trim...and very powerful. Intel and AMD's embracing of multicore ensures that future systems will generally become more Spartan over time.
Clusters need an interconnect. While traditional Beowulf clusters have used commodity Ethernet, they have often done so at the expense of performance for distributed applications that have significant bandwidth and/or latency requirements. As was discussed in the interconnect panel session at the HPC User Forum, InfiniBand (IB) is now making significant inroads into HPC at attractive price points and will continue to do so. Ethernet will also continue to play a role, but commodity 1 GbE is not at all in the same league with IB with respect to either bandwidth or latency. And InfiniBand currently enjoys a significant price advantage over 10 GbE, which does offer (at least currently) comparable bandwidths to IB, though without a latency solution. The use of IB in Sparta clusters allows the nodes to be more tightly coupled in the sense that a broader range of distributed applications will perform well on these systems due to the increased bandwidth and much lower latencies achievable with InfiniBand OS bypass capabilities.
This trend towards beefier nodes will have profound effects on HPC operating system requirements. Or said in a different way, this trend (and others discussed below) will alter the view of the HPC community towards operating systems. The traditional HPC view of an OS is one of "software that gets in the way of my application." In this new world, while we must still pay attention to OS overhead and deliver good application performance, the role of the OS will expand and deliver significant value for HPC.
The above is a photo I shot of the T5240 I described earlier. This is the 2RU server that recently set a new two-socket SPEComp record as well as a SPECcpu record. Details on the benchmarks are here. If you'd like a quick walkthrough of this system's physical layout, check out my annotated version of the above photo here.
The industry shift towards multicore processors has created concern within the HPC community and more broadly as well. There are several challenges to be addressed if the value and power of these processors are to be realized.
The increased number of CPUs and hardware threads within these systems will require careful attention be paid to operating system scalability to ensure that application performance does not suffer due to inefficiencies in the underlying OS. Vendors like Sun, IBM, SGI, etc, who have worked on OS scaling issues for many years have experience in this area, but there will doubtless be continuing scalability and performance challenges as these more closely coupled hardware complexes become available with ever large memory configurations and with ever faster IO subsystems.
There was some disagreement within the panel session over the ramifications to application architectures of these beefier nodes when they are used as part of an HPC cluster. Will users continue to run one MPI process per CPU or thread, or will a fewer number of MPI processes be used per node with each process then consuming additional on-node parallelism via OpenMP or some other threading model? I am of the opinion that the mixed/hybrid style (combined MPI and threads) will be necessary for scaling to very large-size clusters because at some point scaling MPI will become problematic. In addition, regardless of the cluster size under consideration, using MPI within a node is not very efficient. MPI libraries can be optimized in how they use shared memory segments for transferring message data between MPI processes, but any data transfers are much less efficient than using a threading model which takes full advantage of the fact that all of the memory on a node is immediately accessible to all of the threads within one address space.
The tradeoff is that this shift from pure MPI programming to hybrid programming does require application changes and the mixed model can be more difficult since it requires thinking about two levels of parallelism. If this shift to multi-core and multi-threaded processors were not such a fundamental sea change, I would agree that recoding would not be worthwhile. However, I do view this as profound a shift as that which caused the HPC community to move to distributed programming models with PVM and MPI and to recode their applications at that time.
Another challenge is that of efficient use of available memory bandwidth, both between sockets and between sockets and memory. As more computational power is crammed into a socket, it becomes more important for 1) processor and system designers to increase available memory bandwidth, 2) operating system designers to provide efficient and effective capabilities to allow applications to make effective use of bandwidth, and 3) for tool vendors to provide visibility into application memory utilization to help application programmers optimized their use of the memory subsystem. In many case, memory performance will become the gating factor on performance rather than CPU.
As the compute, memory, and IO capacities of these beefier nodes continues to grow, the resiliency of these nodes will become a more important factor. With more applications and more state within a node, downtime will be less acceptable within the HPC community. This will be especially true in commercial HPC environments where many ISV applications are commonly used and where these applications may often be able to run within a single beefy node. In such circumstances, OS capabilities like proactive fault management which identifies nascent problems and takes action to avoid system interruption become much more important to HPC customers. An interesting development, since capabilities like fault management have traditionally been developed for the enterprise computing market.
The last item--interconnect pressures--is fairly obvious. As nodes get beefier and perform more work, they put a larger demand on the cluster interconnect, both for compute-related communication and for storage data transfers. InfiniBand, with its aggressive bandwidth roadmap, and an ability to construct multi-rail fabrics, will play an important role in maintaining system balance. Well-crafted low level software (OS, IB stack, MPI) will be needed to handle the larger loads at scale.
Beyond the challenges of multi-core and multi-threading, there are opportunities. For all but the highest end customers, beefier nodes will allow node counts to grow more slowly, allowing datacenter footprints to grow more slowly, decreasing the rate of complexity growth and scaling issues.
Much more important, however, is that with more hardware resources per node it will now be possible to dedicate some small to moderate amount of processing power to handling OS tasks while minimizing the impact of that processing on application performance. The ability to fence off application workload from OS processing using concepts like processor sets, processor bindings, and the ability to turn off device interrupt processing, etc. should allow applications to be run with reduced jitter while still supporting a full, standard OS environment on each node, and not having to resort to microkernel or other approaches to deliver low jitter. Using standard OSes (possible stripped down by removing or disabling unneeded functions) is very important for several reasons.
As mentioned earlier, OS capabilities are becoming more important, not less. I've mentioned fault management and scalability. In a few slides we'll talk about power management, virtualization and other capabilities that will be needed for successful HPC installations in the future. Attempt to build all of that into a microkernel and you end up with a kernel. You might as well start with all of the learning and innovation that has accrued to standard OSes and minimize and improve where necessary rather than building one-off or few-off custom software environments.
I worry about how well-served the very high end of the HPC market will be in the future. It isn't a large market or one that is growing like other segments of HPC. While it is a segment that solves incredibly important problems, it is also quite a difficult market for vendors to satisfy. The systems are huge, the software scaling issues tremendously difficult, and, frankly, both the volume and the margins are low. This segment has argued for many years that meeting their scaling and other requirements guarantees that a vendor will be well-positioned to satisfy any other HPC customer's requirements. It is essentially the "scale down" argument. But that argument is in jeopardy to the extent that the high-end community embraces a different approach than is needed for the bulk of the HPC market. Commercial HPC customers want and need a full instance of Solaris or Linux on their compute nodes because they have both throughput and capability problems to run and because they run lots of ISV applications. They don't want a microkernel or some other funky software environment.
I absolutely understand and respect the software work being done at our national labs and elsewhere to take advantage of the large-scale systems they have deployed. But this does not stop me from worrying about the ramifications of the high-end delaminating from the rest of the HPC market.
The HPC community is accustomed to being on the leading/bleeding edge, creating new technologies that eventually flow into the larger IT markets. InfiniBand is one such example that is still in process. Parallel computing techniques may be another as the need for parallelization begins to be felt far beyond HPC due to the tailing off of clock speed increases and the emergence of multi-core CPUs.
Virtualization is an example of the opposite. This is a trend that is taking the enterprise computing markets by storm. The HPC community to date has not been interested in a technology that in their view simply adds another layer of "stuff" between the hardware and their applications, reducing performance. I would argue that virtualization is coming and will be ubiquitous and the HPC community needs to actively engaged to 1) influence virtualization technology to align it with HPC needs, and 2) find ways in which virtualization can be used to advantage within the HPC community rather than simply being victimized by it. It is coming, so let's embrace it.
The two immediate challenges are both performance related: base performance of applications in a virtualized environment and virtualizing the InfiniBand (or Ethernet) interconnect while still being able to deliver high performance on distributed applications, including both compute and storage. The first issue may not be a large one since most HPC codes are compute intensive and such code should run at nearly full speed in a virtualized environment. And early research on virtualized IB, for example by DK Panda's Network-based Computing Laboratory at OSU, has shown promising results. In addition, the PCI-IOV (IO Virtualization) standard will add hardware support for PCI virtualization that should help achieve high performance.
What about the potential benefits of virtualization for HPC? I can think of several possibilities:
- Coupling live migration [PDF] with fault management to dynamically shift a running guest OS instance off of a failing node, thereby avoiding application interruptions.
- Using the clean interface between hypervisor and Guest OS instances to perform checkpointing of a guest OS instance (or many instances in the case of an MPI job) rather than attempting to checkpoint individual processes within an OS instance. The HPC community has tried for many years to create the latter capability, but there are always limitations. Perhaps we can do a more complete job by working at this lower level.
- Virtualization can enable higher utilization of available resources (those beefy nodes again) while maintaining a security and failure barrier between applications and users. This is ideal in academic or other service environments in which multi-tenancy is an issue, for example in cases where academic users and industry partners with privacy or security concerns share the same compute resources.
- Virtualization can also be used to decrease the administrative burden on system administration staff and allow it to be more responsive to the needs of its user population. For example, a Solaris or Linux-based HPC installation could easily allow virtualized Windows-based ISV applications to be run dynamically on its systems without having to permanently maintain Windows systems in the environment.
They key point is that virtualization is coming and we as a community should find the best ways of using the technology to our advantage. The above are some ideas--I'd like to hear others.
The power and cooling challenges are straightforward; the solutions are not. We must deliver effective power management capabilities for compute, storage, and interconnect that support high performance, but also deliver significant improvements over current practice. To do this effectively requires work across the entire hardware and software stack. Processor and system design. Operating system design. System management framework design. And it will require a very comprehensive review of coding practices at all levels of the stack. Polling is not eco-efficient, for example.
Power management issues are yet another reason why operating systems become more important to HPC as capabilities developed primarily for the much larger enterprise computing markets gain relevance for HPC customers.
Here I sketch a little of Sun's approach with respect to OSes for HPC. First, we will offer both Linux and Solaris-based HPC solutions, including a full stack of HPC software on top of the base operating systems. We recognize quite clearly the position currently held by Linux in HPC and see no reason why we should not be a preferred provider of such systems. At the same time, we believe there is a strong value proposition for Solaris in HPC and that we can deliver performance along an array of increasingly relevant enterprise-derived capabilities that will benefit the HPC community. We also realize it is incumbent upon us to prove this point to you and we intend to do so.
I will finish by commenting on one bullet on this final slide. For the other products and technologies, I will defer to future blog posts. The item I want to end with is Project Indiana and OpenSolaris due to its relevance to HPC customers.
In 2005, Sun created an open source and open development effort based on the source code for the Solaris Operating System. Called OpenSolaris, the community now numbers well over 75,000 members and it continues to grow.
Project Indiana is an OpenSolaris project whose goal is to produce OpenSolaris binary distributions that will be made freely available to anyone with optional for-fee support available from Sun. An important part of this project is a modernization effort that moves Solaris to a network-based package management system, updates many of the open-source utilities that are included in the distro, and adds open-source programs and utilities that are commonly expected to be present. To those familiar with Linux, the OpenSolaris user experience should become much more familiar as these changes roll out. In my view, this was a necessary and long-overdue step towards lowering the barrier for Linux (and other) users, enabling them to more easily step into the Solaris environment and benefit from the many innovations we've introduced (see slide for some examples.)
In addition to OpenSolaris binary distros, you will see other derivative distros appearing. In particular, we are working to define an OpenSolaris-based distro that will include a full HPC software stack and will address both developers and deployers. This effort has been running within Sun for awhile and will soon transition to an OpenSolaris project so we can more easily solicit community involvement in this effort. This Solaris HPC distro is meant to complement similar work being done by a Linux-focused engineering team within our expanded Lustre group, which is also doing its work in the open and also encourages community involvement.
There was some grumbling at the HPC User Forum about the general Linux community and its lack of focus or interest in HPC. While clearly there have been some successes (for example, some of the Linux scaling work done by SGI), there is frustration. One specific example mentioned was the difficulty in getting InfiniBand support into the Linux kernel. My comment on that? We don't need to ask Linus' permission to put HPC-enabling features into Solaris. In fact, with our CEO making it very clear that HPC is one of the top three strategic focus areas for Sun [PDF, 2MB], we welcome HPC community involvement in OpenSolaris. It's free, it's open, and we want Solaris to be your operating system of choice for HPC.