As I mentioned in an earlier entry, I participated in the HPC interconnect panel discussion at
IDC's HPC User Forum meeting in Norfolk, Virginia
last week. I also sat on the Operating System panel, the subject of this blog entry.
Because the organizers opted against panel member presentations, the following slides were not
actually shown at the conference, though they will be included in the conference proceedings.
I use them here to highlight some of the main points I made during the panel discussion.
My fellow panelists during this session were Kenneth Rozendal from IBM,
John Hesterberg from SGI, Benoit Marchand from eXludus,
Ron Brightwell from Sandia National Laboratory,
John Vert from Microsoft,
Ramesh Joginpaulli from AMD, and
Richard Walsh from IDC.
The framing topic areas suggested by Richard Walsh prior to the conference were used
to guide the discussion:
- Ensuring scheduling efficiency on fat nodes
- Managing cache and bandwidth resources on multi-core chips
- Linux and windows: strengths, weaknesses and alternatives
- OS scalability and resiliency requirements of petascale systems
As you'll see, we covered a wider array of topics during the course of the panel session.
Beowulf clusters have been popular with the HPC community
since about 1998. The idea arose in part as a reaction against expensive, "fat" SMP systems and proprietary, expensive software.
Typical Beowulf clusters were built of "thin" nodes (typically one or two single-CPU sockets), commodity ethernet, and an
open source software stack customized for HPC environments.
With multi-core and multi-threaded processors now becoming
the norm, nodes are maintaining their svelte one or two rack unit form factors, but become much beefier internally.
As an extreme example, consider Sun's new SPARC Enterprise T5140 server, which crams 128 hardware threads, 16 FPUs,
64 GB of memory, and close to 600 GBs of storage into a single rack unit (1.75") form factor.
Or the two rack-unit version (the T5240) that doubles the memory to 128 GB and supports up to almost 2.4 TB of local disk storage in the chassis. I call nodes like these Sparta nodes because they
are slim and trim...and very powerful. Intel and AMD's embracing of multicore ensures that future systems will
generally become more Spartan over time.
Clusters need an interconnect. While traditional Beowulf clusters have used commodity Ethernet, they have often done
so at the expense of performance for distributed applications that have significant bandwidth and/or latency requirements.
As was discussed in the interconnect panel session at the HPC User Forum, InfiniBand (IB) is now making significant inroads
into HPC at attractive price points and will continue to do so. Ethernet will also continue to play a role, but commodity 1 GbE is not at all in the same league with IB
with respect to either bandwidth or latency. And InfiniBand currently enjoys a significant price advantage over 10 GbE,
which does offer (at least currently) comparable bandwidths to IB, though without a latency solution. The use of IB in Sparta clusters
allows the nodes to be more tightly coupled in the sense that a broader range of distributed applications will perform well on these systems due to the increased bandwidth and much lower latencies achievable with InfiniBand OS bypass capabilities.
This trend towards beefier nodes will have profound effects on HPC operating system requirements. Or said in a different
way, this trend (and others discussed below) will alter the view of the HPC community towards operating systems. The
traditional HPC view of an OS is one of "software that gets in the way of my application." In this new world, while we must
still pay attention to OS overhead and deliver good application performance, the role of the OS will expand and deliver
significant value for HPC.
The above is a photo I shot of the T5240 I described earlier. This is the 2RU server that recently set a
new two-socket SPEComp record as well as a SPECcpu record. Details on the benchmarks are here. If you'd like a quick walkthrough
of this system's physical layout, check out my annotated version of the above photo here.
The industry shift towards multicore processors has created concern within the HPC community and
more broadly as well. There are several challenges to be addressed if the value and power of
these processors are to be realized.
The increased number of CPUs and hardware threads within these systems will require careful attention be
paid to operating system scalability to ensure that application performance does not suffer due to inefficiencies
in the underlying OS. Vendors like Sun, IBM, SGI, etc, who have worked on OS scaling issues for many years
have experience in this area, but there will doubtless be continuing scalability and performance challenges
as these more closely coupled hardware complexes become available with ever large memory configurations
and with ever faster IO subsystems.
There was some disagreement within the panel session over the ramifications to application architectures
of these beefier nodes when they are used as part of an HPC cluster. Will users continue to run one MPI
process per CPU or thread, or will a fewer number of MPI processes be used per node with each process
then consuming additional on-node parallelism via OpenMP or some other threading model? I am of the
opinion that the mixed/hybrid style (combined MPI and threads) will be necessary for scaling to very
large-size clusters because at some point scaling MPI will become problematic. In addition, regardless of
the cluster size under consideration, using MPI within a node is not very efficient. MPI libraries can be
optimized in how they use shared memory segments for transferring message data between MPI
processes, but any data transfers are much less efficient than using a threading model which takes
full advantage of the fact that all of the memory on a node is immediately accessible to all of the threads within
one address space.
The tradeoff is that this shift from pure MPI programming to hybrid programming does require application
changes and the mixed model can be more difficult since it requires thinking about two levels of parallelism.
If this shift to multi-core and multi-threaded processors were not such a fundamental sea change, I would agree that
recoding would not be worthwhile. However, I do view this as profound a shift as that which caused the HPC community
to move to distributed programming models with PVM and MPI and to recode their applications at that
Another challenge is that of efficient use of available memory bandwidth, both between sockets and between
sockets and memory. As more computational power is crammed into a socket, it becomes more important for
1) processor and system designers to increase available memory bandwidth, 2) operating system designers
to provide efficient and effective capabilities to allow applications to make effective use of bandwidth, and
3) for tool vendors to provide visibility into application memory utilization to help application programmers
optimized their use of the memory subsystem. In many case, memory performance will become the gating
factor on performance rather than CPU.
As the compute, memory, and IO capacities of these beefier nodes continues to grow, the resiliency of these
nodes will become a more important factor. With more applications and more state within a node,
downtime will be less acceptable within the HPC community. This will be especially true in commercial HPC
environments where many ISV applications are commonly used and where these applications may often
be able to run within a single beefy node. In such circumstances, OS capabilities like proactive fault
management which identifies nascent problems and takes action to avoid system interruption become
much more important to HPC customers. An interesting development, since capabilities like fault management have
traditionally been developed for the enterprise computing market.
The last item--interconnect pressures--is fairly obvious. As nodes get beefier and perform more
work, they put a larger demand on the cluster interconnect, both for compute-related communication
and for storage data transfers. InfiniBand, with its aggressive bandwidth roadmap, and an ability to
construct multi-rail fabrics, will play an important role in maintaining system balance. Well-crafted
low level software (OS, IB stack, MPI) will be needed to handle the larger loads at scale.
Beyond the challenges of multi-core and multi-threading, there are opportunities. For all but
the highest end customers, beefier nodes will allow node counts to grow more slowly, allowing
datacenter footprints to grow more slowly, decreasing the rate of complexity growth
and scaling issues.
Much more important, however, is that with more hardware resources per node it will now be possible
to dedicate some small to moderate amount of processing power to handling OS tasks while minimizing
the impact of that processing on application performance. The ability to fence off application workload
from OS processing using concepts like processor sets, processor bindings, and the ability to turn off
device interrupt processing, etc. should allow applications to be run with reduced jitter while still
supporting a full, standard OS environment on each node, and not having to resort to microkernel
or other approaches to deliver low jitter. Using standard OSes (possible stripped down by removing
or disabling unneeded functions) is very important for several reasons.
As mentioned earlier, OS capabilities are becoming more important, not less. I've mentioned
fault management and scalability. In a few slides we'll talk about power management, virtualization and
other capabilities that will be needed for successful HPC installations in the future. Attempt to
build all of that into a microkernel and you end up with a kernel. You might as well start with all
of the learning and innovation that has accrued to standard OSes and minimize and
improve where necessary rather than building one-off or few-off custom software
I worry about how well-served the very high end of the HPC market will be in the future. It isn't a large market or one
that is growing like other segments of HPC. While it is a segment that solves incredibly important
problems, it is also quite a difficult market for vendors to satisfy. The systems are huge, the software
scaling issues tremendously difficult, and, frankly, both the volume and the margins are low. This
segment has argued for many years that meeting their scaling and other requirements guarantees
that a vendor will be well-positioned to satisfy any other HPC customer's requirements. It is essentially the "scale
down" argument. But that argument is in jeopardy to the extent that the high-end community
embraces a different approach than is needed for the bulk of the HPC market. Commercial HPC
customers want and need a full instance of Solaris or Linux on their compute nodes because
they have both throughput and capability problems to run and because they run lots of ISV
applications. They don't want a microkernel or some other funky software environment.
I absolutely understand and respect the software work being done at our national labs and elsewhere
to take advantage of the large-scale systems they have deployed. But this does not stop me from
worrying about the ramifications of the high-end delaminating from the rest of the HPC market.
The HPC community is accustomed to being on the leading/bleeding edge, creating new technologies
that eventually flow into the larger IT markets. InfiniBand is one such example that is still in process.
Parallel computing techniques may be another as the need for parallelization begins to be felt
far beyond HPC due to the tailing off of clock speed increases and the emergence of multi-core CPUs.
Virtualization is an example of the opposite. This is a trend that is taking the enterprise computing markets
by storm. The HPC community to date has not been interested in a technology that in their view simply adds another
layer of "stuff" between the hardware and their applications, reducing performance. I would argue that virtualization
is coming and will be ubiquitous and the HPC community needs to actively engaged to 1) influence virtualization
technology to align it with HPC needs, and 2) find ways in which virtualization can be used to advantage within
the HPC community rather than simply being victimized by it. It is coming, so let's embrace it.
The two immediate challenges are both performance related: base performance of applications in a
virtualized environment and virtualizing the InfiniBand (or Ethernet) interconnect while still being
able to deliver high performance on distributed applications, including both compute and storage. The first issue may not be a large one
since most HPC codes are compute intensive and such code should run at nearly full speed in a virtualized
environment. And early
research on virtualized IB, for example by DK Panda's Network-based Computing Laboratory at OSU, has shown promising results.
In addition, the PCI-IOV (IO Virtualization) standard will add hardware support for PCI virtualization
that should help achieve high performance.
What about the potential benefits of virtualization for HPC? I can think of several possibilities:
Coupling live migration [PDF] with fault management to dynamically shift a running guest OS instance off of
a failing node, thereby avoiding application interruptions.
- Using the clean interface between hypervisor and Guest OS instances to perform checkpointing
of a guest OS instance (or many instances in the case of an MPI job) rather than attempting to checkpoint
individual processes within an OS instance. The HPC community has tried for many years to create the
latter capability, but there are always limitations. Perhaps we can do a more complete job by working
at this lower level.
Virtualization can enable higher utilization of available resources (those beefy nodes again) while
maintaining a security and failure barrier between applications and users. This is ideal in
academic or other service environments in which multi-tenancy is an issue, for example in cases
where academic users and industry partners with privacy or security concerns share the same
Virtualization can also be used to decrease the administrative burden on system administration staff
and allow it to be more responsive to the needs of its user population. For example, a Solaris or Linux-based
HPC installation could easily allow virtualized Windows-based ISV applications to be run dynamically
on its systems without having to permanently maintain Windows systems in the environment.
They key point is that virtualization is coming and we as a community should find the best ways of
using the technology to our advantage. The above are some ideas--I'd like to hear others.
The power and cooling challenges are straightforward; the solutions are not. We must deliver effective power management
capabilities for compute, storage, and interconnect that support high performance, but also deliver
significant improvements over current practice. To do this effectively requires work across the entire
hardware and software stack. Processor and system design. Operating system design. System management
framework design. And it will require a very comprehensive review of coding practices at all levels of
the stack. Polling is not eco-efficient, for example.
Power management issues are yet another reason why operating systems become more important to
HPC as capabilities developed primarily for the much larger enterprise computing markets gain relevance
for HPC customers.
Here I sketch a little of Sun's approach with respect to OSes for HPC. First, we will offer both Linux and
Solaris-based HPC solutions, including a full stack of HPC software on top of the base operating systems.
We recognize quite clearly the position currently held by Linux in HPC
and see no reason why we should not be a preferred provider of such systems. At the same time, we
believe there is a strong value proposition for Solaris in HPC and that we can deliver performance along
an array of increasingly relevant enterprise-derived capabilities that will benefit the HPC community.
We also realize it is incumbent upon us to prove this point to you and we intend to do so.
I will finish by commenting on one bullet on this final slide. For the other products and technologies, I will
defer to future blog posts. The item I want to end with is Project Indiana and OpenSolaris due to its relevance
to HPC customers.
In 2005, Sun created an open source and open development effort based on the source code for the Solaris
Operating System. Called OpenSolaris, the community now numbers
well over 75,000 members and it continues to grow.
Project Indiana is an OpenSolaris project whose goal is to produce OpenSolaris binary distributions that will
be made freely available to anyone with optional for-fee support available from Sun. An important part of
this project is a modernization effort that moves Solaris to a network-based package management system,
updates many of the open-source utilities that are included in the distro, and adds open-source programs
and utilities that are commonly expected to be present. To those familiar with Linux, the OpenSolaris user
experience should become much more familiar as these changes roll out. In my view, this was a necessary
and long-overdue step towards lowering the barrier for Linux (and other) users, enabling them to more
easily step into the Solaris environment and benefit from the many innovations we've introduced (see
slide for some examples.)
In addition to OpenSolaris binary distros, you will see other derivative distros appearing. In particular, we are
working to define an OpenSolaris-based distro that will include a full HPC software stack and will address
both developers and deployers. This effort has been running within Sun for awhile and will soon transition to an OpenSolaris project so we can more easily solicit community involvement in this
effort. This Solaris HPC distro is meant to complement similar work being done by a Linux-focused
engineering team within our expanded Lustre group, which is also doing its work in the open and also
encourages community involvement.
There was some grumbling at the HPC User Forum about the general Linux community and its lack of focus or
interest in HPC. While clearly there have been some successes (for example, some of the Linux scaling
work done by SGI), there is frustration. One specific example mentioned was the difficulty in getting InfiniBand support
into the Linux kernel. My comment on that? We don't need to ask Linus' permission to put HPC-enabling
features into Solaris. In fact, with our CEO making it very clear that HPC is one of the top three strategic
focus areas for Sun [PDF, 2MB], we welcome HPC community involvement in OpenSolaris. It's free, it's open, and
we want Solaris to be your operating system of choice for HPC.