Professor Satoshi Matsuoka from the
Tokyo Institute of Technology
gave a really excellent talk this afternoon
about using GPUs for HPC at the
HPC Consortium Meeting here in Austin.
As you may know, the Tokyo Institute of Technology is the home of TSUBAME, the largest supercomputer
in Asia. It is an InfiniBand cluster of 648 Sun Fire x4600 compute nodes, many with installed Clearspeed accelerator cards.
The desire is to continue to scale TSUBAME into a petascale computing resource over time. However,
power is a huge problem at the site. The machine is responsible for roughly 10% of the overall power
consumption of the Institute and therefore they cannot expect their power budget to grow over time.
The primary question, then, is how to add significant compute capacity to the machine while working
within a constant power budget.
It was clear from their analysis that conventional CPUs would not allow them to reach their
performance goals while also satisfying the no-growth power constraint. GPUs--graphical
processing units like those made by nVidia--looked appealing in that they claim extremely
high floating point capabilities and deliver this performance at a much better performance/watt
ratio that conventional CPUs. The question, though, is whether GPUs can be used to significantly
accelerate important classes of HPC computations or whether they are perhaps too specialized
to be considered for inclusion in a general-purpose compute resource like TSUBAME. Professor
Matsuoka's talk focused on this question.
The talk approached the question by presenting performance speed-up results for a selection
of important HPC applications or computations based on algorithmic work done by Prof. Matsuoka
and other researchers at the Institute. These studies were done in part because GPU vendors do
a very poor job of describing exactly what GPUs are good for and what problems are perhaps
not handled well by GPUs. By assessing the capabilities over a range of problem areas, it
was hoped that conclusions could be drawn about the general utility of the GPU approach
The first problem examined was a 3D protein docking analysis that
performs an all-to-all analysis of 1K proteins to 1K proteins. Based
on their estimates, a single protein-protein interaction analysis
requires about 200 TeraOps while the full 1000x1000 problem
requires about 200 ExaOps. In order to maximally exploit GPUs
for this problem, a new 3D FFT algorithm was developed that
in the end delivered excellent performance and a 4x better
performance/watt over IBM's BG/L system, which itself is
much more efficient than a more conventional cluster approach.
In addition, other algorithmic work delivered speedups of 45X over single conventional
CPUs for CFD, which is typically limited by available
bandwidth. Likewise, a computation involving phase separation liquid
delivered a speedup of 160X over a conventional processor.
Having looked at single node performance and compared it to
a single-node GPU approach and found that GPUs do appear to
able to deliver interesting performance and performance/watt for an
array of useful problem types so long as new algorithms can be
created to exploit the specific capabilities of these GPUs, the next question was whether
these results could be extended to multi-GPU and cluster environments.
To test this, the team worked with the RIKEN Himeno CFD benchmark,
which is considered the worst memory bandwidth-limited code one
will ever see. It is actually worse than any real application one would
ever encounter. If this could be parallelized and used with GPUs
to advantage, then other less difficult codes should also benefit
from the GPU approach.
To test this, the code was parallelized to run using multiple GPUs
per node and with MPI as the communication mechanism between nodes.
Results showed about a 50X performance improvement over a conventional
CPU cluster on a small-sized problem.
A multi-GPU parallel sparse solver was also created which showed
a 25X-35X improvement over conventional CPUs. This was accomplished
using double precision implemented using mixed-precision techniques.
While all of these results seemed promising, could such a GPU approach
be deployed at scale in a very large cluster rather than just within a single
node or across a modest-sized cluster? The Institute decided to find out
by teaming with nVidia and Sun to enhance TSUBAME by adding Tesla
GPUs to some (most) nodes.
Installing the Tesla cards into the system went very smoothly and resulted
in three classes of nodes: those with both Clearspeed and Tesla installed,
those with only Tesla installed, and those Opteron nodes with neither
kind of accelerator installed.
Could this funky array of heterogeneous nodes be harnessed
to deliver an interesting LINPACK number? It turns out that it could, with much
work and in spite of the fact that there was limited bandwidth in the upper links
of the InfiniBand fabric and that they had limited PCIx/PCIe bandwidth
available in the nodes (I believe due to the number and types of slots
available in the x4600 and the number of required devices in some of
the TSUBAME compute nodes.)
As a result of the LINPACK work (which could have used more time--it was deadline-limited)
the addition of GPU capability in TSUBAME allowed its LINPACK number to be raised
from 67.7 TFLOPs, which was reported in June, to a new high of 77.48 TFLOPs which
shows an impressive increase.
With the Tesla cards installed, TSUBAME can now be viewed as a 900 TFLOPs (single
precision) or 170 TFLOPs (double precision) machine. A machine that has either 10K
cores or 300K SIMD cores if one counts the components embedded within each
The conclusion is pretty clearly that GPUs can be used to significant advantage
on an interesting range of HPC problem types, though it is worth noting that it also appears
that significantly clever, new algorithms may also need to be developed to map these
problems efficiently onto GPU compute resources.