Schulz – Associate Director, HPC,
Texas Advanced Computing
Center gave an update on Ranger, including current usage
statistics as well as some of the interesting technical issues
they've confronted since bringing the system online last year.
Karl started with a system overview, which I will skip in favor
of pointing to an earlier Ranger blog entry that describes the
configuration in detail. Note, however, that Ranger is now running
with 2.3 GHz Barcelona processors.
As of November 2008, Ranger has more than 1500 allocated users who
represent more than 400 individual research projects. Over 300K jobs
have been run so far on the system, consuming a total of 220 million
When TACC brought their 900 TeraByte Lustre filesystem online, they wondered how long it would take to fill it. It took six months.
Just six months to generate 900 TeraBytes of data. Not surprising, I guess, when you hear that users generate between 5 and 20 TeraBytes
of data per day on Ranger.
Now that they've turned on their file purging policy files currently
currently reside on the filesystem for about 30
days before they are purged, which is quite good as supercomputing centers go.
Here are some of the problems Karl described.
OS jitter. For those not familiar, this phrase refers to a
degradation seen by very large MPI jobs that is caused by a lack of
natural synchronization between participating nodes due to unrelated
performance perturbations on individual nodes. Essentially some nodes
fall slightly behind, which slows down MPI synchronization operations,
which can in turn have a large effect on overall application performance.
The worse the loss
of synchronization, the longer certain MPI operations take to
and the larger the overall application performance impact.
A user reported bad performance problems with a somewhat unusual
application that performed about 100K MPI_AllReduce operations with
a small amount of intervening computation between each AllReduce.
When running on 8K cores, a very large performance difference was
seen when running 15 processes per node versus 16 processes per node.
The 16-process-per-node runs showed drastically lower performance.
As it turned out, the MPI implementation was not at fault. Instead,
the issue was traced primarily to two causes. First, an IPMI daemon
that was running on each node. And, second, another daemon that was
being used to gather fine-grained health monitoring information to
be fed into Sun Grid Engine. Once the IPMI daemon was disabled and
some performance optimization work was done on the health daemon,
the 15- and 16-process runs showed almost identical run times.
Karl also showed an example of how NUMA effects at scale can cause
significant performance issues. In particular, it isn't sufficient
to deal with processor affinity without also paying attention to
memory affinity. Off-socket memory access can kill application
performance in some cases, as in the CFD case shown during the