By Josh Simons on Nov 16, 2008
Karl Schulz – Associate Director, HPC, Texas Advanced Computing Center gave an update on Ranger, including current usage statistics as well as some of the interesting technical issues they've confronted since bringing the system online last year.
Karl started with a system overview, which I will skip in favor of pointing to an earlier Ranger blog entry that describes the configuration in detail. Note, however, that Ranger is now running with 2.3 GHz Barcelona processors.As of November 2008, Ranger has more than 1500 allocated users who represent more than 400 individual research projects. Over 300K jobs have been run so far on the system, consuming a total of 220 million CPU hours.
When TACC brought their 900 TeraByte Lustre filesystem online, they wondered how long it would take to fill it. It took six months. Just six months to generate 900 TeraBytes of data. Not surprising, I guess, when you hear that users generate between 5 and 20 TeraBytes of data per day on Ranger. Now that they've turned on their file purging policy files currently currently reside on the filesystem for about 30 days before they are purged, which is quite good as supercomputing centers go.
Here are some of the problems Karl described.
OS jitter. For those not familiar, this phrase refers to a sometimes-significant performance degradation seen by very large MPI jobs that is caused by a lack of natural synchronization between participating nodes due to unrelated performance perturbations on individual nodes. Essentially some nodes fall slightly behind, which slows down MPI synchronization operations, which can in turn have a large effect on overall application performance. The worse the loss of synchronization, the longer certain MPI operations take to complete, and the larger the overall application performance impact.
A user reported bad performance problems with a somewhat unusual application that performed about 100K MPI_AllReduce operations with a small amount of intervening computation between each AllReduce. When running on 8K cores, a very large performance difference was seen when running 15 processes per node versus 16 processes per node. The 16-process-per-node runs showed drastically lower performance.
As it turned out, the MPI implementation was not at fault. Instead, the issue was traced primarily to two causes. First, an IPMI daemon that was running on each node. And, second, another daemon that was being used to gather fine-grained health monitoring information to be fed into Sun Grid Engine. Once the IPMI daemon was disabled and some performance optimization work was done on the health daemon, the 15- and 16-process runs showed almost identical run times.
Karl also showed an example of how NUMA effects at scale can cause significant performance issues. In particular, it isn't sufficient to deal with processor affinity without also paying attention to memory affinity. Off-socket memory access can kill application performance in some cases, as in the CFD case shown during the talk.