X6275 Cluster Demonstrates Performance and Scalability on WRF 2.5km CONUS Dataset
By Paul Kinney on Oct 09, 2009
Significance of Results
Results are presented for the Weather Research and Forecasting (WRF) code running on twelve Sun Blade X6275 server modules, housed in the Sun Blade 6048 chassis, using the 2.5 km CONUS benchmark dataset.
- The Sun Blade X6275 cluster was able to achieve 373 GFLOP/s on the CONUS 2.5-KM Dataset.
- The results demonstrate an 91% speedup efficiency, or 11x speedup, from 1 to 12 blades.
- The current results results were run with turbo on.
Performance LandscapePerformance is expressed in terms "simulation speedup" which is the ratio of the simulated time step per iteration to the average wall clock time required to compute it. A larger number implies better performance.
The current results were run with turbo mode on.
|WRF 220.127.116.11: Weather Research and Forecasting CONUS 2.5-KM Dataset|
(vs. 1 blade)
|Turbo On||Turbo Off||Turbo On||Turbo Off||Turbo On||Turbo Off|
|12||24||48||192||13.58||12.93||373.0||355.1||11.0 / 91%||10.4 / 87%||+6%|
||7.5 / 93%||
|6||12||24||96||7.03||6.60||193.1||181.3||5.7 / 94%||5.3 / 89%||+7%|
||3.8 / 96%||
||2.0 / 98%||
|1||2||4||16||1.24||1.24||34.1||34.1||1.0 / 100%||1.0 / 100%||+0%|
Results and Configuration Summary
Sun Blade 6048 Modular System
12 x Sun Blade X6275 Server Modules, each with
4 x 2.93 GHz Intel QC X5570 processors
24 GB (6 x 4GB)
HT disabled in BIOS
Turbo mode enabled in BIOS
OS: SUSE Linux Enterprise Server 10 SP 2
Compiler: PGI 7.2-5
MPI Library: Scali MPI v5.6.4
Benchmark: WRF 18.104.22.168
Support Library: netCDF 3.6.3
The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. WRF is designed to be a flexible, state-of-the-art atmospheric simulation system that is portable and efficient on available parallel computing platforms. It features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility.
- 1501x1201x35 cell volume
- 6hr, 2.5km resolution dataset from June 4, 2005
- Benchmark is the final 3hr simulation for hrs 3-6 starting from a provided restart file; the benchmark may also be performed (but seldom reported) for the full 6hrs starting from a cold start
- Iterations output at every 15 sec of simulation time, with the computation cost of each time step ~412 GFLOP
Single domain, large size 2.5KM Continental US (CONUS-2.5K)
Key Points and Best Practices
- Processes were bound to processors in round-robin fashion.
- Model simulation time is 15 seconds per iteration as defined in input job deck. An achieved speedup of 2.67X means that each model iteration of 15s of simulation time was executed in 5.6s of real wallclock time (on average).
- Computational requirements are 412 GFLOP per simulation time step as (measured empirically and) documented on the UCAR web site for this data model.
- Model was run as single MPI job.
- Benchmark was built and run as a pure-MPI variant. With larger process counts building and running WRF as a hybrid OpenMP/MPI variant may be more efficient.
- Input and output (netCDF format) datasets can be very large for some WRF data models and run times will generally benefit by using a scalable filesystem. Performance with very large datasets (>5GB) can benefit by enabling WRF quilting of I/O across designated processors/servers. The master thread (or rank-0) performs most of the I/O (unless quilting specifies otherwise), with all processes potentially generating some I/O.