Tuesday Apr 06, 2010

WRF Benchmark: X6275 Beats Power6

Significance of Results

Oracle's Sun Blade X6275 cluster is 28% faster than the IBM POWER6 cluster on Weather Research and Forecasting (WRF) continental United Status (CONUS) benchmark datasets. The Sun Blade X6275 cluster used a Quad Data Rate (QDR) InfiniBand connection along with Intel compilers and MPI.

  • On the 12 km CONUS data set, the Sun Blade X6275 cluster was 28% faster than the IBM POWER6 cluster at 512 cores.

  • The Sun Blade X6275 cluster with 768 cores (one full Sun Blade 6048 chassis) was 47% faster than 1024 cores of the IBM POWER6 cluster (multiple racks).

  • On the 2.5 km CONUS data set, the Sun Blade X6275 cluster was 21% faster than the IBM POWER6 cluster at 512 cores.

  • The Sun Blade X6275 cluster with 768 cores (on full Sun Blade 6048 chassis) outperforms the IBM Power6 cluster with 1024 cores by 28% on the 2.5 km CONUS dataset.

Performance Landscape

The performance in GFLOPS is shown below on multiple datasets.

Weather Research and Forecasting
CONUS 12 KM Dataset
Cores Performance in GFLOPS
Sun
X6275
Intel
Whitebox
IBM
POWER6
Cray
XT5
SGI TACC
Ranger
Blue
Gene/P
8 17.5 19.8 17.0
10.2

16 38.7 37.5 33.6 21.4 20.1 10.8
32 71.6 73.3 66.5 40.4 39.8 21.2 5.9
64 132.5 131.4 117.2 75.2 77.0 37.8
128 235.8 232.8 209.1 137.4 114.0 74.5 20.4
192 323.6





256 405.2 415.1 363.1 243.2 197.9 121.0 37.4
384 556.6





512 691.9 696.7 542.2 392.2 375.2 193.9 65.6
768 912.0






1024

618.5 634.1 605.9 271.7 108.5
1700



840.1


2048





175.6

All cores used on each node which participates in each run.

Sun X6275 - 2.93 GHz X5570, InfiniBand
Intel Whitebox - 2.8 GHz GHz X5560, InfiniBand
IBM POWER6 - IBM Power 575, 4.7 GHz POWER6, InfiniBand, 3 frames
Cray XT5 - 2.7 GHz AMD Opteron (Shanghai), Cray SeaStar 2.1
SGI - best of a variety of results
TACC Ranger - 2.3 GHz AMD Opteron (Barcelona), InfiniBand
Blue Gene/P - 850 MHz PowerPC 450, 3D-Torus (proprietary)

Weather Research and Forecasting
CONUS 2.5 KM Dataset
Cores Performance in GFLOPS
Sun
X6275
SGI
8200EX
Blue
Gene/L
IBM
POWER6
Cray
XT5
Intel
Whitebox
TACC
Ranger
16 35.2






32 69.6

64.3


64 140.2

130.9
147.8 24.5
128 278.9 89.9
242.5 152.1 290.6 87.7
192 400.5





256 514.8 179.6 8.3 431.3 306.3 535.0 145.3
384 735.1





512 973.5 339.9 16.5 804.4 566.2 1019.9 311.0
768 1367.7





1024
721.5 124.8 1067.3 1075.9 1911.4 413.4
2048
1389.5 241.2
1849.7 3251.1
2600




4320.6
3072
1918.7 350.5
2651.3

4096
2543.5 453.2
3288.7

6144
3057.3 642.3
4280.1

8192
3569.7 820.4
5140.4

18432

1238.0



Sun X6275 - 2.93 GHz X5570, InfiniBand
SGI 8200EX - 3.0 GHz E5472, InfiniBand
Blue Gene/L - 700 MHz PowerPC 440, 3D-Torus (proprietary)
IBM POWER6 - IBM Power 575, 4.7 GHz POWER6, InfiniBand, 3 frames
Cray XT5 - 2.4 GHz AMD Opteron (Shanghai), Cray SeaStar 2.1
Intel Whitebox - 2.8 GHz GHz X5560, InfiniBand
TACC Ranger - 2.3 GHz AMD Opteron (Barcelona), InfiniBand

Results and Configuration Summary

Hardware Configuration:

48 x Sun Blade X6275 server modules, 2 nodes per blade, each node with
2 Intel Xeon X5570 2.93 GHz processors, turbo enabled, ht disabled
24 GB memory
QDR InfiniBand

Software Configuration:

SUSE Linux Enterprise Server 10 SP2
Intel Compilers 11.1.059
Intel MPI 3.2.2
WRF 3.0.1.1
WRF 3.1.1
netCDF 4.0.1

Benchmark Description

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. WRF is designed to be a flexible, state-of-the-art atmospheric simulation system that is portable and efficient on available parallel computing platforms. It features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility.

There are two fixed-size benchmark cases.

Single domain, medium size 12KM Continental US (CONUS-12K)

  • 425x300x35 cell volume
  • 48hr, 12km resolution dataset from Oct 24, 2001
  • Benchmark is a 3hr simulation for hrs 25-27 starting from a provided restart file
  • Iterations output at every 72 sec of simulation time, with the computation cost of each time step ~30 GFLOP

Single domain, large size 2.5KM Continental US (CONUS-2.5K)

  • 1501x1201x35 cell volume
  • 6hr, 2.5km resolution dataset from June 4, 2005
  • Benchmark is the final 3hr simulation for hrs 3-6 starting from a provided restart file; the benchmark may also be performed (but seldom reported) for the full 6hrs starting from a cold start
  • Iterations output at every 15 sec of simulation time, with the computation cost of each time step ~412 GFLOP

See Also

Disclosure Statement

WRF, see http://www.mmm.ucar.edu/wrf/WG2/bench/, results as of 3/8/2010.

Friday Oct 09, 2009

X6275 Cluster Demonstrates Performance and Scalability on WRF 2.5km CONUS Dataset

Significance of Results

Results are presented for the Weather Research and Forecasting (WRF) code running on twelve Sun Blade X6275 server modules, housed in the Sun Blade 6048 chassis, using the 2.5 km CONUS benchmark dataset.

  • The Sun Blade X6275 cluster was able to achieve 373 GFLOP/s on the CONUS 2.5-KM Dataset.
  • The results demonstrate an 91% speedup efficiency, or 11x speedup, from 1 to 12 blades.
  • The current results results were run with turbo on.

Performance Landscape

Performance is expressed in terms "simulation speedup" which is the ratio of the simulated time step per iteration to the average wall clock time required to compute it. A larger number implies better performance.

The current results were run with turbo mode on.

WRF 3.0.1.1: Weather Research and Forecasting CONUS 2.5-KM Dataset
#
Blade
#
Node
#
Proc
#
Core
Performance
(Simulation Speedup)
Computation Rate
GFLOP/sec
Speedup/Efficiency
(vs. 1 blade)
Turbo On
Relative Perf
Turbo On Turbo Off Turbo On Turbo Off Turbo On Turbo Off
12 24 48 192 13.58 12.93 373.0 355.1 11.0 / 91% 10.4 / 87% +6%
 8  16  32  128  9.27
254.6
 7.5 / 93% 

 6 12 24  96  7.03  6.60 193.1 181.3  5.7 / 94%  5.3 / 89% +7%
 4  8  16  64  4.74
130.2
 3.8 / 96% 

 2  4  8  32  2.44
67.0
 2.0 / 98% 

 1  2  4  16  1.24  1.24 34.1 34.1 1.0 / 100% 1.0 / 100% +0%

Results and Configuration Summary

Hardware Configuration:

    Sun Blade 6048 Modular System
      12 x Sun Blade X6275 Server Modules, each with
        4 x 2.93 GHz Intel QC X5570 processors
        24 GB (6 x 4GB)
        QDR InfiniBand
        HT disabled in BIOS
        Turbo mode enabled in BIOS

Software Configuration:

    OS: SUSE Linux Enterprise Server 10 SP 2
    Compiler: PGI 7.2-5
    MPI Library: Scali MPI v5.6.4
    Benchmark: WRF 3.0.1.1
    Support Library: netCDF 3.6.3

Benchmark Description

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. WRF is designed to be a flexible, state-of-the-art atmospheric simulation system that is portable and efficient on available parallel computing platforms. It features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility.

Dataset used:

    Single domain, large size 2.5KM Continental US (CONUS-2.5K)

    • 1501x1201x35 cell volume
    • 6hr, 2.5km resolution dataset from June 4, 2005
    • Benchmark is the final 3hr simulation for hrs 3-6 starting from a provided restart file; the benchmark may also be performed (but seldom reported) for the full 6hrs starting from a cold start
    • Iterations output at every 15 sec of simulation time, with the computation cost of each time step ~412 GFLOP

Key Points and Best Practices

  • Processes were bound to processors in round-robin fashion.
  • Model simulation time is 15 seconds per iteration as defined in input job deck. An achieved speedup of 2.67X means that each model iteration of 15s of simulation time was executed in 5.6s of real wallclock time (on average).
  • Computational requirements are 412 GFLOP per simulation time step as (measured empirically and) documented on the UCAR web site for this data model.
  • Model was run as single MPI job.
  • Benchmark was built and run as a pure-MPI variant. With larger process counts building and running WRF as a hybrid OpenMP/MPI variant may be more efficient.
  • Input and output (netCDF format) datasets can be very large for some WRF data models and run times will generally benefit by using a scalable filesystem. Performance with very large datasets (>5GB) can benefit by enabling WRF quilting of I/O across designated processors/servers. The master thread (or rank-0) performs most of the I/O (unless quilting specifies otherwise), with all processes potentially generating some I/O.

See Also

Disclosure Statement

WRF, CONUS-2.5K, see http://www.mmm.ucar.edu/wrf/WG2/bench/, results as of 9/21/2009.
About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today