Friday Nov 20, 2009

Sun Blade 6048 and Sun Blade X6275 NAMD Molecular Dynamics Benchmark beats IBM BlueGene/L

Significance of Results

A Sun Blade 6048 chassis with 48 Sun Blade X6275 server modules ran benchmarks using the NAMD molecular dynamics applications software. NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD is driven by major trends in computing and structural biology and received a 2002 Gordon Bell Award.

  • The cluster of 32 Sun Blade X6275 server modules was 9.2x faster than the 512 processor configuration of the IBM BlueGene/L.

  • The cluster of 48 Sun Blade X6275 server modules exhibited excellent scalability for NAMD molecular dynamics simulation, up to 37.8x speedup for 48 blades relative to 1 blade.

  • For largest molecule considered, the cluster of 48 Sun Blade X6275 server modules achieved a throughput of 0.028 seconds per simulation step.
Molecular dynamics simulation is important to biological and materials science research. Molecular dynamics is used to determine the low energy conformations or shapes of a molecule. These conformations are presumed to be the biologically active conformations.

Performance Landscape

The NAMD Performance Benchmarks web page plots the performance of NAMD when the ApoA1 benchmark is executed on a variety of clusters. The performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, multiplied by the number of "processors" on which NAMD executes in parallel. The following table compares the performance of the Sun Blade X6275 cluster to several of the clusters for which performance is reported on the web page. In this table, the performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Cluster Name and Interconnect Throughput for 128 Cores
(seconds per step)
Throughput for 256 Cores
(seconds per step)
Throughput for 512 Cores
(seconds per step)
Sun Blade X6275 InfiniBand 0.014 0.0073 0.0048
Cambridge Xeon/3.0 InfiniPath 0.016 0.0088 0.0056
NCSA Xeon/2.33 InfiniBand 0.019 0.010 0.008
AMD Opteron/2.2 InfiniPath 0.025 0.015 0.008
IBM HPCx PWR4/1.7 Federation 0.039 0.021 0.013
SDSC IBM BlueGene/L MPI 0.108 0.061 0.044

The following tables report results for NAMD molecular dynamics using a cluster of Sun Blade X6275 server modules. The performance of the cluster is expressed in terms of the time in seconds that is required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Blades Cores STMV molecule (1) f1 ATPase molecule (2) ApoA1 molecule (3)
Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy
48 768 0.0277 37.8 79% 0.0075 35.2 73% 0.0039 22.2 46%
36 576 0.0324 32.3 90% 0.0096 27.4 76% 0.0045 19.3 54%
32 512 0.0368 28.4 89% 0.0104 25.3 79% 0.0048 18.1 57%
24 384 0.0481 21.8 91% 0.0136 19.3 80% 0.0066 13.2 55%
16 256 0.0715 14.6 91% 0.0204 12.9 81% 0.0073 11.9 74%
12 192 0.0875 12.0 100% 0.0271 9.7 81% 0.0096 9.1 76%
8 128 0.1292 8.1 101% 0.0337 7.8 98% 0.0139 6.3 79%
4 64 0.2726 3.8 95% 0.0666 4.0 100% 0.0224 3.9 98%
1 16 1.0466 1.0 100% 0.2631 1.0 100% 0.0872 1.0 100%

spdup - speedup versus 1 blade result
effi'cy - speedup efficiency versus 1 blade result

(1) Satellite Tobacco Mosaic Virus (STMV) molecule, 1,066,628 atoms, 12 Angstrom cutoff, Langevin dynamics, 500 time steps
(2) f1 ATPase molecule, 327,506 atoms, 11 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps
(3) ApoA1 molecule, 92,224 atoms, 12 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps

Results and Configuration Summary

Hardware Configuration

    48 x Sun Blade X6275, each with
      2 x (2 x 2.93 GHz Intel QC Xeon X5570 (Nehalem) processors)
      2 x (24 GB memory)
      Hyper-Threading (HT) off, Turbo Mode on

Software Configuration

    SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
    OpenMPI 1.3.2
    gcc 4.1.2 (1/15/2007), gfortran 4.1.2 (1/15/2007)

Benchmark Description

Molecular dynamics simulation is widely used in biological and materials science research. NAMD is a public-domain molecular dynamics software application for which a variety of molecular input directories are available. Three of these directories define:
  • the Satellite Tobacco Mosaic Virus (STMV) that comprises 1,066,628 atoms
  • the f1 ATPase enzyme that comprises 327,506 atoms
  • the ApoA1 enzyme that comprises 92,224 atoms
Each input directory also specifies the type of molecular dynamics simulation to be performed, for example, Langevin dynamics with a 12 Angstrom cutoff for 500 time steps, or particle mesh Ewald dynamics with an 11 Angstrom cutoff for 500 time steps.

Key Points and Best Practices

Models with large numbers of atoms scale better than models with small numbers of atoms.

The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.33GHz. This feature was was enabled when generating the results reported here.

See Also

Disclosure Statement

NAMD, see http://www.ks.uiuc.edu/Research/namd/performance.html for more information, results as of 11/17/2009.

Monday Nov 02, 2009

Sun Blade X6275 Cluster Beats SGI Running Fluent Benchmarks

A Sun Blade 6048 Modular System with 8 Sun Blade X6275 Server Modules configured with QDR InfiniBand cluster interconnect delivered outstanding performance running the FLUENT 12 benchmark test suite. Sun consistently delivered the best or near best results per node for the 6 benchmark tests considered up to the available nodes considered for these runs.

  • The Sun Blade X6275 cluster delivered the best results for the truck_poly_14M tests for all Rank counts tested.
  • For this large truck_poly_14m test case, the Sun Blade X6275 cluster beat the best results by SGI by as much as 19%.

  • Of the 54 test cases presented here, the Sun Blade X6275 cluster delivered the best results in 87% of the tests, 47 of the 54 cases.

Performance Landscape


FLUENT 12 Benchmark Test Suite
  Results are "Ratings" (bigger is better)
  Rating = No. of sequential runs of test case possible in 1 day 86,400/(Total Elapsed Run Time in Seconds)

System
Nodes Ranks Benchmark Test Case
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Blade X6275 16 128 6496.2 19307.3 8408.8 6341.3 1060.1 984.1
Best Intel 16 128 5236.4 (3) 15638.0 (7) 7981.5 (1) 6582.9 (1) 1005.8 (1) 933.0 (1)
Best SGI 16 128 7578.9 (5) 14706.4 (6) 6789.8 (4) 6249.5 (5) 1044.7 (4) 926.0 (4)

Sun Blade X6275 8 64 5308.8 26790.7 5574.2 5074.9 547.2 525.2
Best Intel 8 64 5016.0 (1) 25226.3 (1) 5220.5 (1) 4614.2 (1) 513.4 (1) 490.9 (1)
Best SGI 8 64 5142.9 (4) 23834.5 (4) 4614.2 (4) 4352.6 (4) 529.4 (4) 479.2 (4)

Sun Blade X6275 4 32 3066.5 13768.9 3066.5 2602.4 289.0 270.3
Best Intel 4 32 2856.2 (1) 13041.5 (1) 2837.4 (1) 2465.0 (1) 266.4 (1) 251.2 (1)
Best SGI 4 32 3083.0 (4) 13190.8 (4) 2588.8 (5) 2445.9 (5) 266.6 (4) 246.5 (4)

Sun Blade X6275 2 16 1714.3 7545.9 1519.1 1345.8 144.4 141.8
Best Intel 2 16 1585.3 (1) 7125.8 (1) 1428.1 (1) 1278.6 (1) 134.7 (1) 132.5 (1)
Best SGI 2 16 1708.4 (4) 7384.6 (4) 1507.9 (4) 1264.1 (5) 128.8 (4) 133.5 (4)

Sun Blade X6275 1 8 931.8 4061.1 827.2 681.5 73.0 73.8
Best Intel 1 8 920.1 (2) 3900.7 (2) 784.9 (2) 644.9 (1) 70.2 (2)) 70.9 (2)
Best SGI 1 8 953.1 (4) 4032.7 (4) 843.3 (4) 651.0 (4) 71.4 (4) 72.0 (4)

Sun Blade X6275 1 4 550.4 2425.3 533.6 423.0 41.6 41.6
Best Intel 1 4 515.7 (1) 2244.2 (1) 490.8 (1) 392.2 (1) 37.8 (1) 38.4 (1)
Best SGI 1 4 561.6 (4) 2416.8 (4) 526.9 (4) 412.6 (4) 40.9 (4) 40.8 (4)

Sun Blade X6275 1 2 299.6 1328.2 293.9 232.1 21.3 21.6
Best Intel 1 2 274.3 (1) 1201.7 (1) 266.1 (1) 214.2 (1) 18.9 (1) 19.6 (1)
Best SGI 1 2 294.2 (4) 1302.7 (4) 289.0 (4) 226.4 (4) 20.5 (4) 21.2 (4)

Sun Blade X6275 1 1 154.7 682.6 149.1 114.8 9.7 10.1
Best Intel 1 1 143.5 (1) 631.1 (1) 137.4 (1) 106.2 (1) 8.8 (1) 9.0 (1)
Best SGI 1 1 153.3 (4) 677.5 (4) 147.3 (4) 111.2 (4) 10.3 (4) 9.5 (4)

Sun Blade X6275 1 serial 155.6 676.6 156.9 110.0 9.4 10.3
Best Intel 1 serial 146.6 (2) 650.0 (2) 150.2 (2) 105.6 (2) 8.8 (2) 9.7 (2)

    Sun Blade X6275, X5570 QC 2.93 GHz, QDR SMT on / Turbo mode on

    (1) Intel Whitebox (X5560 QC 2.80 GHz, RHEL5, IB)
    (2) Intel Whitebox (X5570 QC 2.93 GHz, RHEL5)
    (3) Intel Whitebox (X5482 QC 3.20 GHz, RHEL5, IB)
    (4) SGI Altix ICE_8200IP95 (X5570 2.93 GHz +turbo, SLES10, IB)
    (5) SGI Altix_ICE_8200IP95 (X5570 2.93 GHz, SLES10, IB)
    (6) SGI Altix_ICE_8200EX (Intel64 QC 3.00 GHz, Linux, IB)
    (7) Qlogic Cluster (X5472 QC 3.00 GHz, RHEL5.2, IB Truescale)

Results and Configuration Summary

Hardware Configuration:

    8 x Sun Blade X6275 Server Module ( Dual-Node Blade, 16 nodes ) each node with
      2 x 2.93GHz Intel X5570 QC processors
      24 GB (6 x 4GB, 1333 MHz DDR3 dimms)
      On-board QDR InfiniBand Host Channel Adapters, QNEM

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Interconnect Software: OFED ver 1.4.1
    Shared File System: Lustre ver 1.8.0.1
    Application: FLUENT V12.0.16
    Benchmark: FLUENT 12 Benchmark Test Suite

Benchmark Description

The benchmark tests are representative of typical user large CFD models intended for execution in distributed memory processor (DMP) mode over a cluster of multi-processor platforms.

Key Points and Best Practices

Observations About the Results

The Sun Blade X6275 cluster delivered excellent performance, especially shining with the larger models

These processors include a turbo boost feature coupled with a speedstep option in the CPU section of the Advanced BIOS settings. This, under specific circumstances, can provide a cpu up clocking, temporarily increasing the processor frequency from 2.93GHz to 3.2GHz.

Memory placement is a very significant factor with Nehalem processors. Current Nehalem platforms have two sockets. Each socket has three memory channels and each channel has 3 bays for DIMMs. For example if one DIMM is placed in the 1st bay of each of the 3 channels the DIMM speed will be 1333 MHz with the X5570's altering the DIMM arrangement to an off balance configuration by say adding just one more DIMM into the 2nd bay of one channel will cause the DIMM frequency to drop from 1333 MHz to 1067 MHz.

About the FLUENT 12 Benchmark Test Suite

The FLUENT application performs computational fluid dynamic analysis on a variety of different types of flow and allows for chemically reacting species. transient dynamic and can be linear or nonlinear as far

  • CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.
  • CFD models typically scale very well and are very suited for execution on clusters. The FLUENT 12 benchmark test cases scale well.
  • The memory requirements for the test cases in the FLUENT 12 benchmark test suite range from a few hundred megabytes to about 25 GB. As the job is distributed over multiple nodes the memory requirements per node correspondingly are reduced.
  • The benchmark test cases for the FLUENT module do not have a substantial I/O component. component. However performance will be enhanced very substantially by using high performance interconnects such as InfiniBand for inter node cluster message passing. This nodal message passing data can be stored locally on each node or on a shared file system.
  • As a result of the large amount of inter node message passing performance can be further enhanced by more than a 3x factor as indicated here by implementing the Lustre based shared file I/O system.

See Also

FLUENT 12.0 Benchmark:
http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.4.x/

Disclosure Statement

All information on the Fluent website is Copyrighted 1995-2009 by Fluent Inc. Results from http://www.fluent.com/software/fluent/fl6bench/ as of October 20, 2009 and this presentation.

Saturday Oct 24, 2009

Sun C48 & Lustre fast for Seismic Reverse Time Migration using Sun X6275

Significance of Results

A Sun Blade 6048 Modular System with 12 Sun Blade X6275 server modules were clustered together with QDR InfiniBand and using a Lustre File System with QDR InfiniBand to show performance improvements over an NFS file system for reading in Velocity, Epsilon, and Delta Slices and imaging 800 samples of various various grid sizes using the Reverse Time Migration.

  • The Initialization Time for populating the processing grids demonstrates significant advantages of Lustre over NFS:
    • 2486x1151x1231 : 20x improvement
    • 1243x1151x1231 : 20x improvement
    • 125x1151x1231 : 11x improvement
  • The Total Application Performance shows the Interconnect and I/O advantages of using QDR InfiniBand Lustre for the large grid sizes:
    • 2486x1151x1231 : 2x improvement - processed in less than 19 minutes
    • 1243x1151x1231 : 2x improvement - processed in less than 10 minutes

  • The Computational Kernel Scalability Efficiency for the 3 grid sizes:
    • 125x1151x1231 : 97% (1-8 nodes)
    • 1243x1151x1231 : 102% (8-24 nodes)
    • 2486x1151x1231 : 100% (12-24 nodes)

  • The Total Application Scalability Efficiency for the large grid sizes:
    • 1243x1151x1231 : 72% (8-24 nodes)
    • 2485x1151x1231 : 71% (12-24 nodes)

  • On the X5570 Intel processor with HyperThreading enabled and running 16 OpenMP threads per node gives approximately a 10% performance improvement over running 8 threads per node.

Performance Landscape

This first table presents the initialization time, comparing different number processors along with different problem sizes. The results are presented in seconds and shows the advantage the Lustre file system running over QDR InfiniBand provided when compared to a simple NFS file system.


Initialization Time Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
24 48 1.59 18.90 8.90 181.78 15.63 362.48
20 40 1.60 18.90 8.93 181.49 16.91 358.81
16 32 1.58 18.59 8.97 181.58 17.39 353.72
12 24 1.54 18.61 9.35 182.31 22.50 364.25
8 16 1.40 18.60 10.02 183.79

4 8 1.57 18.80



2 4 2.54 19.31



1 2 4.54 20.34



This next table presents the total application run time, comparing different number processors along with different problem sizes. It shows that for larger problems, using the Lustre file system running over QDR InfiniBand provided a big performance advantage when compared to a simple NFS file system.


Total Application Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
24 48 251.48 273.79 553.75 1125.02 1107.66 2310.25
20 40 232.00 253.63 658.54 971.65 1143.47 2062.80
16 32 227.91 209.66 826.37 1003.81 1309.32 2348.60
12 24 217.77 234.61 884.27 1027.23 1579.95 3877.88
8 16 223.38 203.14 1200.71 1362.42

4 8 341.14 272.68



2 4 605.62 625.25



1 2 892.40 841.94



The following table presents the run time and speedup of just the computational kernel for different processor counts for the three different problem sizes considered. The scaling results are based upon the smallest number of nodes run and that number is used as the baseline reference point.


Computational Kernel Performance & Scalability
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
X6275 Time
(sec)
Speedup:
1-node
X6275 Time
(sec)
Speedup:
1-node
X6275 Time
(sec)
Speedup:
1-node
24 48 35.38 13.7 210.82 24.5 427.40 24.0
20 40 35.02 13.8 255.27 20.2 517.03 19.8
16 32 41.76 11.6 317.96 16.2 646.22 15.8
12 24 49.53 9.8 422.17 12.2 853.37 12.0\*
8 16 62.34 7.8 645.27 8.0\*

4 8 124.66 3.9



2 4 238.80 2.0



1 2 484.89 1.0



The last table presents the speedup of the total application for different processor counts for the three different problem sizes presented. The scaling results are based upon the smallest number of nodes run and that number is used as the baseline reference point.


Total Application Scalability Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
1243 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
2486 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
24 48 3.6 17.3 17.1
20 40 3.8 14.6 16.6
16 32 4.0 11.6 14.5
12 24 4.1 10.9 12.0\*
8 16 4.0 8.0\*
4 8 2.6

2 4 1.5

1 2 1.0

Note: HyperThreading is enabled and running 16 threads per Node.

Results and Configuration Summary

Hardware Configuration:
    Sun Blade 6048 Modular Modular System with
      12 x Sun Blade x6275 Server Modules, each with
        4 x 2.93 GHz Intel Xeon QC X5570 processors
        12 x 4 GB memory at 1333 MHz
        2 x 24 GB Internal Flash
    QDR InfiniBand Lustre 1.8.0.1 File System
    GBit NFS file system

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    MPI: Scali MPI Connect 5.6.6-59413
    Compiler: Sun Studio 12 C++, Fortran, OpenMP

Benchmark Description

The Reverse Time Migration (RTM) is currently the most popular seismic processing algorithm because of its ability to produce quality images of complex substructures. It can accurately image steep dips that can not be imaged correctly with traditional Kirchhoff 3D or frequency domain algorithms. The Wave Equation Migration (WEM) can image steep dips but does not produce the image quality that can be achieved by the RTM. However, the increased computational complexity of the RTM over the WEM introduces new areas for performance optimization. The current trend in seismic processing is to perform iterative migrations on wide azimuth marine data surveys using the Reverse Time Migration.

This Reverse Time Migration code reads in processing parameters that define the grid dimensions, number of threads, number of processors, imaging condition, and various other parameters. The master node calculates the memory requirements to determine if there is sufficient memory to process the migration "in-core". The domain decomposition across all the nodes is determined by dividing the first grid dimension by the number of nodes. Each node then reads in it's section of the Velocity Slices, Delta Slices, and Epsilon Slices using MPI IO reads. The three source and receiver wavefield state vectors are created: previous, current, and next state. The processing steps through the input trace data reading both the receiver and source data for each of the 800 time steps. It uses forward propagation for the source wave field and backward propagation in time to cross correlate the receiver wavefield. The computational kernel consists of a 13 point stencil to process a subgrid within the memory of each node using OpenMP parallelism. Afterwards, conditioning and absorption are applied and boundary data is communicated to neighboring nodes as each time step is processed. The final image is written out using MPI IO.

Total memory requirements for each grid size:

    125x1151x1231: 7.5GB
    1243x1151x1231: 78GB
    2486x1151x1231: 156GB

For this phase of benchmarking, the focus was to optimize the data initialization. In the next phase of benchmarking, the trace data reading will be optimized so that each node reads in only it's section of interest. In this benchmark the trace data reading skews the Total Application Performance as the number of nodes increase. This will be optimized in the next phase of benchmarking, as well as, further node optimization with OpenMP. The IO description for this benchmark phase on each grid size:

    125x1151x1231:
      Initialization MPI Read: 3 x 709MB = 2.1GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 576KB = 920MB \* number of nodes
      Final Output Image MPI Write: 709MB / number of nodes
    1243x1151x1231: 78GB
      Initialization MPI Read: 3 x 7.1GB = 21.3GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 5.7MB = 9.2GB \* number of nodes
      Final Output Image MPI Write: 7.1GB / number of nodes
    2486x1151x1231: 156GB
      Initialization MPI Read: 3 x 14.2GB = 42.6GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 11.4MB = 18.4GB \* number of nodes
      Final Output Image MPI Write: 42.6GB / number of nodes

Key Points and Best Practices

  • Additional evaluations were performed to compare GBit NFS, Infiniband NFS, and Infiniband Lustre for the Reverse Time Migration Initialization. Infiniband NFS was 6x faster than GBit NFS and Infiniband Lustre was 3x faster than Infiniband NFS using the same disk configurations. On 12 nodes for grid size 2486x1151x1231 the initialization time was 22.50 seconds for IB Lustre, 61.03 seconds for IB NFS, and 364.25 seconds for GBit NFS.
  • The Reverse Time Migration computational performance scales nicely as a function of the grid size being processed. This is consistent with the IBM published results for this application.
  • The Total Application performance results are not typically reported in benchmark studies for this application. The IBM report specifically states that the execution times do not include I/O times and non-recurring allocation or initialization delays. Examining the total application performance reveals that the workload is no longer dominated by the the partial differential equation (PDE) solver, as IBM suggests, but is constrained by the I/O for grid initialization, reading in the traces, saving/restoring wave state data, and writing out the final image. Aggressive optimization of the PDE solver has little effect on the overall throughput of this application. It is clearly more important to optimize the I/O. The trend in seismic processing, as stated at the 2008 Society of Exploration Geophysicists (SEG) conference, is to run the reverse time migration iteratively on wide azimuth data. Thus, optimizing the I/O and application throughput is imperative to meet this trend. SSD and Flash technologies in conjunction with Sun's Lustre file system can reduce this I/O bottleneck and pave the path for the future in seismic processing.
  • Minimal tuning effort was applied to achieve the results presented. Sun's HPC software stack, which includes the Sun Studio compiler, was used to build the 70000 lines of C++ and Fortran source into the application executable. The only compiler option used was "-fast". No assembly level optimizations, like those performed by IBM to use SIMD registers (SSE registers), where performed in this benchmark. Similarly, no explicit cache blocking, loop unrolling, or memory bandwidth optimizations were conducted. The idea was to demonstrate the performance that a customer can expect from their existing applications without extensive, platform specific optimizations.

See Also

Disclosure Statement

Reverse Time Migration, Results as of 10/23/2009. For more info http://www.caam.rice.edu/tech_reports/2006/TR06-18.pdf

Sun F5100 and Seismic Reverse Time Migration with faster Optimal Checkpointing

A prominent Seismic Processing algorithm, Reverse Time Migration with Optimal Checkpointing, in SMP "THREADS" Mode, was testing using a Sun Fire X4270 server configured with four high performance 15K SAS hard disk drives (HDDs) and a Sun Storage F5100 Flash Array. This benchmark compares I/O devices for checkpointing wave state information while processing a production seismic migration.

  • Sun Storage F5100 Flash Array is 2.2x faster than high-performance 15K RPM disks.

  • Multithreading the checkpointing using the Sun Studio C++ Compiler OpenMP implementation gives a 12.8x performance improvement over the original single threaded version.

These results show the new trend in seismic processing to run iterative Reverse Time Migrations and migration playback is a reality. This is made possible through the use of Sun FlashFire technology to provide good checkpointing speeds without additional disk cache memory. The application can take advantage of all the memory within a node without regard to checkpoint cache buffers required for performance to HDDs. Similarly, larger problem sizes can be solved without increasing the memory footprint of each computational node.

Performance Landscape


Reverse Time Migration Optimal Checkpointing - SMP Threads Mode
Grid Size -800 x 1151 x 1231 with 800 Samples - 60GB of memory
Number
Checkpts
HDD F5100
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
F5100
Speedup
80 660.8 25.8 686.6 277.4 40.2 317.6 2.2x
400 1615.6 382.3 1997.9 989.5 269.7 1259.2 1.6x


Reverse Time Migration Optimal Checkpointing - SMP Threads Mode
Grid Size -125 x 1151 x 1231 with 800 Samples - 9GB of memory
Number
Checkpts
HDD F5100
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
F5100
Speedup
80 10.2 0.2 10.4 8.0 0.2 8.2 1.3x
400 52.3 0.4 52.7 45.2 0.3 45.5 1.2x
800 102.6 0.7 103.3 91.8 0.6 92.4 1.1x


Reverse Time Migration Optimal Checkpointing
Single Thread vs Multithreaded I/O Performance
Grid Size -125 x 1151 x 1231 with 800 Samples - 9GB of memory
Number
Checkpts
Single Thread F5100
Total Time (secs)
Multithreaded F5100
Total Time (secs)
Multithread
Speedup
80 105.3 8.2 12.8x
400 482.9 45.5 10.6x
800 963.5 92.4 10.4x

Note: Hyperthreading and Turbo Mode enabled while running 16 threads per node.

Results and Configuration Summary

Hardware Configuration:

    Sun Fire 4270 Server
      2 x 2.93 GHz Quad-core Intel Xeon X5570 processors
      72 GB memory
      4 x 73 GB 15K SAS drives
        File system striped across 4 15K RPM high-performance SAS HD RAID0
      Sun Storage F5100 Flash Array with local/internal r/w buff 4096
        20 x 24 GB flash modules

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Compiler: Sun Studio 12 C++, Fortran, OpenMP

Benchmark Description

The Reverse Time Migration (RTM) is currently the most popular seismic processing algorithm because of it's ability to produce quality images of complex substructures. It can accurately image steep dips that can not be imaged correctly with traditional Kirchhoff 3D or frequency domain algorithms. The Wave Equation Migration (WEM) can image steep dips but does not produce the image quality that can be achieved by the RTM. However, the increased computational complexity of the RTM over the WEM introduces new areas for performance optimization. The current trend in seismic processing is to perform iterative migrations on wide azimuth marine data surveys using the Reverse Time Migration.

The Reverse Time Migration with Optimal Checkpointing was introduced so large migrations could be performed within minimal memory configurations of x86 cluster nodes. The idea is to only have three wavestate vectors in memory for each of the source and receiver wavefields instead of holding the entire wavefields in memory for the duration of processing. With the Sun Flash F5100, this can be done with little performance penalty to the full migration time. Another advantage of checkpointing is to provide the ability to playback migrations and facilitate iterative migrations.

  • The stored snapshot data can be reprocessed with different filtering, image conditioning, or a variety of other parameters.
  • Fine grain snapshoting can help the processing of more complex subsurface data.
  • A Geoscientist can "playback" a migration from the saved snapshots to visually validate migration accuracy or pick areas of interest for additional processing.

The Reverse Time Migration with Optimal Checkpointing is an algorithm designed by Griewank (Griewank, 1992; Blanch et al., 1998; Griewank, 2000; Griewank and Walther, 2000; Akcelik et al., 2003).

  • The application takes snapshots of wavefield state data for some interval of the total number of samples.
  • This adjoint state method performs crosscorrelation of the source and receiver wavefields at the each level.
  • Forward recursion is used for the source wavefield and backward recursion for the receiver wavefield.
  • For relatively small seismic migrations, all of the forward processed state information can be saved and restored with minimal impact on the total processing time.
  • Effectively, the computational complexity increases while the memory requirements decrease by a logarithmic factor of the number of snapshots.
  • Griewank's algorithm helps define the most optimal tradeoff between computational performance and the number of memory buffers (memory requirements) to support this cross correlation.

For the purposes of this benchmark, this implementation of the Reverse Time Migration with Optimal Checkpointing does not fully implement the optimal memory buffer scheme proposed by Griewank. The intent is to compare various I/O alternatives for saving wave state data for each node in a compute cluster.

This benchmark measures the time to perform the wave state saves and restores while simultaneously processing the wave state data.

Key Points and Best Practices

  • Mulithreading the checkpointing using Sun Studio OpenMP and running 16 I/O threads with hyperthreading enabled gives a performance advantage over single threaded I/O to the Sun Storage F5100 flash array. The Sun Storage F5100 flash array can process concurrent I/O requests from multiple threads very efficiently.
  • Allocating the majority of a node's available memory to the Reverse Time Migration algorithm and leaving little memory for I/O caching favors the Sun Storage F5100 flash array over direct attached high performance disk drives. This performance advantage decreases as the number of snapshots increase. The reason for this is that increasing the number of snapshots decreases the memory requirement for the application.

See Also

Disclosure Statement

Reverse Time Migration with Optimal Checkpointing, Results as of 10/23/2009. For more info http://www.caam.rice.edu/tech_reports/2006/TR06-18.pdf

Friday Oct 23, 2009

Wiki on performance best practices

A fantastic source of technical Best Practices is at
http://wikis.sun.com/display/Performance/Home

This wiki hosts the combined wisdom of many performance engineers from across Sun. It has information about Hardware, Software, ZFS, Oracle and other various performance topics.  This wiki attempts to categorize and present information so it is easy to find and use. It is getting started, but please let us know if there are any topics which would be useful.

Tuesday Oct 13, 2009

CP2K Life Sciences, Ab-initio Dynamics - Sun Blade 6048 Chassis with Sun Blade X6275 - Scalability and Throughput with Quad Data Rate InfiniBand

Significance of Results

Clusters of Sun Blade X6275 and X6270 server modules were used to run benchmarks using the CP2K ab-initio dynamics applications software.

  • For the X6270 cluster with Dual Data Rate (DDR) InfiniBand the rate of increase of scalability slows dramatically at 16 nodes, whereas for the X6275 cluster with QDR InfiniBand the scalability continues to 72 nodes.
  • For 64 nodes, the speed of the Sun Blade X6275 cluster with QDR InfiniBand was 2.7X that of a Sun Blade X6270 cluster with DDR InfiniBand.

Ab-initio dynamics simulation is important to materials science research.  Dynamics simulation is used to determine the trajectories of atoms or molecules over time.

Performance Landscape

The CP2K Bulk Water Benchmarks web page plots the performance of CP2K ab-initio dynamics benchmarks that have from 32 to 512 water molecules for a cluster that comprises two 2.66GHz Xeon E5430 quad core CPUs per node and that uses Dual Data Rate InfiniBand.

The following table reports the execution time for the 512 water molecule benchmark when executed on the Sun Blade X6275 cluster having Quad Data Rate InfiniBand and on the Sun Blade X6270 cluster having Dual Data Rate InfiniBand. Each node of either Sun Blade cluster comprises two 2.93GHz Intel Xeon X5570 quad core CPUs. In the following table, the performance is expressed in terms of the "wall clock" time in seconds required to execute ten steps of the ab-initio dynamics simulation for 512 water molecules. A smaller number implies better performance.

Number
of Nodes
X6275 QDR InfiniBand
(seconds for 10 steps)
X6270 DDR InfiniBand
(seconds for 10 steps)
96
1184.36
72 564.16
64 598.41 1591.35
32 706.82 1436.49
24 950.02 1752.20
16 1227.73 2119.50
12 1440.16 1739.26
8 1876.95 2120.73
4 3408.39 3705.44

Results and Configuration Summary

Hardware Configuration:

    Sun Blade[tm] 6048 Modular System with 3 shelves, each shelf with
      12 x Sun Blade X6275, each blade with
        2 x (2 x 2.93 GHz Intel QC Xeon X5570 processors)
        2 x (24 GB memory)
        Hyper-Threading (HT) off, Turbo Mode on
    QDR InfiniBand
    96 x Sun Blade X6270, each blade with
      2 x 2.93 GHz Intel QC Xeon X5570 processors)
      1 x (24 GB memory)
      Hyper-Threading (HT) off, Turbo Mode off
    DDR InfiniBand
Software Configuration:
    SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
    OpenMPI 1.3.2
    Sun Studio 12 f90 compiler, ScaLAPACK, BLACS and Performance Libraries
    FFTW (Fastest Fourier Transform in the West) 3.2.1

Benchmark Description

CP2K is a parallel ab-initio dynamics code that is designed to perform atomistic and molecular simulations of solid state, liquid, molecular and biological systems. It provides a general framework for different methods such as e.g. density functional theory (DFT) using a mixed Gaussian and plane waves approach (GPW), and classical pair and many-body potentials.

Ab-initio dynamics simulation is widely used in materials science research. CP2K is a public-domain ab-initio dynamics software application.

Key Points and Best Practices

  • QDR InfiniBand scales better than DDR InfiniBand.
  • The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.2GHz. This feature was was enabled for the X6275 and disabled for the X6270 when generating the results reported here.

See Also

Disclosure Statement

CP2K, see http://cp2k.berlios.de/ for more information, results as of 10/13/2009.

Halliburton ProMAX Oil & Gas Application Fast on Sun 6048/X6275 Cluster

Significance of Results

The ProMAX family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX is integrated with Halliburton's OpenWorks Geoscience Oracle Database to access prestack seismic data and populate the database with seismic images. This shows the powerful combination of scientific computing merged with commercial database technology.

A cluster of 48 Sun Blade X6275 server modules in a Sun Blade 6048 Modular System was configured with QDR Infiniband and a Lustre File System to demonstrate performance on ProMAX.

  • The 3D Prestack Kirchhoff Time Migration showed excellent scalability while utilizing the QDR Infiniband Lustre Filesystem.

    • 70808 Traces : 144x improvement going from 1 to 72 nodes
    • 283232 Traces : 98x improvement going from 1 to 96 nodes

  • The super linear scalability is attributed in part to data caching effects.

  • Improved the performance of the current production release of the ProMAX 3D Kirchhoff Time Migration by a factor up to 1.7x through recompilation of the source code using the Intel 11.1 compilers.

High Performance ProMAX allows Halliburton's GeoProbe interpretation application to perform migrations "on the fly" while pulling additional mapping data, well logs, and reservoir data from the OpenWorks Oracle Database.

  • Improves velocity modeling throughput for performing iterative Kirchhoff Migrations

  • Sun Grid Engine can be used to optimize the throughput of multiple migrations and maximize the return on investment of a Sun Blade 6048 Modular System.

Enabling hyperthreading and running 16 threads per node can benefit current and potential ProMAX users running on a Sun Blade X6275 configuration. In the tests run with the code rebuild, hyperthreading outperformed non-hyperthreading by as much as 27%.

Performance Landscape

Note: Results are all run with 16 Threads per Node with HyperThreading Enabled.


ProMAX 3D Prestack Kirchhoff Time Migration
SMP Threads and PVM Mode
Nodes Procs 70808 Traces 283232 Traces
Current Release
Execution Time
(sec)
Code Rebuild
Execution Time
(sec)
Current Release
Execution Time
(sec)
Code Rebuild
Execution Time
(sec)
96 192

18 13
72 144 3 2 22 16
48 96 4 3 38 23
24 48 11 7 76 48
16 32 23 14 117 72
12 24 37 23 165 100
8 16 62 35 258 150
4 8 129 78 514 343
1 2 486 288 2022 1278


ProMAX 3D Prestack Kirchhoff Time Migration
Scalability
Nodes Procs 70808 Traces 283232 Traces
Current Release
Speedup
1-Node
Code Rebuild
Speedup
1-Node
Current Release
Speedup
1-Node
Code Rebuild
Speedup
1-Node
96 192

112 98
72 144 162 144 92 80
48 96 121 96 53 55
24 48 44 41 26 26
16 32 21 20 17 18
12 24 13 12 12 13
8 16 8 8 8 8
4 8 4 4 4 4
1 2 1 1 1 1



ProMAX 3D Prestack Kirchhoff Time Migration
283232 Traces
Hyperthreading Performance Comparison
Nodes Procs Current Release Code Rebuild
8 Threads per
Node Perf
(sec)
16 Threads per
Node Perf
(sec)
8 Threads per
Node Perf
(sec)
16 Threads per
Node Perf
(sec)
24 48 95 76 59 48
16 32 144 117 91 72
12 24 196 165 122 100
8 16 310 258 190 150
4 8 629 514 390 343
1 2 2518 2022 1554 1278

Results and Configuration Summary

Hardware Configuration:
    Sun Blade 6048 Modular System with
      48 x Sun Blade X6275 (Vayu) Server Modules, each with
        4 x 2.93 GHz Intel Xeon QC X5570 processors
        12 x 4 GB memory at 1333 MHz
        2 x 24 GB Internal Flash
    QDR InfiniBand
    Lustre 1.8.0.1 File System

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    PVM: Parallel Virtual Machine
    Resource Management: Sun Grid Engine
    Compiler: GNU C++ 4.1.2, Intel 11.1 Compilers
    Database: OpenWorks Database requires Oracle 10g Enterprise Edition
    Additional Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

This benchmark compares the current production release of ProMAX built with the GNU C++ and Fortran compilers to builds with the Intel Fortran and C++ compilers. Two different problem sizes were evaluated with the ProMAX 3D Prestack Kirchhoff Time Migration:
  • 70808 traces with 8 msec sample interval and trace length of 4992 msec
  • 283232 traces with 8 msec sample interval and trace length of 4992 msec

The ProMAX processing parameters used for this benchmark:
  • Input data set = Shots
    Minimum output inline = 65
    Maximum output inline = 85
    Inline output sampling interval = 1
    Minimum output xline = 1
    Maximum output xline = 200
    Xline output sampling interval = 1
    Antialias inline spacing = 15
    Antialias xline spacing = 15
    Stretch Mute Aperature Limit with Maximum Stretch = 15
    Image Gather Type = Full Offset Image Traces
    No Block Moveout
    Number of Alias Bands = 10
    3D Amplitude Phase Correction
    No compression
    Maximum Number of Cache Blocks = 500000

The compiler flags used for the various builds:
  • The Current Production Release Code was built with GNU Fortran and C++ flags.
    -O3 -m64 -march=x86-64 -mieee-fp -mfpmath=sse -msse2 -fforce-addr -fno-inline-functions

  • The application was rebuilt with the Intel Fortran and C++ flags.
    -xSSE4.2 -O3 -ipo -no-prec-div -static -m64 -ftz -fast-transcendentals -fp-speculation=fast

Key Points and Best Practices

Super linear scalability of the 70808 trace case for the larger node runs can be attributed to the fact that the dataset decomposition fits in cache which is shared by multiple threads per core.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX, GeoProbe, OpenWorks.

Monday Oct 12, 2009

MCAE ABAQUS faster on Sun F5100 and Sun X4270 - Single Node World Record

The Sun Storage F5100 Flash Array can substantially improve performance over internal hard disk drives as shown by the I/O intensive ABAQUS MCAE application Standard benchmark tests on a Sun Fire X4270 server.

The I/O intensive ABAQUS "Standard" benchmarks test cases were run on a single Sun Fire X4270 server. Data is presented for runs at both 8 and 16 thread counts.

The ABAQUS "Standard" module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal striped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "S4b" test case by 14%.

  • The Sun Fire X4270 server coupled with a Sun Storage F5100 Flash Array established the world record performance on a single node for the four test cases S2A, S4B, S4D and S6.

Performance Landscape

ABAQUS "Standard" Benchmark Test S4B: Advantage of Sun Storage F5100

Results are total elapsed run times in seconds

Threads 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
8 1504 1318 14%
16 1811 1649 10%

ABAQUS Standard Server Benchmark Subset: Single Node Record Performance

Results are total elapsed run times in seconds

Platform Cores S2a S4b S4d S6
X4270 w/F5100 8 302 1192 779 1237
HP BL460c G6 8 324 1309 843 1322
X4270 w/F5100 4 552 1970 1181 1706
HP BL460c G6 4 561 2062 1234 1812

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      Hyperthreading enabled
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: ABAQUS V6.9-1 Standard Module
    Benchmark: ABAQUS Standard Benchmark Test Suite

Benchmark Description

Abaqus/Standard Benchmark Problems

These problems provide an estimate of the performance that can be expected when running Abaqus/Standard or similar commercially available MCAE (FEA) codes like ANSYS and MSC/Nastran on different computers. The jobs are representative of those typically analyzed by Abaqus/Standard and other MCAE applications. These analyses include linear statics, nonlinear statics, and natural frequency extraction.

Please go here for a more complete description of the tests.

Key Points and Best Practices

  • The memory requirements for the test cases in the ABAQUS Standard benchmark test suite are rather substantial with some of the test cases requiring slightly over 20GB of memory. There are two memory limits one a minimum where out of core "memory" will be used when this limit is exceeded. This requires more time consuming cpu and another maximum memory limit that minimizes I/O operations. These memory limits are given in the ABAQUS output and can be established before making a full execution in a preliminary diagnostic mode run.
  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the ABAQUS job. This is done in the "abaqus_v6.env" file that either resides in the subdirectory from where the job was launched or in the abaqus "site" subdirectory under the home installation directory.
  • Sometimes when running multiple cores on a single node, it is preferable from a performance standpoint to run in "smp" shared memory mode This is specified using the "THREADS" option on the "mpi_mode" line in the abaqus_v6.env file as opposed to the "MPI" option on this line. The test case considered here illustrates this point.
  • The test cases for the ABAQUS standard module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system with high performance interconnects. On Linux OS's advantage can be taken of excess memory that can be used to cache and accelerate I/O.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Abaqus, Inc. or its subsidiaries in the United States and/or o ther countries: Abaqus, Abaqus/Standard, Abaqus/Explicit. All information on the ABAQUS website is Copyrighted 2004-2009 by Dassault Systemes. Results from http://www.simulia.com/support/v69/v69_performance.php as of October 12, 2009.

MCAE ANSYS faster on Sun F5100 and Sun X4270

Significance of Results

The Sun Storage F5100 Flash Array can greatly improve performance over internal hard disk drives as shown by the I/O intensive ANSYS MCAE application BMD benchmark tests on a Sun Fire X4270 server.

Select ANSYS 12 BMD benchmarks were run on a single Sun Fire X4270 server. These I/O intensive test cases were run to compare the performance of conventional high performance disk to Sun FlashFire technology.

The ANSYS 12.0 module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal stripped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "BMD-4" test case by 67% in the 8-core/8-thread server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "BMD-7" test case by 18% in the 8-core/16-thread server configuration.

Performance Landscape

ANSYS 12 "BMD" Test Suite on Single X4270 (24GB mem.) - SMP Mode

Results are total elapsed run times in seconds

Test Case SMP 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
bmd-4 8 523 314 67%
bmd-7 16 357 303 18%

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      Hyperthreading enabled
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: ANSYS Multiphysics 12.0
    Benchmark: ANSYS 12 "BMD" Benchmark Test Suite

Benchmark Description

ANSYS is a general purpose engineering analysis MCAE application that is based on the Finite Element Method. It performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior or deformations are concerned. Ansys provides a number of benchmark tests which exercise the capabilities of the software.

Please go here for a more complete description of the tests.

Key Points and Best Practices

Performance Considerations

The performance of Ansys (IO-intensive MCAE application) can be increased by reducing the IO demands of the application by increasing server memory or by using SSDs to increase the bandwidth and reduce the latency. The most I/O intensive case in the ANSYS distributed "BMD" test suite is BMD-4 particularly at the (maximum) 8 core level for a single node.


  • Ansys now takes full advantage of inexpensive RAID0 disk arrays and delivers sustained I/O rates.

  • Large memory can cache file accesses but often the size of ANSYS files grows much larger than the available physical memory so that system file caching is not able to hide the I/O cost.
  • For fast ANSYS runs the recommended configuration is a RAID 0 setup using 4 or more disks and a fast RAID controller. These fast I/O configurations are inexpensive to put together for systems and can achieve I/O rates in excess of 200 MB/sec.
  • SSD drives have much lower seek times, use less power, and tend to be about 2X faster than the fastest rotating disks for sustained throughput. The observed speed of a RAID 0 configuration of SSD drives for ANSYS simulations has been nearly as fast as I/O that is cached by large memory systems. SSD drives then may be the most affordable way to extend the capacity of a system to jobs that are too large to run in-core without incurring the performance penalty usually associated with I/O demands.

More About The ANSYS BMD "Distributed" Benchmarks

ANSYS is a general purpose engineering analysis MCAE application that is based on the Finite Element Method. It performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior or deformations are concerned.

In the most recent release of the ANSYS benchmarks there are now two test suites: The SMP "BM" suite designed to run on a single node with multi processors and the DMP "BMD" suite intended to run on multi node clusters but which can also run on a single node in SMP mode as in this study.

  • The test cases from both ANSYS test suites all have a substantial I/O component where 15% to 20% of the total run times are associated with I/O activity (primarily scratch files). Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system with high performance interconnects. When running with the SX64 build a ZFS system might be a good idea to employ.
  • The ANSYS test cases don't scale very well (BMD better than BM) ; at best on up 8 cores.
  • The memory requirements for the test cases in the ANSYS BMD are greater than for the standard benchmark test suite. The requirements for the standard suite are not great requiring less than 3GB.

See Also

MCAE, SSD, HPC, ANSYS, Linux, SuSE, Performance, X64, Intel

Disclosure Statement

The following are trademarks or registered trademarks of ANSYS, Inc., ANSYS Multiphysics TM. All information on the ANSYS website is Copyrighted by ANSYS, Inc. Results from http://www.ansys.com/services/ss-intel-bench120.htm as of October 12, 2009.

MCAE MCS/NASTRAN faster on Sun F5100 and Fire X4270

Significance of Results

The Sun Storage F5100 Flash Array can double performance over internal hard disk drives as shown by the I/O intensive MSC/Nastran MCAE application MDR3 benchmark tests on a Sun Fire X4270 server.

The MD Nastran MDR3 benchmarks were run on a single Sun Fire X4270 server. The I/O intensive test cases were run at different core levels from one up to the maximum of 8 available cores in SMP mode.

The MSC/Nastran MD 2008 R3 module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal stripped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xx0cmd2" test case by 107% in the 8-core server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xl0tdf1"test case by 85% in the 8-core server configuration.

The MD Nastran MDR3 test suite was designed to include some very I/O intensive test cases albeit some are not very scalable. These cases are the called "xx0wmd0" and "xx0xst0". Both were run and results are presented using a single core server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xx0xst0"test case by 33% in the single-core server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xx0wmd0"test case by 20% in the single-core server configuration.

Performance Landscape

MD Nastran MDR3 Benchmark Tests

Results in seconds

Test Case DMP 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
xx0cmd2 8 959 463 107%
xl0tdf1 8 1104 596 85%
xx0xst0 1 1307 980 33%
xx0wmd0 1 20250 16806 20%

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: MSC/NASTRAN MD 2008 R3
    Benchmark: MDR3 Benchmark Test Suite
    HP MPI: 02.03.00.00 [7585] Linux x86-64

Benchmark Description

The benchmark tests are representative of typical MSC/Nastran applications including both SMP and DMP runs involving linear statics, nonlinear statics, and natural frequency extraction.

The MD (Multi Discipline) Nastran 2008 application performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior and/or deformations are concerned. The new release includes the MARC module for general purpose nonlinear analyses and the Dytran module that employs an explicit solver to analyze crash and high velocity impact conditions.

Please go here for a more complete description of the tests.

Key Points and Best Practices

  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the Nastran job. This is done on the command line with the mem= option. On Linux based systems where the platform has a large amount of memory and where the model does not have large scratch I/O requirements the memory can be allocated to a tmpfs scratch space file system. On Solaris X64 systems advantage can be taken of ZFS for higher I/O performance.

  • The MD Nastran MDR3 test cases don't scale very well, a few not at all and the rest on up to 8 cores at best.

  • The test cases for the MSC/Nastran module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). The required scratch file size ranges from less than 1 GB on up to about 140 GB. Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system, further enhanced as indicated here by implementing the Lustre based I/O system. High performance interconnects such as InfiniBand for inter node cluster message passing as well as I/O transfer from the storage system can also enhance performance substantially.

See Also

Disclosure Statement

MSC.Software is a registered trademark of MSC. All information on the MSC.Software website is copyrighted. MD Nastran MDR3 results from http://www.mscsoftware.com and this report as of October 12, 2009.

Friday Oct 09, 2009

X6275 Cluster Demonstrates Performance and Scalability on WRF 2.5km CONUS Dataset

Significance of Results

Results are presented for the Weather Research and Forecasting (WRF) code running on twelve Sun Blade X6275 server modules, housed in the Sun Blade 6048 chassis, using the 2.5 km CONUS benchmark dataset.

  • The Sun Blade X6275 cluster was able to achieve 373 GFLOP/s on the CONUS 2.5-KM Dataset.
  • The results demonstrate an 91% speedup efficiency, or 11x speedup, from 1 to 12 blades.
  • The current results results were run with turbo on.

Performance Landscape

Performance is expressed in terms "simulation speedup" which is the ratio of the simulated time step per iteration to the average wall clock time required to compute it. A larger number implies better performance.

The current results were run with turbo mode on.

WRF 3.0.1.1: Weather Research and Forecasting CONUS 2.5-KM Dataset
#
Blade
#
Node
#
Proc
#
Core
Performance
(Simulation Speedup)
Computation Rate
GFLOP/sec
Speedup/Efficiency
(vs. 1 blade)
Turbo On
Relative Perf
Turbo On Turbo Off Turbo On Turbo Off Turbo On Turbo Off
12 24 48 192 13.58 12.93 373.0 355.1 11.0 / 91% 10.4 / 87% +6%
 8  16  32  128  9.27
254.6
 7.5 / 93% 

 6 12 24  96  7.03  6.60 193.1 181.3  5.7 / 94%  5.3 / 89% +7%
 4  8  16  64  4.74
130.2
 3.8 / 96% 

 2  4  8  32  2.44
67.0
 2.0 / 98% 

 1  2  4  16  1.24  1.24 34.1 34.1 1.0 / 100% 1.0 / 100% +0%

Results and Configuration Summary

Hardware Configuration:

    Sun Blade 6048 Modular System
      12 x Sun Blade X6275 Server Modules, each with
        4 x 2.93 GHz Intel QC X5570 processors
        24 GB (6 x 4GB)
        QDR InfiniBand
        HT disabled in BIOS
        Turbo mode enabled in BIOS

Software Configuration:

    OS: SUSE Linux Enterprise Server 10 SP 2
    Compiler: PGI 7.2-5
    MPI Library: Scali MPI v5.6.4
    Benchmark: WRF 3.0.1.1
    Support Library: netCDF 3.6.3

Benchmark Description

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. WRF is designed to be a flexible, state-of-the-art atmospheric simulation system that is portable and efficient on available parallel computing platforms. It features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility.

Dataset used:

    Single domain, large size 2.5KM Continental US (CONUS-2.5K)

    • 1501x1201x35 cell volume
    • 6hr, 2.5km resolution dataset from June 4, 2005
    • Benchmark is the final 3hr simulation for hrs 3-6 starting from a provided restart file; the benchmark may also be performed (but seldom reported) for the full 6hrs starting from a cold start
    • Iterations output at every 15 sec of simulation time, with the computation cost of each time step ~412 GFLOP

Key Points and Best Practices

  • Processes were bound to processors in round-robin fashion.
  • Model simulation time is 15 seconds per iteration as defined in input job deck. An achieved speedup of 2.67X means that each model iteration of 15s of simulation time was executed in 5.6s of real wallclock time (on average).
  • Computational requirements are 412 GFLOP per simulation time step as (measured empirically and) documented on the UCAR web site for this data model.
  • Model was run as single MPI job.
  • Benchmark was built and run as a pure-MPI variant. With larger process counts building and running WRF as a hybrid OpenMP/MPI variant may be more efficient.
  • Input and output (netCDF format) datasets can be very large for some WRF data models and run times will generally benefit by using a scalable filesystem. Performance with very large datasets (>5GB) can benefit by enabling WRF quilting of I/O across designated processors/servers. The master thread (or rank-0) performs most of the I/O (unless quilting specifies otherwise), with all processes potentially generating some I/O.

See Also

Disclosure Statement

WRF, CONUS-2.5K, see http://www.mmm.ucar.edu/wrf/WG2/bench/, results as of 9/21/2009.

Monday Jul 06, 2009

Sun Blade 6048 Chassis with Sun Blade X6275: RADIOSS Benchmark Results

Significance of Results

The Sun Blade X6275 cluster, equipped with 2.93 GHz Intel QC X5570 processors and QDR InfiniBand interconnect, delivered the best performance at 32, 64 and 128 cores for the RADIOSS Neon_1M and Taurus_Frontal benchmarks.

  • Using half the nodes (16), the Sun Blade X6275 cluster was 3% faster than the 32-node SGI cluster running the Neon_1M test case.
  • In the 128-core configuration, the Sun Blade X6275 cluster was 49% faster than the SGI cluster running the Neon_1M test case.
  • In the 128-core configuration, the Sun Blade X6275 cluster was 49% faster than the SGI cluster running the Neon_1M test case.
  • In the 128-core configuration, the Sun Blade X6275 cluster was 16% faster than the top SGI cluster running the Taurus_Frontal test case.
  • At both the 32- and 64-core levels the Sun Blade X6275 cluster was 60% faster running the Neon_1M test case.
  • At both the 32- and 64-core levels the Sun Blade X6275 cluster was 4% faster running the Taurus_Frontal test case.

Performance Landscape


RADIOSS Public Benchmark Test Suite
  Results are Total Elapsed Run Times (sec.)

System
cores Benchmark Test Case
TAURUS_FRONTAL
1.8M
NEON_1M
1.06M
NEON_300K
277K

SGI Altix ICE 8200 IP95 2.93GHz, 32 nodes, DDR 256 3559 1672 310

Sun Blade X6275 2.93GHz, 16 nodes, QDR 128 4397 1627 361
SGI Altix ICE 8200 IP95 2.93GHz, 16 nodes, DDR 128 5033 2422 360

Sun Blade X6275 2.93GHz, 8 nodes, QDR 64 5934 2526 587
SGI Altix ICE 8200 IP95 2.93GHz, 8 nodes, DDR 64 6181 4088 584

Sun Blade X6275 2.93GHz, 4 nodes, QDR 32 9764 4720 1035
SGI Altix ICE 8200 IP95 2.93GHz, 4 nodes, DDR 32 10120 7574 1017

Results and Configuration Summary

Hardware Configuration:
    8 x Sun Blade X6275
    2x2.93 GHz Intel QC X5570 processors, turbo enabled (per half blade)
    24 GB (6 x 4GB 1333 MHz DDR3 dimms)
    InfiniBand QDR interconnects

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Application: RADIOSS V9.0 SP 1
    Benchmark: RADIOSS Public Benchmark Test Suite

Benchmark Description

Altair has provided a suite of benchmarks to demonstrate the performance of RADIOSS. The initial set of benchmarks provides four automotive crash models. Future updates will add in marine and aerospace applications, as well as including automotive NVH applications. The benchmarks use real data, requiring double precision computations and the parith feature (Parallel arithmetic algorithm) to obtain exactly the same results whatever the number of processors used.

Please go here for a more complete description of the tests.

Key Points and Best Practices

The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.2GHz. This feature was was enabled when generating the results reported here.

Node to Node MPI ping-pong tests show a bandwidth of 3000 MB/sec on the Sun Blade X6275 cluster using QDR. The same tests performed on a Sun Fire X2270 cluster and equipped with DDR interconnect produced a bandwidth of 1500 MB/sec. On another recent Intel based Sun Fire X2250 cluster (3.4 GHz DC E5272 processors) also equipped with DDR interconnects, the bandwidth was 1250 MB/sec. This same Sun Fire X2250 cluster equipped with SDR IB interconnect produced an MPI ping-pong bandwidth of 975 MB/sec.

See Also

Current RADIOSS Benchmark Results:
http://www.altairhyperworks.com/Benchmark.aspx

Disclosure Statement

All information on the Fluent website is Copyright 2009 Altair Engineering, Inc. All Rights Reserved. Results from http://www.altairhyperworks.com/Benchmark.aspx

Tuesday Jun 30, 2009

Sun Blade 6048 and Sun Blade X6275 NAMD Molecular Dynamics Benchmark beats IBM BlueGene/L

Significance of Results

A Sun Blade 6048 chassis with 12 Sun Blade X6275 server modules ran benchmarks using the NAMD molecular dynamics applications software. NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD was developed by the Theoretical and Computational Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. NAMD is driven by major trends in computing and structural biology and received a 2002 Gordon Bell Award.
  • The cluster of 12 Sun Blade X6275 server modules was 6.2x faster than 256 processor configuration of the IBM BlueGene/L.
  • The cluster of 12 Sun Blade X6275 server modules exhibited excellent scalability for NAMD molecular dynamics simulation, up to 10.4x speedup for 12 blades relative to 1 blade.
  • For largest molecule considered, the cluster of 12 Sun Blade X6275 server modules achieved a throughput of 0.094 seconds per simulation step.
Molecular dynamics simulation is important to biological and materials science research. Molecular dynamics is used to determine the low energy conformations or shapes of a molecule. These conformations are presumed to be the biologically active conformations.

Performance Landscape

The NAMD Performance Benchmarks web page plots the performance of NAMD when the ApoA1 benchmark is executed on a variety of clusters. The performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, multiplied by the number of "processors" on which NAMD executes in parallel. The following table compares the performance of NAMD version 2.6 when executed on the Sun Blade X6275 cluster to the performance of NAMD as reported for several of the clusters on the web page. In this table, the performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, however, not multiplied by the number of "processors". A smaller number implies better performance.
Cluster Name and Interconnect Throughput for 128 Cores
(seconds per step)
Throughput for 192 Cores
(seconds per step)
Throughput for 256 Cores
(seconds per step)
Sun Blade X6275 InfiniBand 0.013 0.010
Cambridge Xeon/3.0 InfiniPath 0.016
0.0088
NCSA Xeon/2.33 InfiniBand 0.019
0.010
AMD Opteron/2.2 InfiniPath 0.025
0.015
IBM HPCx PWR4/1.7 Federation 0.039
0.021
SDSC IBM BlueGene/L MPI 0.108
0.062

The following tables report results for NAMD molecular dynamics using a cluster of Sun Blade X6275 server modules. The performance of the cluster is expressed in terms of the time in seconds that is required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Blades Cores STMV molecule (1) f1 ATPase molecule (2) ApoA1 molecule (3)
Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy
12 192 0.0941 10.6 88% 0.0270 9.1 76% 0.0102 8.1 68%
8 128 0.1322 7.5 94% 0.0317 7.7 97% 0.0131 6.3 79%
4 64 0.2656 3.7 94% 0.0610 4.0 101% 0.0204 4.1 102%
1 16 0.9952 1.0 100% 0.2454 1.0 100% 0.0829 1.0 100%

spdup - speedup versus 1 blade result
effi'cy - speedup efficiency versus 1 blade result

(1) Synthetic Tobacco Mosaic Virus (STMV) molecule, 1,066,628 atoms, 12 Angstrom cutoff, Langevin dynamics, 500 time steps
(2) f1 ATPase molecule, 327,506 atoms, 11 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps
(3) ApoA1 molecule, 92,224 atoms, 12 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps

Results and Configuration Summary

Hardware Configuration

  • Sun Blade[tm] 6048 Modular System with one shelf configured with
    • 12 x Sun Blade X6275, each with
      • 2 x (2 x 2.93 GHz Intel QC Xeon X5570 processors)
      • 2 x (24 GB memory)
      • Hyper-Threading (HT) off, Turbo Mode on

Software Configuration

  • SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
  • Scali MPI 5.6.6
  • gcc 4.1.2 (1/15/2007), gfortran 4.1.2 (1/15/2007)

Key Points and Best Practices

  • Models with large numbers of atoms scale better than models with small numbers of atoms.

About the Sun Blade X6275

The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.2GHz. This feature was was enabled when generating the results reported here.

Benchmark Description

Molecular dynamics simulation is widely used in biological and materials science research. NAMD is a public-domain molecular dynamics software application for which a variety of molecular input directories are available. Three of these directories define:
  • the Synthetic Tobacco Mosaic Virus (STMV) that comprises 1,066,628 atoms
  • the f1 ATPase enzyme that comprises 327,506 atoms
  • the ApoA1 enzyme that comprises 92,224 atoms
Each input directory also specifies the type of molecular dynamics simulation to be performed, for example, Langevin dynamics with a 12 Angstrom cutoff for 500 time steps, or particle mesh Ewald dynamics with an 11 Angstrom cutoff for 500 time steps.

See Also

Disclosure Statement

NAMD, see http://www.ks.uiuc.edu/Research/namd/performance.html for more information, results as of 6/26/2009.

Friday Jun 26, 2009

Sun Fire X2270 Cluster Fluent Benchmark Results

Significance of Results

A Sun Fire X2270 cluster equipped with 2.93 GHz QC Intel X5570 proceesors and DDR Infiniband interconnects delivered outstanding performance running the FLUENT benchmark test suite.

  • The Sun Fire X2270 cluster running at 64-cores delivered the best performance for the 3 largest test cases. On the "truck" workload Sun was 14% faster than SGI Altix ICE 8200.
  • The Sun Fire X2270 cluster running at 32-cores delivered the best performance for 5 of the 6 test cases.
  • The Sun Fire X2270 cluster running at 16-cores beat all comers in all 6 test cases.

Performance Landscape


New FLUENT Benchmark Test Suite
  Results are "Ratings" (bigger is better)
  Rating = No. of sequential runs of test case possible in 1 day 86,400/(Total Elapsed Run Time in Seconds)
  Results ordered by truck_poly column

System (1)
cores Benchmark Test Case
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Fire X2270, 8 node 64 4645.2 23671.2 3445.7 4909.1 566.9 494.8
Intel Endeavor, 8 node 64 5016.0 25226.3 5220.5 4614.2 513.4 490.9
SGI Altix ICE 8200 IP95, 8 node 64 5142.9 23834.5 4614.2 4352.6 496.8 479.2

Sun Fire X2270, 4-node 32 2971.6 13824.0 3074.7 2644.2 291.8 271.8
Intel Endeavor, 4-node 32 2856.2 13041.5 2837.4 2465.0 266.4 251.2
SGI Altix ICE 8200 IP95, 4-node 32 3083.0 13190.8 2563.8 2405.0 266.6 246.5
Sun Fire X2250, 8-node 32 2095.8 9600.0 1844.2 1394.1 203.2 196.8

Sun Fire X2270, 2-node 16 1726.3 7595.6 1520.5 1363.3 145.5 141.8
SGI Altix ICE 8200 IP95, 2-node 16 1708.4 7384.6 1507.9 1211.8 128.8 133.5
Intel Endeavor, 2-node 16 1585.3 7125.8 1428.1 1278.6 134.7 132.5
Sun Fire X2250, 4-node 16 1404.9 6249.5 1324.6 996.3 127.7 129.2

Sun Fire X2270, 1-node 8 945.8 4129.0 883.0 682.5 73.5 72.4
SGI Altix ICE 8200 IP95, 1-node 8 953.1 4032.7 843.3 651.0 71.4 72.0
Sun Fire X2250, 2-node 8 824.2 3248.1 711.4 517.9 66.1 67.9

SGI Altix ICE 8200 IP95, 1-node 4 561.6 2416.8 526.9 412.6 40.9 40.8
Sun Fire X2270, 1-node 4 541.5 2346.2 515.7 409.3 40.8 40.2
Sun Fire X2250, 1-node 4 449.2 1691.6 389.0 271.8 33.6 34.9

Sun Fire X2270, 1-node 2 292.8 1282.4 283.4 223.1 20.9 21.2
SGI Altix ICE 8200 IP95, 1-node 2 294.2 1302.7 289.0 226.4 20.5 21.2
Sun Fire X2250, 1-node 2 224.4 881.0 197.9 134.4 16.3 17.6

Sun Fire X2270, 1-node 1 150.7 658.3 143.2 110.1 10.2 10.6
SGI Altix ICE 8200 IP95, 1-node 1 153.3 677.5 147.3 111.2 10.3 9.5
Sun Fire X2250, 1-node 1 115.4 458.2 100.1 66.6 8.0 9.0

Sun Fire X2270, 1-node serial 151.4 656.7 151.3 107.1 9.3 10.1
Intel Endeavor, 1-node serial 146.6 650.0 150.2 105.6 8.8 9.7
Sun Fire X2250, 1-node serial 115.2 461.7 101.0 65.0 7.2 9.0

(1) SGI Altix ICE 8200, X5570 QC 2.93GHz, DDR
Intel Endeavor, X5560 QC 2.8GHz, DDR
Sun Fire X2250, X5272 DC 3.4GHz, DDR IB
Sun Fire X2270, X5570 QC 3.4GHz, DDR

Results and Configuration Summary

Hardware Configuration

    8 x Sun Fire X2270 (each with)
    2 x 2.93GHz Intel X5570 QC processors (Nehalem)
    1333 MHz DDR3 dimms
    Infiniband (Voltaire) DDR interconnects & DDR switch, IB

Software Configuration

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Interconnect software: Voltaire OFED GridStack-5.1.3.1_5
    Application: FLUENT Beta V12.0.15
    Benchmark: FLUENT "6.3" Benchmark Test Suite

Benchmark Description

The benchmark test are representative of typical user large CFD models intended for execution in distributed memory processor (DMP) mode over a cluster of multi-processor platforms.

Please go here for a more complete description of the tests.

Key Points and Best Practices

Observations About the Results

The Sun Fires X2270 cluster delivered excellent performance, especially shining with the larger problems (truck and truck_poly).

These processors include a turbo boost feature coupled with a speedstep option in the CPU section of the Advanced BIOS settings. This, under specific circumstances, can provide a cpu upclocking, temporarily increasing the processor frequency from 2.93GHz to 3.2GHz.

Memory placement is a very significant factor with Nehalem processors. Current Nehalem platforms have two sockets. Each socket has three memory channels and each channel has 3 bays for DIMMs. For example if one DIMM is placed in the 1st bay of each of the 3 channels the DIMM speed will be 1333 MHz with the X5570's altering the DIMM arrangement to an off balance configuration by say adding just one more DIMM into the 2nd bay of one channel will cause the DIMM frequency to drop from 1333 MHz to 1067 MHz.

About the FLUENT "6.3" Benchmark Test Suite

The FLUENT application performs computational fluid dynamic analysis on a variety of different types of flow and allows for chemically reacting species. transient dynamic and can be linear or nonlinear as far

  • CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.
  • CFD models typically scale very well and are very suited for execution on clusters. The FLUENT "6.3" benchmark test cases scale well particularly up to 64 cores.
  • The memory requirements for the test cases in the new FLUENT "6.3" benchmark test suite range from a few hundred megabytes to about 25 GB. As the job is distributed over multiple nodes the memory requirements per node correspondingly are reduced.
  • The benchmark test cases for the FLUENT module do not have a substantial I/O component. component. However performance will be enhanced very substantially by using high performance interconnects such as Infiniband for inter node cluster message passing. This nodal message passing data can be stored locally on each node or on a shared file system.
  • As a result of the large amount of inter node message passing performance can be further enhanced by more than a 3x factor as indicated here by implementing the Lustre based shared file I/O system.

See Also

Current FLUENT "12.0 Beta" Benchmark:
http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.4.x/

Disclosure Statement

All information on the Fluent website is Copyrighted 1995-2009 by Fluent Inc. Results from http://www.fluent.com/software/fluent/fl6bench/ as of June 9, 2009 and this presentation.

Tuesday Jun 23, 2009

New CPU2006 Records: 3x better integer throughput, 9x better fp throughput

Significance of Results

A Sun Constellation system, composed of 48 Sun Blade X6440 server modules in a Sun Blade 6048 chassis, running OpenSolaris 2008.11 and using the Sun Studio 12 Update 1 compiler delivered World Record SPEC CPU2006 rate results.

On the SPECint_rate_base2006 benchmark, Sun delivered 4.7 times more performance than the IBM power 595 (5GHz POWER6); this IBM system requires a slightly larger cabinet than the Sun Blade 6048 chassis (details below). 

On the SPECfp_rate_base2006 benchmark Sun delivered 3.9 times more performance than the largest IBM power 595 (5GHz POWER6); this IBM system requires a slightly larger cabinet than the Sun Blade 6048 chassis (details below).

  • The Sun Constellation System equipped with AMD Opteron QC 8384 2.7 GHz processors, running OpenSolaris 2008.11 and using the Sun Studio 12 update 1 compiler, delivered the World Record SPECint_rate_base2006 score of 8840.
  • This SPECint_rate_base2006 score beat the previous record holding score by over three times.
  • The Sun Constellation System equipped with AMD Opteron QC 8384 2.7 GHz processors, running OpenSolaris 2008.11 and using the Sun Studio 12 update 1 compiler, delivered the fastest x86 SPECfp_rate_base2006 score of 6500.
  • This SPECfp_rate_base2006 score beat the previous x86 record holding score by nine times.

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results.

SPECint_rate2006

System Processors Performance Results Notes (1)
Type GHz Chips Cores Peak Base
Sun Blade 6048 Opteron 8384 2.7 192 768
8840 New Record
SGI Altix 4700 Density System Itanium 9150M 1.66 128 256 3354 2893 Previous Best
SGI Altix 4700 Bandwidth System Itanium2 9040 1.6 128 256 2971 2715
Fujitsu/Sun SPARC Enterprise M9000 SPARC64 VII 2.52 64 256 2290 2090
IBM Power 595 POWER6 5.0 32 64 2160 1870 Best POWER6

(1) Results as of 23 June 2009 from www.spec.org.

SPECfp_rate2006

System Processors Performance Results Notes (2)
Type GHz Chips Cores Peak Base
SGI Altix 4700 Density System Itanium 9140M 1.66 512 1024
10580
Sun Blade 6048 Opteron 8384 2.7 192 768
6500 New x86 Record
SGI Altix 4700 Bandwidth System Itanium2 9040 1.6 128 256 3507 3419
IBM Power 595 POWER 6 5.0 64 32 2184 1681 Best POWER6
Fujitsu/Sun SPARC Enterprise M9000 SPARC64 VII 2.52 64 256 2005 1861
SGI Altix 4700 Bandwidth System Itanium 9150M 1.66 128 256 1947 1832
SGI Altix ICE 8200EX Intel X5570 2.93 8 32 742 723

(2) Results as of 23 June 2009 from www.spec.org.

(2) Results as of 23 June 2009 from www.spec.org.

Results and Configuration Summary

Hardware Configuration:
    1 x Sun Blade 6048
      48 x Sun Blade X6440, each with
        4 x 2.7 GHz QC AMD Opteron 8384 processors
        32 GB, (8 x 4GB)

Software Configuration:

    O/S: OpenSolaris 2008.11
    Compiler: Sun Studio 12 Update 1
    Other SW: MicroQuill SmartHeap Library 9.01 x64
    Benchmark: SPEC CPU2006 V1.1

Key Points and Best Practices

The Sun Blade 6048 chassis is able to contain a variety of server modules. In this case, the Sun Blade X6440 was used to provide this capacity solution. This single rack delivered results which have not been seen in this form factor.

To run this many jobs, the benchmark requires a reasonably good file server where the benchmark is run. The Sun Fire X4540 server was used to provide the disk space required being accessed by NFS by the blades.

Sun has shown 4.7x greater SPECint_rate_base2006 and 3.9x greater SPECfp_rate_base2006 in a slightly smaller cabinet. IBM specifications are at: http://www-03.ibm.com/systems/power/hardware/595/specs.html. One frame (slimline doors): 79.3"H x 30.5"W x 58.5"D weight: 3,376 lb. One frame (acoustic doors): 79.3"H x 30.5"W x 71.1"D weight: 3,422 lb. The Sun Blade 6048 specifications are at: http://www.sun.com/servers/blades/6048chassis/specs.xml One Sun Blade 6048: 81.6"H x 23.9"W x 40.3"D weight: 2,300 lb (fully configured). 

Disclosure Statement:

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 6/22/2009 and this report. Sun Blade 6048 chassis with Sun Blade X6440 server modules (48 nodes with 4 chips, 16 cores, 16 threads each, OpenSolaris 2008.11, Studio 12 update 1) - 8840 SPECint_rate_base2006, 6500 SPECfp_rate_base2006; IBM p595, 1870 SPECint_rate_base2006, 1681 SPECfp_rate_base2006.

See Also

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today