Monday Oct 12, 2009

MCAE ABAQUS faster on Sun F5100 and Sun X4270 - Single Node World Record

The Sun Storage F5100 Flash Array can substantially improve performance over internal hard disk drives as shown by the I/O intensive ABAQUS MCAE application Standard benchmark tests on a Sun Fire X4270 server.

The I/O intensive ABAQUS "Standard" benchmarks test cases were run on a single Sun Fire X4270 server. Data is presented for runs at both 8 and 16 thread counts.

The ABAQUS "Standard" module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal striped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "S4b" test case by 14%.

  • The Sun Fire X4270 server coupled with a Sun Storage F5100 Flash Array established the world record performance on a single node for the four test cases S2A, S4B, S4D and S6.

Performance Landscape

ABAQUS "Standard" Benchmark Test S4B: Advantage of Sun Storage F5100

Results are total elapsed run times in seconds

Threads 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
8 1504 1318 14%
16 1811 1649 10%

ABAQUS Standard Server Benchmark Subset: Single Node Record Performance

Results are total elapsed run times in seconds

Platform Cores S2a S4b S4d S6
X4270 w/F5100 8 302 1192 779 1237
HP BL460c G6 8 324 1309 843 1322
X4270 w/F5100 4 552 1970 1181 1706
HP BL460c G6 4 561 2062 1234 1812

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      Hyperthreading enabled
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: ABAQUS V6.9-1 Standard Module
    Benchmark: ABAQUS Standard Benchmark Test Suite

Benchmark Description

Abaqus/Standard Benchmark Problems

These problems provide an estimate of the performance that can be expected when running Abaqus/Standard or similar commercially available MCAE (FEA) codes like ANSYS and MSC/Nastran on different computers. The jobs are representative of those typically analyzed by Abaqus/Standard and other MCAE applications. These analyses include linear statics, nonlinear statics, and natural frequency extraction.

Please go here for a more complete description of the tests.

Key Points and Best Practices

  • The memory requirements for the test cases in the ABAQUS Standard benchmark test suite are rather substantial with some of the test cases requiring slightly over 20GB of memory. There are two memory limits one a minimum where out of core "memory" will be used when this limit is exceeded. This requires more time consuming cpu and another maximum memory limit that minimizes I/O operations. These memory limits are given in the ABAQUS output and can be established before making a full execution in a preliminary diagnostic mode run.
  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the ABAQUS job. This is done in the "abaqus_v6.env" file that either resides in the subdirectory from where the job was launched or in the abaqus "site" subdirectory under the home installation directory.
  • Sometimes when running multiple cores on a single node, it is preferable from a performance standpoint to run in "smp" shared memory mode This is specified using the "THREADS" option on the "mpi_mode" line in the abaqus_v6.env file as opposed to the "MPI" option on this line. The test case considered here illustrates this point.
  • The test cases for the ABAQUS standard module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system with high performance interconnects. On Linux OS's advantage can be taken of excess memory that can be used to cache and accelerate I/O.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Abaqus, Inc. or its subsidiaries in the United States and/or o ther countries: Abaqus, Abaqus/Standard, Abaqus/Explicit. All information on the ABAQUS website is Copyrighted 2004-2009 by Dassault Systemes. Results from http://www.simulia.com/support/v69/v69_performance.php as of October 12, 2009.

Friday Jun 26, 2009

Sun Fire X2270 Cluster Fluent Benchmark Results

Significance of Results

A Sun Fire X2270 cluster equipped with 2.93 GHz QC Intel X5570 proceesors and DDR Infiniband interconnects delivered outstanding performance running the FLUENT benchmark test suite.

  • The Sun Fire X2270 cluster running at 64-cores delivered the best performance for the 3 largest test cases. On the "truck" workload Sun was 14% faster than SGI Altix ICE 8200.
  • The Sun Fire X2270 cluster running at 32-cores delivered the best performance for 5 of the 6 test cases.
  • The Sun Fire X2270 cluster running at 16-cores beat all comers in all 6 test cases.

Performance Landscape


New FLUENT Benchmark Test Suite
  Results are "Ratings" (bigger is better)
  Rating = No. of sequential runs of test case possible in 1 day 86,400/(Total Elapsed Run Time in Seconds)
  Results ordered by truck_poly column

System (1)
cores Benchmark Test Case
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Fire X2270, 8 node 64 4645.2 23671.2 3445.7 4909.1 566.9 494.8
Intel Endeavor, 8 node 64 5016.0 25226.3 5220.5 4614.2 513.4 490.9
SGI Altix ICE 8200 IP95, 8 node 64 5142.9 23834.5 4614.2 4352.6 496.8 479.2

Sun Fire X2270, 4-node 32 2971.6 13824.0 3074.7 2644.2 291.8 271.8
Intel Endeavor, 4-node 32 2856.2 13041.5 2837.4 2465.0 266.4 251.2
SGI Altix ICE 8200 IP95, 4-node 32 3083.0 13190.8 2563.8 2405.0 266.6 246.5
Sun Fire X2250, 8-node 32 2095.8 9600.0 1844.2 1394.1 203.2 196.8

Sun Fire X2270, 2-node 16 1726.3 7595.6 1520.5 1363.3 145.5 141.8
SGI Altix ICE 8200 IP95, 2-node 16 1708.4 7384.6 1507.9 1211.8 128.8 133.5
Intel Endeavor, 2-node 16 1585.3 7125.8 1428.1 1278.6 134.7 132.5
Sun Fire X2250, 4-node 16 1404.9 6249.5 1324.6 996.3 127.7 129.2

Sun Fire X2270, 1-node 8 945.8 4129.0 883.0 682.5 73.5 72.4
SGI Altix ICE 8200 IP95, 1-node 8 953.1 4032.7 843.3 651.0 71.4 72.0
Sun Fire X2250, 2-node 8 824.2 3248.1 711.4 517.9 66.1 67.9

SGI Altix ICE 8200 IP95, 1-node 4 561.6 2416.8 526.9 412.6 40.9 40.8
Sun Fire X2270, 1-node 4 541.5 2346.2 515.7 409.3 40.8 40.2
Sun Fire X2250, 1-node 4 449.2 1691.6 389.0 271.8 33.6 34.9

Sun Fire X2270, 1-node 2 292.8 1282.4 283.4 223.1 20.9 21.2
SGI Altix ICE 8200 IP95, 1-node 2 294.2 1302.7 289.0 226.4 20.5 21.2
Sun Fire X2250, 1-node 2 224.4 881.0 197.9 134.4 16.3 17.6

Sun Fire X2270, 1-node 1 150.7 658.3 143.2 110.1 10.2 10.6
SGI Altix ICE 8200 IP95, 1-node 1 153.3 677.5 147.3 111.2 10.3 9.5
Sun Fire X2250, 1-node 1 115.4 458.2 100.1 66.6 8.0 9.0

Sun Fire X2270, 1-node serial 151.4 656.7 151.3 107.1 9.3 10.1
Intel Endeavor, 1-node serial 146.6 650.0 150.2 105.6 8.8 9.7
Sun Fire X2250, 1-node serial 115.2 461.7 101.0 65.0 7.2 9.0

(1) SGI Altix ICE 8200, X5570 QC 2.93GHz, DDR
Intel Endeavor, X5560 QC 2.8GHz, DDR
Sun Fire X2250, X5272 DC 3.4GHz, DDR IB
Sun Fire X2270, X5570 QC 3.4GHz, DDR

Results and Configuration Summary

Hardware Configuration

    8 x Sun Fire X2270 (each with)
    2 x 2.93GHz Intel X5570 QC processors (Nehalem)
    1333 MHz DDR3 dimms
    Infiniband (Voltaire) DDR interconnects & DDR switch, IB

Software Configuration

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Interconnect software: Voltaire OFED GridStack-5.1.3.1_5
    Application: FLUENT Beta V12.0.15
    Benchmark: FLUENT "6.3" Benchmark Test Suite

Benchmark Description

The benchmark test are representative of typical user large CFD models intended for execution in distributed memory processor (DMP) mode over a cluster of multi-processor platforms.

Please go here for a more complete description of the tests.

Key Points and Best Practices

Observations About the Results

The Sun Fires X2270 cluster delivered excellent performance, especially shining with the larger problems (truck and truck_poly).

These processors include a turbo boost feature coupled with a speedstep option in the CPU section of the Advanced BIOS settings. This, under specific circumstances, can provide a cpu upclocking, temporarily increasing the processor frequency from 2.93GHz to 3.2GHz.

Memory placement is a very significant factor with Nehalem processors. Current Nehalem platforms have two sockets. Each socket has three memory channels and each channel has 3 bays for DIMMs. For example if one DIMM is placed in the 1st bay of each of the 3 channels the DIMM speed will be 1333 MHz with the X5570's altering the DIMM arrangement to an off balance configuration by say adding just one more DIMM into the 2nd bay of one channel will cause the DIMM frequency to drop from 1333 MHz to 1067 MHz.

About the FLUENT "6.3" Benchmark Test Suite

The FLUENT application performs computational fluid dynamic analysis on a variety of different types of flow and allows for chemically reacting species. transient dynamic and can be linear or nonlinear as far

  • CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.
  • CFD models typically scale very well and are very suited for execution on clusters. The FLUENT "6.3" benchmark test cases scale well particularly up to 64 cores.
  • The memory requirements for the test cases in the new FLUENT "6.3" benchmark test suite range from a few hundred megabytes to about 25 GB. As the job is distributed over multiple nodes the memory requirements per node correspondingly are reduced.
  • The benchmark test cases for the FLUENT module do not have a substantial I/O component. component. However performance will be enhanced very substantially by using high performance interconnects such as Infiniband for inter node cluster message passing. This nodal message passing data can be stored locally on each node or on a shared file system.
  • As a result of the large amount of inter node message passing performance can be further enhanced by more than a 3x factor as indicated here by implementing the Lustre based shared file I/O system.

See Also

Current FLUENT "12.0 Beta" Benchmark:
http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.4.x/

Disclosure Statement

All information on the Fluent website is Copyrighted 1995-2009 by Fluent Inc. Results from http://www.fluent.com/software/fluent/fl6bench/ as of June 9, 2009 and this presentation.

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today