Thursday Sep 15, 2011

Sun Fire X4800 M2 Servers (now known as Sun Server X2-8) Produce World Record on SAP SD-Parallel Benchmark

Oracle delivered an SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution - Parallel (SD Parallel) Benchmark world record result using eight of Oracle's Sun Fire X4800 M2 servers (now known as Sun Server X2-8), Oracle Solaris 10 and Oracle Database 11g Real Application Clusters (RAC) software that achieved 180,000 users as of 10/03/2011.

  • The eight Sun Fire X4800 M2 servers delivered a world record result of 180,000 users on the SAP SD Parallel Benchmark.

  • The eight Sun Fire X4800 M2 server SD Parallel result of 180,000 users delivered 43% more performance compared to the IBM Power 795 server SD two-tier result of 126,063 users.

Performance Landscape

Selected SAP Sales and Distribution (SD) benchmark results are presented in decreasing order of performance. All benchmarks were using SAP enhancement package 4 for SAP ERP 6.0 (Unicode).

System OS
Database
Users SAPS Type Cert #
Eight Sun Fire X4800 M2
8 x Intel Xeon E7-8870 @2.4 GHz
512 GB
Oracle Solaris 10
Oracle 11g RAC
180,000 1,016,380 Parallel 2011037
Six Sun Fire X4800 M2
8 x Intel Xeon E7-8870 @2.4 GHz
512 GB
Oracle Solaris 10
Oracle 11g RAC
137,904 765,470 Parallel 2011038
IBM Power 795
32 x POWER7 @4.0 GHz
4096 GB
AIX 7.1
DB2 9.7
126,063 688,630 Two-Tier 2010046
Four Sun Fire X4800 M2
8 x Intel Xeon E7-8870 @2.4 GHz
512 GB
Oracle Solaris 10
Oracle 11g RAC
94,736 546,050 Parallel 2011039
Two Sun Fire X4800 M2
8 x Intel Xeon E7-8870 @2.4 GHz
512 GB
Oracle Solaris 10
Oracle 11g RAC
49,860 274,080 Parallel 2011040
Four Sun Fire X4470
4 x Intel Xeon X7560 @2.26 GHz
256 GB
Solaris 10
Oracle 11g RAC
40,000 221,020 Parallel 2010039

Complete benchmark results and descriptions can be found at the SAP standard applications benchmark website.
For SD benchmark results website: Two-Tier or Three-Tier. For SD Parallel benchmark results website: SD Parallel.

Configuration and Results Summary

Hardware Configuration:

8 x Sun Fire X4800 M2 servers, each with
8 x Intel Xeon E7-8870 @ 2.4 GHz (8 processors, 80 cores, 160 threads)
512 GB memory

Software Configuration:

SAP enhancement package 4 for SAP ERP 6.0
Oracle Database 11g Real Application Clusters (RAC)
Oracle Solaris 10

Results Summary:

Number of SAP SD benchmark users:
180,000
Average dialog response time:
0.63 seconds
Throughput:

Fully processed order line items per hour:
20,327,670

Dialog steps/hour:
60,983,000

SAPS:
1,016,380
Average database request time (dialog/update):
0.010 sec / 0.055 sec
SAP Certification:
2011037

Benchmark Description

The SAP Standard Application Sales and Distribution - Parallel (SD Parallel) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing and demonstrates the ability to run both the application and database software on a single system. The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments.

The SD Parallel Benchmark consists of the same transactions and user interaction steps as the two-tier and three-tier SD Benchmark. This means that the SD Parallel Benchmark runs the same business processes as the SD Benchmark. The difference between the benchmarks is the technical data distribution. Additionally, the benchmark requires equal distribution of the benchmark users across all database nodes for the used benchmark clients (round-robin method). Following this rule, all database nodes work on data of all clients. This avoids unrealistic configurations such as having only one client per database node.

The SAP Benchmark Council agreed to give the parallel benchmark a different name so that the difference can be easily recognized by any interested parties - customers, prospects, and analysts. The naming convention is SD Parallel for Sales & Distribution - Parallel.

SAP is one of the premier world-wide ERP application providers, and maintains a suite of benchmark tests to demonstrate the performance of competitive systems on the various SAP products.

See Also

Disclosure Statement

SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, results as of 10/03/2011.

SD Parallel, 8 x Sun Fire X4800 M2 (each 8 processors, 80 cores, 160 threads) 180,000 SAP SD Users, Oracle Solaris 10, Oracle 11g Real Application Clusters (RAC), Certification Number 2011037.
SD Parallel, 6 x Sun Fire X4800 M2 (each 8 processors, 80 cores, 160 threads) 137,904 SAP SD Users, Oracle Solaris 10, Oracle 11g Real Application Clusters (RAC), Certification Number 2011038.
SD Parallel, 4 x Sun Fire X4470 (each 4 processors, 32 cores, 64 threads) 40,000 SAP SD Users, Oracle Solaris 10, Oracle 11g Real Application Clusters (RAC), Certification Number 2010039.
SD Two-Tier, IBM Power 795 (32 processors, 256 cores, 1024 threads) 126,063 SAP SD Users, AIX 7.1, DB2 9.7, Certification Number 2010046.

SAP, R/3 are registered trademarks of SAP AG in Germany and other countries. More information may be found at www.sap.com/benchmark.

Wednesday Dec 08, 2010

Sun Blade X6275 M2 Cluster with Sun Storage 7410 Performance Running Seismic Processing Reverse Time Migration

This Oil & Gas benchmark highlights both the computational performance improvements of the Sun Blade X6275 M2 server module over the previous genernation server and the linear scalability achievable for the total application throughput using a Sun Storage 7410 system to deliver almost 2 GB/sec I/O effective write performance.

Oracle's Sun Storage 7410 system attached via 10 Gigabit Ethernet to a cluster of Oracle's Sun Blade X6275 M2 server modules was used to demonstrate the performance of a 3D VTI Reverse Time Migration application, a heavily used geophysical imaging and modeling application for Oil & Gas Exploration. The total application throughput scaling and computational kernel performance improvements are presented for imaging two production sized grids using 800 input samples.

  • The Sun Blade X6275 M2 server module showed up to a 40% performance improvement over the previous generation server module with super-linear scalability to 16 nodes for the 9-Point Stencil used in this Reverse Time Migration computational kernel.

  • The balanced combination of Oracle's Sun Storage 7410 system over 10 GbE to the Sun Blade X6275 M2 server module cluster showed linear scalability for the total application throughput, including the I/O and MPI communication, to produce a final 3-D seismic depth imaged cube for interpretation.

  • The final image write time from the Sun Blade X6275 M2 server module nodes to Oracle's Sun Storage 7410 system achieved 10GbE line speed of 1.25 GBytes/second or better write performance. The effects of I/O buffer caching on the Sun Blade X6275 M2 server module nodes and 34 GByte write optimized cache on the Sun Storage 7410 system gave up to 1.8 GBytes/second effective write performance.

Performance Landscape

Server Generational Performance Improvements

Performance improvements for the Reverse Time Migration computational kernel using a Sun Blade X6275 M2 cluster are compared to the previous generation Sun Blade X6275 cluster. Hyper-threading was enabled for both configurations allowing 24 OpenMP threads for the Sun Blade X6275 M2 server module nodes and 16 for the Sun Blade X6275 server module nodes.

Sun Blade X6275 M2 Performance Improvements
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
X6275 Kernel Time (sec) X6275 M2 Kernel Time (sec) X6275 M2 Speedup X6275 Kernel Time (sec) X6275 M2 Kernel Time (sec) X6275 M2 Speedup
16 306 242 1.3 728 576 1.3
14 355 271 1.3 814 679 1.2
12 435 346 1.3 945 797 1.2
10 541 390 1.4 1156 890 1.3
8 726 555 1.3 1511 1193 1.3

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Blade X6275 M2 server cluster with a Sun Storage 7410 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server node.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 501 242 2.1\* 2.3\* 1060 576 2.0 2.1\*
14 583 271 1.8 2.0 1219 679 1.7 1.8
12 681 346 1.6 1.6 1420 797 1.5 1.5
10 807 390 1.3 1.4 1688 890 1.2 1.3
8 1058 555 1.0 1.0 2085 1193 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache for larger node counts

Image File Effective Write Performance

The performance for writing the final 3D image from a Sun Blade X6275 M2 server cluster over 10 Gigabit Ethernet to a Sun Storage 7410 system are presented. Each server allocated one core per node for MPI I/O thus allowing 22 OpenMP compute threads per node with hyperthreading enabled. Captured performance analytics from the Sun Storage 7410 system indicate effective use of its 34 Gigabyte write optimized cache.

Image File Effective Write Performance
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Write Time (sec) Write Performance (GB/sec) Write Time (sec) Write Performance (GB/sec)
16 4.8 1.5 10.2 1.4
14 5.0 1.4 10.2 1.4
12 4.0 1.8 11.3 1.3
10 4.3 1.6 9.1 1.6
8 4.6 1.5 9.7 1.5

Note: Performance results better than 1.3GB/sec related to I/O buffer caching on server nodes.

Configuration Summary

Hardware Configuration:

8 x 2 node Sun Blade X6275 M2 server nodes, each node with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)
1 x QDR InfiniBand Host Channel Adapter

Sun Datacenter InfiniBand Switch IB-36
Sun Network 10 GbE Switch 72p

Sun Storage 7410 system connected via 10 Gigabit Ethernet
4 x 17 GB STEC ZeusIOPs SSD mirrored - 34 GB
40 x 750 GB 7500 RPM Seagate SATA disks mirrored - 14.4 TB
No L2ARC Readzilla Cache

Software Configuration:

Oracle Enterprise Linux Server release 5.5
Oracle Message Passing Toolkit 8.2.1c (for MPI)
Oracle Solaris Studio 12.2 C++, Fortran, OpenMP

Benchmark Description

This Vertical Transverse Isotropy (VTI) Anisotropic Reverse Time Depth Migration (RTM) application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk for the next work flow step involving 3-D seismic volume interpretation. In doing so, it reports the compute, interprocessor communication, and I/O performance of the individual functions that comprise the total solution. Unlike most references for the Reverse Time Migration, that focus solely on the performance of the 3D stencil compute kernel, this demonstration code additionally reports the total throughput involved in processing large data sets with a full 3D Anisotropic RTM application. It provides valuable insight into configuration and sizing for specific seismic processing requirements. The performance effects of new processors, interconnects, I/O subsystems, and software technologies can be evaluated while solving a real Exploration business problem.

This benchmark study uses the "in-core" implementation of this demonstration code where each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a 4 element array pad (based on spatial order 8) shared with it's neighbors to the left and right during the initialization phase. It maintains previous, current, and next wavefield state information for each of the source, receiver, and anisotropic wavefields in memory. The second two grid dimensions used in this benchmark are specifically chosen to be prime numbers to exaggerate the effects of data alignment. Algorithm adaptions for processing higher orders in space and alternative "out-of-core" solutions using SSDs for wave state checkpointing are implemented in this demonstration application to better understand the effects of problem size scaling. Care is taken to handle absorption boundary conditioning and a variety of imaging conditions, appropriately.

RTM Application Structure:

Read Processing Parameter File, Determine Domain Decomposition, and Initialize Data Structures, and Allocate Memory.

Read Velocity, Epsilon, and Delta Data Based on Domain Decomposition and create source, receiver, & anisotropic previous, current, and next wave states.

First Loop over Time Steps

Compute 3D Stencil for Source Wavefield (a,s) - 8th order in space, 2nd order in time
Propagate over Time to Create s(t,z,y,x) & a(t,z,y,x)
Inject Estimated Source Wavelet
Apply Absorption Boundary Conditioning (a)
Update Wavefield States and Pointers
Write Snapshot of Wavefield (out-of-core) or Push Wavefield onto Stack (in-core)
Communicate Boundary Information

Second Loop over Time Steps
Compute 3D Stencil for Receiver Wavefield (a,r) - 8th order in space, 2nd order in time
Propagate over Time to Create r(t,z,y,x) & a(t,z,y,x)
Read Receiver Trace and Inject Receiver Wavelet
Apply Absorption Boundary Conditioning (a)
Update Wavefield States and Pointers
Communicate Boundary Information
Read in Source Wavefield Snapshot (out-of-core) or Pop Off of Stack (in-core)
Cross-correlate Source and Receiver Wavefields
Update image using image conditioning parameters

Write 3D Depth Image i(z,x,y) = Sum over time steps s(t,z,x,y) \* r(t,z,x,y) or other imaging conditions.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

Image File MPI Write Performance Tuning

Changing the Image File Write from MPI non-blocking to MPI blocking and setting Oracle Message Passing Toolkit MPI environment variables revealed an 18x improvement in write performance to the Sun Storage 7410 system going from:

    86.8 to 4.8 seconds for the 1243 x 1151 x 1231 grid size
    183.1 to 10.2 seconds for the 2486 x 1151 x 1231 grid size

The Swat Sun Storage 7410 analytics data capture indicated an initial write performance of about 100 MB/sec with the MPI non-blocking implementation. After modifying to MPI blocking writes, Swat showed between 1.3 and 1.8 GB/sec with up to 13000 write ops/sec to write the final output image. The Swat results are consistent with the actual measured performance and provide valuable insight into the Reverse Time Migration application I/O performance.

The reason for this vast improvement has to do with whether the MPI file mode is sequential or not (MPI_MODE_SEQUENTIAL, O_SYNC, O_DSYNC). The MPI non-blocking routines, MPI_File_iwrite_at and MPI_wait, typically used for overlapping I/O and computation, do not support sequential file access mode. Therefore, the application could not take full performance advantages of the Sun Storage 7410 system write optimized cache. In contrast, the MPI blocking routine, MPI_File_write_at, defaults to MPI sequential mode and the performance advantages of the write optimized cache are realized. Since writing the final image is at the end of RTM execution, there is no need to overlap the I/O with computation.

Additional MPI parameters used:

    setenv SUNW_MP_PROCBIND true
    setenv MPI_SPIN 1
    setenv MPI_PROC_BIND 1

Adjusting the Level of Multithreading for Performance

The level of multithreading (8, 10, 12, 22, or 24) for various components of the RTM should be adjustable based on the type of computation taking place. Best to use OpenMP num_threads clause to adjust the level of multi-threading for each particular work task. Use numactl to specify how the threads are allocated to cores in accordance to the OpenMP parallelism level.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 12/07/2010.

Sun Blade X6275 M2 Delivers Best Fluent (MCAE Application) Performance on Tested Configurations

This Manufacturing Engineering benchmark highlights the performance advantage the Sun Blade X6275 M2 server module offers over IBM, Cray, and SGI solutions as shown by the ANSYS FLUENT fluid dynamics application.

A cluster of eight of Oracle's Sun Blade X6275 M2 server modules delivered outstanding performance running the FLUENT 12 benchmark test suite.

  • The Sun Blade X6275 M2 server module cluster delivered the best results in all 36 of the test configurations run, outperforming the best posted results by as much as 42%.
  • The Sun Blade X6275 M2 server module demonstrated up to 76% performance improvement over the previous generation Sun Blade X6275 server module.

Performance Landscape

In the following tables, results are "Ratings" (bigger is better).
Rating = No. of sequential runs of test case possible in 1 day: 86,400/(Total Elapsed Run Time in Seconds)

The following table compares results on the basis of core count, irrespective of processor generation. This means that in some cases, i.e., for the 32-core and 64-core configurations, systems with the Intel Xeon X5670 six-core processors did not utilize quite all of the cores available for the specified processor count.


FLUENT 12 Benchmark Test Suite

Competitive Comparisons

System
Processors Cores Benchmark Test Case Ratings
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Blade X6275 M2 16 96 9340.5 39272.7 8307.7 8533.3 903.8 786.9
Best Posted 24 96

7562.4
797.0 712.9
Best Posted 16 96 7337.6 33553.4 6533.1 5989.6 739.1 683.5

Sun Blade X6275 M2 11 64 6306.6 27212.6 5592.2 5158.2 568.8 518.9
Best Posted 16 64 5556.3 26381.7 5494.4 4902.1 566.6 518.6

Sun Blade X6275 M2 8 48 4620.3 19093.9 4080.3 3251.2 376.0 359.4
Best Posted 8 48 4494.1 18989.0 3990.8 3185.3 372.7 354.5

Sun Blade X6275 M2 6 32 4061.1 15091.7 3275.8 3013.1 299.5 267.8
Best Posted 8 32 3404.9 14832.6 3211.9 2630.1 286.7 266.7

Sun Blade X6275 M2 4 24 2751.6 10441.1 2161.4 1907.3 188.2 182.5
Best Posted 6 24 1458.2 9626.7 1820.9 1747.2 185.1 180.8
Best Posted 4 24 2565.7 10164.7 2109.9 1608.2 187.1 180.8

Sun Blade X6275 M2 2 12 1429.9 5358.1 1097.5 813.2 95.9 95.9
Best Posted 2 12 1338.0 5308.8 1073.3 808.6 92.9 94.4



The following table compares results on the basis of processor count showing inter-generational processor performance improvement.


FLUENT 12 Benchmark Test Suite

Intergenerational Comparisons

System
Processors Cores Benchmark Test Case Ratings
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Blade X6275 M2 16 96 9340.5 39272.7 8307.7 8533.3 903.8 786.9
Sun Blade X6275 16 64 5308.8 26790.7 5574.2 5074.9 547.2 525.2
X6275 M2 : X6275 16
1.76 1.47 1.49 1.68 1.65 1.50

Sun Blade X6275 M2 8 48 4620.3 19093.9 4080.3 3251.2 376.0 359.4
Sun Blade X6275 8 32 3066.5 13768.9 3066.5 2602.4 289.0 270.3
X6275 M2 : X6275 8
1.51 1.39 1.33 1.25 1.30 1.33

Sun Blade X6275 M2 4 24 2751.6 10441.1 2161.4 1907.3 188.2 182.5
Sun Blade X6275 4 16 1714.3 7545.9 1519.1 1345.8 144.4 141.8
X6275 M2 : X6275 4
1.61 1.38 1.42 1.42 1.30 1.29

Sun Blade X6275 M2 2 12 1429.9 5358.1 1097.5 813.2 95.9 95.9
Sun Blade X6275 2 8 931.8 4061.1 827.2 681.5 73.0 73.8
X6275 M2 : X6275 2
1.53 1.32 1.33 1.19 1.31 1.30

Configuration Summary

Hardware Configuration:

8 x Sun Blade X6275 M2 server modules, each with
4 Intel Xeon X5670 2.93 GHz processors, turbo enabled
96 GB memory 1333 MHz
2 x 24 GB SATA-based Sun Flash Modules
2 x QDR InfiniBand Host Channel Adapter
Sun Datacenter InfiniBand Switch IB-36

Software Configuration:

Oracle Enterprise Linux Enterprise Server 5.5
ANSYS FLUENT V12.1.2
ANSYS FLUENT Benchmark Test Suite

Benchmark Description

The following description is from the ANSYS FLUENT website:

The FLUENT benchmarks suite comprises of a set of test cases covering a large range of mesh sizes, physical models and solvers representing typical industry usage. The cases range in size from a few 100 thousand cells to more than 100 million cells. Both the segregated and coupled implicit solvers are included, as well as hexahedral, mixed and polyhedral cell cases. This broad coverage is expected to demonstrate the breadth of FLUENT performance on a variety of hardware platforms and test cases.

The performance of a CFD code will depend on several factors, including size and topology of the mesh, physical models, numerics and parallelization, compilers and optimization, in addition to performance characteristics of the hardware where the simulation is performed. The principal objective of this benchmark suite is to provide comprehensive and fair comparative information of the performance of FLUENT on available hardware platforms.

About the ANSYS FLUENT 12 Benchmark Test Suite

    CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.

Key Points and Best Practices

  • ANSYS FLUENT has not yet been certified by the vendor on Oracle Enterprise Linux (OEL). However, the ANSYS FLUENT benchmark tests have been run successfully on Oracle hardware running OEL as is (i.e. with NO changes or modifications).
  • The performance improvement of the Sun Blade X6275 M2 server module over the previous generation Sun Blade X6275 server module was due to two main factors: the increased core count per processor (6 vs. 4), and the more optimal, iterative dataset partitioning scheme used for the Sun Blade X6275 M2 server module.

See Also

Disclosure Statement

All information on the FLUENT website (http://www.fluent.com) is Copyrighted 1995-2010 by ANSYS Inc. Results as of December 06, 2010.

Tuesday Dec 07, 2010

Sun Blade X6275 M2 Server Module with Intel X5670 Processors SPEC CPU2006 Results

Results are presented for Oracle's Sun Blade X6275 M2 server module running the SPEC CPU2006 benchmark suite.
  • The dual-node Sun Blade X6275 M2 server module, equipped with two Intel Xeon X5670 2.93 GHz processors per node and running the Oracle Enterprise Linux 5.5 operating system delivered the best SPECint_rate2006 and SPECfp_rate2006 benchmark results for all systems with Intel Xeon processor 5000 sequence.

  • With a SPECint_rate2006 benchmark result of 679, the Sun Blade X6275 M2 server module, with two compute nodes per blade, delivers maximum performance for space constrained environments.

  • Comparing Oracle's dual-node blade to HP's dual-node blade server, based on their single node performance, the Sun Blade X6275 M2 server module SPECfp_rate2006 score of 241 outperforms the best published HP ProLiant BL2X220c G5 server score by 3.2x.

  • A single node of a Sun Blade X6275 M2 server module using 2.93 GHz Intel Xeon X5670 processors delivered 37% improvement in SPECint_rate2006 benchmark results and 22% improvement in SPECfp_rate2006 benchmark results compared to the previous generation Sun Blade X6275 server module.

  • Both nodes of a Sun Blade X6275 M2 server module using 2.93 GHz Intel Xeon X5670 processors delivered 59% improvement on the SPECint_rate2006 benchmark and 40% improvement on the SPECfp_rate2006 benchmark compared to the previous generation Sun Blade X6275 server module.

Performance Landscape

SPEC CPU2006 results comparing blade systems using Intel Xeon processor 5000 sequence based CPUs.

System SPECint_rate2006 SPECfp_rate2006
base peak base peak
Sun Blade X6275 M2 2-nodes, X5670 651 679 465 474
Sun Blade X6275 M2 1-node, X5670 326 348 234 241
IBM BladeCenter HS22V, X5680 352 377 246 254
Dell PowerEdge M710, X5680 355 380 247 256
HP BL460c G7, X5670 324 347 233 241
HP BL2X220c G5 1-node, E5450 106 132 67.4 74.8

SPEC CPU2006 results generational comparison between the Sun Blade X6275 M2 server module and the Sun Blade X6275 server module.

System SPECint_rate2006 SPECfp_rate2006
base peak base peak
Sun Blade X6275 M2 2-nodes, X5670 651 679 465 474
Sun Blade X6275 2-nodes, X5570 410 478 332 355
Sun Blade X6275 M2 1-node, X5670 326 348 234 241
Sun Blade X6275 1-node, X5570 238 253 191 197

Results in the above tables are from www.spec.org and this report as of 12/6/2010.

Configuration Summary and Results

Hardware Configuration:

Sun Blade X6275 server module, 2 nodes and each node has
2 x 2.93 GHz Intel Xeon X5670 processors, turbo enabled
96 GB, (12 x 8 GB DDR3-1333 DIMM)
Sun Storage 7410 System via NFS

Software Configuration:

Oracle Enterprise Linux Server release 5.5, kernel 2.6.18-194.el5
Intel 11.1
MicroQuill SmartHeap Library V8.1
SPEC CPU2006 V1.1

Results Summary:

Sun Blade X6275 M2, both nodes 651 SPECint_rate_base2006 679 SPECint_rate2006
Sun Blade X6275 M2, both nodes 465 SPECfp_rate_base2006 474 SPECfp_rate2006
Sun Blade X6275 M2, one node 326 SPECint_rate_base2006 348 SPECint_rate2006
Sun Blade X6275 M2, one node 234 SPECfp_rate_base2006 241 SPECfp_rate2006
Sun Blade X6275 M2, one node 36.2 SPECint_base2006 39.0 SPECint2006

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark, with over 14000 results published in the years since it was introduced. It measures:

  • "Speed" - single copy performance of chip, memory, compiler
  • "Rate" - multiple copy (throughput)

The rate metrics are used for the throughput-oriented systems described on this page. These metrics include:

  • SPECint_rate2006: throughput for 12 integer benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding
  • SPECfp_rate2006: throughput for 17 floating point benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

There are base variants of both the above metrics that require more conservative compilation. In particular, all benchmarks of a particular programming language must use the same compilation flags.

See Also

Disclosure Statement

SPEC and the benchmark names SPECint and SPECfp are registered trademarks of the Standard Performance Evaluation Corporation. Results are from the report and www.spec.org as of December 6, 2010.

Tuesday Oct 26, 2010

3D VTI Reverse Time Migration Scalability On Sun Fire X2270-M2 Cluster with Sun Storage 7210

This Oil & Gas benchmark shows the Sun Storage 7210 system delivers almost 2 GB/sec bandwidth and realizes near-linear scaling performance on a cluster of 16 Sun Fire X2270 M2 servers.

Oracle's Sun Storage 7210 system attached via QDR InfiniBand to a cluster of sixteen of Oracle's Sun Fire X2270 M2 servers was used to demonstrate the performance of a Reverse Time Migration application, an important application in the Oil & Gas industry. The total application throughput and computational kernel scaling are presented for two production sized grids of 800 samples.

  • Both the Reverse Time Migration I/O and combined computation shows near-linear scaling from 8 to 16 nodes on the Sun Storage 7210 system connected via QDR InfiniBand to a Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 2.0x improvement
      2486 x 1151 x 1231: 1.7x improvement
  • The computational kernel of the Reverse Time Migration has linear to super-linear scaling from 8 to 16 nodes in Oracle's Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231 : 2.2x improvement
      2486 x 1151 x 1231 : 2.0x improvement
  • Intel Hyper-Threading provides additional performance benefits to both the Reverse Time Migration I/O and computation when going from 12 to 24 OpenMP threads on the Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 8% - computational kernel; 2% - total application throughput
      2486 x 1151 x 1231: 12% - computational kernel; 6% - total application throughput
  • The Sun Storage 7210 system delivers the Velocity, Epsilon, and Delta data to the Reverse Time Migration at a steady rate even when timing includes memory initialization and data object creation:

      1243 x 1151 x 1231: 1.4 to 1.6 GBytes/sec
      2486 x 1151 x 1231: 1.2 to 1.3 GBytes/sec

    One can see that when doubling the size of the problem, the additional complexity of overlapping I/O and multiple node file contention only produces a small reduction in read performance.

Performance Landscape

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 504 259 2.0 2.2\* 1024 551 1.7 2.0
14 565 279 1.8 2.0 1191 677 1.5 1.6
12 662 343 1.6 1.6 1426 817 1.2 1.4
10 784 394 1.3 1.4 1501 856 1.2 1.3
8 1024 560 1.0 1.0 1745 1108 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache

Application Scaling – Hyper-Threading Study

The affects of hyperthreading are presented when running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server.

Hyper-Threading Comparison – 12 versus 24 OpenMP Threads
Number Nodes Thread per Node Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup
16 24 504 259 1.02 1.08 1024 551 1.06 1.12
16 12 515 279 1.00 1.00 1088 616 1.00 1.00

Read Performance

Read performance is presented for the velocity, epsilon and delta files running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Velocity, Epsilon, and Delta File Read and Memory Initialization Performance
Number Nodes Overlap MBytes Read Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s
16 2040 16.7 1.1 23.2 1.4 36.8 1.1 44.3 1.2
8 951
14.8 1.0 22.1 1.6 33.0 1.0 43.2 1.3

Configuration Summary

Hardware Configuration:

16 x Sun Fire X2270 M2 servers, each with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)

Sun Storage 7210 system connected via QDR InfiniBand
2 x 18 GB SATA SSD (logzilla)
40 x 1 TB 7200 RM SATA disk

Software Configuration:

SUSE Linux Enterprise Server SLES 10 SP 2
Oracle Message Passing Toolkit 8.2.1 (for MPI)
Sun Studio 12 Update 1 C++, Fortran, OpenMP

Benchmark Description

This Reverse Time Migration (RTM) demonstration application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk. In this version, each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a four element inline 3-D array pad (spatial order of eight) shared with its neighbors to the left and right during the initialization phase. It represents a full RTM application including the data input, computation, communication, and final output image to be used by the next work flow step involving 3D volumetric seismic interpretation.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

I/O Characterization without Optimal Checkpointing

Velocity, Epsilon, and Delta Files - Grid Reading

The additional amount of overlapping reads to share velocity, epsilon, and delta edge data with neighbors can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (z_dimension) x (4 bytes) x (3 files)

For this particular benchmark study, the additional 3-D pad overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 1231 x 4 x 3 = 2.04 GB extra
    8 nodes: 7 x 8 x 1151 x 1231 x 4 x 3 = 0.95 GB extra

For the first of the two test cases, the total size of the three files used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 1231 x 4 bytes = 7.05 GB per file x 3 files = 21.13 GB

With the additional 3-D pad, the total amount of data read is:

    16 nodes: 2.04 GB + 21.13 GB = 23.2 GB
    8 nodes: 0.95 GB + 21.13 GB = 22.1 GB

For the second of the two test cases, the total size of the three files used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 1231 x 4 bytes = 14.09 GB per file x 3 files = 42.27 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: 2.04 GB + 42.27 GB = 44.3 GB
    8 nodes: 0.95 GB + 42.27 GB = 43.2 GB

Note that the amount of overlapping data read increases, not only by the number of nodes, but as the y dimension and/or the z dimension increases.

Trace Reading

The additional amount of overlapping reads to share trace edge data with neighbors for can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (4 bytes) x (number_of_time_slices)

For this particular benchmark study, the additional overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 4 x 800 = 442MB extra
    8 nodes: 7 x 8 x 1151 x 4 x 800 = 206MB extra

For the first case the size of the trace data file used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 4 bytes x 800 = 4.578 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 4.578 GB = 5.0 GB
    8 nodes: .206 GB + 4.578 GB = 4.8 GB

For the second case the size of the trace data file used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 4 bytes x 800 = 9.156 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 9.156 GB = 9.6 GB
    8 nodes: .206 GB + 9.156 GB = 9.4 GB

As the number of nodes is increased, the overlap causes more disk lock contention.

Writing Final Output Image

1243x1151x1231 - 7.1 GB per file:

    16 nodes: 78 x 1151 x 1231 x 4 = 442MB/node (7.1 GB total)
    8 nodes: 156 x 1151 x 1231 x 4 = 884MB/node (7.1 GB total)

2486x1151x1231 - 14.1 GB per file:

    16 nodes: 156 x 1151 x 1231 x 4 = 930 MB/node (14.1 GB total)
    8 nodes: 311 x 1151 x 1231 x 4 = 1808 MB/node (14.1 GB total)

Resource Allocation

It is best to allocate one node as the Oracle Grid Engine resource scheduler and MPI master host. This is especially true when running with 24 OpenMP threads in hyperthreading mode to avoid oversubscribing a node that is cooperating in delivering the solution.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 10/20/2010.

Thursday Sep 30, 2010

Consolidation of 30 x86 Servers onto One SPARC T3-2

One of Oracle's SPARC T3-2 servers was able to consolidate the database workloads off of thirty older x86 servers in a secure virtualized environment.

  • The thirty x86 servers required 6.7 times more power than the consolidated workload on the SPARC T3-2 server.

  • The x86 configuration used 10 times the rack space than the consolidated workload did on the SPARC T3-2 server.

  • In addition to power & space considerations, there are also administrative cost savings resulting from having to manage just one server, as opposed to thirty servers.

  • Gartner says, "They need to realize that removing a single x86 server from a data center will result in savings of more than $400 a year in energy costs alone".

  • The total transaction throughput for the SPARC T3 server (132,000) was almost the same as the aggregate throughput achieved by the thirty x86 servers (138,000), where each x86 running at 10% utilization.

  • The average transaction response time on the SPARC T3-2 server (24 ms) was just a little higher than the average transaction response time on the Intel servers (19.5 ms).

Performance Landscape

System Oracle
Instances
Average
System
Utilization
Transactions/
min/system
Average
Response
time (ms)
watts/
system
OS
Sun Fire X4250
2x 3.0GHz Xeon
1 10% 4,600 19.5 320 Linux
SPARC T3-2
1x 1.65GHz SPARC T3
30 80% 132,000 24.0 1400\* Solaris

\* power consumption includes storage and periperal devices

Notes:
total throughput for 30 Intel systems = 30 \* 4600 = 138,000
total watts for 30 Intel systems = 30 \* 320 = 9600

Results and Configuration Summary

x86 Server Configuration:

30 x Sun Fire X4250 servers, each with
2 X Intel 3.0 GHz E5450 processors
16 GB memory
6 x internal 146 GB 15K SAS disks
RedHat Linux 5.3
Oracle Database 11g Release 2

SPARC T3 Server Configuration:

1 x SPARC T3-2 server
2 x 1.65 GHz SPARC T3 processors
256 GB memory
2 X 10K 300 GB internal SAS disks
1 x Sun Storage F5100 Flash Array storage
1 x Sun Fires X4270 server as COMSTAR target
Oracle Solaris 10 9/10
Oracle Database 11g Release 2

Benchmark Description

This demonstration was designed to show the benefits of virtualization when upgrading from older X86 systems to one of Oracle's T-series servers. A 30:1 consolidation was shown moving from thirty X86 Linux servers to a single T-Series server running Oracle Solaris in a secure virtualized environment. After the consolidation, there was still 20% headroom in the SPARC T3-2 server for additional growth in the workload.

The 200 scale iGen OLTP workload was used to test the consolidation. The x86 system was loaded with iGen clients up to a level of 10% cpu utilization. This load level for x86 systems is typically found in many data centers.

Thirty Oracle Solaris zones (containers) were created on the SPARC T3-2 server, with each zone configured identically as the Oracle configuration on the x86 server. The throughput on each zone was ramped up to the same level as on the Intel base server.

The overall CPU utilization on the SPARC T3-2 server, together with the average iGen transaction response times were then measured along with the power consumption.

Key Points and Best Practices

  • Each Oracle Solaris container was assigned to a processor set consisting of eight virtual CPUs. This use of processor sets was critical to obtaining the reported performance number. Without processor set, the performance was reduced to about one-half the reported performance number.

  • Once the first container was completely configured (with Oracle 11g and iGen installed), the remaining containers were created by a simple cloning procedure, which took no more than a few minutes for each container.

  • Setting up a standalone x86 server with Linux, Oracle and iGen is a far more time consuming task than setting up additional containers once the first container has been created.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/20/2010.

Monday Sep 27, 2010

Sun Fire X2270 M2 Super-Linear Scaling of Hadoop Terasort and CloudBurst Benchmarks

A 16-node cluster of Oracle's Sun Fire X2270 M2 servers showed super-linear scaling of two Hadoop benchmarks. Performance was measured using the Terasort benchmark with a 100GB data set. In addition, performance was measured using Cloudburst which maps next generation "short read" sequence data onto the human and other genomes.

  • On the Terasort workload, a 16-node Sun Fire X2270 M2 cluster sorted the 100GB data set at a rate of 433.3 MB/s finishing in 236.3 seconds.

  • The 16-node Sun Fire X2270 M2 cluster was 9.3x faster on a per node basis than the 2010 winner of the Terasort benchmark competition (www.sortbenchmark.org) which used a 3,452-node Xeon cluster to sort 100 TB of input data in 173 minutes. Both systems used Hadoop, Terasort and 2-socket x86 servers. Allowances have to be made for the differences in problem complexity.

  • The Terasort benchmark showed super-linear scaling on the Sun Fire X2270 M2 cluster (total of 32 Intel 2.93GHz Xeons).

  • Using Cloudburst on a workload of the human genome and the SRR001113 short read data set, a 16-node Sun Fire X2270 M2 cluster finished mapping the short reads onto the human genome in 34.2 minutes.

  • On a per node basis, a 2-node Sun Fire X2270 M2 cluster was 1.7x faster than a 12-node Xeon cluster that processed the human genome and the SRR001113 short read data set in approximately 60,000 seconds (see figure 3 of this journal article). Both systems used Hadoop, CloudBurst and x86 servers.

  • The Terasort benchmark showed super-linear scaling on the Sun Fire X2270 M2 cluster (total of 32 Intel 2.93GHz Xeons).

Performance Landscape

Terasort
100 GB input data set
Performance is "real" execution time reported by /usr/bin/time in seconds (smaller is better)
Number
of Nodes
Seconds Scaling Linear
Scaling
16 236.3 25.4 16
8 466.3 12.9 8
4 927.2 6.5 4
2 2140.8 2.8 2
1 6010.2 1.0 1

CloudBurst
SR001111 short read data set mapped onto the hs_ref_GRCh37 human genome
Performance is "total running" time reported by CloudBurst in seconds (smaller is better)
Number
of Nodes
Seconds Scaling Linear
Scaling
16 2054.9 8.4 8
8 3615.8 4.7 4
4 7895.7 2.2 2
2 17155.1 1.0 1

Results and Configuration Summary

Hardware Configuration:

16 x Sun Fire X2270 M2 server, each server with
2 Intel Xeon X5670 2.93GHz processors, turbo enabled
96 GB memory 1066 MHz
HDD SATA 1 TB 7200 RPM 3.5-in.
2 x 10/100/1000 ethernet

Software Configuration:

Oracle Solaris 10 10/09
Java Platform, Standard Edition, JDK 6 Update 20 Performance Release
Hadoop v0.20.2

Benchmark Description

The Apache Hadoop middleware is the Yahoo implementation of Google's Map Reduce. Map Reduce permits the programmer to write serial code that Map Reduce schedules for parallel execution. Map Reduce has been applied to a wide variety of problems, including image processing, sorting, database merging and genomics.

Hadoop uses the Hadoop Distributed Filesystem (HDFS) that distributes data across the local disks of a cluster such that each node in the cluster accesses its local disk to the greatest extent possible.

Results for two different Hadoop benchmarks are reported above:

  • Terasort is an I/O intensive benchmark that was originally developed by Jim Gray. By having many Hadoop data nodes, it is possible to achieve high I/O capacity. For purposes of benchmarking, the Teragen program was used to create an input data set that comprised 100 GB.

  • CloudBurst is a genome assembly benchmark that was developed by Michael Schatz, previously of the University of Maryland and presently of Cold Springs Harbor Laboratory. CloudBurst maps what is known as DNA short read data onto a reference genome. For purposes of benchmarking, the SRR001113 short read data set is mapped onto the hs_ref_GRCh37 sequence data for all chromosomes of the human genome. Specifically, the hs_ref_GRCh37 FASTA files for chromosomes 1, 2, ... 21, 22, X and Y were catenated in that order to obtain one large FASTA file that represented all chromosomes of the human genome. For purposes of benchmarking, any DNA fragment from the SRR001113 short read data set that contained more than three mismatches was ignored.

See Also

Disclosure Statement

Hadoop, see http://hadoop.apache.org/ for more information. Results as of 9/20/2010.

Tuesday Sep 21, 2010

ProMAX Performance and Throughput on Sun Fire X2270 and Sun Storage 7410

Halliburton/Landmark's ProMAX 3D Prestack Kirchhoff Time Migration's single job scalability and multiple job throughput using various scheduling methods are evaluated on a cluster of Oracle's Sun Fire X2270 servers attached via QDR InfiniBand to Oracle's Sun Storage 7410 system.

Two resource scheduling methods, compact and distributed, are compared while increasing the system load with additional concurrent ProMAX jobs.

  • A single ProMAX job has near linear scaling of 5.5x on 6 nodes of a Sun Fire X2270 cluster.

  • A single ProMAX job has near linear scaling of 7.5x on a Sun Fire X2270 server when running from 1 to 8 threads.

  • ProMAX can take advantage of Oracle's Sun Storage 7410 system features compared to dedicated local disks. There was no significant difference in run time observed when running up to 8 concurrent 16 thread jobs.

  • The 8-thread ProMAX job throughput using the distributed scheduling method is equivalent or slightly faster than the compact scheme for 1 to 4 concurrent jobs.

  • The 16-thread ProMAX job throughput using the distributed scheduling method is up to 8% faster when compared to the compact scheme on an 8-node Sun Fire X2270 cluster.

The multiple job throughput characterization revealed in this benchmark study are key in pre-configuring Oracle Grid Engine resource scheduling for ProMAX on a Sun Fire X2270 cluster and provide valuable insight for server consolidation.

Performance Landscape

Single Job Scaling

Single job performance on a single node is near linear up the number of cores in the node, i.e. 2 Intel Xeon X5570s with 4 cores each. With hyperthreading (2 active threads per core) enabled, more ProMAX threads are used increasing the load on the CPU's memory architecture causing the reduced speedups.
ProMAX single job performance on the 6-node cluster shows near linear speedup node to node.
Single Job 6-Node Scalability
Hyperthreading Enabled - 16 Threads/Node Maximum
Number of Nodes Threads Per Node Speedup to 1 Thread Speedup to 1 Node
6 16 54.2 5.5
4 16 36.2 3.6
3 16 26.1 2.6
2 16 17.6 1.8
1 16 10.0 1.0
1 14 9.2
1 12 8.6
1 10 7.2\*
1 8 7.5
1 6 5.9
1 4 3.9
1 3 3.0
1 2 2.0
1 1 1.0

\* 2 threads contend with two master node daemons

Multiple Job Throughput Scaling, Compact Scheduling

With the Sun Storage 7410 system, performance of 8 concurrent jobs on the cluster using compact scheduling is equivalent to a single job.

Multiple Job Throughput Scalability
Hyperthreading Enabled - 16 Threads/Node Maximum
Number of Nodes Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Percent Cluster Used
1 1 16 1.00 1 13
2 1 16 1.00 2 25
4 1 16 1.00 4 50
8 1 16 1.00 8 100

Multiple 8-Thread Job Throughput Scaling, Compact vs. Distributed Scheduling

These results report the difference of different distributed method resource scheduling levels to 1, 2, and 4 concurrent job compact method baselines.

Multiple 8-Thread Job Scheduling
HyperThreading Enabled - Use 8 Threads/Node Maximum
Number of Jobs Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Total Threads per Node Used Percent of PVM Master 8 Threads Used
1 1 8 1.00 1 8 100
1 4 2 1.01 4 2 25
1 8 1 1.01 8 1 13

2 1 8 1.00 2 8 100
2 4 2 1.01 4 4 50
2 8 1 1.01 8 2 25

4 1 8 1.00 4 8 100
4 4 2 1.00 4 8 100
4 8 1 1.01 8 4 100

Multiple 16-Thread Job Throughput Scaling, Compact vs. Distributed Scheduling

The results are reported relative to the performance of 1, 2, 4, and 8 concurrent 2-node, 8-thread jobs.

Multiple 16-Thread Job Scheduling
HyperThreading Enabled - 16 Threads/Node Available
Number of Jobs Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Total Threads per Node Used Percent of PVM Master 16 Threads Used
1 1 16 0.66 1 16 100\*
1 2 8 1.00 2 8 50
1 4 4 1.03 4 4 25
1 8 2 1.06 8 2 13

2 1 16 0.70 2 16 100\*
2 2 8 1.00 4 8 50
2 4 4 1.07 8 4 25
2 8 2 1.08 8 4 25

4 1 16 0.74 4 16 100\*
4 4 4 0.74 4 16 100\*
4 2 8 1.00 8 8 50
4 4 4 1.05 8 8 50
4 8 2 1.04 8 8 50

8 1 16 1.00 8 16 100\*
8 4 4 1.00 8 16 100\*
8 8 2 1.00 8 16 100\*

\* master PVM host; running 20 to 21 total threads (over-subscribed)

Results and Configuration Summary

Hardware Configuration:

8 x Sun Fire X2270 servers, each with
2 x 2.93 GHz Intel Xeon X5570 processors
48 GB memory at 1333 MHz
1 x 500 GB SATA
Sun Storage 7410 system
4 x 2.3 GHz AMD Opteron 8356 processors
128 GB memory
2 Internal 233GB SAS drives = 466 GB
2 Internal 93 GB read optimized SSD = 186 GB
1 External Sun Storage J4400 array with 22 1TB SATA drives and 2 18GB write optimized SSD
11 TB mirrored data and mirrored write optimized SSD

Software Configuration:

SUSE Linux Enterprise Server 10 SP 2
Parallel Virtual Machine 3.3.11
Oracle Grid Engine
Intel 11.1 Compilers
OpenWorks Database requires Oracle 10g Enterprise Edition
Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

The ProMAX family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX is integrated with Halliburton's OpenWorks Geoscience Oracle Database to index prestack seismic data and populate the database with processed seismic.

This benchmark evaluates single job scalability and multiple job throughput of the ProMAX 3D Prestack Kirchhoff Time Migration while processing the Halliburton benchmark data set containing 70,808 traces with 8 msec sample interval and trace length of 4992 msec. Alternative thread scheduling methods are compared for optimizing single and multiple job throughput. The compact scheme schedules the threads of a single job in as few nodes as possible, whereas, the distributed scheme schedules the threads across a many nodes as possible. The effects of load on the Sun Storage 7410 system are measured. This information provides valuable insight into determining the Oracle Grid Engine resource management policies.

Hyperthreading is enabled for all of the tests. It should be noted that every node is running a PVM daemon and ProMAX license server daemon. On the master PVM daemon node, there are three additional ProMAX daemons running.

The first test measures single job scalability across a 6-node cluster with an additional node serving as the master PVM host. The speedup relative to a single node, single thread are reported.

The second test measures multiple job scalability running 1 to 8 concurrent 16-thread jobs using the Sun Storage 7410 system. The performance is reported relative to a single job.

The third test compares 8-thread multiple job throughput using different job scheduling methods on a cluster. The compact method involves putting all 8 threads for a job on the same node. The distributed method involves spreading the 8 threads of job across multiple nodes. The results report the difference of different distributed method resource scheduling levels to 1, 2, and 4 concurrent job compact method baselines.

The fourth test is similar to the second test except running 16-thread ProMAX jobs. The results are reported relative to the performance of 1, 2, 4, and 8 concurrent 2-node, 8-thread jobs.

The ProMAX processing parameters used for this benchmark:

Minimum output inline = 65
Maximum output inline = 85
Inline output sampling interval = 1
Minimum output xline = 1
Maximum output xline = 200 (fold)
Xline output sampling interval = 1
Antialias inline spacing = 15
Antialias xline spacing = 15
Stretch Mute Aperature Limit with Maximum Stretch = 15
Image Gather Type = Full Offset Image Traces
No Block Moveout
Number of Alias Bands = 10
3D Amplitude Phase Correction
No compression
Maximum Number of Cache Blocks = 500000

Key Points and Best Practices

  • The application was rebuilt with the Intel 11.1 Fortran and C++ compilers with these flags.

    -xSSE4.2 -O3 -ipo -no-prec-div -static -m64 -ftz -fast-transcendentals -fp-speculation=fast
  • There are additional execution threads associated with a ProMAX node. There are two threads that run on each node: the license server and PVM daemon. There are at least three additional daemon threads that run on the PVM master server: the ProMAX interface GUI, the ProMAX job execution - SuperExec, and the PVM console and control. It is best to allocate one node as the master PVM server to handle the additional 5+ threads. Otherwise, hyperthreading can be enabled and the master PVM host can support up to 8 ProMAX job threads.

  • When hyperthreading is enabled in on one of the non-master PVM hosts, there is a 7% penalty going from 8 to 10 threads. However, 12 threads are 11 percent faster than 8. This can be contributed to the two additional support threads when hyperthreading initiates.

  • Single job performance on a single node is near linear up the number of cores in the node, i.e. 2 Intel Xeon X5570s with 4 cores each. With hyperthreading (2 active threads per core) enabled, more ProMAX threads are used increasing the load on the CPU's memory architecture causing the reduced speedups.

    Users need to be aware of these performance differences and how it effects their production environment.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX. Results as of 9/20/2010.

Monday Sep 20, 2010

Schlumberger's ECLIPSE 300 Performance Throughput On Sun Fire X2270 Cluster with Sun Storage 7410

Oracle's Sun Storage 7410 system, attached via QDR InfiniBand to a cluster of eight of Oracle's Sun Fire X2270 servers, was used to evaluate multiple job throughput of Schlumberger's Linux-64 ECLIPSE 300 compositional reservoir simulator processing their standard 2 Million Cell benchmark model with 8 rank parallelism (MM8 job).

  • The Sun Storage 7410 system showed little difference in performance (2%) compared to running the MM8 job with dedicated local disk.

  • When running 8 concurrent jobs on 8 different nodes all to the Sun Storage 7140 system, the performance saw little degradation (5%) compared to a single MM8 job running on dedicated local disk.

Experiments were run changing how the cluster was utilized in scheduling jobs. Rather than running with the default compact mode, tests were run distributing the single job among the various nodes. Performance improvements were measured when changing from the default compact scheduling scheme (1 job to 1 node) to a distributed scheduling scheme (1 job to multiple nodes).

  • When running at 75% of the cluster capacity, distributed scheduling outperformed the compact scheduling by up to 34%. Even when running at 100% of the cluster capacity, the distributed scheduling is still slightly faster than compact scheduling.

  • When combining workloads, using the distributed scheduling allowed two MM8 jobs to finish 19% faster than the reference time and a concurrent PSTM workload to find 2% faster.

The Oracle Solaris Studio Performance Analyzer and Sun Storage 7410 system analytics were used to identify a 3D Prestack Kirchhoff Time Migration (PSTM) as a potential candidate for consolidating with ECLIPSE. Both scheduling schemes are compared while running various job mixes of these two applications using the Sun Storage 7410 system for I/O.

These experiments showed a potential opportunity for consolidating applications using Oracle Grid Engine resource scheduling and Oracle Virtual Machine templates.

Performance Landscape

Results are presented below on a variety of experiments run using the 2009.2 ECLIPSE 300 2 Million Cell Performance Benchmark (MM8). The compute nodes are a cluster of Sun Fire X2270 servers connected with QDR InfiniBand. First, some definitions used in the tables below:

Local HDD: Each job runs on a single node to its dedicated direct attached storage.
NFSoIB: One node hosts its local disk for NFS mounting to other nodes over InfiniBand.
IB 7410: Sun Storage 7410 system over QDR InfiniBand.
Compact Scheduling: All 8 MM8 MPI processes run on a single node.
Distributed Scheduling: Allocate the 8 MM8 MPI processes across all available nodes.

First Test

The first test compares the performance of a single MM8 test on a single node using local storage to running a number of jobs across the cluster and showing the effect of different storage solutions.

Compact Scheduling
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Local HDD Relative Throughput NFSoIB Relative Throughput IB 7410 Relative Throughput
13% 1 1.00 1.00\* 0.98
25% 2 0.98 0.97 0.98
50% 4 0.98 0.96 0.97
75% 6 0.98 0.95 0.95
100% 8 0.98 0.95 0.95

\* Performance measured on node hosting its local disk to other nodes in the cluster.

Second Test

This next test uses the Sun Storage 7410 system and compares the performance of running the MM8 job on 1 node using the compact scheduling to running multiple jobs with compact scheduling and to running multiple jobs with the distributed schedule. The tests are run on a 8 node cluster, so each distributed job has only 1 MPI process per node.

Comparing Compact and Distributed Scheduling
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Compact Scheduling
Relative Throughput
Distributed Scheduling\*
Relative Throughput
13% 1 1.00 1.34
25% 2 1.00 1.32
50% 4 0.99 1.25
75% 6 0.97 1.10
100% 8 0.97 0.98

\* Each distributed job has 1 MPI process per node.

Third Test

This next test uses the Sun Storage 7410 system and compares the performance of running the MM8 job on 1 node using the compact scheduling to running multiple jobs with compact scheduling and to running multiple jobs with the distributed schedule. This test only uses 4 nodes, so each distributed job has two MPI processes per node.

Comparing Compact and Distributed Scheduling on 4 Nodes
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Compact Scheduling
Relative Throughput
Distributed Scheduling\*
Relative Throughput
25% 1 1.00 1.39
50% 2 1.00 1.28
100% 4 1.00 1.00

\* Each distributed job it has two MPI processes per node.

Fourth Test

The last test involves running two different applications on the 4 node cluster. It compares the performance of running the cluster fully loaded and changing how the applications are run, either compact or distributed. The comparisons are made against the individual application running the compact strategy (as few nodes as possible). It shows that appropriately mixing jobs can give better job performance than running just one kind of application on a single cluster.

Multiple Job, Multiple Application Throughput Results
Comparing Scheduling Strategies
2009.2 ECLIPSE 300 MM8 2 Million Cell and 3D Kirchoff Time Migration (PSTM)

Number of PSTM Jobs Number of MM8 Jobs Compact Scheduling
(1 node x 8 processes
per job)
ECLIPSE
Distributed Scheduling
(4 nodes x 2 processes
per job)
ECLIPSE
Distributed Scheduling
(4 nodes x 4 processes
per job)
PSTM
Compact Scheduling
(2 nodes x 8 processes per job)
PSTM
Cluster Load
0 1 1.00 1.40

25%
0 2 1.00 1.27

50%
0 4 0.99 0.98

100%
1 2
1.19 1.02
100%
2 0

1.07 0.96 100%
1 0

1.08 1.00 50%

Results and Configuration Summary

Hardware Configuration:

8 x Sun Fire X2270 servers, each with
2 x 2.93 GHz Intel Xeon X5570 processors
24 GB memory (6 x 4 GB memory at 1333 MHz)
1 x 500 GB SATA
Sun Storage 7410 system, 24 TB total, QDR InfiniBand
4 x 2.3 GHz AMD Opteron 8356 processors
128 GB memory
2 Internal 233GB SAS drives (466 GB total)
2 Internal 93 GB read optimized SSD (186 GB total)
1 Sun Storage J4400 with 22 1 TB SATA drives and 2 18 GB write optimized SSD
20 TB RAID-Z2 (double parity) data and 2-way striped write optimized SSD or
11 TB mirrored data and mirrored write optimized SSD
QDR InfiniBand Switch

Software Configuration:

SUSE Linux Enterprise Server 10 SP 2
Scali MPI Connect 5.6.6
GNU C 4.1.2 compiler
2009.2 ECLIPSE 300
ECLIPSE license daemon flexlm v11.3.0.0
3D Kirchoff Time Migration

Benchmark Description

The benchmark is a home-grown study in resource usage options when running the Schlumberger ECLIPSE 300 Compositional reservoir simulator with 8 rank parallelism (MM8) to process Schlumberger's standard 2 Million Cell benchmark model. Schlumberger pre-built executables were used to process a 260x327x73 (2 Million Cell) sub-grid with 6,206,460 total grid cells and model 7 different compositional components within a reservoir. No source code modifications or executable rebuilds were conducted.

The ECLIPSE 300 MM8 job uses 8 MPI processes. It can run within a single node (compact) or across multiple nodes of a cluster (distributed). By using the MM8 job, it is possible to compare the performance between running each job on a separate node using local disk to using a shared network attached storage solution. The benchmark tests study the affect of increasing the number of MM8 jobs in a throughput model.

The first test compares the performance of running 1, 2, 4, 6 and 8 jobs on a cluster of 8 nodes using local disk, NFSoIB disk, and the Sun Storage 7410 system connected via InfiniBand. Results are compared against the time it takes to run 1 job with local disk. This test shows what performance impact there is when loading down a cluster.

The second test compares different methods of scheduling jobs on a cluster. The compact method involves putting all 8 MPI processes for a job on the same node. The distributed method involves using 1 MPI processes per node. The results compare the performance against 1 job on one node.

The third test is similar to the second test, but uses only 4 nodes in the cluster, so when running distributed, there are 2 MPI processes per node.

The fourth test compares the compact and distributed scheduling methods on 4 nodes while running a 2 MM8 jobs and one 16-way parallel 3D Prestack Kirchhoff Time Migration (PSTM).

Key Points and Best Practices

  • ECLIPSE is very sensitive to memory bandwidth and needs to be run on 1333 MHz or greater memory speeds. In order to maintain 1333 MHz memory, the maximum memory configuration for the processors used in this benchmark is 24 GB. Bios upgrades now allow 1333 MHz memory for up to 48 GB of memory. Additional nodes can be used to handle data sets that require more memory than available per node. Allocating at least 20% of memory per node for I/O caching helps application performance.

  • If allocating an 8-way parallel job (MM8) to a single node, it is best to use an ECLIPSE license for that particular node to avoid the any additional network overhead of sharing a global license with all the nodes in a cluster.

  • Understanding the ECLIPSE MM8 I/O access patterns is essential to optimizing a shared storage solution. The analytics available on the Oracle Unified Storage 7410 provide valuable I/O characterization information even without source code access. A single MM8 job run shows an initial read and write load related to reading the input grid, parsing Petrel ascii input parameter files and creating an initial solution grid and runtime specifications. This is followed by a very long running simulation that writes data, restart files, and generates reports to the 7410. Due to the nature of the small block I/O, the mirrored configuration for the 7410 outperformed the RAID-Z2 configuration.

    A single MM8 job reads, processes, and writes approximately 240 MB of grid and property data in the first 36 seconds of execution. The actual read and write of the grid data, that is intermixed with this first stage of processing, is done at a rate of 240 MB/sec to the 7410 for each of the two operations.

    Then, it calculates and reports the well connections at an average 260 KB writes/second with 32 operations/second = 32 x 8 KB writes/second. However, the actual size of each I/O operation varies between 2 to 100 KB and there are peaks every 20 seconds. The write cache is on average operating at 8 accesses/second at approximately 61 KB/second (8 x 8 KB writes/sec). As the number of concurrent jobs increases, the interconnect traffic and random I/O operations per second to the 7410 increases.

  • MM8 multiple job startup time is reduced on shared file systems, if each job uses separate input files.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/20/2010.

Sun Fire X4470 4 Node Cluster Delivers World Record SAP SD-Parallel Benchmark Result

Oracle delivered an SAP enhancement package 4 for SAP ERP 6.0 Sales and Distribution – Parallel (SD-Parallel) Benchmark world record result using four of Oracle's Sun Fire X4470 servers, Oracle Solaris 10 and Oracle 11g Real Application Clusters (RAC) software.

  • The Sun Fire X4470 servers delivered 8% more performance compared to the IBM Power 780 server running the SAP enhancement package 4 for SAP ERP 6.0 Sales and Distribution benchmark.

  • The Sun Fire X4470 servers result of 40,000 users delivered 2.2 times the performance of the HP ProLiant DL980 G7 result of 18,180 users.

  • The Sun Fire X4470 servers result of 40,000 users delivered 2.5 times the performance of the Fujitsu PRIMEQUEST 1800E result of 16,000 users.

This result shows that a complete software and hardware solution from Oracle using Oracle RAC, Oracle Solaris and Sun servers provides a superior performing solution.

Performance Landscape

Selected SAP Sales and Distribution benchmark results are presented in decreasing order in performance. All benchmarks were using SAP enhancement package 4 for SAP ERP 6.0 (Unicode) except the result marked with an asterix (\*) which was achieved with SAP ERP 6.0.

System OS
Database
Users SAPS Type Date
Four Sun Fire X4470
4xIntel Xeon X7560 @2.26GHz
256 GB
Solaris 10
Oracle 11g Real Application Clusters
40,000 221,014 Parallel 20-Sep-10
Five IBM System p 570 (\*)
8xPOWER6 @4.7GHz
128 GB
AIX 5L Version 5.3
Oracle 10g Real Application Clusters
37,040 187,450 Parallel "non-Unicode" 25-Mar-08
IBM Power 780
8xPOWER7 @3.8GHz
1 TB
AIX 6.1
DB2 9.7
37,000 202,180 2-Tier 7-Apr-10
Two Sun Fire X4470
4xIntel Xeon X7560 @2.26GHz
256 GB
Solaris 10
Oracle 11g Real Application Clusters
21,000 115,300 Parallel 28-Jun-10
HP DL980 G7
8xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
18,180 99,320 2-Tier 21-Jun-10
Fujitsu PRIMEQUEST 1800E
8xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
16,000 87,550 2-Tier 30-Mar-10
Four Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
13,718 75,762 Parallel 12-Oct-09
HP DL580 G7
4xIntel Xeon X7560 @2.26GHz
256 GB
Win Server 2008 R2 DE
SQL Server 2008
10,445 57,020 2-Tier 21-Jun-10
Two Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
7,220 39,420 Parallel 12-Oct-09
One Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
3,800 20,750 Parallel 12-Oct-09

Complete benchmark results and a description can be found at the SAP benchmark website http://www.sap.com/solutions/benchmark/sd.epx.

Results and Configuration Summary

Hardware Configuration:

4 x Sun Fire X4470 servers, each with
4 x Intel Xeon X7560 2.26 GHz (4 chips, 32 cores, 64 threads)
256 GB memory

Software Configuration:

Oracle 11g Real Application Clusters (RAC)
Oracle Solaris 10

Results Summary:

Number of SAP SD benchmark users:
40,000
Average dialog response time:
0.86 seconds
Throughput:

Dialog steps/hour:
13,261,000

SAPS:
221,020
SAP Certification:
2010039

Benchmark Description

SAP is one of the premier world-wide ERP application providers and maintains a suite of benchmark tests to demonstrate the performance of competitive systems running the various SAP products.

The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments. The SAP Standard Application Sales and Distribution - Parallel (SD-Parallel) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing and demonstrates the ability to run both the application and database software on a single system.

The SD-Parallel Benchmark consists of the same transactions and user interaction steps as the SD Benchmark. This means that the SD-Parallel Benchmark runs the same business processes as the SD Benchmark. The difference between the benchmarks is the technical data distribution.

The additional rule for parallel and distributed databases is one must equally distribute the benchmark users across all database nodes for the used benchmark clients (round-robin method). Following this rule, all database nodes work on data of all clients. This avoids unrealistic configurations such as having only one client per database node.

The SAP Benchmark Council agreed to give the parallel benchmark a different name so that the difference can be easily recognized by any interested parties - customers, prospects, and analysts. The naming convention is SD-Parallel for Sales & Distribution - Parallel.

In January 2009, a new version of the SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution (SD) Benchmark was released. This new release has higher cpu requirements and so yields from 25-50% fewer users compared to the previous (non-unicode) Standard Sales and Distribution (SD) Benchmark. Between 10-30% of this greater load is due to the extra overhead from the processing of the larger character strings due to Unicode encoding.

Unicode is a computing standard that allows for the representation and manipulation of text expressed in most of the world's writing systems. Before the Unicode requirement, this benchmark used ASCII characters meaning each was just 1 byte. The new version of the benchmark requires Unicode characters and the Application layer (where ~90% of the cycles in this benchmark are spent) uses a new encoding, UTF-16, which uses 2 bytes to encode most characters (including all ASCII characters) and 4 bytes for some others. This requires computers to do more computation and use more bandwidth and storage for most character strings. Refer to the above SAP Note for more details.

See Also

Disclosure Statement

SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, results as of 9/19/2010. For more details, see http://www.sap.com/benchmark. SD-Parallel, Four Sun Fire X4470 (each 4 processors, 32 cores, 64 threads) 40,000 SAP SD Users, Cert# 2010039. SD-Parallel, Two Sun Fire X4470 (each 4 processors, 32 cores, 64 threads) 21,000 SAP SD Users, Cert# 2010029. SD 2-Tier, HP ProLiant DL980 G7 (8 processors, 64 cores, 128 threads) 18,180 SAP SD Users, Cert# 2010028. SD 2-Tier, Fujitsu PRIMEQUEST 1800E (8 processors, 64 cores, 128 threads) 16,000 SAP SD Users, Cert# 2010010. SD-Parallel, Four Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 13,718 SAP SD Users, Cert# 2009041. SD 2-Tier, HP ProLiant DL580 G7 (4 processors, 32 cores, 64 threads) 10,490 SAP SD Users, Cert# 2010032. SD 2-Tier, IBM System x3850 X5 (4 processors, 32 cores, 64 threads) 10,450 SAP SD Users, Cert# 2010012. SD 2-Tier, Fujitsu PRIMERGY RX600 S5 (4 processors, 32 cores, 64 threads) 9,560 SAP SD Users, Cert# 2010017. SD-Parallel, Two Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 7,220 SAP SD Users, Cert# 2009040. SD-Parallel, Sun Blade X6270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, Cert# 2009039. SD 2-Tier, Sun Fire X4270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, Cert# 2009033.

SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, results as of 9/19/2010. SD-Parallel, Five IBM System p 570 (each 8 processors, 16 cores, 32 threads) 37,040 SAP SD Users, Cert# 2008013.

Tuesday Jun 29, 2010

Sun Fire X2270 M2 Achieves Leading Single Node Results on ANSYS FLUENT Benchmark

Oracle's Sun Fire X2270 M2 server produced leading single node performance results running the ANSYS FLUENT benchmark cases as compared to the best single node results currently posted at the ANSYS FLUENT website. ANSYS FLUENT is a prominent MCAE application used for computational fluid dynamics (CFD).

  • The Sun Fire X2270 M2 server outperformed all single node systems in 5 of 6 test cases at the 12 core level, beating systems from Cray and SGI.
  • For the truck_14m test, the Sun Fire X2270 M2 server outperformed all single node systems at all posted core counts, beating systems from SGI, Cray and HP. When considering performance on a single node, the truck_14m model is most representative of customer CFD model sizes in the test suite.
  • The Sun Fire X2270 M2 server with 12 cores performed up to 1.3 times faster than the previous generation Sun Fire X2270 server with 8 cores.

Performance Landscape

Results are presented for six of the seven ANSYS FLUENT benchmark tests. The seventh test is not a practical test for a single system. Results are ratings, where bigger is better. A rating is the number of jobs that could be run in a single day (86,400 / run time). Competitive results are from the ANSYS FLUENT benchmark website as of 25 June 2010.

Single System Performance

ANSYS FLUENT Benchmark Tests
Results are Ratings, Bigger is Better
System Benchmark Test
eddy_417k turbo_500k aircraft_2m sedan_4m truck_14m truck_poly_14m
Sun Fire X2270 M2 1129.4 5391.6 1105.9 814.1 94.8 96.4
SGI Altix 8400EX 1338.0 5308.8 1073.3 796.3 - -
SGI Altix XE1300C 1299.2 5284.4 1071.3 801.3 90.2 -
Cray CX1 1060.8 5127.6 1069.6 808.6 86.1 87.5

Scaling of Benchmark Test truck_14m

ANSYS FLUENT truck_14m Model
Results are Ratings, Bigger is Better
System Cores Used
12 8 4 2 1
Sun Fire X2270 M2 94.8 73.6 41.4 21.0 10.4
SGI Altix XE1300C 90.2 60.9 41.1 20.7 9.0
Cray CX1 (X5570) - 71.7 33.2 18.9 8.1
HP BL460 G6 (X5570) - 70.3 38.6 19.6 9.2

Comparing System Generations, Sun Fire X2270 M2 to Sun Fire X2270

ANSYS FLUENT Benchmark Tests
Results are Ratings, Bigger is Better
System Benchmark Test
eddy_417k turbo_500k aircraft_2m sedan_4m truck_14m truck_poly_14m
Sun Fire X2270 M2 1129.4 5374.8 1103.8 814.1 94.8 96.4
Sun Fire X2270 981.5 4163.9 862.7 691.2 73.6 73.3

Ratio 1.15 1.29 1.28 1.18 1.29 1.32

Results and Configuration Summary

Hardware Configuration:

Sun Fire X2270 M2
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory
1 x 500 GB 7200 rpm SATA internal HDD

Sun Fire X2270
2 x 2.93 GHz Intel Xeon X5570 processors
48 GB memory
2 x 24 GB internal striped SSDs

Software Configuration:

64-bit SUSE Linux Enterprise Server 10 SP 3 (SP 2 for X2270)
ANSYS FLUENT V12.1.2
ANSYS FLUENT Benchmark Test Suite

Benchmark Description

The following description is from the ANSYS FLUENT website:

The FLUENT benchmarks suite comprises of a set of test cases covering a large range of mesh sizes, physical models and solvers representing typical industry usage. The cases range in size from a few 100 thousand cells to more than 100 million cells. Both the segregated and coupled implicit solvers are included, as well as hexahedral, mixed and polyhedral cell cases. This broad coverage is expected to demonstrate the breadth of FLUENT performance on a variety of hardware platforms and test cases.

The performance of a CFD code will depend on several factors, including size and topology of the mesh, physical models, numerics and parallelization, compilers and optimization, in addition to performance characteristics of the hardware where the simulation is performed. The principal objective of this benchmark suite is to provide comprehensive and fair comparative information of the performance of FLUENT on available hardware platforms.

About the ANSYS FLUENT 12 Benchmark Test Suite

    CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.

See Also

Disclosure Statement

All information on the FLUENT website (http://www.fluent.com) is Copyrighted 1995-2010 by ANSYS Inc. Results as of June 25, 2010.

Sun Fire X2270 M2 Demonstrates Outstanding Single Node Performance on MSC.Nastran Benchmarks

Oracle's Sun Fire X2270 M2 server results showed outstanding performance running the MCAE application MSC.Nastran as shown by the MD Nastran MDR3 serial and parallel test cases.

Performance Landscape

Complete information about the serial results presented below can be found on the MSC Nastran website.


MD Nastran MDR3 Serial Test Results
Platform Benchmark Problem
Results are total elapsed run time in seconds
xl0imf1 xx0xst0 xl1fn40 vl0sst1
Sun Fire X2270 M2 999 704 2337 115
Sun Blade X6275 1107 798 2285 120
Intel Nehalem 1235 971 2453 123
Intel Nehalem w/ SSD 1484 767 2456 120
IBM:P6 570 ( I8 )
1510 4612 132
IBM:P6 570 ( I4 ) 1016 1618 5534 147

Complete information about the parallel results presented below can be found on the MSC Nastran website.


MD Nastran MDR3 Parallel Test Results
Platform Benchmark Problem
Results are total elapsed run time in seconds
xx0cmd2 md0mdf1
Serial DMP=2 DMP=4 DMP=8 Serial DMP=2 DMP=4
Sun Blade X6275 840 532 391 279 880 422 223
Sun Fire X2270 M2 847 558 371 297 889 462 232
Intel Nehalem w/ 4 SSD 887 639 405
902 479 235
Intel Nehalem 915 561 408
922 470 251
IBM:P6 570 ( I8 ) 920 574 392 322


IBM:P6 570 ( I4 ) 959 616 419 343 911 469 242

Results and Configuration Summary

Hardware Configuration:

Sun Fire X2270 M2
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory
4 x 24 GB SSDs (striped)

Software Configuration:

64-bit SUSE Linux Enterprise Server 10 SP 3
MSC Software MD 2008 R3
MD Nastran MDR3 benchmark test suite

Benchmark Description

The benchmark tests are representative of typical MSC.Nastran applications including both serial and parallel (DMP) runs involving linear statics, nonlinear statics, and natural frequency extraction as well as others. MD Nastran is an integrated simulation system with a broad set of multidiscipline analysis capabilities.

Key Points and Best Practices

  • The test cases for the MSC.Nastran module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). The required scratch file size ranges from less than 1 GB on up to about 140 GB. To obtain best performance, it is important to have a high performance storage system when running MD Nastran.

  • To improve performance, it is possible to make use of the MD Nastran feature which sets the maximum amount of memory the application will use. This allows a user to configure where temporary files are held, including in memory file systems like tmpfs.

See Also

Disclosure Statement

MSC.Software is a registered trademark of MSC. All information on the MSC.Software website is copyrighted. MD Nastran MDR3 results from http://www.mscsoftware.com and this report as of June 28, 2010.

Sun Fire X2270 M2 Sets World Record on SPEC OMP2001 Benchmark

Oracle's Sun Fire X2270 M2 server running the Oracle Solaris 10 10/09 with the Oracle Solaris Studio 12 Update 1 compiler, produced the top x86 SPECompM2001 result for all 2-socket servers.

  • The Sun Fire X2270 M2 server with two Intel Xeon X5670 processors running 24 OpenMP threads achieved a SPEC OMP2001 result of 55,178 SPECompM2001.

  • The Sun Fire X2270 M2 server beat the Cisco B200 M2 system even thought the Cisco system used the faster Intel Xeon X5680 (3.33GHz) chips.

Performance Landscape

SPEC OMP2001 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results. All results as of 06/28/10.

In the tables below
"Base" = SPECompMbase2001 and "Peak" = SPECompMpeak2001

SPEC OMPM2001 results

System Processors Base
Threads
Performance Results
Cores/
Chips
Type GHz Peak Base
Sun Fire X2270 M2 12/2 Xeon X5670 2.93 24 55178 49548
Cisco B200 M2 12/2 Xeon X5680 3.33 24 55072 52314
Intel SR1600UR 12/2 Xeon X5680 3.33 24 54249 51510
Intel SR1600UR 12/2 Xeon X5670 2.93 24 53313 50283

Results and Configuration Summary

Hardware Configuration:

Sun Fire X2270 M2
2 x 2.93 GHz Intel Xeon X5670
24 GB

Software Configuration:

Oracle Solaris 10 10/09
Oracle Solaris Studio 12 Update 1
SPEC OMP2001 suite v3.2

Benchmark Description

The SPEC OMPM2001 Benchmark Suite was released in June 2001 and tests HPC performance using OpenMP for parallelism.

  • 11 programs (3 in C and 8 in Fortran) parallelized using OpenMP API
Goals of the suite:
  • Targeted to mid-range (4-32 processor) parallel systems
  • Run rules, tools and reporting similar to SPEC CPU2006
  • Programs representative of HPC and Scientific Applications

The SPEC OMPL2001 Benchmark Suite was released in June 2001 and tests HPC performance using OpenMP for parallelism.

  • 9 programs (2 in C and 7 in Fortran) parallelized using OpenMP API
Goals of the suite:
  • Targeted to larger parallel systems
  • Run rules, tools and reporting similar to SPEC CPU2006
  • Programs representative of HPC and Scientific Applications

There are "base" variants of both the above metrics that require more conservative compilation, such as using the same flags for all benchmarks.

See Also

Disclosure Statement

SPEC, SPEComp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 28 June 2010 and this report. Sun Fire X2270 M2 (2 chips, 12 cores, 24 OpenMP threads) 55,178 SPECompM2001;

Sun Fire X4170 M2 Sets World Record on SPEC CPU2006 Benchmark

Oracle's Sun Fire X4170 M2 server equipped with two Intel Xeon X5670 2.93 GHz processors and running the Oracle Solaris 10 operating system delivered the a world record score of 53.5 SPECfp_base2006.

  • The Sun Fire X4170 M2 server using the Oracle Solaris Studio Express 06/10 compiler delivered a world record result of 53.5 SPECfp_base2006.

  • The Sun Fire X4170 M2 server delivered 20% better performance on the SPECfp_base2006 benchmark compared to the IBM 780 POWER7 based system.

  • The Sun Fire X4170 M2 server beat systems from Supermicro (X8DTU-LN4F+), Dell (R710), IBM (x3650 M3) and Bull (R460 F2) on SPECfp_base2006.

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results. All results as of 06/28/10.

In the tables below
"Base" = SPECint_base2006, SPECfp_base2006, SPECint_rate_base2006 or SPECfp_rate_base2006
"Peak" = SPECint2006, SPECfp2006, SPECint_rate2006 or SPECfp_rate2006

SPECfp2006 results

System Processors Performance Results
Cores/
Chips
Type GHz Peak Base
Sun Fire X4170 M2 12/2 Xeon X5670 2.93 57.6 53.5
Sun Fire X2270 M2 12/2 Xeon X5670 2.93 58.6 49.9
Supermicro X8DTU-LN4F+ 8/2 Xeon X5677 3.46 48.8 45.9
IBM x3650 M3 8/2 Xeon X5677 3.46 48.9 45.8
Bull R460 F2 8/2 Xeon X5677 3.46 49.3 45.8
Dell R710 8/2 Xeon X5677 3.46 49.3 45.8
Dell R710 12/2 Xeon X5680 3.33 48.5 45.0
IBM 780 16/2 POWER7 3.94 71.5 44.5
Dell R710 12/2 Xeon X5670 2.93 45.8 42.5

SPECint_rate2006 results

System Processors Base
Copies
Performance Results
Cores/
Chips
Type GHz Peak Base
Dell R815 24/2 Opteron 6176 2.3 24 401 314
Fijitsu BX922 S2 12/2 Xeon X5680 3.33 24 381 354
Dell R710 12/2 Xeon X5680 3.33 24 379 355
Sun Blade X6270 M2 12/2 Xeon X5680 3.33 24 369 337
Sun Fire X4170 M2 12/2 Xeon X5670 2.93 24 353 316
Sun Fire X2270 M2 (S10) 12/2 Xeon X5670 2.93 24 346 311
Sun Fire X2270 M2 (OEL) 12/2 Xeon X5670 2.93 24 342 320

SPECfp_rate2006 results

System Processors Base
Copies
Performance Results
Cores/
Chips
Type GHz Peak Base
Dell R815 24/2 Opteron 6176 2.3 24 323 295
Dell R710 12/2 Xeon X5680 3.33 24 256 248
Fijitsu BX922 S2 12/2 Xeon X5680 3.33 24 256 248
Sun Blade X6270 M2 12/2 Xeon X5680 3.33 24 255 247
Sun Fire X4170 M2 12/2 Xeon X5670 2.93 24 245 234
Sun Fire X2270 M2 (S10) 12/2 Xeon X5670 2.93 24 240 231
Sun Fire X2270 M2 (OEL) 12/2 Xeon X5670 2.93 24 235 226

Results and Configuration Summary

Hardware Configuration:

Sun Fire X4170
2 x 2.93 GHz Intel Xeon X5670
48 GB
Sun Fire X2270
2 x 2.93 GHz Intel Xeon X5670
48 GB
Sun Blade X6270
2 x 3.33 GHz Intel Xeon X5680
48 GB

Software Configuration:

Oracle Solaris 10 10/09
Oracle Solaris Studio Express 6/10
SPEC CPU2006 suite v1.1
MicroQuill SmartHeap Library v8.1

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark, with over 8000 results published in the three years since it was introduced. It measures:

  • "Speed" - single copy performance of chip, memory, compiler
  • "Rate" - multiple copy (throughput)

The rate metrics are used for the throughput-oriented systems described on this page. These metrics include:

  • SPECint_rate2006: throughput for 12 integer benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding
  • SPECfp_rate2006: throughput for 17 floating point benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

There are base variants of both the above metrics that require more conservative compilation. In particular, all benchmarks of a particular programming language must use the same compilation flags.

See Also

Disclosure Statement

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 24 June 2010 and this report. Sun Fire X4170 M2 53.5 SPECfp_base2006.

Sun Blade X6270 M2 Sets World Record on SPECjbb2005 Benchmark

Oracle's Sun Blade X6270 M2, equipped with two 3.33 Ghz Intel Xeon X5680 processors obtained a result of 931637 SPECjbb2005 bops, 465819 SPECjbb2005 bops/JVM on the SPECjbb2005 benchmark. This is a World Record result on 2-socket servers with Intel Xeon 5600 series x86 processors.

  • This result was obtained on the Sun Blade X6270 M2 using Microsoft Windows 2008 R2 and Java HotSpot(TM) 64-Bit Server VM on Windows, version 1.6.0_21 Performance Release.

Performance Landscape

SPECjbb2005 Performance Chart (ordered by performance)

bops: SPECjbb2005 Business Operations per Second (bigger is better)

System Processor JVM Performance
SPECjbb2005
bops
SPECjbb2005
bops/JVM
Sun Blade X6270 M2 Intel X5680 3.33 GHz Java HotSpot 1.6.0_21 931637 465819
Cisco UCS B200 M2 Intel X5680 3.33 GHz IBM J9 VM 2.4 931076 155179
Fujitsu TX300 S6 Intel X5680 3.33 GHz IBM J9 VM 2.4 928393 154732
IBM x3500 M3 Intel X5680 3.33 GHz IBM J9 VM 2.4 916251 152709

Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org.

Results and Configuration Summary

Hardware Configuration:

Sun Blade X6270 M2
2 x Intel Xeon X5680 3.33 GHz processors
48 GB of memory

Software Configuration:

Windows Server 2008 R2 Enterprise Edition
Java HotSpot(TM) 64-Bit Server VM on Windows, version 1.6.0_21 Performance Release

Benchmark Description

SPECjbb2005 (Java Business Benchmark) measures the performance of a Java implemented application tier (server-side Java). The benchmark is based on the order processing in a wholesale supplier application. The performance of the user tier and the database tier are not measured in this test. The metrics given are number of SPECjbb2005 bops (Business Operations per Second) and SPECjbb2005 bops/JVM (bops per JVM instance).

Key Points and Best Practices

  • Enhancements to the JVM had a major impact on performance.

See Also

Disclosure Statement

SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results as of 6/28/2010 on www.spec.org. Sun Blade X6270 M2(2 chips, 12 cores) 931637 SPECjbb2005 bops, 465819 SPECjbb2005 bops/JVM submitted for review. Cisco UCS B200 M2(2 chips, 12 cores) 931076 SPECjbb2005 bops, 155179 SPECjbb2005 bops/JVM. Fujitsu TX300 S6(2 chips, 12 cores) 928393 SPECjbb2005 bops, 154732 SPECjbb2005 bops/JVM. IBM x3500 M3(2 chips, 12 cores) 916251 SPECjbb2005 bops, 152709 SPECjbb2005 bops/JVM.

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today