Monday Sep 20, 2010

Sun Fire X4470 4 Node Cluster Delivers World Record SAP SD-Parallel Benchmark Result

Oracle delivered an SAP enhancement package 4 for SAP ERP 6.0 Sales and Distribution – Parallel (SD-Parallel) Benchmark world record result using four of Oracle's Sun Fire X4470 servers, Oracle Solaris 10 and Oracle 11g Real Application Clusters (RAC) software.

  • The Sun Fire X4470 servers delivered 8% more performance compared to the IBM Power 780 server running the SAP enhancement package 4 for SAP ERP 6.0 Sales and Distribution benchmark.

  • The Sun Fire X4470 servers result of 40,000 users delivered 2.2 times the performance of the HP ProLiant DL980 G7 result of 18,180 users.

  • The Sun Fire X4470 servers result of 40,000 users delivered 2.5 times the performance of the Fujitsu PRIMEQUEST 1800E result of 16,000 users.

This result shows that a complete software and hardware solution from Oracle using Oracle RAC, Oracle Solaris and Sun servers provides a superior performing solution.

Performance Landscape

Selected SAP Sales and Distribution benchmark results are presented in decreasing order in performance. All benchmarks were using SAP enhancement package 4 for SAP ERP 6.0 (Unicode) except the result marked with an asterix (\*) which was achieved with SAP ERP 6.0.

System OS
Database
Users SAPS Type Date
Four Sun Fire X4470
4xIntel Xeon X7560 @2.26GHz
256 GB
Solaris 10
Oracle 11g Real Application Clusters
40,000 221,014 Parallel 20-Sep-10
Five IBM System p 570 (\*)
8xPOWER6 @4.7GHz
128 GB
AIX 5L Version 5.3
Oracle 10g Real Application Clusters
37,040 187,450 Parallel "non-Unicode" 25-Mar-08
IBM Power 780
8xPOWER7 @3.8GHz
1 TB
AIX 6.1
DB2 9.7
37,000 202,180 2-Tier 7-Apr-10
Two Sun Fire X4470
4xIntel Xeon X7560 @2.26GHz
256 GB
Solaris 10
Oracle 11g Real Application Clusters
21,000 115,300 Parallel 28-Jun-10
HP DL980 G7
8xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
18,180 99,320 2-Tier 21-Jun-10
Fujitsu PRIMEQUEST 1800E
8xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
16,000 87,550 2-Tier 30-Mar-10
Four Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
13,718 75,762 Parallel 12-Oct-09
HP DL580 G7
4xIntel Xeon X7560 @2.26GHz
256 GB
Win Server 2008 R2 DE
SQL Server 2008
10,445 57,020 2-Tier 21-Jun-10
Two Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
7,220 39,420 Parallel 12-Oct-09
One Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
3,800 20,750 Parallel 12-Oct-09

Complete benchmark results and a description can be found at the SAP benchmark website http://www.sap.com/solutions/benchmark/sd.epx.

Results and Configuration Summary

Hardware Configuration:

4 x Sun Fire X4470 servers, each with
4 x Intel Xeon X7560 2.26 GHz (4 chips, 32 cores, 64 threads)
256 GB memory

Software Configuration:

Oracle 11g Real Application Clusters (RAC)
Oracle Solaris 10

Results Summary:

Number of SAP SD benchmark users:
40,000
Average dialog response time:
0.86 seconds
Throughput:

Dialog steps/hour:
13,261,000

SAPS:
221,020
SAP Certification:
2010039

Benchmark Description

SAP is one of the premier world-wide ERP application providers and maintains a suite of benchmark tests to demonstrate the performance of competitive systems running the various SAP products.

The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments. The SAP Standard Application Sales and Distribution - Parallel (SD-Parallel) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing and demonstrates the ability to run both the application and database software on a single system.

The SD-Parallel Benchmark consists of the same transactions and user interaction steps as the SD Benchmark. This means that the SD-Parallel Benchmark runs the same business processes as the SD Benchmark. The difference between the benchmarks is the technical data distribution.

The additional rule for parallel and distributed databases is one must equally distribute the benchmark users across all database nodes for the used benchmark clients (round-robin method). Following this rule, all database nodes work on data of all clients. This avoids unrealistic configurations such as having only one client per database node.

The SAP Benchmark Council agreed to give the parallel benchmark a different name so that the difference can be easily recognized by any interested parties - customers, prospects, and analysts. The naming convention is SD-Parallel for Sales & Distribution - Parallel.

In January 2009, a new version of the SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution (SD) Benchmark was released. This new release has higher cpu requirements and so yields from 25-50% fewer users compared to the previous (non-unicode) Standard Sales and Distribution (SD) Benchmark. Between 10-30% of this greater load is due to the extra overhead from the processing of the larger character strings due to Unicode encoding.

Unicode is a computing standard that allows for the representation and manipulation of text expressed in most of the world's writing systems. Before the Unicode requirement, this benchmark used ASCII characters meaning each was just 1 byte. The new version of the benchmark requires Unicode characters and the Application layer (where ~90% of the cycles in this benchmark are spent) uses a new encoding, UTF-16, which uses 2 bytes to encode most characters (including all ASCII characters) and 4 bytes for some others. This requires computers to do more computation and use more bandwidth and storage for most character strings. Refer to the above SAP Note for more details.

See Also

Disclosure Statement

SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, results as of 9/19/2010. For more details, see http://www.sap.com/benchmark. SD-Parallel, Four Sun Fire X4470 (each 4 processors, 32 cores, 64 threads) 40,000 SAP SD Users, Cert# 2010039. SD-Parallel, Two Sun Fire X4470 (each 4 processors, 32 cores, 64 threads) 21,000 SAP SD Users, Cert# 2010029. SD 2-Tier, HP ProLiant DL980 G7 (8 processors, 64 cores, 128 threads) 18,180 SAP SD Users, Cert# 2010028. SD 2-Tier, Fujitsu PRIMEQUEST 1800E (8 processors, 64 cores, 128 threads) 16,000 SAP SD Users, Cert# 2010010. SD-Parallel, Four Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 13,718 SAP SD Users, Cert# 2009041. SD 2-Tier, HP ProLiant DL580 G7 (4 processors, 32 cores, 64 threads) 10,490 SAP SD Users, Cert# 2010032. SD 2-Tier, IBM System x3850 X5 (4 processors, 32 cores, 64 threads) 10,450 SAP SD Users, Cert# 2010012. SD 2-Tier, Fujitsu PRIMERGY RX600 S5 (4 processors, 32 cores, 64 threads) 9,560 SAP SD Users, Cert# 2010017. SD-Parallel, Two Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 7,220 SAP SD Users, Cert# 2009040. SD-Parallel, Sun Blade X6270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, Cert# 2009039. SD 2-Tier, Sun Fire X4270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, Cert# 2009033.

SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, results as of 9/19/2010. SD-Parallel, Five IBM System p 570 (each 8 processors, 16 cores, 32 threads) 37,040 SAP SD Users, Cert# 2008013.

Tuesday Jun 29, 2010

Sun Fire X2270 M2 Achieves Leading Single Node Results on ANSYS FLUENT Benchmark

Oracle's Sun Fire X2270 M2 server produced leading single node performance results running the ANSYS FLUENT benchmark cases as compared to the best single node results currently posted at the ANSYS FLUENT website. ANSYS FLUENT is a prominent MCAE application used for computational fluid dynamics (CFD).

  • The Sun Fire X2270 M2 server outperformed all single node systems in 5 of 6 test cases at the 12 core level, beating systems from Cray and SGI.
  • For the truck_14m test, the Sun Fire X2270 M2 server outperformed all single node systems at all posted core counts, beating systems from SGI, Cray and HP. When considering performance on a single node, the truck_14m model is most representative of customer CFD model sizes in the test suite.
  • The Sun Fire X2270 M2 server with 12 cores performed up to 1.3 times faster than the previous generation Sun Fire X2270 server with 8 cores.

Performance Landscape

Results are presented for six of the seven ANSYS FLUENT benchmark tests. The seventh test is not a practical test for a single system. Results are ratings, where bigger is better. A rating is the number of jobs that could be run in a single day (86,400 / run time). Competitive results are from the ANSYS FLUENT benchmark website as of 25 June 2010.

Single System Performance

ANSYS FLUENT Benchmark Tests
Results are Ratings, Bigger is Better
System Benchmark Test
eddy_417k turbo_500k aircraft_2m sedan_4m truck_14m truck_poly_14m
Sun Fire X2270 M2 1129.4 5391.6 1105.9 814.1 94.8 96.4
SGI Altix 8400EX 1338.0 5308.8 1073.3 796.3 - -
SGI Altix XE1300C 1299.2 5284.4 1071.3 801.3 90.2 -
Cray CX1 1060.8 5127.6 1069.6 808.6 86.1 87.5

Scaling of Benchmark Test truck_14m

ANSYS FLUENT truck_14m Model
Results are Ratings, Bigger is Better
System Cores Used
12 8 4 2 1
Sun Fire X2270 M2 94.8 73.6 41.4 21.0 10.4
SGI Altix XE1300C 90.2 60.9 41.1 20.7 9.0
Cray CX1 (X5570) - 71.7 33.2 18.9 8.1
HP BL460 G6 (X5570) - 70.3 38.6 19.6 9.2

Comparing System Generations, Sun Fire X2270 M2 to Sun Fire X2270

ANSYS FLUENT Benchmark Tests
Results are Ratings, Bigger is Better
System Benchmark Test
eddy_417k turbo_500k aircraft_2m sedan_4m truck_14m truck_poly_14m
Sun Fire X2270 M2 1129.4 5374.8 1103.8 814.1 94.8 96.4
Sun Fire X2270 981.5 4163.9 862.7 691.2 73.6 73.3

Ratio 1.15 1.29 1.28 1.18 1.29 1.32

Results and Configuration Summary

Hardware Configuration:

Sun Fire X2270 M2
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory
1 x 500 GB 7200 rpm SATA internal HDD

Sun Fire X2270
2 x 2.93 GHz Intel Xeon X5570 processors
48 GB memory
2 x 24 GB internal striped SSDs

Software Configuration:

64-bit SUSE Linux Enterprise Server 10 SP 3 (SP 2 for X2270)
ANSYS FLUENT V12.1.2
ANSYS FLUENT Benchmark Test Suite

Benchmark Description

The following description is from the ANSYS FLUENT website:

The FLUENT benchmarks suite comprises of a set of test cases covering a large range of mesh sizes, physical models and solvers representing typical industry usage. The cases range in size from a few 100 thousand cells to more than 100 million cells. Both the segregated and coupled implicit solvers are included, as well as hexahedral, mixed and polyhedral cell cases. This broad coverage is expected to demonstrate the breadth of FLUENT performance on a variety of hardware platforms and test cases.

The performance of a CFD code will depend on several factors, including size and topology of the mesh, physical models, numerics and parallelization, compilers and optimization, in addition to performance characteristics of the hardware where the simulation is performed. The principal objective of this benchmark suite is to provide comprehensive and fair comparative information of the performance of FLUENT on available hardware platforms.

About the ANSYS FLUENT 12 Benchmark Test Suite

    CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.

See Also

Disclosure Statement

All information on the FLUENT website (http://www.fluent.com) is Copyrighted 1995-2010 by ANSYS Inc. Results as of June 25, 2010.

Sun Fire X2270 M2 Demonstrates Outstanding Single Node Performance on MSC.Nastran Benchmarks

Oracle's Sun Fire X2270 M2 server results showed outstanding performance running the MCAE application MSC.Nastran as shown by the MD Nastran MDR3 serial and parallel test cases.

Performance Landscape

Complete information about the serial results presented below can be found on the MSC Nastran website.


MD Nastran MDR3 Serial Test Results
Platform Benchmark Problem
Results are total elapsed run time in seconds
xl0imf1 xx0xst0 xl1fn40 vl0sst1
Sun Fire X2270 M2 999 704 2337 115
Sun Blade X6275 1107 798 2285 120
Intel Nehalem 1235 971 2453 123
Intel Nehalem w/ SSD 1484 767 2456 120
IBM:P6 570 ( I8 )
1510 4612 132
IBM:P6 570 ( I4 ) 1016 1618 5534 147

Complete information about the parallel results presented below can be found on the MSC Nastran website.


MD Nastran MDR3 Parallel Test Results
Platform Benchmark Problem
Results are total elapsed run time in seconds
xx0cmd2 md0mdf1
Serial DMP=2 DMP=4 DMP=8 Serial DMP=2 DMP=4
Sun Blade X6275 840 532 391 279 880 422 223
Sun Fire X2270 M2 847 558 371 297 889 462 232
Intel Nehalem w/ 4 SSD 887 639 405
902 479 235
Intel Nehalem 915 561 408
922 470 251
IBM:P6 570 ( I8 ) 920 574 392 322


IBM:P6 570 ( I4 ) 959 616 419 343 911 469 242

Results and Configuration Summary

Hardware Configuration:

Sun Fire X2270 M2
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory
4 x 24 GB SSDs (striped)

Software Configuration:

64-bit SUSE Linux Enterprise Server 10 SP 3
MSC Software MD 2008 R3
MD Nastran MDR3 benchmark test suite

Benchmark Description

The benchmark tests are representative of typical MSC.Nastran applications including both serial and parallel (DMP) runs involving linear statics, nonlinear statics, and natural frequency extraction as well as others. MD Nastran is an integrated simulation system with a broad set of multidiscipline analysis capabilities.

Key Points and Best Practices

  • The test cases for the MSC.Nastran module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). The required scratch file size ranges from less than 1 GB on up to about 140 GB. To obtain best performance, it is important to have a high performance storage system when running MD Nastran.

  • To improve performance, it is possible to make use of the MD Nastran feature which sets the maximum amount of memory the application will use. This allows a user to configure where temporary files are held, including in memory file systems like tmpfs.

See Also

Disclosure Statement

MSC.Software is a registered trademark of MSC. All information on the MSC.Software website is copyrighted. MD Nastran MDR3 results from http://www.mscsoftware.com and this report as of June 28, 2010.

Sun Fire X2270 M2 Sets World Record on SPEC OMP2001 Benchmark

Oracle's Sun Fire X2270 M2 server running the Oracle Solaris 10 10/09 with the Oracle Solaris Studio 12 Update 1 compiler, produced the top x86 SPECompM2001 result for all 2-socket servers.

  • The Sun Fire X2270 M2 server with two Intel Xeon X5670 processors running 24 OpenMP threads achieved a SPEC OMP2001 result of 55,178 SPECompM2001.

  • The Sun Fire X2270 M2 server beat the Cisco B200 M2 system even thought the Cisco system used the faster Intel Xeon X5680 (3.33GHz) chips.

Performance Landscape

SPEC OMP2001 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results. All results as of 06/28/10.

In the tables below
"Base" = SPECompMbase2001 and "Peak" = SPECompMpeak2001

SPEC OMPM2001 results

System Processors Base
Threads
Performance Results
Cores/
Chips
Type GHz Peak Base
Sun Fire X2270 M2 12/2 Xeon X5670 2.93 24 55178 49548
Cisco B200 M2 12/2 Xeon X5680 3.33 24 55072 52314
Intel SR1600UR 12/2 Xeon X5680 3.33 24 54249 51510
Intel SR1600UR 12/2 Xeon X5670 2.93 24 53313 50283

Results and Configuration Summary

Hardware Configuration:

Sun Fire X2270 M2
2 x 2.93 GHz Intel Xeon X5670
24 GB

Software Configuration:

Oracle Solaris 10 10/09
Oracle Solaris Studio 12 Update 1
SPEC OMP2001 suite v3.2

Benchmark Description

The SPEC OMPM2001 Benchmark Suite was released in June 2001 and tests HPC performance using OpenMP for parallelism.

  • 11 programs (3 in C and 8 in Fortran) parallelized using OpenMP API
Goals of the suite:
  • Targeted to mid-range (4-32 processor) parallel systems
  • Run rules, tools and reporting similar to SPEC CPU2006
  • Programs representative of HPC and Scientific Applications

The SPEC OMPL2001 Benchmark Suite was released in June 2001 and tests HPC performance using OpenMP for parallelism.

  • 9 programs (2 in C and 7 in Fortran) parallelized using OpenMP API
Goals of the suite:
  • Targeted to larger parallel systems
  • Run rules, tools and reporting similar to SPEC CPU2006
  • Programs representative of HPC and Scientific Applications

There are "base" variants of both the above metrics that require more conservative compilation, such as using the same flags for all benchmarks.

See Also

Disclosure Statement

SPEC, SPEComp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 28 June 2010 and this report. Sun Fire X2270 M2 (2 chips, 12 cores, 24 OpenMP threads) 55,178 SPECompM2001;

Monday Jun 28, 2010

Sun Fire X4470 Sets World Records on SPEC OMP2001 Benchmarks

Oracle's Sun Fire X4470 server, with four Intel Xeon X7560 processors capable of running OpenMP applications with 64 compute threads, delivered outstanding performance on the both medium and large suites of the industry-standard SPEC OMP2001 benchmark.

  • The Sun Fire X4470 server running the Oracle Solaris 10 10/09 operating system with Oracle Solaris Studio 12 Update 1 compiler software, produced the top x86 result on SPECompM2001.

  • The Sun Fire X4470 server running the Oracle Solaris 10 10/09 operating system with Oracle Solaris Studio 12 Update 1 compiler software, produced the top x86 result on SPECompL2001.

  • The Sun Fire X4470 server beats IBM Power 750 Express POWER7 3.55 GHz SPECompM2001 score by 14%, while the Sun Fire X4470 server uses half the number of OpenMP threads compared to the IBM Power 750.
  • The Sun Fire X4470 server with four Intel Xeon 7560 processors, running 64 OpenMP threads, achieved SPEC OMP2001 results of 118,264 SPECompM2001 and 642,479 SPECompL2001.

  • The Sun Fire X4470 server produced better SPECompL2001 results than Cisco (UCS C460 M1) and Intel (QSSC-S4R) even though they all used the same number of Intel Xeon X7560 processors.

  • The Sun Fire X4470 server produced better SPECompM2001 results than Cisco (UCS C460 M1), SGI (Altix UV 10) and Intel (QSSC-S4R) even though they all used the same number of Intel Xeon X7560 processors.

Performance Landscape

SPEC OMP2001 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results. All results as of 06/28/10.

In the tables below
"Base" = SPECompLbase2001 or SPECompMbase2001
"Peak" = SPECompLpeak2001 or SPECompMpeak2001

SPEC OMPL2001 results

System Processors Base
Threads
Performance Results
Cores/
Chips
Type GHz Peak Base
Sun Fire X4470 32/4 Xeon 7560 2.26 64 642479 615790
Cisco UCS C460 M1 32/4 Xeon 7560 2.26 64 628126 607818
Intel QSSC-S4R X7560 32/4 Xeon 7560 2.26 64 610386 591375
Sun/Fujitsu SPARC M8000 64/16 SPARC64 VII 2.52 64 581807 532576

SPEC OMPM2001 results

System Processors Base
Threads
Performance Results
Cores/
Chips
Type GHz Peak Base
Sun Fire X4470 32/4 Xeon 7560 2.26 64 118264 95650
Cisco UCS C460 M1 32/4 Xeon 7560 2.26 64 109077 100258
SGI Altix UV 10 32/4 Xeon 7560 2.26 64 107248 96797
Intel QSSC-S4R X7560 32/4 Xeon 7560 2.26 64 106369 98288
Sun/Fujitsu SPARC M8000 64/16 SPARC64 VII 2.52 64 104714 75418
IBM Power 750 Express 32/4 POWER7 3.55 128 104175 92957

Results and Configuration Summary

Hardware Configuration:

Sun Fire X4470
4 x 2.26 GHz Intel Xeon X7560
256 GB

Software Configuration:

Oracle Solaris 10 10/09
Oracle Solaris Studio Update 1
SPEC OMP2001 suite v3.2

Benchmark Description

The SPEC OMPM2001 Benchmark Suite was released in June 2001 and tests HPC performance using OpenMP for parallelism.

  • 11 programs (3 in C and 8 in Fortran) parallelized using OpenMP API
Goals of the suite:
  • Targeted to mid-range (4-32 processor) parallel systems
  • Run rules, tools and reporting similar to SPEC CPU2006
  • Programs representative of HPC and Scientific Applications

The SPEC OMPL2001 Benchmark Suite was released in June 2001 and tests HPC performance using OpenMP for parallelism.

  • 9 programs (2 in C and 7 in Fortran) parallelized using OpenMP API
Goals of the suite:
  • Targeted to larger parallel systems
  • Run rules, tools and reporting similar to SPEC CPU2006
  • Programs representative of HPC and Scientific Applications

There are "base" variants of both the above metrics that require more conservative compilation, such as using the same flags for all benchmarks.

See Also

Disclosure Statement

SPEC, SPEComp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 28 June 2010 and this report. Sun Fire X4470 (4 chips, 32 cores, 64 OpenMP threads) 642,479 SPECompL2001, 118264 SPECompM2001.

Sun Fire X4470 2-Node Configuration Sets World Record for SAP SD-Parallel Benchmark

Using two of Oracle's Sun Fire X4470 servers to run the SAP Enhancement Package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution – Parallel (SD-Parallel) standard application benchmark, Oracle delivered a world record result. This was run using Oracle Solaris 10 and Oracle 11g Real Application Clusters (RAC) software.

  • The Sun Fire X4470 servers result of 21,000 users delivered more than twice the performance of the IBM System x3850 X5 system result of 10,450 users.

  • The Sun Fire X4470 servers result of 21,000 users beat the HP ProLiant DL980 G7 system result of 18,180 users. Both solutions used 8 Intel Xeon X7560 processors.

  • The Sun Fire X4470 servers result of 21,000 users beat the Fujitsu PRIMEQUEST 1800E system result of 16,000 users. Both solutions used 8 Intel Xeon X7560 processors.

  • This result shows how a compete software and hardware solution from Oracle, using Oracle RAC, Oracle Solaris and along with Oracle's Sun servers, can provide a superior performing solution when compared to the competition.

Performance Landscape

SAP Enhancement Package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, select results presented in decreasing performance order. Both Parallel and 2-Tier solution results are listed in the table.

System OS
Database
Users SAPS Type Date
Two Sun Fire X4470
4xIntel Xeon X7560 @2.26GHz
256 GB
Solaris 10
Oracle 11g Real Application Clusters
21,000 115,300 Parallel 28-Jun-10
HP DL980 G7
8xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
18,180 99,320 2-Tier 21-Jun-10
Fujitsu PRIMEQUEST 1800E
8xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
16,000 87,550 2-Tier 30-Mar-10
Four Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
13,718 75,762 Parallel 12-Oct-09
IBM System x3850 X5
4xIntel Xeon X7560 @2.26GHz
256 GB
Win Server 2008 EE
DB2 9.7
10,450 57,120 2-Tier 30-Mar-10
HP DL580 G7
4xIntel Xeon X7560 @2.26GHz
256 GB
Win Server 2008 R2 DE
SQL Server 2008
10,445 57,020 2-Tier 21-Jun-10
Fujitsu PRIMERGY RX600 S5
4xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
9,560 52,300 2-Tier 06-May-10
Two Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
7,220 39,420 Parallel 12-Oct-09
One Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
3,800 20,750 Parallel 12-Oct-09
Sun Fire X4270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g
3,800 21,000 2-Tier 21-Aug-09

Complete benchmark results may be found at the SAP benchmark website http://www.sap.com/solutions/benchmark/sd.epx.

Results and Configuration Summary

Hardware Configuration:

2 x Sun Fire X4470 servers, each with
4 x Intel Xeon X7560 2.26 GHz (4 chips, 32 cores, 64 threads)
256 GB memory

Software Configuration:

Oracle 11g Real Application Clusters (RAC)
Oracle Solaris 10

Results Summary:

Number of SAP SD benchmark users:
21,000
Average dialog response time:
0.93 seconds
Throughput:

Dialog steps/hour:
6,918,000

SAPS:
115,300
SAP Certification:
2010029

Benchmark Description

The SAP Standard Application Sales and Distribution - Parallel (SD-Parallel) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing, and demonstrates the ability to run both the application and database software on a single system. The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments.

The SD-Parallel Benchmark consists of the same transactions and user interaction steps as the SD Benchmark. This means that the SD-Parallel Benchmark runs the same business processes as the SD Benchmark. The difference between the benchmarks is the technical data distribution.

An additional rule for parallel and distributed databases is one must equally distribute the benchmark users across all database nodes for the used benchmark clients (round-robin-method). Following this rule, all database nodes work on data of all clients. This avoids unrealistic configurations such as having only one client per database node.

The SAP Benchmark Council agreed to give the parallel benchmark a different name so that the difference can be easily recognized by any interested parties - customers, prospects, and analysts. The naming convention is SD-Parallel for Sales & Distribution - Parallel.

SAP is one of the premier world-wide ERP application providers, and maintains a suite of benchmark tests to demonstrate the performance of competitive systems on the various SAP products.

See Also

Disclosure Statement

SAP Enhancement Package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, results as of 6/22/2010. For more details, see http://www.sap.com/benchmark. SD-Parallel, Two Sun Fire X4470 (each 4 processors, 32 cores, 64 threads) 21,000 SAP SD Users, Cert# 2010029. SD 2-Tier, HP ProLiant DL980 G7 (8 processors, 64 cores, 128 threads) 18,180 SAP SD Users, Cert# 2010028. SD 2-Tier, Fujitsu PRIMEQUEST 1800E (8 processors, 64 cores, 128 threads) 16,00o SAP SD Users, Cert# 2010010. SD-Parallel, Four Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 13,718 SAP SD Users, Cert# 2009041. SD 2-Tier, IBM System x3850 X5 (4 processors, 32 cores, 64 threads) 10,450 SAP SD Users, Cert# 2010012. SD 2-Tier, Fujitsu PRIMERGY RX600 S5 (4 processors, 32 cores, 64 threads) 9,560 SAP SD Users, Cert# 2010017. SD-Parallel, Two Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 7,220 SAP SD Users, Cert# 2009040. SD-Parallel, Sun Blade X6270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, Cert# 2009039. SD 2-Tier, Sun Fire X4270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, Cert# 2009033.

Tuesday Apr 06, 2010

WRF Benchmark: X6275 Beats Power6

Significance of Results

Oracle's Sun Blade X6275 cluster is 28% faster than the IBM POWER6 cluster on Weather Research and Forecasting (WRF) continental United Status (CONUS) benchmark datasets. The Sun Blade X6275 cluster used a Quad Data Rate (QDR) InfiniBand connection along with Intel compilers and MPI.

  • On the 12 km CONUS data set, the Sun Blade X6275 cluster was 28% faster than the IBM POWER6 cluster at 512 cores.

  • The Sun Blade X6275 cluster with 768 cores (one full Sun Blade 6048 chassis) was 47% faster than 1024 cores of the IBM POWER6 cluster (multiple racks).

  • On the 2.5 km CONUS data set, the Sun Blade X6275 cluster was 21% faster than the IBM POWER6 cluster at 512 cores.

  • The Sun Blade X6275 cluster with 768 cores (on full Sun Blade 6048 chassis) outperforms the IBM Power6 cluster with 1024 cores by 28% on the 2.5 km CONUS dataset.

Performance Landscape

The performance in GFLOPS is shown below on multiple datasets.

Weather Research and Forecasting
CONUS 12 KM Dataset
Cores Performance in GFLOPS
Sun
X6275
Intel
Whitebox
IBM
POWER6
Cray
XT5
SGI TACC
Ranger
Blue
Gene/P
8 17.5 19.8 17.0
10.2

16 38.7 37.5 33.6 21.4 20.1 10.8
32 71.6 73.3 66.5 40.4 39.8 21.2 5.9
64 132.5 131.4 117.2 75.2 77.0 37.8
128 235.8 232.8 209.1 137.4 114.0 74.5 20.4
192 323.6





256 405.2 415.1 363.1 243.2 197.9 121.0 37.4
384 556.6





512 691.9 696.7 542.2 392.2 375.2 193.9 65.6
768 912.0






1024

618.5 634.1 605.9 271.7 108.5
1700



840.1


2048





175.6

All cores used on each node which participates in each run.

Sun X6275 - 2.93 GHz X5570, InfiniBand
Intel Whitebox - 2.8 GHz GHz X5560, InfiniBand
IBM POWER6 - IBM Power 575, 4.7 GHz POWER6, InfiniBand, 3 frames
Cray XT5 - 2.7 GHz AMD Opteron (Shanghai), Cray SeaStar 2.1
SGI - best of a variety of results
TACC Ranger - 2.3 GHz AMD Opteron (Barcelona), InfiniBand
Blue Gene/P - 850 MHz PowerPC 450, 3D-Torus (proprietary)

Weather Research and Forecasting
CONUS 2.5 KM Dataset
Cores Performance in GFLOPS
Sun
X6275
SGI
8200EX
Blue
Gene/L
IBM
POWER6
Cray
XT5
Intel
Whitebox
TACC
Ranger
16 35.2






32 69.6

64.3


64 140.2

130.9
147.8 24.5
128 278.9 89.9
242.5 152.1 290.6 87.7
192 400.5





256 514.8 179.6 8.3 431.3 306.3 535.0 145.3
384 735.1





512 973.5 339.9 16.5 804.4 566.2 1019.9 311.0
768 1367.7





1024
721.5 124.8 1067.3 1075.9 1911.4 413.4
2048
1389.5 241.2
1849.7 3251.1
2600




4320.6
3072
1918.7 350.5
2651.3

4096
2543.5 453.2
3288.7

6144
3057.3 642.3
4280.1

8192
3569.7 820.4
5140.4

18432

1238.0



Sun X6275 - 2.93 GHz X5570, InfiniBand
SGI 8200EX - 3.0 GHz E5472, InfiniBand
Blue Gene/L - 700 MHz PowerPC 440, 3D-Torus (proprietary)
IBM POWER6 - IBM Power 575, 4.7 GHz POWER6, InfiniBand, 3 frames
Cray XT5 - 2.4 GHz AMD Opteron (Shanghai), Cray SeaStar 2.1
Intel Whitebox - 2.8 GHz GHz X5560, InfiniBand
TACC Ranger - 2.3 GHz AMD Opteron (Barcelona), InfiniBand

Results and Configuration Summary

Hardware Configuration:

48 x Sun Blade X6275 server modules, 2 nodes per blade, each node with
2 Intel Xeon X5570 2.93 GHz processors, turbo enabled, ht disabled
24 GB memory
QDR InfiniBand

Software Configuration:

SUSE Linux Enterprise Server 10 SP2
Intel Compilers 11.1.059
Intel MPI 3.2.2
WRF 3.0.1.1
WRF 3.1.1
netCDF 4.0.1

Benchmark Description

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. WRF is designed to be a flexible, state-of-the-art atmospheric simulation system that is portable and efficient on available parallel computing platforms. It features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility.

There are two fixed-size benchmark cases.

Single domain, medium size 12KM Continental US (CONUS-12K)

  • 425x300x35 cell volume
  • 48hr, 12km resolution dataset from Oct 24, 2001
  • Benchmark is a 3hr simulation for hrs 25-27 starting from a provided restart file
  • Iterations output at every 72 sec of simulation time, with the computation cost of each time step ~30 GFLOP

Single domain, large size 2.5KM Continental US (CONUS-2.5K)

  • 1501x1201x35 cell volume
  • 6hr, 2.5km resolution dataset from June 4, 2005
  • Benchmark is the final 3hr simulation for hrs 3-6 starting from a provided restart file; the benchmark may also be performed (but seldom reported) for the full 6hrs starting from a cold start
  • Iterations output at every 15 sec of simulation time, with the computation cost of each time step ~412 GFLOP

See Also

Disclosure Statement

WRF, see http://www.mmm.ucar.edu/wrf/WG2/bench/, results as of 3/8/2010.

Monday Jan 25, 2010

Sun/Solaris Leadership in SAP SD Benchmarks and HP claims

COMMENTS ON  SIGNIFICANT SAP SD 2 Tier RESULTS and HP MISLEADING CLAIMS:

HP is making claims of "leadership" in the  Two-Tier SAP SD benchmark by carefully fencing the claims, based on the OS (Linux & Windows),  and conveniently omitting the actual leading results from Sun on Solaris.

HP's claims: ftp://ftp.hp.com/pub/c-products/servers/benchmarks/HP_ProLiant_785_585_385_SAP_perf_brief_121009.pdf

It is worthwhile to take a closer look at the results and the real leadership of Sun and Solaris in this SAP benchmark. All the SAP SD Two-Tier results discussed here can be seen at: http://www.sap.com/solutions/benchmark/sd2tier.epx  All results here use the latest version of SAP Enhancement Package 4  for SAP ERP 6.0 (Unicode).

Here is a summary of the HP claims and the counterpoints showing Sun and Solaris real leadership in performance and scalability.

HP claims the #1 position for 8-Processor, 4-Processor and 2-processor servers as follows;

  • 8-processor:  yes, but on Windows (8,280 SAP SD Benchmark users) and on Linux (8,022 SAP SD Benchmark users) with the HP Proliant DL785 G6 (8x 2.8GHz Opteron 8439 SE).
  • Formally correct statements however HP fails to mention Sun actual  #1 8-processor Overall record result by far, using Solaris at 10,000 SAP SD Benchmark users on a Sun Fire X4640 with eight 2.6GHz Opteron 8435, leading by more than 20% with lower clock speed!
  • 4-processor:  yes, but on Linux (4,370 SAP SD Benchmark users,  using the HP Proliant DL585 G6, 4x 2.8GHz AMD Opteron SE).
  • Again, formally correct statement however, HP fails to mention that Sun holds the 4-processor Overall record result using Solaris at 4,720 SAP SD Benchmark users, obtained on a Sun SPARC Enterprise T5440 with four 1.6GHz UltraSPARC T2 Plus.
  • 2-processor: Similarly HP claims the #1 and #2 rankings, but on Linux (3,171 SAP SD Benchmark users, HP Proliant DL380 G6, 2x 2.93GHz Xeon X5570) and (2,315 SAP SD Benchmark users, HP Proliant DL385 G6, 2x 2.6GHz Opteron 2435).
  • Again, HP omits the fact that Sun holds the  #1 2-Processor Overall  record result on Solaris at 3,800 SAP SD Benchmark users  obtained on a Sun Fire X4270 with 2x 2.93GHz Xeon X5570. A 20% and 64% Sun leading result!

The only conclusion is that that Sun Servers running Solaris have real Leadership in the SAP SD Two-Tier Benchmark. This fact is not only confirmed for the 2- to 8-processor servers but also at the high end where the Sun M9000 with Solaris leads with the World Record Overall for this benchmark, showing real record performance and top vertical scalability.

More details on the World Record Sun M9000 SAP SD 2-Tier results at BestPerf blog: http://blogs.sun.com/BestPerf/entry/sun_m9000_fastest_sap_2

SAP Benchmark Disclosure statement

Two-tier SAP Sales and Distribution (SD) standard SAP enhancement package 4 for SAP ERP 6.0 (Unicode) application benchmarks as of Jan 22, 2010: 

Cert# Benchmark Server Users SAPS Procs Cores Thrds CPU CPU MHz Mem (MB) Operating System RDBMS Release
2009046 Sun SPARC M9000 32000 175600 64 256 512 SPARC64 VII 2880 1179648 Solaris 10 Oracle 10g
2009049 Sun Fire X4640 10000 55070 8 48 48 AMD Opt 8435 2600 262144 Solaris 10 Oracle 10g
2009052 HP ProL DL785 G6 8022 43800 8 48 48 AMD Opt 8439SE 2800 131072 SuSE Linux ES10 MaxDB 7.8
2009035 HP ProL DL785 G6 8280 45350 8 48 48 AMD Opt 8439SE 2800 131072 Windows 2008-EE SQL Server 2008
2009026 Sun SPARC T5440 4720 25830 4 32 256 UltraSPARC T2Plus 1600 262144 Solaris 10 Oracle 10g
2009025 HP ProL DL585 G6 4665 25530 4 24 24 AMD Opt 8439SE 2800 65536 Windows 2008-EE SQL Server 2008
2009051 HP ProL DL585 G6 4370 23850 4 24 24 AMD Opt 8439SE 2800 65536 SuSE Linux ES10 MaxDB 7.8
2009033 Sun Fire X4270 3800 21000 2 8 16 Intel Xeon X5570 2930 49152 Solaris 10 Oracle 10g
2009004 HP ProL DL380 G6 3300 18030 2 8 16 Intel Xeon X5570 2930 49152 Windows 2008-EE SQL Server 2008
2009006 HP ProL DL380 G6 3171 17380 2 8 16 Intel Xeon X5570 2930 49152 SuSE Linux ES10 MaxDB 7.8
2009050 HP ProL DL385 G6 2315 12650 2 12 12 AMD Opt 2435 2600 49152 SuSE Linux ES10 MaxDB 7.8

SAP, R/3, reg TM of SAP AG in Germany and other countries. More info www.sap.com/benchmark

Friday Nov 20, 2009

Sun Blade 6048 and Sun Blade X6275 NAMD Molecular Dynamics Benchmark beats IBM BlueGene/L

Significance of Results

A Sun Blade 6048 chassis with 48 Sun Blade X6275 server modules ran benchmarks using the NAMD molecular dynamics applications software. NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD is driven by major trends in computing and structural biology and received a 2002 Gordon Bell Award.

  • The cluster of 32 Sun Blade X6275 server modules was 9.2x faster than the 512 processor configuration of the IBM BlueGene/L.

  • The cluster of 48 Sun Blade X6275 server modules exhibited excellent scalability for NAMD molecular dynamics simulation, up to 37.8x speedup for 48 blades relative to 1 blade.

  • For largest molecule considered, the cluster of 48 Sun Blade X6275 server modules achieved a throughput of 0.028 seconds per simulation step.
Molecular dynamics simulation is important to biological and materials science research. Molecular dynamics is used to determine the low energy conformations or shapes of a molecule. These conformations are presumed to be the biologically active conformations.

Performance Landscape

The NAMD Performance Benchmarks web page plots the performance of NAMD when the ApoA1 benchmark is executed on a variety of clusters. The performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, multiplied by the number of "processors" on which NAMD executes in parallel. The following table compares the performance of the Sun Blade X6275 cluster to several of the clusters for which performance is reported on the web page. In this table, the performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Cluster Name and Interconnect Throughput for 128 Cores
(seconds per step)
Throughput for 256 Cores
(seconds per step)
Throughput for 512 Cores
(seconds per step)
Sun Blade X6275 InfiniBand 0.014 0.0073 0.0048
Cambridge Xeon/3.0 InfiniPath 0.016 0.0088 0.0056
NCSA Xeon/2.33 InfiniBand 0.019 0.010 0.008
AMD Opteron/2.2 InfiniPath 0.025 0.015 0.008
IBM HPCx PWR4/1.7 Federation 0.039 0.021 0.013
SDSC IBM BlueGene/L MPI 0.108 0.061 0.044

The following tables report results for NAMD molecular dynamics using a cluster of Sun Blade X6275 server modules. The performance of the cluster is expressed in terms of the time in seconds that is required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Blades Cores STMV molecule (1) f1 ATPase molecule (2) ApoA1 molecule (3)
Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy
48 768 0.0277 37.8 79% 0.0075 35.2 73% 0.0039 22.2 46%
36 576 0.0324 32.3 90% 0.0096 27.4 76% 0.0045 19.3 54%
32 512 0.0368 28.4 89% 0.0104 25.3 79% 0.0048 18.1 57%
24 384 0.0481 21.8 91% 0.0136 19.3 80% 0.0066 13.2 55%
16 256 0.0715 14.6 91% 0.0204 12.9 81% 0.0073 11.9 74%
12 192 0.0875 12.0 100% 0.0271 9.7 81% 0.0096 9.1 76%
8 128 0.1292 8.1 101% 0.0337 7.8 98% 0.0139 6.3 79%
4 64 0.2726 3.8 95% 0.0666 4.0 100% 0.0224 3.9 98%
1 16 1.0466 1.0 100% 0.2631 1.0 100% 0.0872 1.0 100%

spdup - speedup versus 1 blade result
effi'cy - speedup efficiency versus 1 blade result

(1) Satellite Tobacco Mosaic Virus (STMV) molecule, 1,066,628 atoms, 12 Angstrom cutoff, Langevin dynamics, 500 time steps
(2) f1 ATPase molecule, 327,506 atoms, 11 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps
(3) ApoA1 molecule, 92,224 atoms, 12 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps

Results and Configuration Summary

Hardware Configuration

    48 x Sun Blade X6275, each with
      2 x (2 x 2.93 GHz Intel QC Xeon X5570 (Nehalem) processors)
      2 x (24 GB memory)
      Hyper-Threading (HT) off, Turbo Mode on

Software Configuration

    SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
    OpenMPI 1.3.2
    gcc 4.1.2 (1/15/2007), gfortran 4.1.2 (1/15/2007)

Benchmark Description

Molecular dynamics simulation is widely used in biological and materials science research. NAMD is a public-domain molecular dynamics software application for which a variety of molecular input directories are available. Three of these directories define:
  • the Satellite Tobacco Mosaic Virus (STMV) that comprises 1,066,628 atoms
  • the f1 ATPase enzyme that comprises 327,506 atoms
  • the ApoA1 enzyme that comprises 92,224 atoms
Each input directory also specifies the type of molecular dynamics simulation to be performed, for example, Langevin dynamics with a 12 Angstrom cutoff for 500 time steps, or particle mesh Ewald dynamics with an 11 Angstrom cutoff for 500 time steps.

Key Points and Best Practices

Models with large numbers of atoms scale better than models with small numbers of atoms.

The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.33GHz. This feature was was enabled when generating the results reported here.

See Also

Disclosure Statement

NAMD, see http://www.ks.uiuc.edu/Research/namd/performance.html for more information, results as of 11/17/2009.

Monday Nov 02, 2009

Sun Ultra 27 Delivers Leading Single Frame Buffer SPECviewperf 10 Results

A Sun Ultra 27 workstation configured with an nVidia FX5800 graphics card delivered outstanding performance running the SPECviewperf® 10 benchmark.

  • When compared with other workstations running a single graphics card (i.e. not running two or more cards in SLI mode), the Sun Ultra 27 workstation places first in 6 of 8 subtests and second in the remaining two subtests.

  • The calculated geometric mean shows that Sun Ultra 27 workstation is 11% faster than competitor's workstations.

  • The optimum point for price/performance is the nVidia FX1800 graphics card.

Results have been published on the SPEC web site at http://www.spec.org/gwpg/gpc.data/vp10/summary.html.

Performance Landscape

Performance of the Sun Ultra 27 versus the competition. Bigger is better for each of the eight tests. The comparison is based upon the performance of the Sun Ultra 27 workstation. Performance is measured in frames per second.


3DSMAX CATIA ENSIGHT MAYA
Perf % Perf % Perf % Perf %
Sun Ultra 27 FX5800 59.34
68.81
58.07
246.09
HP xw4600 ATI FireGL V7700 49.71 19 48.05 43 57.11 2
268.62 -8
HP xw4600 FX4800 52.26 14 63.26 12 53.79 8
226.82 7
Fujtsu Celsius M470 FX3800 53.67 11 65.25 7 52.19 10 227.37 7

PROENGINEER SOLIDWORKS TEAMCENTER UGS
Perf % Perf % Perf % Perf %
Sun Ultra 27 FX5800 68.96
152.01
42.02
36.04
HP xw4600 ATI FireGL V7700 47.25 32 109.71 28 40.18 4 56.65 -57
HP xw4600 FX4800 61.15 11 131.31 14 28.42 32 33.43 7
Fujtsu Celsius M470 FX3800 64.39 7
139.2 8 29.02 31 33.27 8

Comparison of various frame buffers on the Sun Ultra 27 running SPECviewperf 10. Performance is reported for each test along with the difference in performance as compared to the FX5800 frame buffer. The runs in the table below were made with 3.2GHz W3570 processors.


3DSMAX CATIA ENSIGHT MAYA PROENGR SOLIDWRKS TEAMCNTR UGS
Perf % Perf % Perf % Perf % Perf % Perf % Perf % Perf %
FX5800 57.07
67.84
58.63
219.4
68.05
152.3
40.85
34.73
FX3800 57.17 0 66.57 2
54.91 7
206.4 6 66.48 2 146.3 4 38.48 6 33.12 5
FX1800 56.73 1
64.33 6
52.05 13 189.3 16 64.67 5 135.2 13 34.18 20
30.46 14
FX380 45.90 24 55.81 22 34.93 68 120.3 82 46.09 48 64.11 138 17.00 140 13.88 150

Results and Configuration Summary

Hardware Configuration:

    Sun Ultra 27 Workstation
    1 x 3.33 GHz Intel Xeon (tm) W3580
    2GB (1 x 2GB PC10600 1333MHz)
    1 x 500GB SATA
    nVidia Quadro FX380, FX1800, FX3800 & FX5800
    $7,529.00 (includes Microsoft Windows and monitor)

Software Configuration:

    OS: Microsoft Windows Vista Ultimate, 32-bit
    Benchmark: SPECviewperf 10

Benchmark Description

SPECviewperf measures 3D graphics rendering performance of systems running under OpenGL. SPECviewperf is a synthetic benchmark designed to be a predictor of application performance and a measure of graphics subsystem performance. It is a measure of graphics subsystem performance (primarily graphics bus, driver and graphics hardware) and its impact on the system without the full overhead of an application. SPECviewperf reports performance in frames per second.

Please go here for a more complete description of the tests.

Key Points and Best Practices

SPECviewperf measures the 3D rendering performance of systems running under OpenGL.

The SPECopcSM project group's SPECviewperf 10 is totally new performance evaluation software. In addition to features found in previous versions, it now provides the ability to compare performance of systems running in higher-quality graphics modes that use full-scene anti-aliasing, and measures how effectively graphics subsystems scale when running multithreaded graphics content. Since the SPECviewperf source and binaries have been upgraded to support changes, no comparisons should be made between past results and current results for viewsets running under SPECviewperf 10.

SPECviewperf 10 requires OpenGL 1.5 and a minimum of 1GB of system memory. It currently supports Windows 32/64.

See Also

Disclosure Statement

SPEC® and the benchmark name SPECviewperf® are registered trademarks of the Standard Performance Evaluation Corporation. Competitive benchmark results stated above reflect results published on www.spec.org as of Oct 18, 2009. For the latest SPECviewperf benchmark results, visit www.spec.org/gwpg.

Saturday Oct 24, 2009

Sun C48 & Lustre fast for Seismic Reverse Time Migration using Sun X6275

Significance of Results

A Sun Blade 6048 Modular System with 12 Sun Blade X6275 server modules were clustered together with QDR InfiniBand and using a Lustre File System with QDR InfiniBand to show performance improvements over an NFS file system for reading in Velocity, Epsilon, and Delta Slices and imaging 800 samples of various various grid sizes using the Reverse Time Migration.

  • The Initialization Time for populating the processing grids demonstrates significant advantages of Lustre over NFS:
    • 2486x1151x1231 : 20x improvement
    • 1243x1151x1231 : 20x improvement
    • 125x1151x1231 : 11x improvement
  • The Total Application Performance shows the Interconnect and I/O advantages of using QDR InfiniBand Lustre for the large grid sizes:
    • 2486x1151x1231 : 2x improvement - processed in less than 19 minutes
    • 1243x1151x1231 : 2x improvement - processed in less than 10 minutes

  • The Computational Kernel Scalability Efficiency for the 3 grid sizes:
    • 125x1151x1231 : 97% (1-8 nodes)
    • 1243x1151x1231 : 102% (8-24 nodes)
    • 2486x1151x1231 : 100% (12-24 nodes)

  • The Total Application Scalability Efficiency for the large grid sizes:
    • 1243x1151x1231 : 72% (8-24 nodes)
    • 2485x1151x1231 : 71% (12-24 nodes)

  • On the X5570 Intel processor with HyperThreading enabled and running 16 OpenMP threads per node gives approximately a 10% performance improvement over running 8 threads per node.

Performance Landscape

This first table presents the initialization time, comparing different number processors along with different problem sizes. The results are presented in seconds and shows the advantage the Lustre file system running over QDR InfiniBand provided when compared to a simple NFS file system.


Initialization Time Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
24 48 1.59 18.90 8.90 181.78 15.63 362.48
20 40 1.60 18.90 8.93 181.49 16.91 358.81
16 32 1.58 18.59 8.97 181.58 17.39 353.72
12 24 1.54 18.61 9.35 182.31 22.50 364.25
8 16 1.40 18.60 10.02 183.79

4 8 1.57 18.80



2 4 2.54 19.31



1 2 4.54 20.34



This next table presents the total application run time, comparing different number processors along with different problem sizes. It shows that for larger problems, using the Lustre file system running over QDR InfiniBand provided a big performance advantage when compared to a simple NFS file system.


Total Application Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
24 48 251.48 273.79 553.75 1125.02 1107.66 2310.25
20 40 232.00 253.63 658.54 971.65 1143.47 2062.80
16 32 227.91 209.66 826.37 1003.81 1309.32 2348.60
12 24 217.77 234.61 884.27 1027.23 1579.95 3877.88
8 16 223.38 203.14 1200.71 1362.42

4 8 341.14 272.68



2 4 605.62 625.25



1 2 892.40 841.94



The following table presents the run time and speedup of just the computational kernel for different processor counts for the three different problem sizes considered. The scaling results are based upon the smallest number of nodes run and that number is used as the baseline reference point.


Computational Kernel Performance & Scalability
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
X6275 Time
(sec)
Speedup:
1-node
X6275 Time
(sec)
Speedup:
1-node
X6275 Time
(sec)
Speedup:
1-node
24 48 35.38 13.7 210.82 24.5 427.40 24.0
20 40 35.02 13.8 255.27 20.2 517.03 19.8
16 32 41.76 11.6 317.96 16.2 646.22 15.8
12 24 49.53 9.8 422.17 12.2 853.37 12.0\*
8 16 62.34 7.8 645.27 8.0\*

4 8 124.66 3.9



2 4 238.80 2.0



1 2 484.89 1.0



The last table presents the speedup of the total application for different processor counts for the three different problem sizes presented. The scaling results are based upon the smallest number of nodes run and that number is used as the baseline reference point.


Total Application Scalability Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
1243 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
2486 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
24 48 3.6 17.3 17.1
20 40 3.8 14.6 16.6
16 32 4.0 11.6 14.5
12 24 4.1 10.9 12.0\*
8 16 4.0 8.0\*
4 8 2.6

2 4 1.5

1 2 1.0

Note: HyperThreading is enabled and running 16 threads per Node.

Results and Configuration Summary

Hardware Configuration:
    Sun Blade 6048 Modular Modular System with
      12 x Sun Blade x6275 Server Modules, each with
        4 x 2.93 GHz Intel Xeon QC X5570 processors
        12 x 4 GB memory at 1333 MHz
        2 x 24 GB Internal Flash
    QDR InfiniBand Lustre 1.8.0.1 File System
    GBit NFS file system

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    MPI: Scali MPI Connect 5.6.6-59413
    Compiler: Sun Studio 12 C++, Fortran, OpenMP

Benchmark Description

The Reverse Time Migration (RTM) is currently the most popular seismic processing algorithm because of its ability to produce quality images of complex substructures. It can accurately image steep dips that can not be imaged correctly with traditional Kirchhoff 3D or frequency domain algorithms. The Wave Equation Migration (WEM) can image steep dips but does not produce the image quality that can be achieved by the RTM. However, the increased computational complexity of the RTM over the WEM introduces new areas for performance optimization. The current trend in seismic processing is to perform iterative migrations on wide azimuth marine data surveys using the Reverse Time Migration.

This Reverse Time Migration code reads in processing parameters that define the grid dimensions, number of threads, number of processors, imaging condition, and various other parameters. The master node calculates the memory requirements to determine if there is sufficient memory to process the migration "in-core". The domain decomposition across all the nodes is determined by dividing the first grid dimension by the number of nodes. Each node then reads in it's section of the Velocity Slices, Delta Slices, and Epsilon Slices using MPI IO reads. The three source and receiver wavefield state vectors are created: previous, current, and next state. The processing steps through the input trace data reading both the receiver and source data for each of the 800 time steps. It uses forward propagation for the source wave field and backward propagation in time to cross correlate the receiver wavefield. The computational kernel consists of a 13 point stencil to process a subgrid within the memory of each node using OpenMP parallelism. Afterwards, conditioning and absorption are applied and boundary data is communicated to neighboring nodes as each time step is processed. The final image is written out using MPI IO.

Total memory requirements for each grid size:

    125x1151x1231: 7.5GB
    1243x1151x1231: 78GB
    2486x1151x1231: 156GB

For this phase of benchmarking, the focus was to optimize the data initialization. In the next phase of benchmarking, the trace data reading will be optimized so that each node reads in only it's section of interest. In this benchmark the trace data reading skews the Total Application Performance as the number of nodes increase. This will be optimized in the next phase of benchmarking, as well as, further node optimization with OpenMP. The IO description for this benchmark phase on each grid size:

    125x1151x1231:
      Initialization MPI Read: 3 x 709MB = 2.1GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 576KB = 920MB \* number of nodes
      Final Output Image MPI Write: 709MB / number of nodes
    1243x1151x1231: 78GB
      Initialization MPI Read: 3 x 7.1GB = 21.3GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 5.7MB = 9.2GB \* number of nodes
      Final Output Image MPI Write: 7.1GB / number of nodes
    2486x1151x1231: 156GB
      Initialization MPI Read: 3 x 14.2GB = 42.6GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 11.4MB = 18.4GB \* number of nodes
      Final Output Image MPI Write: 42.6GB / number of nodes

Key Points and Best Practices

  • Additional evaluations were performed to compare GBit NFS, Infiniband NFS, and Infiniband Lustre for the Reverse Time Migration Initialization. Infiniband NFS was 6x faster than GBit NFS and Infiniband Lustre was 3x faster than Infiniband NFS using the same disk configurations. On 12 nodes for grid size 2486x1151x1231 the initialization time was 22.50 seconds for IB Lustre, 61.03 seconds for IB NFS, and 364.25 seconds for GBit NFS.
  • The Reverse Time Migration computational performance scales nicely as a function of the grid size being processed. This is consistent with the IBM published results for this application.
  • The Total Application performance results are not typically reported in benchmark studies for this application. The IBM report specifically states that the execution times do not include I/O times and non-recurring allocation or initialization delays. Examining the total application performance reveals that the workload is no longer dominated by the the partial differential equation (PDE) solver, as IBM suggests, but is constrained by the I/O for grid initialization, reading in the traces, saving/restoring wave state data, and writing out the final image. Aggressive optimization of the PDE solver has little effect on the overall throughput of this application. It is clearly more important to optimize the I/O. The trend in seismic processing, as stated at the 2008 Society of Exploration Geophysicists (SEG) conference, is to run the reverse time migration iteratively on wide azimuth data. Thus, optimizing the I/O and application throughput is imperative to meet this trend. SSD and Flash technologies in conjunction with Sun's Lustre file system can reduce this I/O bottleneck and pave the path for the future in seismic processing.
  • Minimal tuning effort was applied to achieve the results presented. Sun's HPC software stack, which includes the Sun Studio compiler, was used to build the 70000 lines of C++ and Fortran source into the application executable. The only compiler option used was "-fast". No assembly level optimizations, like those performed by IBM to use SIMD registers (SSE registers), where performed in this benchmark. Similarly, no explicit cache blocking, loop unrolling, or memory bandwidth optimizations were conducted. The idea was to demonstrate the performance that a customer can expect from their existing applications without extensive, platform specific optimizations.

See Also

Disclosure Statement

Reverse Time Migration, Results as of 10/23/2009. For more info http://www.caam.rice.edu/tech_reports/2006/TR06-18.pdf

Sun F5100 and Seismic Reverse Time Migration with faster Optimal Checkpointing

A prominent Seismic Processing algorithm, Reverse Time Migration with Optimal Checkpointing, in SMP "THREADS" Mode, was testing using a Sun Fire X4270 server configured with four high performance 15K SAS hard disk drives (HDDs) and a Sun Storage F5100 Flash Array. This benchmark compares I/O devices for checkpointing wave state information while processing a production seismic migration.

  • Sun Storage F5100 Flash Array is 2.2x faster than high-performance 15K RPM disks.

  • Multithreading the checkpointing using the Sun Studio C++ Compiler OpenMP implementation gives a 12.8x performance improvement over the original single threaded version.

These results show the new trend in seismic processing to run iterative Reverse Time Migrations and migration playback is a reality. This is made possible through the use of Sun FlashFire technology to provide good checkpointing speeds without additional disk cache memory. The application can take advantage of all the memory within a node without regard to checkpoint cache buffers required for performance to HDDs. Similarly, larger problem sizes can be solved without increasing the memory footprint of each computational node.

Performance Landscape


Reverse Time Migration Optimal Checkpointing - SMP Threads Mode
Grid Size -800 x 1151 x 1231 with 800 Samples - 60GB of memory
Number
Checkpts
HDD F5100
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
F5100
Speedup
80 660.8 25.8 686.6 277.4 40.2 317.6 2.2x
400 1615.6 382.3 1997.9 989.5 269.7 1259.2 1.6x


Reverse Time Migration Optimal Checkpointing - SMP Threads Mode
Grid Size -125 x 1151 x 1231 with 800 Samples - 9GB of memory
Number
Checkpts
HDD F5100
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
F5100
Speedup
80 10.2 0.2 10.4 8.0 0.2 8.2 1.3x
400 52.3 0.4 52.7 45.2 0.3 45.5 1.2x
800 102.6 0.7 103.3 91.8 0.6 92.4 1.1x


Reverse Time Migration Optimal Checkpointing
Single Thread vs Multithreaded I/O Performance
Grid Size -125 x 1151 x 1231 with 800 Samples - 9GB of memory
Number
Checkpts
Single Thread F5100
Total Time (secs)
Multithreaded F5100
Total Time (secs)
Multithread
Speedup
80 105.3 8.2 12.8x
400 482.9 45.5 10.6x
800 963.5 92.4 10.4x

Note: Hyperthreading and Turbo Mode enabled while running 16 threads per node.

Results and Configuration Summary

Hardware Configuration:

    Sun Fire 4270 Server
      2 x 2.93 GHz Quad-core Intel Xeon X5570 processors
      72 GB memory
      4 x 73 GB 15K SAS drives
        File system striped across 4 15K RPM high-performance SAS HD RAID0
      Sun Storage F5100 Flash Array with local/internal r/w buff 4096
        20 x 24 GB flash modules

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Compiler: Sun Studio 12 C++, Fortran, OpenMP

Benchmark Description

The Reverse Time Migration (RTM) is currently the most popular seismic processing algorithm because of it's ability to produce quality images of complex substructures. It can accurately image steep dips that can not be imaged correctly with traditional Kirchhoff 3D or frequency domain algorithms. The Wave Equation Migration (WEM) can image steep dips but does not produce the image quality that can be achieved by the RTM. However, the increased computational complexity of the RTM over the WEM introduces new areas for performance optimization. The current trend in seismic processing is to perform iterative migrations on wide azimuth marine data surveys using the Reverse Time Migration.

The Reverse Time Migration with Optimal Checkpointing was introduced so large migrations could be performed within minimal memory configurations of x86 cluster nodes. The idea is to only have three wavestate vectors in memory for each of the source and receiver wavefields instead of holding the entire wavefields in memory for the duration of processing. With the Sun Flash F5100, this can be done with little performance penalty to the full migration time. Another advantage of checkpointing is to provide the ability to playback migrations and facilitate iterative migrations.

  • The stored snapshot data can be reprocessed with different filtering, image conditioning, or a variety of other parameters.
  • Fine grain snapshoting can help the processing of more complex subsurface data.
  • A Geoscientist can "playback" a migration from the saved snapshots to visually validate migration accuracy or pick areas of interest for additional processing.

The Reverse Time Migration with Optimal Checkpointing is an algorithm designed by Griewank (Griewank, 1992; Blanch et al., 1998; Griewank, 2000; Griewank and Walther, 2000; Akcelik et al., 2003).

  • The application takes snapshots of wavefield state data for some interval of the total number of samples.
  • This adjoint state method performs crosscorrelation of the source and receiver wavefields at the each level.
  • Forward recursion is used for the source wavefield and backward recursion for the receiver wavefield.
  • For relatively small seismic migrations, all of the forward processed state information can be saved and restored with minimal impact on the total processing time.
  • Effectively, the computational complexity increases while the memory requirements decrease by a logarithmic factor of the number of snapshots.
  • Griewank's algorithm helps define the most optimal tradeoff between computational performance and the number of memory buffers (memory requirements) to support this cross correlation.

For the purposes of this benchmark, this implementation of the Reverse Time Migration with Optimal Checkpointing does not fully implement the optimal memory buffer scheme proposed by Griewank. The intent is to compare various I/O alternatives for saving wave state data for each node in a compute cluster.

This benchmark measures the time to perform the wave state saves and restores while simultaneously processing the wave state data.

Key Points and Best Practices

  • Mulithreading the checkpointing using Sun Studio OpenMP and running 16 I/O threads with hyperthreading enabled gives a performance advantage over single threaded I/O to the Sun Storage F5100 flash array. The Sun Storage F5100 flash array can process concurrent I/O requests from multiple threads very efficiently.
  • Allocating the majority of a node's available memory to the Reverse Time Migration algorithm and leaving little memory for I/O caching favors the Sun Storage F5100 flash array over direct attached high performance disk drives. This performance advantage decreases as the number of snapshots increase. The reason for this is that increasing the number of snapshots decreases the memory requirement for the application.

See Also

Disclosure Statement

Reverse Time Migration with Optimal Checkpointing, Results as of 10/23/2009. For more info http://www.caam.rice.edu/tech_reports/2006/TR06-18.pdf

Tuesday Oct 13, 2009

CP2K Life Sciences, Ab-initio Dynamics - Sun Blade 6048 Chassis with Sun Blade X6275 - Scalability and Throughput with Quad Data Rate InfiniBand

Significance of Results

Clusters of Sun Blade X6275 and X6270 server modules were used to run benchmarks using the CP2K ab-initio dynamics applications software.

  • For the X6270 cluster with Dual Data Rate (DDR) InfiniBand the rate of increase of scalability slows dramatically at 16 nodes, whereas for the X6275 cluster with QDR InfiniBand the scalability continues to 72 nodes.
  • For 64 nodes, the speed of the Sun Blade X6275 cluster with QDR InfiniBand was 2.7X that of a Sun Blade X6270 cluster with DDR InfiniBand.

Ab-initio dynamics simulation is important to materials science research.  Dynamics simulation is used to determine the trajectories of atoms or molecules over time.

Performance Landscape

The CP2K Bulk Water Benchmarks web page plots the performance of CP2K ab-initio dynamics benchmarks that have from 32 to 512 water molecules for a cluster that comprises two 2.66GHz Xeon E5430 quad core CPUs per node and that uses Dual Data Rate InfiniBand.

The following table reports the execution time for the 512 water molecule benchmark when executed on the Sun Blade X6275 cluster having Quad Data Rate InfiniBand and on the Sun Blade X6270 cluster having Dual Data Rate InfiniBand. Each node of either Sun Blade cluster comprises two 2.93GHz Intel Xeon X5570 quad core CPUs. In the following table, the performance is expressed in terms of the "wall clock" time in seconds required to execute ten steps of the ab-initio dynamics simulation for 512 water molecules. A smaller number implies better performance.

Number
of Nodes
X6275 QDR InfiniBand
(seconds for 10 steps)
X6270 DDR InfiniBand
(seconds for 10 steps)
96
1184.36
72 564.16
64 598.41 1591.35
32 706.82 1436.49
24 950.02 1752.20
16 1227.73 2119.50
12 1440.16 1739.26
8 1876.95 2120.73
4 3408.39 3705.44

Results and Configuration Summary

Hardware Configuration:

    Sun Blade[tm] 6048 Modular System with 3 shelves, each shelf with
      12 x Sun Blade X6275, each blade with
        2 x (2 x 2.93 GHz Intel QC Xeon X5570 processors)
        2 x (24 GB memory)
        Hyper-Threading (HT) off, Turbo Mode on
    QDR InfiniBand
    96 x Sun Blade X6270, each blade with
      2 x 2.93 GHz Intel QC Xeon X5570 processors)
      1 x (24 GB memory)
      Hyper-Threading (HT) off, Turbo Mode off
    DDR InfiniBand
Software Configuration:
    SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
    OpenMPI 1.3.2
    Sun Studio 12 f90 compiler, ScaLAPACK, BLACS and Performance Libraries
    FFTW (Fastest Fourier Transform in the West) 3.2.1

Benchmark Description

CP2K is a parallel ab-initio dynamics code that is designed to perform atomistic and molecular simulations of solid state, liquid, molecular and biological systems. It provides a general framework for different methods such as e.g. density functional theory (DFT) using a mixed Gaussian and plane waves approach (GPW), and classical pair and many-body potentials.

Ab-initio dynamics simulation is widely used in materials science research. CP2K is a public-domain ab-initio dynamics software application.

Key Points and Best Practices

  • QDR InfiniBand scales better than DDR InfiniBand.
  • The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.2GHz. This feature was was enabled for the X6275 and disabled for the X6270 when generating the results reported here.

See Also

Disclosure Statement

CP2K, see http://cp2k.berlios.de/ for more information, results as of 10/13/2009.

SAP 2-tier SD-Parallel on Sun Blade X6270 1-node, 2-node and 4-node

Significance of Results

  • Four Sun Blade X6270 (2 processors, 8 cores, 16 threads), running SAP ERP application Release 6.0 Enhancement Pack 4 (Unicode) with Oracle Database on top of Solaris 10 OS delivered the highest eight-processor result on the two-tier SAP SD-Parallel Standard Application Benchmark, as of Oct 12th, 2009.

  • Four Sun Blade X6270 servers with Intel Xeon X5570 processors achieved 1.9x performance improvement from two Sun Blade X6270 with the same processors.

  • Two Sun Blade X6270 (2 processors, 8 cores, 16 threads), running SAP ERP application Release 6.0 Enhancement Pack 4 (Unicode) with Oracle Database on top of Solaris 10 OS delivered the highest four-processor result on the two-tier SAP SD-Parallel Standard Application Benchmark, as of Oct 12th, 2009.

  • Two Sun Blade X6270 servers with Intel Xeon X5570 processors achieved 1.9x performance imporvement over a single 2-processor Sun Blade X6270 system.

  • A one node Sun Blade X6270 server with Intel Xeon X5570 processors running Oracle RAC delivers the same result as a Sun Fire X4270 server with Intel Xeon X5570 processors running Oracle with no performance difference between Oracle 10g and Oracle 10g RAC.

  • This benchmark highlights the near-linear scaling of Oracle 10g Real Application Cluster runs on Sun Microsystems hardware in a SAP environment.

  • In January 2009, a new version, the Two-tier SAP ERP 6.0 Enhancement Pack 4 (Unicode) Standard Sales and Distribution (SD) Benchmark, was released. This new release has higher cpu requirements and so yields from 25-50% fewer users compared to the previous Two-tier SAP ERP 6.0 (non-unicode) Standard Sales and Distribution (SD) Benchmark. 10-30% of this is due to the extra overhead from the processing of the larger character strings due to Unicode encoding. See this SAP Note for more details.

  • Unicode is a computing standard that allows for the representation and manipulation of text expressed in most of the world's writing systems. Before the Unicode requirement, this benchmark used ASCII characters meaning each was just 1 byte. The new version of the benchmark requires Unicode characters and the Application layer (where ~90% of the cycles in this benchmark are spent) uses a new encoding, UTF-16, which uses 2 bytes to encode most characters (including all ASCII characters) and 4 bytes for some others. This requires computers to do more computation and use more bandwidth and storage for most character strings. Refer to the above SAP Note for more details.

Performance Landscape

SAP SD-Parallel 2-Tier Performance Table (in decreasing performance order).

SAP ERP 6.0 Enhancement Pack 4 (Unicode) Results
(New version of the benchmark as of January 2009)

System OS
Database
Users SAP
ERP/ECC
Release
SAPS Date
Four Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
13,718 2009
6.0 EP4
(Unicode)
75,762 12-Oct-09
Two Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
7,220 2009
6.0 EP4
(Unicode)
39,420 12-Oct-09
One Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
3,800 2009
6.0 EP4
(Unicode)
20,750 12-Oct-09
Sun Fire X4270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g
3,800 2009
6.0 EP4
(Unicode)
21,000 21-Aug-09

Complete benchmark results may be found at the SAP benchmark website http://www.sap.com/benchmark.

Results and Configuration Summary

Four Sun Blade X6270 Servers, each with two Intel Xeon X5570 2.93 GHz(2 processors, 8 cores, 16 threads)

    Number of SAP SD benchmark users:
    13,718
    Average dialog response time:
    0.86 seconds
    Throughput:

    Dialog steps/hour:
    4,545,729

    SAPS:
    75,762
    SAP Certification:
    2009041

Two Sun Blade X6270 Servers, each with two Intel Xeon X5570 2.93 GHz(2 processors, 8 cores, 16 threads)

    Number of SAP SD benchmark users:
    7,220
    Average dialog response time:
    0.99 seconds
    Throughput:

    Dialog steps/hour:
    2,365,000

    SAPS:
    39,420
    SAP Certification:
    2009040

One Sun Blade X6270 Servers, with two Intel Xeon X5570 2.93 GHz(2 processors, 8 cores, 16 threads)

    Number of SAP SD benchmark users:
    3,800
    Average dialog response time:
    0.99 seconds
    Throughput:

    Dialog steps/hour:
    1,245,000

    SAPS:
    20,750
    SAP Certification:
    2009039

Software:

    Oracle 10g Real Application Clusters
    Solaris 10 OS

Benchmark Description

The SAP Standard Application Sales and Distribution - Parallel (SD-Parallel) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing, and demonstrates the ability to run both the application and database software on a single system. The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments.
SD Versus SD-Parallel
The SD-Parallel Benchmark consists of the same transactions and user interaction steps as the SD Benchmark. This means that the SD-Parallel Benchmark runs the same business processes as the SD Benchmark. The difference between the benchmarks is the technical data distribution. An Additional Rule for Parallel and Distributed Databases
The additional rule is: Equally distribute the benchmark users across all database nodes for the used benchmark clients (round-robin-method). Following this rule, all database nodes work on data of all clients. This avoids unrealistic configurations such as having only one client per database node.
The SAP Benchmark Council agreed to give the parallel benchmark a different name so that the difference can be easily recognized by any interested parties - customers, prospects, and analysts. The naming convention is SD-Parallel for Sales & Distribution - Parallel.
SAP is one of the premier world-wide ERP application providers, and maintains a suite of benchmark tests to demonstrate the performance of competitive systems on the various SAP products.

Disclosure Statement

SAP SD benchmark based on SAP enhancement package 4 for SAP ERP 6.0 (Unicode) application benchmark as of Oct 12th, 2009: Four Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 13,718 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, each 48 GB memory, running two-tier SAP Sales and Distribution Parallel (SD-Parallel) standard SAP SD benchmark with Oracle 10g Real Application Clusters and Solaris 10, Cert# 2009041. Two Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 7,220 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, each 48 GB memory, running two-tier SAP Sales and Distribution Parallel (SD-Parallel) standard SAP SD benchmark with Oracle 10g Real Application Clusters and Solaris 10, Cert# 2009040. Sun Blade X6270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, running two-tier SAP Sales and Distribution Parallel (SD-Parallel) standard SAP SD benchmark with Oracle 10g Real Application Clusters and Solaris 10, Cert# 2009039. Sun Fire X4270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, running two-tier SAP Sales and Distribution (SD) standard SAP SD benchmark with Oracle 10g and Solaris 10, Cert# 2009033.

SAP, R/3, reg TM of SAP AG in Germany and other countries. More info www.sap.com/benchmark

Monday Oct 12, 2009

MCAE ABAQUS faster on Sun F5100 and Sun X4270 - Single Node World Record

The Sun Storage F5100 Flash Array can substantially improve performance over internal hard disk drives as shown by the I/O intensive ABAQUS MCAE application Standard benchmark tests on a Sun Fire X4270 server.

The I/O intensive ABAQUS "Standard" benchmarks test cases were run on a single Sun Fire X4270 server. Data is presented for runs at both 8 and 16 thread counts.

The ABAQUS "Standard" module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal striped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "S4b" test case by 14%.

  • The Sun Fire X4270 server coupled with a Sun Storage F5100 Flash Array established the world record performance on a single node for the four test cases S2A, S4B, S4D and S6.

Performance Landscape

ABAQUS "Standard" Benchmark Test S4B: Advantage of Sun Storage F5100

Results are total elapsed run times in seconds

Threads 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
8 1504 1318 14%
16 1811 1649 10%

ABAQUS Standard Server Benchmark Subset: Single Node Record Performance

Results are total elapsed run times in seconds

Platform Cores S2a S4b S4d S6
X4270 w/F5100 8 302 1192 779 1237
HP BL460c G6 8 324 1309 843 1322
X4270 w/F5100 4 552 1970 1181 1706
HP BL460c G6 4 561 2062 1234 1812

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      Hyperthreading enabled
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: ABAQUS V6.9-1 Standard Module
    Benchmark: ABAQUS Standard Benchmark Test Suite

Benchmark Description

Abaqus/Standard Benchmark Problems

These problems provide an estimate of the performance that can be expected when running Abaqus/Standard or similar commercially available MCAE (FEA) codes like ANSYS and MSC/Nastran on different computers. The jobs are representative of those typically analyzed by Abaqus/Standard and other MCAE applications. These analyses include linear statics, nonlinear statics, and natural frequency extraction.

Please go here for a more complete description of the tests.

Key Points and Best Practices

  • The memory requirements for the test cases in the ABAQUS Standard benchmark test suite are rather substantial with some of the test cases requiring slightly over 20GB of memory. There are two memory limits one a minimum where out of core "memory" will be used when this limit is exceeded. This requires more time consuming cpu and another maximum memory limit that minimizes I/O operations. These memory limits are given in the ABAQUS output and can be established before making a full execution in a preliminary diagnostic mode run.
  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the ABAQUS job. This is done in the "abaqus_v6.env" file that either resides in the subdirectory from where the job was launched or in the abaqus "site" subdirectory under the home installation directory.
  • Sometimes when running multiple cores on a single node, it is preferable from a performance standpoint to run in "smp" shared memory mode This is specified using the "THREADS" option on the "mpi_mode" line in the abaqus_v6.env file as opposed to the "MPI" option on this line. The test case considered here illustrates this point.
  • The test cases for the ABAQUS standard module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system with high performance interconnects. On Linux OS's advantage can be taken of excess memory that can be used to cache and accelerate I/O.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Abaqus, Inc. or its subsidiaries in the United States and/or o ther countries: Abaqus, Abaqus/Standard, Abaqus/Explicit. All information on the ABAQUS website is Copyrighted 2004-2009 by Dassault Systemes. Results from http://www.simulia.com/support/v69/v69_performance.php as of October 12, 2009.

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today