Wednesday Dec 08, 2010

Sun Blade X6275 M2 Delivers Best Fluent (MCAE Application) Performance on Tested Configurations

This Manufacturing Engineering benchmark highlights the performance advantage the Sun Blade X6275 M2 server module offers over IBM, Cray, and SGI solutions as shown by the ANSYS FLUENT fluid dynamics application.

A cluster of eight of Oracle's Sun Blade X6275 M2 server modules delivered outstanding performance running the FLUENT 12 benchmark test suite.

  • The Sun Blade X6275 M2 server module cluster delivered the best results in all 36 of the test configurations run, outperforming the best posted results by as much as 42%.
  • The Sun Blade X6275 M2 server module demonstrated up to 76% performance improvement over the previous generation Sun Blade X6275 server module.

Performance Landscape

In the following tables, results are "Ratings" (bigger is better).
Rating = No. of sequential runs of test case possible in 1 day: 86,400/(Total Elapsed Run Time in Seconds)

The following table compares results on the basis of core count, irrespective of processor generation. This means that in some cases, i.e., for the 32-core and 64-core configurations, systems with the Intel Xeon X5670 six-core processors did not utilize quite all of the cores available for the specified processor count.


FLUENT 12 Benchmark Test Suite

Competitive Comparisons

System
Processors Cores Benchmark Test Case Ratings
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Blade X6275 M2 16 96 9340.5 39272.7 8307.7 8533.3 903.8 786.9
Best Posted 24 96

7562.4
797.0 712.9
Best Posted 16 96 7337.6 33553.4 6533.1 5989.6 739.1 683.5

Sun Blade X6275 M2 11 64 6306.6 27212.6 5592.2 5158.2 568.8 518.9
Best Posted 16 64 5556.3 26381.7 5494.4 4902.1 566.6 518.6

Sun Blade X6275 M2 8 48 4620.3 19093.9 4080.3 3251.2 376.0 359.4
Best Posted 8 48 4494.1 18989.0 3990.8 3185.3 372.7 354.5

Sun Blade X6275 M2 6 32 4061.1 15091.7 3275.8 3013.1 299.5 267.8
Best Posted 8 32 3404.9 14832.6 3211.9 2630.1 286.7 266.7

Sun Blade X6275 M2 4 24 2751.6 10441.1 2161.4 1907.3 188.2 182.5
Best Posted 6 24 1458.2 9626.7 1820.9 1747.2 185.1 180.8
Best Posted 4 24 2565.7 10164.7 2109.9 1608.2 187.1 180.8

Sun Blade X6275 M2 2 12 1429.9 5358.1 1097.5 813.2 95.9 95.9
Best Posted 2 12 1338.0 5308.8 1073.3 808.6 92.9 94.4



The following table compares results on the basis of processor count showing inter-generational processor performance improvement.


FLUENT 12 Benchmark Test Suite

Intergenerational Comparisons

System
Processors Cores Benchmark Test Case Ratings
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Blade X6275 M2 16 96 9340.5 39272.7 8307.7 8533.3 903.8 786.9
Sun Blade X6275 16 64 5308.8 26790.7 5574.2 5074.9 547.2 525.2
X6275 M2 : X6275 16
1.76 1.47 1.49 1.68 1.65 1.50

Sun Blade X6275 M2 8 48 4620.3 19093.9 4080.3 3251.2 376.0 359.4
Sun Blade X6275 8 32 3066.5 13768.9 3066.5 2602.4 289.0 270.3
X6275 M2 : X6275 8
1.51 1.39 1.33 1.25 1.30 1.33

Sun Blade X6275 M2 4 24 2751.6 10441.1 2161.4 1907.3 188.2 182.5
Sun Blade X6275 4 16 1714.3 7545.9 1519.1 1345.8 144.4 141.8
X6275 M2 : X6275 4
1.61 1.38 1.42 1.42 1.30 1.29

Sun Blade X6275 M2 2 12 1429.9 5358.1 1097.5 813.2 95.9 95.9
Sun Blade X6275 2 8 931.8 4061.1 827.2 681.5 73.0 73.8
X6275 M2 : X6275 2
1.53 1.32 1.33 1.19 1.31 1.30

Configuration Summary

Hardware Configuration:

8 x Sun Blade X6275 M2 server modules, each with
4 Intel Xeon X5670 2.93 GHz processors, turbo enabled
96 GB memory 1333 MHz
2 x 24 GB SATA-based Sun Flash Modules
2 x QDR InfiniBand Host Channel Adapter
Sun Datacenter InfiniBand Switch IB-36

Software Configuration:

Oracle Enterprise Linux Enterprise Server 5.5
ANSYS FLUENT V12.1.2
ANSYS FLUENT Benchmark Test Suite

Benchmark Description

The following description is from the ANSYS FLUENT website:

The FLUENT benchmarks suite comprises of a set of test cases covering a large range of mesh sizes, physical models and solvers representing typical industry usage. The cases range in size from a few 100 thousand cells to more than 100 million cells. Both the segregated and coupled implicit solvers are included, as well as hexahedral, mixed and polyhedral cell cases. This broad coverage is expected to demonstrate the breadth of FLUENT performance on a variety of hardware platforms and test cases.

The performance of a CFD code will depend on several factors, including size and topology of the mesh, physical models, numerics and parallelization, compilers and optimization, in addition to performance characteristics of the hardware where the simulation is performed. The principal objective of this benchmark suite is to provide comprehensive and fair comparative information of the performance of FLUENT on available hardware platforms.

About the ANSYS FLUENT 12 Benchmark Test Suite

    CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.

Key Points and Best Practices

  • ANSYS FLUENT has not yet been certified by the vendor on Oracle Enterprise Linux (OEL). However, the ANSYS FLUENT benchmark tests have been run successfully on Oracle hardware running OEL as is (i.e. with NO changes or modifications).
  • The performance improvement of the Sun Blade X6275 M2 server module over the previous generation Sun Blade X6275 server module was due to two main factors: the increased core count per processor (6 vs. 4), and the more optimal, iterative dataset partitioning scheme used for the Sun Blade X6275 M2 server module.

See Also

Disclosure Statement

All information on the FLUENT website (http://www.fluent.com) is Copyrighted 1995-2010 by ANSYS Inc. Results as of December 06, 2010.

Thursday Dec 02, 2010

World Record TPC-C Result on Oracle's SPARC Supercluster with T3-4 Servers

Oracle demonstrated the world's fastest database performance using 27 of Oracle's SPARC T3-4 servers, 138 Sun Storage F5100 Flash Array storage systems and Oracle Database 11g Release 2 Enterprise Edition with Real Application Clusters (RAC) and Partitioning delivered a world-record TPC-C benchmark result.

  • The SPARC T3-4 server cluster delivered a world record TPC-C benchmark result of 30,249,688 tpmC and $1.01 $/tpmC (USD) using Oracle Database 11g Release 2 on a configuration available 6/1/2011.

  • The SPARC T3-4 server cluster is 2.9x faster than the performance of the IBM Power 780 (POWER7 3.86 GHz) cluster with IBM DB2 9.7 database and has 27% better price/performance on the TPC-C benchmark. Almost identical price discount levels were applied by Oracle and IBM.

  • The Oracle solution has three times better performance than the IBM configuration and only used twice the power during the run of the TPC-C benchmark.  (Based upon IBM's own claims of energy usage from their August 17, 2010 press release.)

  • The Oracle solution delivered 2.9x the performance in only 71% of the space compared to the IBM TPC-C benchmark result.

  • The SPARC T3-4 server with Sun Storage F5100 Flash Array storage solution demonstrates 3.2x faster response time than IBM Power 780 (POWER7 3.86 GHz) result on the TPC-C benchmark.

  • Oracle used a single-image database, whereas IBM used 96 separate database partitions on their 3-node cluster. It is interesting to note that IBM used 32 database images instead of running each server as a simple SMP.

  • IBM did not use DB2 Enterprise Database, but instead IBM used "DB2 InfoSphere Warehouse 9.7" which is a data warehouse and data management product and not their flagship OLTP product.

  • The multi-node SPARC T3-4 server cluster is 7.4x faster than the HP Superdome (1.6 GHz Itanium2) solution and has 66% better price/performance on the TPC-C benchmark.

  • The Oracle solution utilized Oracle's Sun FlashFire technology to deliver this result. The Sun Storage F5100 Flash Array storage system was used for database storage.

  • Oracle Database 11g Enterprise Edition Release 2 with Real Application Clusters and Partitioning scales and effectively uses all of the nodes in this configuration to produce the world record TPC-C benchmark performance.

  • This result showed Oracle's integrated hardware and software stacks provide industry leading performance.

Performance Landscape

TPC-C results (sorted by tpmC, bigger is better)

System tpmC Price/tpmC Avail Database Cluster Racks
27 x SPARC T3-4 30,249,688 1.01 USD 6/1/2011 Oracle 11g RAC Y 15
3 x IBM Power 780 10,366,254 1.38 USD 10/13/10 DB2 9.7 Y 10
HP Integrity Superdome 4,092,799 2.93 USD 08/06/07 Oracle 10g R2 N 46

Avail - Availability date
Racks - Clients, servers, storage, infrastructure

Oracle and IBM TPC-C Response times

System tpmC Response Time (sec)
New Order 90th%
Response Time (sec)
New Order Average
27 x SPARC T3-4 30,249,688 0.750 0.352
3 x IBM Power 780 10,366,254 2.1 1.137
Response Time Ratio - Oracle Better 2.9x 2.8x 3.2x

Oracle uses Average New Order Response time for comparison between Oracle and IBM.

Graphs of Oracle's and IBM's response times for New-Order can be found in the full disclosure reports on TPC's website TPC-C Official Result Page.

Configuration Summary and Results

Hardware Configuration:

15 racks used to hold

Servers
27 x SPARC T3-4 servers, each with
4 x 1.65 GHz SPARC T3 processors
512 GB memory
3 x 300 GB 10K RPM 2.5" SAS disks

Data Storage
69 x Sun Fire X4270 M2 servers configured as COMSTAR heads, each with
1 x 2.93 GHz Intel Xeon X5670 processor
8 GB memory
9 x 2 TB 7.2K RPM 3.5" SAS disks
2 x Sun Storage F5100 Flash Array storage (1.92 TB each)
1 x Brocade DCX switch

Redo Storage
28 x Sun Fire X4270 M2 servers configured as COMSTAR heads, each with
1 x 2.93 GHz Intel Xeon X5670 processor
8 GB memory
11 x 2 TB 7.2K RPM 3.5" SAS disks
2 x Brocade 5300 switches

Clients
81 x Sun Fire X4170 M2 servers, each with
2 x 2.93 GHz Intel X5670 processors
48 GB memory
2 x 146 GB 10K RMP 2.5" SAS disks

Software Configuration:

Oracle Solaris 10 9/10 (for SPARC T3-4 and Sun Fire X4170 M2)
Oracle Solaris 11 Express (COMSTAR for Sun Fire X4270 M2)
Oracle Database 11g Release 2 Enterprise Edition with Real Application Clusters and Partitioning
Oracle iPlanet Web Server 7.0 U5
Tuxedo CFS-R Tier 1

Results:

System 27 x SPARC T3-4
tpmC 30,249,688
Price/tpmC 1.01 USD
Avail 6/1/2011
Database Oracle Database 11g RAC
Cluster yes
Racks 15
New Order Ave Response 0.352 seconds

Benchmark Description

TPC-C is an OLTP system benchmark. It simulates a complete environment where a population of terminal operators executes transactions against a database. The benchmark is centered around the principal activities (transactions) of an order-entry environment. These transactions include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses.

Key Points and Best Practices

  • Oracle Database 11g Release 2 Enterprise Edition with Real Application Clusters and Partitioning scales easily to this high level of performance.

  • Sun Storage F5100 Flash Array storage provides high performance, very low latency, and very high storage density.

  • COMSTAR (Common Multiprotocol SCSI Target), new in Oracle Solaris 11 Express, is the software framework that enables a Solaris host to serve as a SCSI Target platform. COMSTAR uses a modular approach to break the huge task of handling all the different pieces in a SCSI target subsystem into independent functional modules which are glued together by the SCSI Target Mode Framework (STMF). The modules implementing functionality at SCSI level (disk, tape, medium changer etc.) are not required to know about the underlying transport. And the modules implementing the transport protocol (FC, iSCSI, etc.) are not aware of the SCSI-level functionality of the packets they are transporting. The framework hides the details of allocation providing execution context and cleanup of SCSI commands and associated resources and simplifies the task of writing the SCSI or transport modules.

  • Oracle iPlanet Web Server 7.0 U5 is used in the user tier of the benchmark with each of the web server instance supporting more than a quarter-million users, while satisfying the stringent response time requirement from the TPC-C benchmark.

See Also

Disclosure Statement

TPC Benchmark C, tpmC, and TPC-C are trademarks of the Transaction Processing Performance Council (TPC). 27-node SPARC T3-4 Cluster (4 x 1.65 GHz SPARC T3 processors) with Oracle Database 11g Release 2 Enterprise Edition with Real Application Clusters and Partitioning, 30,249,688 tpmC, $1.01/tpmC, Available 6/1/2011. IBM Power 780 Cluster (3 nodes using 3.86 GHz POWER7 processors) with IBM DB2 InfoSphere Warehouse Ent. Base Ed. 9.7, 10,366,254 tpmC, $1.38 USD/tpmC, available 10/13/2010. HP Integrity Superdome(1.6GHz Itanium2, 64 processors, 128 cores, 256 threads) with Oracle 10g Enterprise Edition, 4,092,799 tpmC, $2.93/tpmC, available 8/06/07. Energy claims based upon IBM calculations and internal measurements. Source: http://www.tpc.org/tpcc, results as of 11/22/2010

World Record Performance on PeopleSoft Enterprise Financials Benchmark run on Sun SPARC Enterprise M4000 and M5000

Oracle's Sun SPARC Enterprise M4000 and M5000 servers have combined to produce a world record result on Oracle's PeopleSoft Enterprise Financial Management 9.0 benchmark.

  • The Sun SPARC Enterprise M4000 and M5000 servers configured with SPARC64 VII+ processors along with Oracle's Sun Storage F5100 Flash Array system achieved a world record result using PeopleSoft Enterprise Financial Management and Oracle Database 11g Release 2 software running on the Oracle Solaris 10 operating system.

  • The PeopleSoft Enterprise Financial Management solution processed online business transactions to support 1000 concurrent users using 32 application server threads with compliant response times while simultaneously completing complex batch jobs in record time.

  • The Sun Storage F5100 Flash Array system is a high performance, high-density solid-state flash array which provides a read latency of only 0.5 msec which is about 10 times faster than the normal disk latencies of 5 msec measured on this benchmark.

  • The Sun SPARC Enterprise M4000 and M5000 servers were able to process online users and concurrent batch jobs simultaneously in 34.72 minutes on this benchmark that reflects complex, multi-tier environment and utilizes a large back-end database of nearly 1 TB.

  • The combination of Oracle's PeopleSoft Enterprise Financial Management 9.00.00.331, PeopleSoft PeopleTools 8.49.23 and Oracle WebLogic server was run on the Sun SPARC Enterprise M4000 server and Oracle database 11g Release 2 was run on the Sun SPARC Enterprise M5000 server for this benchmark.

Performance Landscape

The following table discloses the current and the single previously disclosed result for this benchmark. Results are elapsed times therefore the smaller number is better.

Servers CPU Tier Batch (mins) Batch
w/Online (mins)
Sun SPARC Enterprise M4000 2.66 GHz SPARC64 VII+ Web/App
33.09
34.72
Sun SPARC Enterprise M5000 2.66 GHz SPARC64 VII+ DB

SPARC T3-1 1.65 GHz SPARC T3 Web/App 35.82 37.01
Sun SPARC Enterprise M5000 2.5 GHz SPARC64 VII DB

Configuration Summary

Web/Application Tier Configuration:

1 x Sun SPARC Enterprise M4000
4 x 2.66 GHz SPARC64 VII+ processors
128 GB of memory

Database Tier Configuration:

1 x Sun SPARC Enterprise M5000
8 x 2.66 GHz SPARC64 VII+ processors
128 GB of memory
1 x Sun Storage F5100 Flash Array (74 x 24 GB FMODs)
2 x StorageTek 2540 (12 x 146 GB SAS 15K RPM)
1 x StorageTek 2501 (12 x 146 GB SAS 15K RPM)
1 x Dual-Port SAS Fibre Channel Host Bus Adapters (HBA)

Software Configurations:

Oracle Solaris 10 10/09
PeopleSoft Enterprise Financial Management/SCM 9.00.00.311 64-bit
PeopleSoft Enterprise (PeopleTools) 8.49.23 64-bit
Oracle Database 11g Release 2 11.1.0.6 64-bit
Oracle Tuxedo 9.1 RP36 with Jolt 9.1
Micro Focus COBOL Server Express 4.0 SP4 64-bit

Benchmark Description

This Day-in-the-Life benchmark measured the concurrent batch and online performance for a large database model. This scenario more accurately represents a production environment where users and scheduled batch jobs must run concurrently. This benchmark measured performance results during a Close-the-Books process.

The PeopleSoft Enterprise Financials 9 batch processes included in this benchmark are as follows:

  • Journal Generator: (AE) This process creates journals from accounting entries (AE) generated from various data sources, including non-PeopleSoft systems as well as PeopleSoft applications. In the benchmark, the Journal Generator (FS_JGEN) process is set up to create accounting entries from Oracle's PeopleSoft applications in the same database, such as PeopleSoft Enterprise Payables, Receivables, Asset Management, Expenses, Cash Management. The process is run with the option of Edit and Post turned on to edit and post the journals created by Journal generator. Journal Edit is an AE program and Post is a COBOL program.

  • Allocation: (AE) This process allocates balances held or accumulated in one or more entities to more than one business unit, department or other entities based on user-defined rules.

  • Journal Edit & Post: (AE & COBOL) Journal Edit validates journal transactions before posting them to the ledger. This validation ensures that journals are valid, for example: valid ChartFields values and combinations, debits and credits equal, and inter/intra-unit balanced, Journal Post process posts only valid, edited journals, ensures each journal line posts to the appropriate target detail ledgers, and then changes the journal's status to posted. In this benchmark, the Journal Edit & Post is also set up to edit and post Oracle's PeopleSoft applications from another database, such as PeopleSoft Enterprise Payroll data.

  • Summary Ledger: (AE) Summary Ledger processing summarizes detail ledger data across selected GL BUs. Summary Ledgers can be generated for reporting purposes or used in consolidations.

  • Consolidations: (COBOL) Consolidation processing summarizes ledger balances and generates elimination journal entries across business units based on user-defined rules.

  • SQR & nVision Reporting: Reporting will consist of nVision and SQR reports. A balance sheet, an income statement, and a trial balance will be generated for each GL BU by SQR processes GLS7002 and GLS7012. The consolidated results of the nVision reports are run by 10 nVision users using 4 standard delivered report request definitions such as BALANCE, INCOME, CONSBAL, and DEPTINC. Each of the nVision users will have ownership over 10 Business Units and each of the nVision users will submit multiple runs that are being executed in parallel to generate a total of 40 nVision reports.

Batch processes are run concurrently with more than 1000 emulated users executing 30 pre-defined online applications. Response times for the online applications are collected and must conform to a maximum time.

Key Points and Best Practices

The Sun SPARC Enterprise M4000 and M5000 servers were able process online users and concurrent batch jobs simultaneously in 34.72 minutes.

The Sun Storage F5100 Flash Array system, which is highly tuned for IOPS, contributed to the result through reduced IO latency.

The family of Sun SPARC Enterprise M-series servers, with Sun Storage F5100 Flash Array systems, form an ideal environment for hosting complex multi-tier applications. This is the second public disclosure of any system running this benchmark.

The Sun SPARC Enterprise M4000 server hosted the web and application server tiers providing good response time to emulated user requests. The benchmark specification allows 1000 users, but there is headroom for increased load.

The Sun SPARC Enterprise M5000 server was used for the database server along with a Sun Storage F5100 Flash Array system. The speed of the M-series server with the low latency of the Flash Array provided the overall low latency for user requests, even while completing complex batch jobs.

Despite the systems being lightly loaded, the increased frequency of the SPARC64 VII+ processors yielded lower latencies and faster elapsed times than previously disclosed results.

The low latency of the Sun Storage F5100 Flash Array storage contributed to the excellent response times of emulated users by making data quickly available to the database back-end. The array was configured as several RAID 0 volumes and data was distributed across the volumes, maximizing storage bandwidth.

The transaction processing capacity of the Sun SPARC Enterprise M5000 server enabled very fast batch processing times while supporting over 1000 online users.

While running the maximum workload specified by the benchmark, the systems were lightly loaded, providing headroom to grow.

Please see the white paper for information on PeopleSoft payroll best practices using flash.

See Also

Disclosure Statement

Oracle's PeopleSoft Financials 9.0 benchmark, Oracle's Sun SPARC Enterprise M4000 (4 2.66 SPARC64 VII+), Oracle's Sun SPARC Enterprise M5000 (8 2.66 SPARC64 VII+), 34.72 min. Results as of 12/02/2010, see www.oracle.com/apps_benchmark/html/white-papers-peoplesoft.html for more about PeopleSoft.

Tuesday Oct 26, 2010

3D VTI Reverse Time Migration Scalability On Sun Fire X2270-M2 Cluster with Sun Storage 7210

This Oil & Gas benchmark shows the Sun Storage 7210 system delivers almost 2 GB/sec bandwidth and realizes near-linear scaling performance on a cluster of 16 Sun Fire X2270 M2 servers.

Oracle's Sun Storage 7210 system attached via QDR InfiniBand to a cluster of sixteen of Oracle's Sun Fire X2270 M2 servers was used to demonstrate the performance of a Reverse Time Migration application, an important application in the Oil & Gas industry. The total application throughput and computational kernel scaling are presented for two production sized grids of 800 samples.

  • Both the Reverse Time Migration I/O and combined computation shows near-linear scaling from 8 to 16 nodes on the Sun Storage 7210 system connected via QDR InfiniBand to a Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 2.0x improvement
      2486 x 1151 x 1231: 1.7x improvement
  • The computational kernel of the Reverse Time Migration has linear to super-linear scaling from 8 to 16 nodes in Oracle's Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231 : 2.2x improvement
      2486 x 1151 x 1231 : 2.0x improvement
  • Intel Hyper-Threading provides additional performance benefits to both the Reverse Time Migration I/O and computation when going from 12 to 24 OpenMP threads on the Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 8% - computational kernel; 2% - total application throughput
      2486 x 1151 x 1231: 12% - computational kernel; 6% - total application throughput
  • The Sun Storage 7210 system delivers the Velocity, Epsilon, and Delta data to the Reverse Time Migration at a steady rate even when timing includes memory initialization and data object creation:

      1243 x 1151 x 1231: 1.4 to 1.6 GBytes/sec
      2486 x 1151 x 1231: 1.2 to 1.3 GBytes/sec

    One can see that when doubling the size of the problem, the additional complexity of overlapping I/O and multiple node file contention only produces a small reduction in read performance.

Performance Landscape

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 504 259 2.0 2.2\* 1024 551 1.7 2.0
14 565 279 1.8 2.0 1191 677 1.5 1.6
12 662 343 1.6 1.6 1426 817 1.2 1.4
10 784 394 1.3 1.4 1501 856 1.2 1.3
8 1024 560 1.0 1.0 1745 1108 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache

Application Scaling – Hyper-Threading Study

The affects of hyperthreading are presented when running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server.

Hyper-Threading Comparison – 12 versus 24 OpenMP Threads
Number Nodes Thread per Node Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup
16 24 504 259 1.02 1.08 1024 551 1.06 1.12
16 12 515 279 1.00 1.00 1088 616 1.00 1.00

Read Performance

Read performance is presented for the velocity, epsilon and delta files running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Velocity, Epsilon, and Delta File Read and Memory Initialization Performance
Number Nodes Overlap MBytes Read Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s
16 2040 16.7 1.1 23.2 1.4 36.8 1.1 44.3 1.2
8 951
14.8 1.0 22.1 1.6 33.0 1.0 43.2 1.3

Configuration Summary

Hardware Configuration:

16 x Sun Fire X2270 M2 servers, each with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)

Sun Storage 7210 system connected via QDR InfiniBand
2 x 18 GB SATA SSD (logzilla)
40 x 1 TB 7200 RM SATA disk

Software Configuration:

SUSE Linux Enterprise Server SLES 10 SP 2
Oracle Message Passing Toolkit 8.2.1 (for MPI)
Sun Studio 12 Update 1 C++, Fortran, OpenMP

Benchmark Description

This Reverse Time Migration (RTM) demonstration application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk. In this version, each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a four element inline 3-D array pad (spatial order of eight) shared with its neighbors to the left and right during the initialization phase. It represents a full RTM application including the data input, computation, communication, and final output image to be used by the next work flow step involving 3D volumetric seismic interpretation.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

I/O Characterization without Optimal Checkpointing

Velocity, Epsilon, and Delta Files - Grid Reading

The additional amount of overlapping reads to share velocity, epsilon, and delta edge data with neighbors can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (z_dimension) x (4 bytes) x (3 files)

For this particular benchmark study, the additional 3-D pad overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 1231 x 4 x 3 = 2.04 GB extra
    8 nodes: 7 x 8 x 1151 x 1231 x 4 x 3 = 0.95 GB extra

For the first of the two test cases, the total size of the three files used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 1231 x 4 bytes = 7.05 GB per file x 3 files = 21.13 GB

With the additional 3-D pad, the total amount of data read is:

    16 nodes: 2.04 GB + 21.13 GB = 23.2 GB
    8 nodes: 0.95 GB + 21.13 GB = 22.1 GB

For the second of the two test cases, the total size of the three files used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 1231 x 4 bytes = 14.09 GB per file x 3 files = 42.27 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: 2.04 GB + 42.27 GB = 44.3 GB
    8 nodes: 0.95 GB + 42.27 GB = 43.2 GB

Note that the amount of overlapping data read increases, not only by the number of nodes, but as the y dimension and/or the z dimension increases.

Trace Reading

The additional amount of overlapping reads to share trace edge data with neighbors for can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (4 bytes) x (number_of_time_slices)

For this particular benchmark study, the additional overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 4 x 800 = 442MB extra
    8 nodes: 7 x 8 x 1151 x 4 x 800 = 206MB extra

For the first case the size of the trace data file used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 4 bytes x 800 = 4.578 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 4.578 GB = 5.0 GB
    8 nodes: .206 GB + 4.578 GB = 4.8 GB

For the second case the size of the trace data file used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 4 bytes x 800 = 9.156 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 9.156 GB = 9.6 GB
    8 nodes: .206 GB + 9.156 GB = 9.4 GB

As the number of nodes is increased, the overlap causes more disk lock contention.

Writing Final Output Image

1243x1151x1231 - 7.1 GB per file:

    16 nodes: 78 x 1151 x 1231 x 4 = 442MB/node (7.1 GB total)
    8 nodes: 156 x 1151 x 1231 x 4 = 884MB/node (7.1 GB total)

2486x1151x1231 - 14.1 GB per file:

    16 nodes: 156 x 1151 x 1231 x 4 = 930 MB/node (14.1 GB total)
    8 nodes: 311 x 1151 x 1231 x 4 = 1808 MB/node (14.1 GB total)

Resource Allocation

It is best to allocate one node as the Oracle Grid Engine resource scheduler and MPI master host. This is especially true when running with 24 OpenMP threads in hyperthreading mode to avoid oversubscribing a node that is cooperating in delivering the solution.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 10/20/2010.

Tuesday Sep 28, 2010

SPARC T3-2 Delivers First Oracle E-Business X-Large Benchmark Self-Service (OLTP) Result

With Oracle's SPARC T3-2 server running the application and Oracle's Sun SPARC Enterprise M5000 server running the database, Oracle set a world record result for the Oracle E-Business Standard X-Large HR Self-Service (OLTP) benchmark.

  • The combination of a SPARC T3-2 server for the application and a Sun SPARC Enterprise M5000 server for the database achieved a result of 4000 HR Self-Service Online users on the Oracle E-Business X-Large benchmark dataset.

  • Oracle's Sun Storage F5100 Flash Array storage which was utilized in the benchmark was instrumental in obtaining an average transaction response time as low as 1.2 seconds.

  • Oracle has published the first Oracle E-Business R12.1.2 XL benchmark for 4000 HR Self-Service online users on a SPARC T3-2 server for the application tier and a Sun SPARC Enterprise M5000 server on database tier with Oracle 11g R2 database. Both servers ran with the Oracle Solaris 10 operating system.

  • The combination of the SPARC T3-2 server and Oracle E-Business R12.1.2 in the application tier with low CPU utilization provides headroom for growth.

  • The Sun Storage F5100 Flash Array storage provides higher performance with smaller footprint and lower power/cooling costs.

  • The result shows that the SPARC T3-2 server works well as a high capacity application server.

Performance Landscape

This is the FIRST published result for this X-large benchmark.

Workload HR Self-Service
X-Large Configuration
Vendor/System OS Users
SPARC T3-2 Oracle Solaris 10 9/10 4000

Results and Configuration Summary

Application Tier Configuration:

1 x SPARC T3-2 server
2 x SPARC T3 processors, 1.65 GHz
128 GB memory
Oracle Solaris 10 9/10
Oracle E-Business Suite 12.1.2

Database Tier Configuration:

1 x Sun SPARC Enterprise M5000 server
4 x SPARC64 VII processors, 2.53 GHz
128 GB memory
Oracle Solaris 10 10/09
Oracle Database 11g Release 2

Storage Configuration:

1 x Sun Storage F5100 Flash Array storage
1 x StorageTek 2540 array
300 GB

Benchmark Description

The Oracle R12 E-Business Standard Benchmark combines online transaction execution by simulated users with concurrent batch processing to model a typical scenario for a global enterprise. This benchmark includes one online component and 2 batch components. The goal is to obtain reference response times and throughput for Oracle EBS R12. Results can be published in four configurations:

  • X-large: Maximum online users running all business flows between 10,000 to 20,000; 750,000 order to cash lines per hour and 250,000 payroll checks per hour.
    • HR Self-Service Online -- 4000 users
      • The percentage across the 4 transactions in HR Self-Service module is:
        • Create Query Cash Expense -- 20%
        • Create Query Credit Expense -- 20%
        • View Payslip -- 30%
        • Create TimeCard -- 30%
    • Customer Support Flow -- 8000 users
    • Procure to Pay -- 2000 users
    • Order to Cash -- 2400 users
  • Large: 10,000 online users; 100,000 order to cash lines per hour and 100,000 payroll checks per hour.
  • Medium: up to 3000 online users; 50,000 order to cash lines per hour and 10,000 payroll checks per hour.
  • Small: up to 1000 online users; 10,000 order to cash lines per hour and 5,000 payroll checks per hour.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/20/2010.

Monday Sep 27, 2010

Sun Fire X2270 M2 Super-Linear Scaling of Hadoop Terasort and CloudBurst Benchmarks

A 16-node cluster of Oracle's Sun Fire X2270 M2 servers showed super-linear scaling of two Hadoop benchmarks. Performance was measured using the Terasort benchmark with a 100GB data set. In addition, performance was measured using Cloudburst which maps next generation "short read" sequence data onto the human and other genomes.

  • On the Terasort workload, a 16-node Sun Fire X2270 M2 cluster sorted the 100GB data set at a rate of 433.3 MB/s finishing in 236.3 seconds.

  • The 16-node Sun Fire X2270 M2 cluster was 9.3x faster on a per node basis than the 2010 winner of the Terasort benchmark competition (www.sortbenchmark.org) which used a 3,452-node Xeon cluster to sort 100 TB of input data in 173 minutes. Both systems used Hadoop, Terasort and 2-socket x86 servers. Allowances have to be made for the differences in problem complexity.

  • The Terasort benchmark showed super-linear scaling on the Sun Fire X2270 M2 cluster (total of 32 Intel 2.93GHz Xeons).

  • Using Cloudburst on a workload of the human genome and the SRR001113 short read data set, a 16-node Sun Fire X2270 M2 cluster finished mapping the short reads onto the human genome in 34.2 minutes.

  • On a per node basis, a 2-node Sun Fire X2270 M2 cluster was 1.7x faster than a 12-node Xeon cluster that processed the human genome and the SRR001113 short read data set in approximately 60,000 seconds (see figure 3 of this journal article). Both systems used Hadoop, CloudBurst and x86 servers.

  • The Terasort benchmark showed super-linear scaling on the Sun Fire X2270 M2 cluster (total of 32 Intel 2.93GHz Xeons).

Performance Landscape

Terasort
100 GB input data set
Performance is "real" execution time reported by /usr/bin/time in seconds (smaller is better)
Number
of Nodes
Seconds Scaling Linear
Scaling
16 236.3 25.4 16
8 466.3 12.9 8
4 927.2 6.5 4
2 2140.8 2.8 2
1 6010.2 1.0 1

CloudBurst
SR001111 short read data set mapped onto the hs_ref_GRCh37 human genome
Performance is "total running" time reported by CloudBurst in seconds (smaller is better)
Number
of Nodes
Seconds Scaling Linear
Scaling
16 2054.9 8.4 8
8 3615.8 4.7 4
4 7895.7 2.2 2
2 17155.1 1.0 1

Results and Configuration Summary

Hardware Configuration:

16 x Sun Fire X2270 M2 server, each server with
2 Intel Xeon X5670 2.93GHz processors, turbo enabled
96 GB memory 1066 MHz
HDD SATA 1 TB 7200 RPM 3.5-in.
2 x 10/100/1000 ethernet

Software Configuration:

Oracle Solaris 10 10/09
Java Platform, Standard Edition, JDK 6 Update 20 Performance Release
Hadoop v0.20.2

Benchmark Description

The Apache Hadoop middleware is the Yahoo implementation of Google's Map Reduce. Map Reduce permits the programmer to write serial code that Map Reduce schedules for parallel execution. Map Reduce has been applied to a wide variety of problems, including image processing, sorting, database merging and genomics.

Hadoop uses the Hadoop Distributed Filesystem (HDFS) that distributes data across the local disks of a cluster such that each node in the cluster accesses its local disk to the greatest extent possible.

Results for two different Hadoop benchmarks are reported above:

  • Terasort is an I/O intensive benchmark that was originally developed by Jim Gray. By having many Hadoop data nodes, it is possible to achieve high I/O capacity. For purposes of benchmarking, the Teragen program was used to create an input data set that comprised 100 GB.

  • CloudBurst is a genome assembly benchmark that was developed by Michael Schatz, previously of the University of Maryland and presently of Cold Springs Harbor Laboratory. CloudBurst maps what is known as DNA short read data onto a reference genome. For purposes of benchmarking, the SRR001113 short read data set is mapped onto the hs_ref_GRCh37 sequence data for all chromosomes of the human genome. Specifically, the hs_ref_GRCh37 FASTA files for chromosomes 1, 2, ... 21, 22, X and Y were catenated in that order to obtain one large FASTA file that represented all chromosomes of the human genome. For purposes of benchmarking, any DNA fragment from the SRR001113 short read data set that contained more than three mismatches was ignored.

See Also

Disclosure Statement

Hadoop, see http://hadoop.apache.org/ for more information. Results as of 9/20/2010.

SPARC T3-1 Shows Capabilities Running Online Auction Benchmark with Oracle Fusion Middleware

Oracle produced the best performance of the Fusion Middleware Online Auction workload using Oracle's SPARC T3-1 server for the application server and Oracle's Sun SPARC Enterprise M4000 server for the database server.

  • The J2EE WebLogic application achieved 19,000 concurrent users with 4 WebLogic instances running an online auction workload.

  • The SPARC T3-1 server using the J2EE WebLogic application server software on Oracle Solaris 10 showed near-linear scaling of 3.7x on 4 WebLogic instances.

  • The SPARC T3-1 server has proven its capability of balancing a large number of interactive user sessions distributed across multiple hardware threads with sub-second response time while leaving processing capacity for additional growth.

  • The database server, a Sun SPARC Enterprise M4000 server, scaled well with a large number of concurrent user sessions and utilized all the CPU resources fully to serve 19,000 users.

  • Oracle Fusion Middleware provides a family of complete, integrated, hot plugable and best-of-breed products known for enabling enterprise customers to create and run agile and intelligent business applications. The Oracle WebLogic Server's on-going, record-setting Java application server performance demonstrates why so many customers rely on Oracle Fusion Middleware as their foundation for innovation.

  • To obtain this leading result a number of Oracle technologies were used: Oracle Solaris 10, Oracle Java Hotspot VM, Oracle WebLogic 10.3.3, Oracle Database 11g Release 2, SPARC T3-1 server, and Sun SPARC Enterprise M4000 server.

Performance Landscape

Online Auction POC Oracle Fusion Middleware Benchmark Results

Application Server Database Server Users Ops/
sec\*
1 x SPARC T3-1
1 x 1.65 GHz SPARC T3
Oracle WebLogic 10.3.3 - 4 Instances
Oracle Solaris 10 9/10
1 x Sun SPARC Enterprise M4000
2 x 2.53 GHz, SPARC64 VII
Oracle 11g DB 11.2.0.1
Oracle Solaris 10 9/10
19000 4821
2 x Sun Blade X6270,
Oracle WebLogic 10.3.3
– 8 Instances with Coherence, EclipselinkJPA
Oracle Solaris 10 10/09
1 x Sun SPARC Enterprise M5000
4 x 2.53 GHz, SPARC64 VII
Oracle 11g DB 11.2.0.1.0
Oracle Solaris 10 10/09
16500 4600
1 x SPARC T3-1
1 x 1.65 GHz SPARC T3
Oracle WebLogic 10.3.3 – 1 Instance
Oracle Solaris 10 9/10
1 x Sun SPARC Enterprise M4000,
2 x 2.53GHz, SPARC64 VII
Oracle 11g DB 11.2.0.1
Oracle Solaris 10 9/10
5000 1302

\* Online Auction Workload (Obay) Ops/sec, bigger is better.

Configuration Summary

Application Server Configuration:

1 x SPARC T3-1 server
1 x 1.65 GHz SPARC T3 processor
64 GB memory
1 x 10GbE NIC
Oracle Solaris 10 9/10
Oracle WebLogic 10.3.3 Application Server - Standard Edition
Oracle Fusion Middleware, Obay/DB Loader.1
Oracle Java SE, JDK 6 Update 21
Faban Harness and driver 1.0

Database Server Configuration:

1 x Sun SPARC Enterprise M4000 server
2 x 2.53 GHz SPARC64 VII processors
128 GB memory
1 x 10GbE NIC, 1 x 1GbE NIC
1 x StorageTek 2540 HW RAID controller
2 x Sun Storage 2501 JBOD Arrays
Oracle Solaris 10 9/10
Oracle Database Enterprise Edition Release 11.2.0.1

Benchmark Description

This Fusion Middleware Online Auction workload is derived from an auction demonstration application originally developed by Oracle based on Ebay's online auction model. The workload includes functions such as adding new product for auction, submitting bids, user login/logout and admin account setup. It requires a J2EE application server and database. The benchmark metric is the number of total users (bidders/sellers) accessing the items database and executing a number of operations per second with an average response time below a second. The online auction workload demonstrates the performance benefits of using Oracle Fusion Middleware integrated with Oracle servers.

Key Points and Best Practices

  • TCP tunable tcp_time_wait_interval was reduced to 10000 to respond to a large number of concurrent connections.

  • TCP tunable tcp_conn_req_max_q and tcp_conn_req_max_q0 were increased to 65536.

  • Four Oracle WebLogic application server instances were hosted on a single SPARC T3-1.

  • The WebLogic application servers were executed in the FX scheduling class to reduce the Context Switches.

  • The JVM used was 64-bit with the Heap was increased to 6 GB.

  • WebLogic server made use of a performance pack enabled by LD_LIBRARY_PATH native to SPARC64.

  • The Oracle database logwriter was bound to a core on the Sun SPARC Enterprise M4000 server and ran under the RT class.

  • The Oracle database processes were bound round robin to processors and executed in the FX scheduling class to reduce the thread migration and, hence, improved performance.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Reported 9/20/2010.

Thursday Sep 23, 2010

SPARC T3-1 Performance on PeopleSoft Enterprise Financials 9.0 Benchmark

Oracle's SPARC T3-1 and Sun SPARC Enterprise M5000 servers combined with Oracle's Sun Storage F5100 Flash Array storage has produced the first world-wide disclosure and World Record performance on the PeopleSoft Enterprise Financials 9.0 benchmark.

  • Using SPARC T3-1 and Sun SPARC Enterprise M5000 servers along with a Sun Storage F5100 Flash Array system, the Oracle solution processed online business transactions to support 1000 concurrent users using 32 application server threads with compliant response times while simultaneously completing complex batch jobs. This is the first publication of this benchmark by any vendor world-wide.

  • The Sun Storage F5100 Flash Array system is a high performance, high-density solid-state flash array which provides a read latency of only 0.5 msec which is about 10 times faster than the normal disk latencies of 5 msec measured on this benchmark.

  • The SPARC T3-1 and Sun SPARC Enterprise M5000 servers were able process online users and concurrent batch jobs simultaneously in 38.66 minutes on this benchmark that reflects complex, multi-tier environment and utilizes a large back-end database of nearly 1 TB.

  • Both the SPARC T3-1 and Sun SPARC Enterprise M5000 servers used the Oracle Solaris 10 operating system.

  • The combination of Oracle's PeopleSoft Enterprise Financials/SCM 9.00.00.331, PeopleSoft Enterprise (PeopleTools) 8.49.23 and Oracle WebLogic server running on the SPARC T3-1 server and the Oracle database 11g Release 1 was run on the Sun SPARC Enterprise M5000 server for this benchmark.

Performance Landscape

As the first world-wide disclosure of this benchmark, no competitive results exist with which the current result may be compared.

Batch Processing Times
Batch Process Elapsed Time in Minutes
Batch Alone\* Batch with
1000 Online Users\*
JGEN Subsystem 7.30 7.78
JEDIT1 2.52 3.77
ALLOCATION 6.05 10.15
ALLOC EDIT/POST 2.32 2.23
SUM LEDGER 1.00 1.18
CONSOLIDATION 1.50 1.55
Total Main Batch Stream 20.69 26.66
SQR/GL_LEDGER 8.92 9.12
SQR/GL_TBAL 3.33 3.35
SQR 11.83 12.00
nVisions 8.78 8.83
nVision 11.83 12.00
Max SQR and nVision Stream 11.83 12.00
Total Batch (sum of Main Batch and Max SQR) 32.52 38.66

\* PeopleSoft Enterprise Financials batch processing and post-processing elapsed times.

Results and Configuration Summary

Hardware Configuration:

1 x SPARC T3-1 (1 x T3 at 1.65 GHz with 128 GB of memory)
1 x Sun SPARC Enterprise M5000 (8 x SPARC64 at 2.53 GHz with 64 GB of memory)
1 x Sun Storage F5100 Flash Array (74 x 24 GB FMODs)
2 x StorageTek 2540 (12 x 146 GB SAS 15K RPM)
1 x StorageTek 2501 (12 x 146 GB SAS 15K RPM)
1 x Dual-Port SAS Fibre Channel Host Bus Adapters (HBA)

Software Configuration:

Oracle Solaris 10 10/09
Oracle's PeopleSoft Enterprise Financials/SCM 9.00.00.311 64-bit
Oracle's PeopleSoft Enterprise (PeopleTools) 8.49.23 64-bit
Oracle 11g R2 11.1.0.6 64-bit
Oracle Tuxedo 9.1 RP36 with Jolt 9.1
Micro Focus COBOL Server Express 4.0 SP4 64-bit

Benchmark Description

The PeopleSoft Enterprise Financials batch processes included in this benchmark are as follows:

  • Journal Generator: (AE) This process creates journals from accounting entries (AE) generated from various data sources, including non-PeopleSoft systems as well as PeopleSoft applications. In the benchmark, the Journal Generator (FS_JGEN) process is set up to create accounting entries from Oracle's PeopleSoft applications in the same database, such as PeopleSoft Enterprise Payables, Receivables, Asset Management, Expenses, Cash Management. The process is run with the option of Edit and Post turned on to edit and post the journals created by Journal generator. Journal Edit is an AE program and Post is a COBOL program.

  • Allocation: (AE) This process allocates balances held or accumulated in one or more entities to more than one business unit, department or other entities based on user-defined rules.

  • Journal Edit & Post: (AE & COBOL) Journal Edit validates journal transactions before posting them to the ledger. This validation ensures that journals are valid, for example: valid ChartFields values and combinations, debits and credits equal, and inter/intra-unit balanced, Journal Post process posts only valid, edited journals, ensures each journal line posts to the appropriate target detail ledgers, and then changes the journal's status to posted. In this benchmark, the Journal Edit & Post is also set up to edit and post Oracle's PeopleSoft applications from another database, such as PeopleSoft Enterprise Payroll data.

  • Summary Ledger: (AE) Summary Ledger processing summarizes detail ledger data across selected GL BUs. Summary Ledgers can be generated for reporting purposes or used in consolidations.

  • Consolidations: (COBOL) Consolidation processing summarizes ledger balances and generates elimination journal entries across business units based on user-defined rules.

  • SQR & nVision Reporting: Reporting will consist of nVision and SQR reports. A balance sheet, and income statement, and a trial balance will be generated for each GL BU by SQR processes GLS7002 and GLS7012. The consolidated results of the nVision reports are run by 10 nVision users using 4 standard delivered report request definitions such as BALANCE, INCOME, CONSBAL, and DEPTINC. Each of the nVision users will have ownership over 10 Business Units and each of the nVision users will submit multiple runs that are being executed in parallel to generate a total of 40 nVision reports.

Batch processes are run concurrently with more than 1000 emulated users executing 30 pre-defined online applications. Response times for the online applications are collected and must conform to a maximum time.

Key Points and Best Practices

Oracle's SPARC T3-1 and Oracle's Sun SPARC Enterprise M5000 servers published the first result for Oracle's PeopleSoft Enterprise Financials 9.0 benchmark for concurrent batch and 1000 online users using the large database model on Oracle 11g running Oracle Solaris 10.

The SPARC T3-1 and Sun SPARC Enterprise M5000 servers were able process online users and concurrent batch jobs simultaneously in 38.66 minutes.

The Sun Storage F5100 Flash Array system, which is highly tuned for IOPS, contributed to the result through reduced IO latency.

The combination of the SPARC T3-1 and Sun SPARC Enterprise M5000 servers, with a Sun Storage F5100 Flash Array system, form an ideal environment for hosting complex multi-tier applications. This is the first public disclosure of any system running this benchmark.

The SPARC T3-1 server hosted the web and application server tiers, providing good response time to emulated user requests. The benchmark specification allows 1000 users, but there is headroom for increased load.

The Sun SPARC Enterprise M5000 server was used for the database server along with a Sun Storage F5100 Flash Array system. The speed of the M-series server with the low latency of the Flash Array provided the overall low latency for user requests, even while completing complex batch jobs.

The parallelism of the SPARC T3-1 server, when used as an application and web server tier, is best taken advantage of by configuring sufficient server processes. With sufficient server processes distributed across the hardware cores, acceptable user response times are achieved.

The low latency of the Sun Storage F5100 Flash Array storage contributed to the excellent response times of emulated users by making data quickly available to the database back-end. The array was configured as several RAID 0 volumes and data was distributed across the volumes, maximizing storage bandwidth.

The transaction processing capacity of the Sun SPARC Enterprise M5000 server enabled very fast batch processing times while supporting over 1000 online users.

While running the maximum workload specified by the benchmark, the systems were lightly loaded, providing headroom to grow.

Please see the white paper for information on PeopleSoft payroll best practices using flash.

See Also

Disclosure Statement

Oracle's PeopleSoft Financials 9.0 benchmark, Oracle's SPARC T3-1 (1 1.65GHz SPARC-T3), Oracle's SPARC Enterprise M5000 (8 2.53GHz SPARC64), 38.66 min. www.oracle.com/apps_benchmark/html/white-papers-peoplesoft.html Results 09/20/2010.

Tuesday Sep 21, 2010

ProMAX Performance and Throughput on Sun Fire X2270 and Sun Storage 7410

Halliburton/Landmark's ProMAX 3D Prestack Kirchhoff Time Migration's single job scalability and multiple job throughput using various scheduling methods are evaluated on a cluster of Oracle's Sun Fire X2270 servers attached via QDR InfiniBand to Oracle's Sun Storage 7410 system.

Two resource scheduling methods, compact and distributed, are compared while increasing the system load with additional concurrent ProMAX jobs.

  • A single ProMAX job has near linear scaling of 5.5x on 6 nodes of a Sun Fire X2270 cluster.

  • A single ProMAX job has near linear scaling of 7.5x on a Sun Fire X2270 server when running from 1 to 8 threads.

  • ProMAX can take advantage of Oracle's Sun Storage 7410 system features compared to dedicated local disks. There was no significant difference in run time observed when running up to 8 concurrent 16 thread jobs.

  • The 8-thread ProMAX job throughput using the distributed scheduling method is equivalent or slightly faster than the compact scheme for 1 to 4 concurrent jobs.

  • The 16-thread ProMAX job throughput using the distributed scheduling method is up to 8% faster when compared to the compact scheme on an 8-node Sun Fire X2270 cluster.

The multiple job throughput characterization revealed in this benchmark study are key in pre-configuring Oracle Grid Engine resource scheduling for ProMAX on a Sun Fire X2270 cluster and provide valuable insight for server consolidation.

Performance Landscape

Single Job Scaling

Single job performance on a single node is near linear up the number of cores in the node, i.e. 2 Intel Xeon X5570s with 4 cores each. With hyperthreading (2 active threads per core) enabled, more ProMAX threads are used increasing the load on the CPU's memory architecture causing the reduced speedups.
ProMAX single job performance on the 6-node cluster shows near linear speedup node to node.
Single Job 6-Node Scalability
Hyperthreading Enabled - 16 Threads/Node Maximum
Number of Nodes Threads Per Node Speedup to 1 Thread Speedup to 1 Node
6 16 54.2 5.5
4 16 36.2 3.6
3 16 26.1 2.6
2 16 17.6 1.8
1 16 10.0 1.0
1 14 9.2
1 12 8.6
1 10 7.2\*
1 8 7.5
1 6 5.9
1 4 3.9
1 3 3.0
1 2 2.0
1 1 1.0

\* 2 threads contend with two master node daemons

Multiple Job Throughput Scaling, Compact Scheduling

With the Sun Storage 7410 system, performance of 8 concurrent jobs on the cluster using compact scheduling is equivalent to a single job.

Multiple Job Throughput Scalability
Hyperthreading Enabled - 16 Threads/Node Maximum
Number of Nodes Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Percent Cluster Used
1 1 16 1.00 1 13
2 1 16 1.00 2 25
4 1 16 1.00 4 50
8 1 16 1.00 8 100

Multiple 8-Thread Job Throughput Scaling, Compact vs. Distributed Scheduling

These results report the difference of different distributed method resource scheduling levels to 1, 2, and 4 concurrent job compact method baselines.

Multiple 8-Thread Job Scheduling
HyperThreading Enabled - Use 8 Threads/Node Maximum
Number of Jobs Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Total Threads per Node Used Percent of PVM Master 8 Threads Used
1 1 8 1.00 1 8 100
1 4 2 1.01 4 2 25
1 8 1 1.01 8 1 13

2 1 8 1.00 2 8 100
2 4 2 1.01 4 4 50
2 8 1 1.01 8 2 25

4 1 8 1.00 4 8 100
4 4 2 1.00 4 8 100
4 8 1 1.01 8 4 100

Multiple 16-Thread Job Throughput Scaling, Compact vs. Distributed Scheduling

The results are reported relative to the performance of 1, 2, 4, and 8 concurrent 2-node, 8-thread jobs.

Multiple 16-Thread Job Scheduling
HyperThreading Enabled - 16 Threads/Node Available
Number of Jobs Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Total Threads per Node Used Percent of PVM Master 16 Threads Used
1 1 16 0.66 1 16 100\*
1 2 8 1.00 2 8 50
1 4 4 1.03 4 4 25
1 8 2 1.06 8 2 13

2 1 16 0.70 2 16 100\*
2 2 8 1.00 4 8 50
2 4 4 1.07 8 4 25
2 8 2 1.08 8 4 25

4 1 16 0.74 4 16 100\*
4 4 4 0.74 4 16 100\*
4 2 8 1.00 8 8 50
4 4 4 1.05 8 8 50
4 8 2 1.04 8 8 50

8 1 16 1.00 8 16 100\*
8 4 4 1.00 8 16 100\*
8 8 2 1.00 8 16 100\*

\* master PVM host; running 20 to 21 total threads (over-subscribed)

Results and Configuration Summary

Hardware Configuration:

8 x Sun Fire X2270 servers, each with
2 x 2.93 GHz Intel Xeon X5570 processors
48 GB memory at 1333 MHz
1 x 500 GB SATA
Sun Storage 7410 system
4 x 2.3 GHz AMD Opteron 8356 processors
128 GB memory
2 Internal 233GB SAS drives = 466 GB
2 Internal 93 GB read optimized SSD = 186 GB
1 External Sun Storage J4400 array with 22 1TB SATA drives and 2 18GB write optimized SSD
11 TB mirrored data and mirrored write optimized SSD

Software Configuration:

SUSE Linux Enterprise Server 10 SP 2
Parallel Virtual Machine 3.3.11
Oracle Grid Engine
Intel 11.1 Compilers
OpenWorks Database requires Oracle 10g Enterprise Edition
Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

The ProMAX family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX is integrated with Halliburton's OpenWorks Geoscience Oracle Database to index prestack seismic data and populate the database with processed seismic.

This benchmark evaluates single job scalability and multiple job throughput of the ProMAX 3D Prestack Kirchhoff Time Migration while processing the Halliburton benchmark data set containing 70,808 traces with 8 msec sample interval and trace length of 4992 msec. Alternative thread scheduling methods are compared for optimizing single and multiple job throughput. The compact scheme schedules the threads of a single job in as few nodes as possible, whereas, the distributed scheme schedules the threads across a many nodes as possible. The effects of load on the Sun Storage 7410 system are measured. This information provides valuable insight into determining the Oracle Grid Engine resource management policies.

Hyperthreading is enabled for all of the tests. It should be noted that every node is running a PVM daemon and ProMAX license server daemon. On the master PVM daemon node, there are three additional ProMAX daemons running.

The first test measures single job scalability across a 6-node cluster with an additional node serving as the master PVM host. The speedup relative to a single node, single thread are reported.

The second test measures multiple job scalability running 1 to 8 concurrent 16-thread jobs using the Sun Storage 7410 system. The performance is reported relative to a single job.

The third test compares 8-thread multiple job throughput using different job scheduling methods on a cluster. The compact method involves putting all 8 threads for a job on the same node. The distributed method involves spreading the 8 threads of job across multiple nodes. The results report the difference of different distributed method resource scheduling levels to 1, 2, and 4 concurrent job compact method baselines.

The fourth test is similar to the second test except running 16-thread ProMAX jobs. The results are reported relative to the performance of 1, 2, 4, and 8 concurrent 2-node, 8-thread jobs.

The ProMAX processing parameters used for this benchmark:

Minimum output inline = 65
Maximum output inline = 85
Inline output sampling interval = 1
Minimum output xline = 1
Maximum output xline = 200 (fold)
Xline output sampling interval = 1
Antialias inline spacing = 15
Antialias xline spacing = 15
Stretch Mute Aperature Limit with Maximum Stretch = 15
Image Gather Type = Full Offset Image Traces
No Block Moveout
Number of Alias Bands = 10
3D Amplitude Phase Correction
No compression
Maximum Number of Cache Blocks = 500000

Key Points and Best Practices

  • The application was rebuilt with the Intel 11.1 Fortran and C++ compilers with these flags.

    -xSSE4.2 -O3 -ipo -no-prec-div -static -m64 -ftz -fast-transcendentals -fp-speculation=fast
  • There are additional execution threads associated with a ProMAX node. There are two threads that run on each node: the license server and PVM daemon. There are at least three additional daemon threads that run on the PVM master server: the ProMAX interface GUI, the ProMAX job execution - SuperExec, and the PVM console and control. It is best to allocate one node as the master PVM server to handle the additional 5+ threads. Otherwise, hyperthreading can be enabled and the master PVM host can support up to 8 ProMAX job threads.

  • When hyperthreading is enabled in on one of the non-master PVM hosts, there is a 7% penalty going from 8 to 10 threads. However, 12 threads are 11 percent faster than 8. This can be contributed to the two additional support threads when hyperthreading initiates.

  • Single job performance on a single node is near linear up the number of cores in the node, i.e. 2 Intel Xeon X5570s with 4 cores each. With hyperthreading (2 active threads per core) enabled, more ProMAX threads are used increasing the load on the CPU's memory architecture causing the reduced speedups.

    Users need to be aware of these performance differences and how it effects their production environment.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX. Results as of 9/20/2010.

Monday Sep 20, 2010

Sun Fire X4470 4 Node Cluster Delivers World Record SAP SD-Parallel Benchmark Result

Oracle delivered an SAP enhancement package 4 for SAP ERP 6.0 Sales and Distribution – Parallel (SD-Parallel) Benchmark world record result using four of Oracle's Sun Fire X4470 servers, Oracle Solaris 10 and Oracle 11g Real Application Clusters (RAC) software.

  • The Sun Fire X4470 servers delivered 8% more performance compared to the IBM Power 780 server running the SAP enhancement package 4 for SAP ERP 6.0 Sales and Distribution benchmark.

  • The Sun Fire X4470 servers result of 40,000 users delivered 2.2 times the performance of the HP ProLiant DL980 G7 result of 18,180 users.

  • The Sun Fire X4470 servers result of 40,000 users delivered 2.5 times the performance of the Fujitsu PRIMEQUEST 1800E result of 16,000 users.

This result shows that a complete software and hardware solution from Oracle using Oracle RAC, Oracle Solaris and Sun servers provides a superior performing solution.

Performance Landscape

Selected SAP Sales and Distribution benchmark results are presented in decreasing order in performance. All benchmarks were using SAP enhancement package 4 for SAP ERP 6.0 (Unicode) except the result marked with an asterix (\*) which was achieved with SAP ERP 6.0.

System OS
Database
Users SAPS Type Date
Four Sun Fire X4470
4xIntel Xeon X7560 @2.26GHz
256 GB
Solaris 10
Oracle 11g Real Application Clusters
40,000 221,014 Parallel 20-Sep-10
Five IBM System p 570 (\*)
8xPOWER6 @4.7GHz
128 GB
AIX 5L Version 5.3
Oracle 10g Real Application Clusters
37,040 187,450 Parallel "non-Unicode" 25-Mar-08
IBM Power 780
8xPOWER7 @3.8GHz
1 TB
AIX 6.1
DB2 9.7
37,000 202,180 2-Tier 7-Apr-10
Two Sun Fire X4470
4xIntel Xeon X7560 @2.26GHz
256 GB
Solaris 10
Oracle 11g Real Application Clusters
21,000 115,300 Parallel 28-Jun-10
HP DL980 G7
8xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
18,180 99,320 2-Tier 21-Jun-10
Fujitsu PRIMEQUEST 1800E
8xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
16,000 87,550 2-Tier 30-Mar-10
Four Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
13,718 75,762 Parallel 12-Oct-09
HP DL580 G7
4xIntel Xeon X7560 @2.26GHz
256 GB
Win Server 2008 R2 DE
SQL Server 2008
10,445 57,020 2-Tier 21-Jun-10
Two Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
7,220 39,420 Parallel 12-Oct-09
One Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
3,800 20,750 Parallel 12-Oct-09

Complete benchmark results and a description can be found at the SAP benchmark website http://www.sap.com/solutions/benchmark/sd.epx.

Results and Configuration Summary

Hardware Configuration:

4 x Sun Fire X4470 servers, each with
4 x Intel Xeon X7560 2.26 GHz (4 chips, 32 cores, 64 threads)
256 GB memory

Software Configuration:

Oracle 11g Real Application Clusters (RAC)
Oracle Solaris 10

Results Summary:

Number of SAP SD benchmark users:
40,000
Average dialog response time:
0.86 seconds
Throughput:

Dialog steps/hour:
13,261,000

SAPS:
221,020
SAP Certification:
2010039

Benchmark Description

SAP is one of the premier world-wide ERP application providers and maintains a suite of benchmark tests to demonstrate the performance of competitive systems running the various SAP products.

The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments. The SAP Standard Application Sales and Distribution - Parallel (SD-Parallel) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing and demonstrates the ability to run both the application and database software on a single system.

The SD-Parallel Benchmark consists of the same transactions and user interaction steps as the SD Benchmark. This means that the SD-Parallel Benchmark runs the same business processes as the SD Benchmark. The difference between the benchmarks is the technical data distribution.

The additional rule for parallel and distributed databases is one must equally distribute the benchmark users across all database nodes for the used benchmark clients (round-robin method). Following this rule, all database nodes work on data of all clients. This avoids unrealistic configurations such as having only one client per database node.

The SAP Benchmark Council agreed to give the parallel benchmark a different name so that the difference can be easily recognized by any interested parties - customers, prospects, and analysts. The naming convention is SD-Parallel for Sales & Distribution - Parallel.

In January 2009, a new version of the SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution (SD) Benchmark was released. This new release has higher cpu requirements and so yields from 25-50% fewer users compared to the previous (non-unicode) Standard Sales and Distribution (SD) Benchmark. Between 10-30% of this greater load is due to the extra overhead from the processing of the larger character strings due to Unicode encoding.

Unicode is a computing standard that allows for the representation and manipulation of text expressed in most of the world's writing systems. Before the Unicode requirement, this benchmark used ASCII characters meaning each was just 1 byte. The new version of the benchmark requires Unicode characters and the Application layer (where ~90% of the cycles in this benchmark are spent) uses a new encoding, UTF-16, which uses 2 bytes to encode most characters (including all ASCII characters) and 4 bytes for some others. This requires computers to do more computation and use more bandwidth and storage for most character strings. Refer to the above SAP Note for more details.

See Also

Disclosure Statement

SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, results as of 9/19/2010. For more details, see http://www.sap.com/benchmark. SD-Parallel, Four Sun Fire X4470 (each 4 processors, 32 cores, 64 threads) 40,000 SAP SD Users, Cert# 2010039. SD-Parallel, Two Sun Fire X4470 (each 4 processors, 32 cores, 64 threads) 21,000 SAP SD Users, Cert# 2010029. SD 2-Tier, HP ProLiant DL980 G7 (8 processors, 64 cores, 128 threads) 18,180 SAP SD Users, Cert# 2010028. SD 2-Tier, Fujitsu PRIMEQUEST 1800E (8 processors, 64 cores, 128 threads) 16,000 SAP SD Users, Cert# 2010010. SD-Parallel, Four Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 13,718 SAP SD Users, Cert# 2009041. SD 2-Tier, HP ProLiant DL580 G7 (4 processors, 32 cores, 64 threads) 10,490 SAP SD Users, Cert# 2010032. SD 2-Tier, IBM System x3850 X5 (4 processors, 32 cores, 64 threads) 10,450 SAP SD Users, Cert# 2010012. SD 2-Tier, Fujitsu PRIMERGY RX600 S5 (4 processors, 32 cores, 64 threads) 9,560 SAP SD Users, Cert# 2010017. SD-Parallel, Two Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 7,220 SAP SD Users, Cert# 2009040. SD-Parallel, Sun Blade X6270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, Cert# 2009039. SD 2-Tier, Sun Fire X4270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, Cert# 2009033.

SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, results as of 9/19/2010. SD-Parallel, Five IBM System p 570 (each 8 processors, 16 cores, 32 threads) 37,040 SAP SD Users, Cert# 2008013.

Tuesday Jun 29, 2010

Sun Fire X2270 M2 Achieves Leading Single Node Results on ANSYS FLUENT Benchmark

Oracle's Sun Fire X2270 M2 server produced leading single node performance results running the ANSYS FLUENT benchmark cases as compared to the best single node results currently posted at the ANSYS FLUENT website. ANSYS FLUENT is a prominent MCAE application used for computational fluid dynamics (CFD).

  • The Sun Fire X2270 M2 server outperformed all single node systems in 5 of 6 test cases at the 12 core level, beating systems from Cray and SGI.
  • For the truck_14m test, the Sun Fire X2270 M2 server outperformed all single node systems at all posted core counts, beating systems from SGI, Cray and HP. When considering performance on a single node, the truck_14m model is most representative of customer CFD model sizes in the test suite.
  • The Sun Fire X2270 M2 server with 12 cores performed up to 1.3 times faster than the previous generation Sun Fire X2270 server with 8 cores.

Performance Landscape

Results are presented for six of the seven ANSYS FLUENT benchmark tests. The seventh test is not a practical test for a single system. Results are ratings, where bigger is better. A rating is the number of jobs that could be run in a single day (86,400 / run time). Competitive results are from the ANSYS FLUENT benchmark website as of 25 June 2010.

Single System Performance

ANSYS FLUENT Benchmark Tests
Results are Ratings, Bigger is Better
System Benchmark Test
eddy_417k turbo_500k aircraft_2m sedan_4m truck_14m truck_poly_14m
Sun Fire X2270 M2 1129.4 5391.6 1105.9 814.1 94.8 96.4
SGI Altix 8400EX 1338.0 5308.8 1073.3 796.3 - -
SGI Altix XE1300C 1299.2 5284.4 1071.3 801.3 90.2 -
Cray CX1 1060.8 5127.6 1069.6 808.6 86.1 87.5

Scaling of Benchmark Test truck_14m

ANSYS FLUENT truck_14m Model
Results are Ratings, Bigger is Better
System Cores Used
12 8 4 2 1
Sun Fire X2270 M2 94.8 73.6 41.4 21.0 10.4
SGI Altix XE1300C 90.2 60.9 41.1 20.7 9.0
Cray CX1 (X5570) - 71.7 33.2 18.9 8.1
HP BL460 G6 (X5570) - 70.3 38.6 19.6 9.2

Comparing System Generations, Sun Fire X2270 M2 to Sun Fire X2270

ANSYS FLUENT Benchmark Tests
Results are Ratings, Bigger is Better
System Benchmark Test
eddy_417k turbo_500k aircraft_2m sedan_4m truck_14m truck_poly_14m
Sun Fire X2270 M2 1129.4 5374.8 1103.8 814.1 94.8 96.4
Sun Fire X2270 981.5 4163.9 862.7 691.2 73.6 73.3

Ratio 1.15 1.29 1.28 1.18 1.29 1.32

Results and Configuration Summary

Hardware Configuration:

Sun Fire X2270 M2
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory
1 x 500 GB 7200 rpm SATA internal HDD

Sun Fire X2270
2 x 2.93 GHz Intel Xeon X5570 processors
48 GB memory
2 x 24 GB internal striped SSDs

Software Configuration:

64-bit SUSE Linux Enterprise Server 10 SP 3 (SP 2 for X2270)
ANSYS FLUENT V12.1.2
ANSYS FLUENT Benchmark Test Suite

Benchmark Description

The following description is from the ANSYS FLUENT website:

The FLUENT benchmarks suite comprises of a set of test cases covering a large range of mesh sizes, physical models and solvers representing typical industry usage. The cases range in size from a few 100 thousand cells to more than 100 million cells. Both the segregated and coupled implicit solvers are included, as well as hexahedral, mixed and polyhedral cell cases. This broad coverage is expected to demonstrate the breadth of FLUENT performance on a variety of hardware platforms and test cases.

The performance of a CFD code will depend on several factors, including size and topology of the mesh, physical models, numerics and parallelization, compilers and optimization, in addition to performance characteristics of the hardware where the simulation is performed. The principal objective of this benchmark suite is to provide comprehensive and fair comparative information of the performance of FLUENT on available hardware platforms.

About the ANSYS FLUENT 12 Benchmark Test Suite

    CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.

See Also

Disclosure Statement

All information on the FLUENT website (http://www.fluent.com) is Copyrighted 1995-2010 by ANSYS Inc. Results as of June 25, 2010.

Monday Jun 28, 2010

Sun Fire X4470 2-Node Configuration Sets World Record for SAP SD-Parallel Benchmark

Using two of Oracle's Sun Fire X4470 servers to run the SAP Enhancement Package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution – Parallel (SD-Parallel) standard application benchmark, Oracle delivered a world record result. This was run using Oracle Solaris 10 and Oracle 11g Real Application Clusters (RAC) software.

  • The Sun Fire X4470 servers result of 21,000 users delivered more than twice the performance of the IBM System x3850 X5 system result of 10,450 users.

  • The Sun Fire X4470 servers result of 21,000 users beat the HP ProLiant DL980 G7 system result of 18,180 users. Both solutions used 8 Intel Xeon X7560 processors.

  • The Sun Fire X4470 servers result of 21,000 users beat the Fujitsu PRIMEQUEST 1800E system result of 16,000 users. Both solutions used 8 Intel Xeon X7560 processors.

  • This result shows how a compete software and hardware solution from Oracle, using Oracle RAC, Oracle Solaris and along with Oracle's Sun servers, can provide a superior performing solution when compared to the competition.

Performance Landscape

SAP Enhancement Package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, select results presented in decreasing performance order. Both Parallel and 2-Tier solution results are listed in the table.

System OS
Database
Users SAPS Type Date
Two Sun Fire X4470
4xIntel Xeon X7560 @2.26GHz
256 GB
Solaris 10
Oracle 11g Real Application Clusters
21,000 115,300 Parallel 28-Jun-10
HP DL980 G7
8xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
18,180 99,320 2-Tier 21-Jun-10
Fujitsu PRIMEQUEST 1800E
8xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
16,000 87,550 2-Tier 30-Mar-10
Four Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
13,718 75,762 Parallel 12-Oct-09
IBM System x3850 X5
4xIntel Xeon X7560 @2.26GHz
256 GB
Win Server 2008 EE
DB2 9.7
10,450 57,120 2-Tier 30-Mar-10
HP DL580 G7
4xIntel Xeon X7560 @2.26GHz
256 GB
Win Server 2008 R2 DE
SQL Server 2008
10,445 57,020 2-Tier 21-Jun-10
Fujitsu PRIMERGY RX600 S5
4xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
9,560 52,300 2-Tier 06-May-10
Two Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
7,220 39,420 Parallel 12-Oct-09
One Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
3,800 20,750 Parallel 12-Oct-09
Sun Fire X4270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g
3,800 21,000 2-Tier 21-Aug-09

Complete benchmark results may be found at the SAP benchmark website http://www.sap.com/solutions/benchmark/sd.epx.

Results and Configuration Summary

Hardware Configuration:

2 x Sun Fire X4470 servers, each with
4 x Intel Xeon X7560 2.26 GHz (4 chips, 32 cores, 64 threads)
256 GB memory

Software Configuration:

Oracle 11g Real Application Clusters (RAC)
Oracle Solaris 10

Results Summary:

Number of SAP SD benchmark users:
21,000
Average dialog response time:
0.93 seconds
Throughput:

Dialog steps/hour:
6,918,000

SAPS:
115,300
SAP Certification:
2010029

Benchmark Description

The SAP Standard Application Sales and Distribution - Parallel (SD-Parallel) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing, and demonstrates the ability to run both the application and database software on a single system. The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments.

The SD-Parallel Benchmark consists of the same transactions and user interaction steps as the SD Benchmark. This means that the SD-Parallel Benchmark runs the same business processes as the SD Benchmark. The difference between the benchmarks is the technical data distribution.

An additional rule for parallel and distributed databases is one must equally distribute the benchmark users across all database nodes for the used benchmark clients (round-robin-method). Following this rule, all database nodes work on data of all clients. This avoids unrealistic configurations such as having only one client per database node.

The SAP Benchmark Council agreed to give the parallel benchmark a different name so that the difference can be easily recognized by any interested parties - customers, prospects, and analysts. The naming convention is SD-Parallel for Sales & Distribution - Parallel.

SAP is one of the premier world-wide ERP application providers, and maintains a suite of benchmark tests to demonstrate the performance of competitive systems on the various SAP products.

See Also

Disclosure Statement

SAP Enhancement Package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, results as of 6/22/2010. For more details, see http://www.sap.com/benchmark. SD-Parallel, Two Sun Fire X4470 (each 4 processors, 32 cores, 64 threads) 21,000 SAP SD Users, Cert# 2010029. SD 2-Tier, HP ProLiant DL980 G7 (8 processors, 64 cores, 128 threads) 18,180 SAP SD Users, Cert# 2010028. SD 2-Tier, Fujitsu PRIMEQUEST 1800E (8 processors, 64 cores, 128 threads) 16,00o SAP SD Users, Cert# 2010010. SD-Parallel, Four Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 13,718 SAP SD Users, Cert# 2009041. SD 2-Tier, IBM System x3850 X5 (4 processors, 32 cores, 64 threads) 10,450 SAP SD Users, Cert# 2010012. SD 2-Tier, Fujitsu PRIMERGY RX600 S5 (4 processors, 32 cores, 64 threads) 9,560 SAP SD Users, Cert# 2010017. SD-Parallel, Two Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 7,220 SAP SD Users, Cert# 2009040. SD-Parallel, Sun Blade X6270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, Cert# 2009039. SD 2-Tier, Sun Fire X4270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, Cert# 2009033.

Wednesday Jun 09, 2010

PeopleSoft Payroll 500K Employees on Sun SPARC Enterprise M5000 World Record

Oracle's Sun SPARC Enterprise M5000 server combined with Oracle's Sun Storage F5100 Flash Array system has produced World Record Performance on PeopleSoft Payroll 9.0 (North American) 500K employees benchmark.
  • The Sun SPARC Enterprise M5000 server and the Sun Storage F5100 Flash Array system processed payroll for 500K employees using 32 payroll threads 18% faster than the IBM z10 EC 2097-709 mainframe as measured for payroll processing tasks in the Peoplesoft Payroll 9.0 (North American) benchmark. This IBM mainframe is rated at 6,512 MIPS.

  • The IBM z10 mainframe with nine 4.4 GHz Gen1 processors has a list price over $6M.

  • The Sun SPARC Enterprise M5000 server together with the Sun Storage F5100 Flash Array system processed payroll for 500K employees using 32 payroll threads 92% faster than an HP rx7640 as measured for payroll processing tasks in the Peoplesoft Payroll 9.0 (North American) benchmark.

  • The Sun Storage F5100 Flash Array system is a high performance, high density solid state flash array which provides a read latency of only 0.5 msec which is about 10 times faster than the normal disk latencies 5 msec measured on this benchmark.

  • The Sun SPARC Enterprise M5000 server used the Oracle Solaris 10 operating system and ran with the Oracle 11gR1 database for this benchmark.

Performance Landscape

500K Employees

System Processor OS/Database Time in Minutes Num of
Streams
Payroll
Processing
Result
Run 1 Run 2 Run 3
Sun M5000 8x 2.53GHz SPARC64 VII Solaris/Oracle 11g 50.11 73.88 534.20 1267.06 32
IBM z10 9x 4.4GHz Gen1, 6,512 MIPS Z/OS /DB2 58.96 80.5 250.68 462.6 8
HP rx7640 8x 1.6GHz Itanium2 HP-UX/Oracle 11g 96.17 133.63 712.72 1665.01 32

Times under all Run columns above represent Payroll processing and Post-processing elapsed times and furthermore:

  • Run 1 = 32 parallel job streams & Single Check option = "No"
  • Run 2 = 32 sequential jobs for Pay Calculation process & 32 parallel job streams for the rest. Single Check option = "Yes"
  • Run 3 = One job stream & Single Check option = "Yes"

Times under Result column represents Payroll processing only.

Results and Configuration Summary

Hardware Configuration:

    1 x Sun SPARC Enterprise M5000 (8 x 2.53 GHz/64 GB)
    1 x Sun Storage F5100 Flash Array (40 x 24 GB FMODs)
    1 x StorageTek 2510 (4 x 136 GB SAS 15K RPM)
    4 x Dual-Port SAS Fibre Channel Host Bus Adapters (HBA)

Software Configuration:

    Oracle Solaris 10 10/09
    Oracle PeopleSoft HCM and Campus Solutions 9.00.00.311 64-bit
    Oracle PeopleSoft Enterprise (PeopleTools) 8.49.25 64-bit
    Oracle 11g R1 11.1.0.7 64-bit
    Micro Focus COBOL Server Express 4.0 SP4 64-bit

Benchmark Description

The PeopleSoft 9.0 Payroll (North America) benchmark is a performance benchmark established by PeopleSoft to demonstrate system performance for a range of processing volumes in a specific configuration. This information may be used to determine the software, hardware, and network configurations necessary to support processing volumes. This workload represents large batch runs typical of OLTP workloads during a mass update.

To measure five application business process run times for a database representing large organization. The five processes are:

  • Paysheet Creation: generates payroll data worksheet for employees, consisting of std payroll information for each employee for given pay cycle.

  • Payroll Calculation: Looks at Paysheets and calculates checks for those employees.

  • Payroll Confirmation: Takes information generated by Payroll Calculation and updates the employees' balances with the calculated amounts.

  • Print Advice forms: The process takes the information generated by payroll Calculations and Confirmation and produces an Advice for each employee to report Earnings, Taxes, Deduction, etc.

  • Create Direct Deposit File: The process takes information generated by above processes and produces an electronic transmittal file use to transfer payroll funds directly into an employee bank a/c.

For the benchmark, we collect at least three data points with different number of job streams (parallel jobs). This batch benchmark allows a maximum of thirty-two job streams to be configured to run in parallel.

Key Points and Best Practices

Please see the white paper for information on PeopleSoft payroll best practices using flash.

See Also

Disclosure Statement

Oracle PeopleSoft Payroll 9.0 benchmark, Sun SPARC Enterprise M5000 (8 2.53GHz SPARC64 VII) 50.11 min, IBM z10 (9 gen1) 58.96 min, HP rx7640 (8 1.6GHz Itanium2) 96.17 min, www.oracle.com/apps_benchmark/html/white-papers-peoplesoft.html, results 6/3/2010.

Friday Nov 20, 2009

Sun Blade 6048 and Sun Blade X6275 NAMD Molecular Dynamics Benchmark beats IBM BlueGene/L

Significance of Results

A Sun Blade 6048 chassis with 48 Sun Blade X6275 server modules ran benchmarks using the NAMD molecular dynamics applications software. NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD is driven by major trends in computing and structural biology and received a 2002 Gordon Bell Award.

  • The cluster of 32 Sun Blade X6275 server modules was 9.2x faster than the 512 processor configuration of the IBM BlueGene/L.

  • The cluster of 48 Sun Blade X6275 server modules exhibited excellent scalability for NAMD molecular dynamics simulation, up to 37.8x speedup for 48 blades relative to 1 blade.

  • For largest molecule considered, the cluster of 48 Sun Blade X6275 server modules achieved a throughput of 0.028 seconds per simulation step.
Molecular dynamics simulation is important to biological and materials science research. Molecular dynamics is used to determine the low energy conformations or shapes of a molecule. These conformations are presumed to be the biologically active conformations.

Performance Landscape

The NAMD Performance Benchmarks web page plots the performance of NAMD when the ApoA1 benchmark is executed on a variety of clusters. The performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, multiplied by the number of "processors" on which NAMD executes in parallel. The following table compares the performance of the Sun Blade X6275 cluster to several of the clusters for which performance is reported on the web page. In this table, the performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Cluster Name and Interconnect Throughput for 128 Cores
(seconds per step)
Throughput for 256 Cores
(seconds per step)
Throughput for 512 Cores
(seconds per step)
Sun Blade X6275 InfiniBand 0.014 0.0073 0.0048
Cambridge Xeon/3.0 InfiniPath 0.016 0.0088 0.0056
NCSA Xeon/2.33 InfiniBand 0.019 0.010 0.008
AMD Opteron/2.2 InfiniPath 0.025 0.015 0.008
IBM HPCx PWR4/1.7 Federation 0.039 0.021 0.013
SDSC IBM BlueGene/L MPI 0.108 0.061 0.044

The following tables report results for NAMD molecular dynamics using a cluster of Sun Blade X6275 server modules. The performance of the cluster is expressed in terms of the time in seconds that is required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Blades Cores STMV molecule (1) f1 ATPase molecule (2) ApoA1 molecule (3)
Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy
48 768 0.0277 37.8 79% 0.0075 35.2 73% 0.0039 22.2 46%
36 576 0.0324 32.3 90% 0.0096 27.4 76% 0.0045 19.3 54%
32 512 0.0368 28.4 89% 0.0104 25.3 79% 0.0048 18.1 57%
24 384 0.0481 21.8 91% 0.0136 19.3 80% 0.0066 13.2 55%
16 256 0.0715 14.6 91% 0.0204 12.9 81% 0.0073 11.9 74%
12 192 0.0875 12.0 100% 0.0271 9.7 81% 0.0096 9.1 76%
8 128 0.1292 8.1 101% 0.0337 7.8 98% 0.0139 6.3 79%
4 64 0.2726 3.8 95% 0.0666 4.0 100% 0.0224 3.9 98%
1 16 1.0466 1.0 100% 0.2631 1.0 100% 0.0872 1.0 100%

spdup - speedup versus 1 blade result
effi'cy - speedup efficiency versus 1 blade result

(1) Satellite Tobacco Mosaic Virus (STMV) molecule, 1,066,628 atoms, 12 Angstrom cutoff, Langevin dynamics, 500 time steps
(2) f1 ATPase molecule, 327,506 atoms, 11 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps
(3) ApoA1 molecule, 92,224 atoms, 12 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps

Results and Configuration Summary

Hardware Configuration

    48 x Sun Blade X6275, each with
      2 x (2 x 2.93 GHz Intel QC Xeon X5570 (Nehalem) processors)
      2 x (24 GB memory)
      Hyper-Threading (HT) off, Turbo Mode on

Software Configuration

    SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
    OpenMPI 1.3.2
    gcc 4.1.2 (1/15/2007), gfortran 4.1.2 (1/15/2007)

Benchmark Description

Molecular dynamics simulation is widely used in biological and materials science research. NAMD is a public-domain molecular dynamics software application for which a variety of molecular input directories are available. Three of these directories define:
  • the Satellite Tobacco Mosaic Virus (STMV) that comprises 1,066,628 atoms
  • the f1 ATPase enzyme that comprises 327,506 atoms
  • the ApoA1 enzyme that comprises 92,224 atoms
Each input directory also specifies the type of molecular dynamics simulation to be performed, for example, Langevin dynamics with a 12 Angstrom cutoff for 500 time steps, or particle mesh Ewald dynamics with an 11 Angstrom cutoff for 500 time steps.

Key Points and Best Practices

Models with large numbers of atoms scale better than models with small numbers of atoms.

The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.33GHz. This feature was was enabled when generating the results reported here.

See Also

Disclosure Statement

NAMD, see http://www.ks.uiuc.edu/Research/namd/performance.html for more information, results as of 11/17/2009.

Wednesday Nov 04, 2009

New TPC-C World Record Sun/Oracle

TPC-C Sun SPARC Enterprise T5440 with Oracle RAC World Record Database Result

Sun and Oracle demonstrate the World's fastest database performance. Sun Microsystems using 12 Sun SPARC Enterprise T5440 servers, 60 Sun Storage F5100 Flash arrays and Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning delivered a world-record TPC-C benchmark result.

  • The 12-node Sun SPARC Enterprise T5440 server cluster result delivered a world record TPC-C benchmark result of 7,646,486.7 tpmC and $2.36 $/tpmC (USD) using Oracle 11g R1 on a configuration available 3/19/10.

  • The 12-node Sun SPARC Enterprise T5440 server cluster beats the performance of the IBM Power 595 (5GHz) with IBM DB2 9.5 database by 26% and has 16% better price/performance on the TPC-C benchmark.

  • The complete Oracle/Sun solution used 10.7x better computational density than the IBM configuration (computational density = performance/rack).

  • The complete Oracle/Sun solution used 8 times fewer racks than the IBM configuration.

  • The complete Oracle/Sun solution has 5.9x better power/performance than the IBM configuration.

  • The 12-node Sun SPARC Enterprise T5440 server cluster beats the performance of the HP Superdome (1.6GHz Itanium2) by 87% and has 19% better price/performance on the TPC-C benchmark.

  • The Oracle/Sun solution utilized Sun FlashFire technology to deliver this result. The Sun Storage F5100 flash array was used for database storage.

  • Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning scales and effectively uses all of the nodes in this configuration to produce the world record performance.

  • This result showed Sun and Oracle's integrated hardware and software stacks provide industry-leading performance.

More information on this benchmark will be posted in the next several days.

Performance Landscape

TPC-C results (sorted by tpmC, bigger is better)


System
tpmC Price/tpmC Avail Database Cluster Racks w/KtpmC
12 x Sun SPARC Enterprise T5440 7,646,487 2.36 USD 03/19/10 Oracle 11g RAC Y 9 9.6
IBM Power 595 6,085,166 2.81 USD 12/10/08 IBM DB2 9.5 N 76 56.4
HP Integrity Superdome 4,092,799 2.93 USD 08/06/07 Oracle 10g R2 N 46 to be added

Avail - Availability date
w/KtmpC - Watts per 1000 tpmC
Racks - clients, servers, storage, infrastructure

Sun and IBM TPC-C Response times


System
tpmC

Response Time

New Order 90th%

Response Time

New Order Average

12 x Sun SPARC Enterprise T5440 7,646,487 0.170 0.168
IBM Power 595 6,085,166 1.69
1.22
Response Time Ratio - Sun Better

9.9x 7.3x

Sun uses 7x comparison to highlight the differences in response times between Sun's solution and IBM.  Although notice that Sun is 10x faster on New Order transactions that finish in the 90% percentile.

It is also interesting to note that none of Sun's response times, avg or 90th percentile, for any transaction is over 0.25 seconds. While IBM does not have even one interactive transaction, not even the menu, below 0.50 seconds. Graphs of Sun's and IBM's response times for New-Order can be found in the full disclosure reports on TPC's website TPC-C Official Result Page.

Results and Configuration Summary

Hardware Configuration:

    9 racks used to hold

    Servers:
      12 x Sun SPARC Enterprise T5440
      4 x 1.6 GHz UltraSPARC T2 Plus
      512 GB memory
      10 GbE network for cluster
    Storage:
      60 x Sun Storage F5100 Flash Array
      61 x Sun Fire X4275, Comstar SAS target emulation
      24 x Sun StorageTek 6140 (16 x 300 GB SAS 15K RPM)
      6 x Sun Storage J4400
      3 x 80-port Brocade FC switches
    Clients:
      24 x Sun Fire X4170, each with
      2 x 2.53 GHz X5540
      48 GB memory

Software Configuration:

    Solaris 10 10/09
    OpenSolaris 6/09 (COMSTAR) for Sun Fire X4275
    Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning
    Tuxedo CFS-R Tier 1
    Sun Web Server 7.0 Update 5

Benchmark Description

TPC-C is an OLTP system benchmark. It simulates a complete environment where a population of terminal operators executes transactions against a database. The benchmark is centered around the principal activities (transactions) of an order-entry environment. These transactions include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses.

See Also

Disclosure Statement

TPC Benchmark C, tpmC, and TPC-C are trademarks of the Transaction Performance Processing Council (TPC). 12-node Sun SPARC Enterprise T5440 Cluster (1.6GHz UltraSPARC T2 Plus, 4 processor) with Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning, 7,646,486.7 tpmC, $2.36/tpmC. Available 3/19/10. IBM Power 595 (5GHz Power6, 32 chips, 64 cores, 128 threads) with IBM DB2 9.5, 6,085,166 tpmC, $2.81/tpmC, available 12/10/08. HP Integrity Superdome(1.6GHz Itanium2, 64 processors, 128 cores, 256 threads) with Oracle 10g Enterprise Edition, 4,092,799 tpmC, $2.93/tpmC. Available 8/06/07. Source: www.tpc.org, results as of 11/5/09.

Tuesday Oct 13, 2009

Sun T5440 Oracle BI EE Sun SPARC Enterprise T5440 World Record

The Oracle BI EE, a component of Oracle Fusion Middleware,  workload was run on two Sun SPARC Enterprise T5440 servers and achieved world record performance.
  • Two Sun SPARC Enterprise T5440 servers with four 1.6 GHz UltraSPARC T2 Plus processors delivered the best performance of 50K concurrent users on the Oracle BI EE 10.1.3.4 benchmark with Oracle 11g database running on free and open Solaris 10.

  • The two node Sun SPARC Enterprise T5440 servers with Oracle BI EE running on Solaris 10 using 8 Solaris Containers shows 1.8x scaling over Sun's previous one node SPARC Enterprise T5440 server result with 4 Solaris Containers.

  • The two node SPARC Enterprise T5440 servers demonstrated the performance and scalability of the UltraSPARC T2 Plus processor demonstrating 50K users can be serviced with 0.2776 sec response time.

  • The Sun SPARC Enterprise T5220 server was used as an NFS server with 4 internal SSDs and the ZFS file system which showed significant I/O performance improvement over traditional disk for Business Intelligence Web Catalog activity.

  • Oracle Fusion Middleware provides a family of complete, integrated, hot pluggable and best-of-breed products known for enabling enterprise customers to create and run agile and intelligent business applications. Oracle BI EE performance demonstrates why so many customers rely on Oracle Fusion Middleware as their foundation for innovation.

  • IBM has not published any POWER6 processor based results on this important benchmark.

Performance Landscape

System Processors Users
Chips GHz Type
2 x Sun SPARC Enterprise T5440 8 1.6 UltraSPARC T2 Plus 50,000
1 x Sun SPARC Enterprise T5440 4 1.6 UltraSPARC T2 Plus 28,000
5 x Sun Fire T2000 1 1.2 UltraSPARC T1 10,000

Results and Configuration Summary

Hardware Configuration:

    2 x Sun SPARC Enterprise T5440 (1.6GHz/128GB)
    1 x Sun SPARC Enterprise T5220 (1.2GHz/64GB) and 4 SSDs (used as NFS server)

Software Configuration:

    Solaris10 05/09
    Oracle BI EE 10.1.3.4
    Oracle Fusion Middleware
    Oracle 11gR1

Benchmark Description

The objective of this benchmark is to highlight how Oracle BI EE can support pervasive deployments in large enterprises, using minimal hardware, by simulating an organization that needs to support more than 25,000 active concurrent users, each operating in mixed mode: ad-hoc reporting, application development, and report viewing.

The user population was divided into a mix of administrative users and business users. A maximum of 28,000 concurrent users were actively interacting and working in the system during the steady-state period. The tests executed 580 transactions per second, with think times of 60 seconds per user, between requests. In the test scenario 95% of the workload consisted of business users viewing reports and navigating within dashboards. The remaining 5% of the concurrent users, categorized as administrative users, were doing application development.

The benchmark scenario used a typical business user sequence of dashboard navigation, report viewing, and drill down. For example, a Service Manager logs into the system and navigates to his own set of dashboards viz. .Service Manager.. The user then selects the .Service Effectiveness. dashboard, which shows him four distinct reports, .Service Request Trend., .First Time Fix Rate., .Activity Problem Areas., and .Cost Per completed Service Call . 2002 till 2005. . The user then proceeds to view the .Customer Satisfaction. dashboard, which also contains a set of 4 related reports. He then proceeds to drill-down on some of the reports to see the detail data. Then the user proceeds to more dashboards, for example .Customer Satisfaction. and .Service Request Overview.. After navigating through these dashboards, he logs out of the application

This benchmark did not use a synthetic database schema. The benchmark tests were run on a full production version of the Oracle Business Intelligence Applications with a fully populated underlying database schema. The business processes in the test scenario closely represents a true customer scenario.

See Also

Disclosure Statement

Oracle BI EE benchmark results 10/13/2009, see

CP2K Life Sciences, Ab-initio Dynamics - Sun Blade 6048 Chassis with Sun Blade X6275 - Scalability and Throughput with Quad Data Rate InfiniBand

Significance of Results

Clusters of Sun Blade X6275 and X6270 server modules were used to run benchmarks using the CP2K ab-initio dynamics applications software.

  • For the X6270 cluster with Dual Data Rate (DDR) InfiniBand the rate of increase of scalability slows dramatically at 16 nodes, whereas for the X6275 cluster with QDR InfiniBand the scalability continues to 72 nodes.
  • For 64 nodes, the speed of the Sun Blade X6275 cluster with QDR InfiniBand was 2.7X that of a Sun Blade X6270 cluster with DDR InfiniBand.

Ab-initio dynamics simulation is important to materials science research.  Dynamics simulation is used to determine the trajectories of atoms or molecules over time.

Performance Landscape

The CP2K Bulk Water Benchmarks web page plots the performance of CP2K ab-initio dynamics benchmarks that have from 32 to 512 water molecules for a cluster that comprises two 2.66GHz Xeon E5430 quad core CPUs per node and that uses Dual Data Rate InfiniBand.

The following table reports the execution time for the 512 water molecule benchmark when executed on the Sun Blade X6275 cluster having Quad Data Rate InfiniBand and on the Sun Blade X6270 cluster having Dual Data Rate InfiniBand. Each node of either Sun Blade cluster comprises two 2.93GHz Intel Xeon X5570 quad core CPUs. In the following table, the performance is expressed in terms of the "wall clock" time in seconds required to execute ten steps of the ab-initio dynamics simulation for 512 water molecules. A smaller number implies better performance.

Number
of Nodes
X6275 QDR InfiniBand
(seconds for 10 steps)
X6270 DDR InfiniBand
(seconds for 10 steps)
96
1184.36
72 564.16
64 598.41 1591.35
32 706.82 1436.49
24 950.02 1752.20
16 1227.73 2119.50
12 1440.16 1739.26
8 1876.95 2120.73
4 3408.39 3705.44

Results and Configuration Summary

Hardware Configuration:

    Sun Blade[tm] 6048 Modular System with 3 shelves, each shelf with
      12 x Sun Blade X6275, each blade with
        2 x (2 x 2.93 GHz Intel QC Xeon X5570 processors)
        2 x (24 GB memory)
        Hyper-Threading (HT) off, Turbo Mode on
    QDR InfiniBand
    96 x Sun Blade X6270, each blade with
      2 x 2.93 GHz Intel QC Xeon X5570 processors)
      1 x (24 GB memory)
      Hyper-Threading (HT) off, Turbo Mode off
    DDR InfiniBand
Software Configuration:
    SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
    OpenMPI 1.3.2
    Sun Studio 12 f90 compiler, ScaLAPACK, BLACS and Performance Libraries
    FFTW (Fastest Fourier Transform in the West) 3.2.1

Benchmark Description

CP2K is a parallel ab-initio dynamics code that is designed to perform atomistic and molecular simulations of solid state, liquid, molecular and biological systems. It provides a general framework for different methods such as e.g. density functional theory (DFT) using a mixed Gaussian and plane waves approach (GPW), and classical pair and many-body potentials.

Ab-initio dynamics simulation is widely used in materials science research. CP2K is a public-domain ab-initio dynamics software application.

Key Points and Best Practices

  • QDR InfiniBand scales better than DDR InfiniBand.
  • The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.2GHz. This feature was was enabled for the X6275 and disabled for the X6270 when generating the results reported here.

See Also

Disclosure Statement

CP2K, see http://cp2k.berlios.de/ for more information, results as of 10/13/2009.

SAP 2-tier SD-Parallel on Sun Blade X6270 1-node, 2-node and 4-node

Significance of Results

  • Four Sun Blade X6270 (2 processors, 8 cores, 16 threads), running SAP ERP application Release 6.0 Enhancement Pack 4 (Unicode) with Oracle Database on top of Solaris 10 OS delivered the highest eight-processor result on the two-tier SAP SD-Parallel Standard Application Benchmark, as of Oct 12th, 2009.

  • Four Sun Blade X6270 servers with Intel Xeon X5570 processors achieved 1.9x performance improvement from two Sun Blade X6270 with the same processors.

  • Two Sun Blade X6270 (2 processors, 8 cores, 16 threads), running SAP ERP application Release 6.0 Enhancement Pack 4 (Unicode) with Oracle Database on top of Solaris 10 OS delivered the highest four-processor result on the two-tier SAP SD-Parallel Standard Application Benchmark, as of Oct 12th, 2009.

  • Two Sun Blade X6270 servers with Intel Xeon X5570 processors achieved 1.9x performance imporvement over a single 2-processor Sun Blade X6270 system.

  • A one node Sun Blade X6270 server with Intel Xeon X5570 processors running Oracle RAC delivers the same result as a Sun Fire X4270 server with Intel Xeon X5570 processors running Oracle with no performance difference between Oracle 10g and Oracle 10g RAC.

  • This benchmark highlights the near-linear scaling of Oracle 10g Real Application Cluster runs on Sun Microsystems hardware in a SAP environment.

  • In January 2009, a new version, the Two-tier SAP ERP 6.0 Enhancement Pack 4 (Unicode) Standard Sales and Distribution (SD) Benchmark, was released. This new release has higher cpu requirements and so yields from 25-50% fewer users compared to the previous Two-tier SAP ERP 6.0 (non-unicode) Standard Sales and Distribution (SD) Benchmark. 10-30% of this is due to the extra overhead from the processing of the larger character strings due to Unicode encoding. See this SAP Note for more details.

  • Unicode is a computing standard that allows for the representation and manipulation of text expressed in most of the world's writing systems. Before the Unicode requirement, this benchmark used ASCII characters meaning each was just 1 byte. The new version of the benchmark requires Unicode characters and the Application layer (where ~90% of the cycles in this benchmark are spent) uses a new encoding, UTF-16, which uses 2 bytes to encode most characters (including all ASCII characters) and 4 bytes for some others. This requires computers to do more computation and use more bandwidth and storage for most character strings. Refer to the above SAP Note for more details.

Performance Landscape

SAP SD-Parallel 2-Tier Performance Table (in decreasing performance order).

SAP ERP 6.0 Enhancement Pack 4 (Unicode) Results
(New version of the benchmark as of January 2009)

System OS
Database
Users SAP
ERP/ECC
Release
SAPS Date
Four Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
13,718 2009
6.0 EP4
(Unicode)
75,762 12-Oct-09
Two Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
7,220 2009
6.0 EP4
(Unicode)
39,420 12-Oct-09
One Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
3,800 2009
6.0 EP4
(Unicode)
20,750 12-Oct-09
Sun Fire X4270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g
3,800 2009
6.0 EP4
(Unicode)
21,000 21-Aug-09

Complete benchmark results may be found at the SAP benchmark website http://www.sap.com/benchmark.

Results and Configuration Summary

Four Sun Blade X6270 Servers, each with two Intel Xeon X5570 2.93 GHz(2 processors, 8 cores, 16 threads)

    Number of SAP SD benchmark users:
    13,718
    Average dialog response time:
    0.86 seconds
    Throughput:

    Dialog steps/hour:
    4,545,729

    SAPS:
    75,762
    SAP Certification:
    2009041

Two Sun Blade X6270 Servers, each with two Intel Xeon X5570 2.93 GHz(2 processors, 8 cores, 16 threads)

    Number of SAP SD benchmark users:
    7,220
    Average dialog response time:
    0.99 seconds
    Throughput:

    Dialog steps/hour:
    2,365,000

    SAPS:
    39,420
    SAP Certification:
    2009040

One Sun Blade X6270 Servers, with two Intel Xeon X5570 2.93 GHz(2 processors, 8 cores, 16 threads)

    Number of SAP SD benchmark users:
    3,800
    Average dialog response time:
    0.99 seconds
    Throughput:

    Dialog steps/hour:
    1,245,000

    SAPS:
    20,750
    SAP Certification:
    2009039

Software:

    Oracle 10g Real Application Clusters
    Solaris 10 OS

Benchmark Description

The SAP Standard Application Sales and Distribution - Parallel (SD-Parallel) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing, and demonstrates the ability to run both the application and database software on a single system. The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments.
SD Versus SD-Parallel
The SD-Parallel Benchmark consists of the same transactions and user interaction steps as the SD Benchmark. This means that the SD-Parallel Benchmark runs the same business processes as the SD Benchmark. The difference between the benchmarks is the technical data distribution. An Additional Rule for Parallel and Distributed Databases
The additional rule is: Equally distribute the benchmark users across all database nodes for the used benchmark clients (round-robin-method). Following this rule, all database nodes work on data of all clients. This avoids unrealistic configurations such as having only one client per database node.
The SAP Benchmark Council agreed to give the parallel benchmark a different name so that the difference can be easily recognized by any interested parties - customers, prospects, and analysts. The naming convention is SD-Parallel for Sales & Distribution - Parallel.
SAP is one of the premier world-wide ERP application providers, and maintains a suite of benchmark tests to demonstrate the performance of competitive systems on the various SAP products.

Disclosure Statement

SAP SD benchmark based on SAP enhancement package 4 for SAP ERP 6.0 (Unicode) application benchmark as of Oct 12th, 2009: Four Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 13,718 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, each 48 GB memory, running two-tier SAP Sales and Distribution Parallel (SD-Parallel) standard SAP SD benchmark with Oracle 10g Real Application Clusters and Solaris 10, Cert# 2009041. Two Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 7,220 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, each 48 GB memory, running two-tier SAP Sales and Distribution Parallel (SD-Parallel) standard SAP SD benchmark with Oracle 10g Real Application Clusters and Solaris 10, Cert# 2009040. Sun Blade X6270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, running two-tier SAP Sales and Distribution Parallel (SD-Parallel) standard SAP SD benchmark with Oracle 10g Real Application Clusters and Solaris 10, Cert# 2009039. Sun Fire X4270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, running two-tier SAP Sales and Distribution (SD) standard SAP SD benchmark with Oracle 10g and Solaris 10, Cert# 2009033.

SAP, R/3, reg TM of SAP AG in Germany and other countries. More info www.sap.com/benchmark

Sunday Oct 11, 2009

TPC-C World Record Sun - Oracle

TPC-C Sun SPARC Enterprise T5440 with Oracle RAC World Record Database Result

Sun and Oracle demonstrate the World's fastest database performance. Sun Microsystems using 12 Sun SPARC Enterprise T5440 servers, 60 Sun Storage F5100 Flash arrays and Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning delivered a world-record TPC-C benchmark result.

  • The 12-node Sun SPARC Enterprise T5440 server cluster result delivered a world record TPC-C benchmark result of 7,646,486.7 tpmC and $2.36 $/tpmC (USD) using Oracle 11g R1 on a configuration available 3/19/10.

  • The 12-node Sun SPARC Enterprise T5440 server cluster beats the performance of the IBM Power 595 (5GHz) with IBM DB2 9.5 database by 26% and has 16% better price/performance on the TPC-C benchmark.

  • The complete Oracle/Sun solution used 10.7x better computational density than the IBM configuration (computational density = performance/rack).

  • The complete Oracle/Sun solution used 8 times fewer racks than the IBM configuration.

  • The complete Oracle/Sun solution has 5.9x better power/performance than the IBM configuration.

  • The 12-node Sun SPARC Enterprise T5440 server cluster beats the performance of the HP Superdome (1.6GHz Itanium2) by 87% and has 19% better price/performance on the TPC-C benchmark.

  • The Oracle/Sun solution utilized Sun FlashFire technology to deliver this result. The Sun Storage F5100 flash array was used for database storage.

  • Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning scales and effectively uses all of the nodes in this configuration to produce the world record performance.

  • This result showed Sun and Oracle's integrated hardware and software stacks provide industry-leading performance.

More information on this benchmark will be posted in the next several days.

Performance Landscape

TPC-C results (sorted by tpmC, bigger is better)


System
tpmC Price/tpmC Avail Database Cluster Racks w/KtpmC
12 x Sun SPARC Enterprise T5440 7,646,487 2.36 USD 03/19/10 Oracle 11g RAC Y 9 9.6
IBM Power 595 6,085,166 2.81 USD 12/10/08 IBM DB2 9.5 N 76 56.4
Bull Escala PL6460R 6,085,166 2.81 USD 12/15/08 IBM DB2 9.5 N 71 56.4
HP Integrity Superdome 4,092,799 2.93 USD 08/06/07 Oracle 10g R2 N 46 to be added

Avail - Availability date
w/KtmpC - Watts per 1000 tpmC
Racks - clients, servers, storage, infrastructure

Results and Configuration Summary

Hardware Configuration:

    9 racks used to hold

    Servers:
      12 x Sun SPARC Enterprise T5440
      4 x 1.6 GHz UltraSPARC T2 Plus
      512 GB memory
      10 GbE network for cluster
    Storage:
      60 x Sun Storage F5100 Flash Array
      61 x Sun Fire X4275, Comstar SAS target emulation
      24 x Sun StorageTek 6140 (16 x 300 GB SAS 15K RPM)
      6 x Sun Storage J4400
      3 x 80-port Brocade FC switches
    Clients:
      24 x Sun Fire X4170, each with
      2 x 2.53 GHz X5540
      48 GB memory

Software Configuration:

    Solaris 10 10/09
    OpenSolaris 6/09 (COMSTAR) for Sun Fire X4275
    Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning
    Tuxedo CFS-R Tier 1
    Sun Web Server 7.0 Update 5

Benchmark Description

TPC-C is an OLTP system benchmark. It simulates a complete environment where a population of terminal operators executes transactions against a database. The benchmark is centered around the principal activities (transactions) of an order-entry environment. These transactions include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses.

POSTSCRIPT: Here are some comments on IBM's grasping-at-straws-perf/core attacks on the TPC-C result:
c0t0d0s0 blog: "IBM's Reaction to Sun&Oracle TPC-C

See Also

Disclosure Statement

TPC Benchmark C, tpmC, and TPC-C are trademarks of the Transaction Performance Processing Council (TPC). 12-node Sun SPARC Enterprise T5440 Cluster (1.6GHz UltraSPARC T2 Plus, 4 processor) with Oracle 11g Enterprise Edition with Real Application Clusters and Partitioning, 7,646,486.7 tpmC, $2.36/tpmC. Available 3/19/10. IBM Power 595 (5GHz Power6, 32 chips, 64 cores, 128 threads) with IBM DB2 9.5, 6,085,166 tpmC, $2.81/tpmC, available 12/10/08. HP Integrity Superdome(1.6GHz Itanium2, 64 processors, 128 cores, 256 threads) with Oracle 10g Enterprise Edition, 4,092,799 tpmC, $2.93/tpmC. Available 8/06/07. Source: www.tpc.org, results as of 10/11/09.

Friday Oct 09, 2009

X6275 Cluster Demonstrates Performance and Scalability on WRF 2.5km CONUS Dataset

Significance of Results

Results are presented for the Weather Research and Forecasting (WRF) code running on twelve Sun Blade X6275 server modules, housed in the Sun Blade 6048 chassis, using the 2.5 km CONUS benchmark dataset.

  • The Sun Blade X6275 cluster was able to achieve 373 GFLOP/s on the CONUS 2.5-KM Dataset.
  • The results demonstrate an 91% speedup efficiency, or 11x speedup, from 1 to 12 blades.
  • The current results results were run with turbo on.

Performance Landscape

Performance is expressed in terms "simulation speedup" which is the ratio of the simulated time step per iteration to the average wall clock time required to compute it. A larger number implies better performance.

The current results were run with turbo mode on.

WRF 3.0.1.1: Weather Research and Forecasting CONUS 2.5-KM Dataset
#
Blade
#
Node
#
Proc
#
Core
Performance
(Simulation Speedup)
Computation Rate
GFLOP/sec
Speedup/Efficiency
(vs. 1 blade)
Turbo On
Relative Perf
Turbo On Turbo Off Turbo On Turbo Off Turbo On Turbo Off
12 24 48 192 13.58 12.93 373.0 355.1 11.0 / 91% 10.4 / 87% +6%
 8  16  32  128  9.27
254.6
 7.5 / 93% 

 6 12 24  96  7.03  6.60 193.1 181.3  5.7 / 94%  5.3 / 89% +7%
 4  8  16  64  4.74
130.2
 3.8 / 96% 

 2  4  8  32  2.44
67.0
 2.0 / 98% 

 1  2  4  16  1.24  1.24 34.1 34.1 1.0 / 100% 1.0 / 100% +0%

Results and Configuration Summary

Hardware Configuration:

    Sun Blade 6048 Modular System
      12 x Sun Blade X6275 Server Modules, each with
        4 x 2.93 GHz Intel QC X5570 processors
        24 GB (6 x 4GB)
        QDR InfiniBand
        HT disabled in BIOS
        Turbo mode enabled in BIOS

Software Configuration:

    OS: SUSE Linux Enterprise Server 10 SP 2
    Compiler: PGI 7.2-5
    MPI Library: Scali MPI v5.6.4
    Benchmark: WRF 3.0.1.1
    Support Library: netCDF 3.6.3

Benchmark Description

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. WRF is designed to be a flexible, state-of-the-art atmospheric simulation system that is portable and efficient on available parallel computing platforms. It features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility.

Dataset used:

    Single domain, large size 2.5KM Continental US (CONUS-2.5K)

    • 1501x1201x35 cell volume
    • 6hr, 2.5km resolution dataset from June 4, 2005
    • Benchmark is the final 3hr simulation for hrs 3-6 starting from a provided restart file; the benchmark may also be performed (but seldom reported) for the full 6hrs starting from a cold start
    • Iterations output at every 15 sec of simulation time, with the computation cost of each time step ~412 GFLOP

Key Points and Best Practices

  • Processes were bound to processors in round-robin fashion.
  • Model simulation time is 15 seconds per iteration as defined in input job deck. An achieved speedup of 2.67X means that each model iteration of 15s of simulation time was executed in 5.6s of real wallclock time (on average).
  • Computational requirements are 412 GFLOP per simulation time step as (measured empirically and) documented on the UCAR web site for this data model.
  • Model was run as single MPI job.
  • Benchmark was built and run as a pure-MPI variant. With larger process counts building and running WRF as a hybrid OpenMP/MPI variant may be more efficient.
  • Input and output (netCDF format) datasets can be very large for some WRF data models and run times will generally benefit by using a scalable filesystem. Performance with very large datasets (>5GB) can benefit by enabling WRF quilting of I/O across designated processors/servers. The master thread (or rank-0) performs most of the I/O (unless quilting specifies otherwise), with all processes potentially generating some I/O.

See Also

Disclosure Statement

WRF, CONUS-2.5K, see http://www.mmm.ucar.edu/wrf/WG2/bench/, results as of 9/21/2009.

Tuesday Jun 30, 2009

Sun Blade 6048 and Sun Blade X6275 NAMD Molecular Dynamics Benchmark beats IBM BlueGene/L

Significance of Results

A Sun Blade 6048 chassis with 12 Sun Blade X6275 server modules ran benchmarks using the NAMD molecular dynamics applications software. NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD was developed by the Theoretical and Computational Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. NAMD is driven by major trends in computing and structural biology and received a 2002 Gordon Bell Award.
  • The cluster of 12 Sun Blade X6275 server modules was 6.2x faster than 256 processor configuration of the IBM BlueGene/L.
  • The cluster of 12 Sun Blade X6275 server modules exhibited excellent scalability for NAMD molecular dynamics simulation, up to 10.4x speedup for 12 blades relative to 1 blade.
  • For largest molecule considered, the cluster of 12 Sun Blade X6275 server modules achieved a throughput of 0.094 seconds per simulation step.
Molecular dynamics simulation is important to biological and materials science research. Molecular dynamics is used to determine the low energy conformations or shapes of a molecule. These conformations are presumed to be the biologically active conformations.

Performance Landscape

The NAMD Performance Benchmarks web page plots the performance of NAMD when the ApoA1 benchmark is executed on a variety of clusters. The performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, multiplied by the number of "processors" on which NAMD executes in parallel. The following table compares the performance of NAMD version 2.6 when executed on the Sun Blade X6275 cluster to the performance of NAMD as reported for several of the clusters on the web page. In this table, the performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, however, not multiplied by the number of "processors". A smaller number implies better performance.
Cluster Name and Interconnect Throughput for 128 Cores
(seconds per step)
Throughput for 192 Cores
(seconds per step)
Throughput for 256 Cores
(seconds per step)
Sun Blade X6275 InfiniBand 0.013 0.010
Cambridge Xeon/3.0 InfiniPath 0.016
0.0088
NCSA Xeon/2.33 InfiniBand 0.019
0.010
AMD Opteron/2.2 InfiniPath 0.025
0.015
IBM HPCx PWR4/1.7 Federation 0.039
0.021
SDSC IBM BlueGene/L MPI 0.108
0.062

The following tables report results for NAMD molecular dynamics using a cluster of Sun Blade X6275 server modules. The performance of the cluster is expressed in terms of the time in seconds that is required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Blades Cores STMV molecule (1) f1 ATPase molecule (2) ApoA1 molecule (3)
Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy
12 192 0.0941 10.6 88% 0.0270 9.1 76% 0.0102 8.1 68%
8 128 0.1322 7.5 94% 0.0317 7.7 97% 0.0131 6.3 79%
4 64 0.2656 3.7 94% 0.0610 4.0 101% 0.0204 4.1 102%
1 16 0.9952 1.0 100% 0.2454 1.0 100% 0.0829 1.0 100%

spdup - speedup versus 1 blade result
effi'cy - speedup efficiency versus 1 blade result

(1) Synthetic Tobacco Mosaic Virus (STMV) molecule, 1,066,628 atoms, 12 Angstrom cutoff, Langevin dynamics, 500 time steps
(2) f1 ATPase molecule, 327,506 atoms, 11 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps
(3) ApoA1 molecule, 92,224 atoms, 12 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps

Results and Configuration Summary

Hardware Configuration

  • Sun Blade[tm] 6048 Modular System with one shelf configured with
    • 12 x Sun Blade X6275, each with
      • 2 x (2 x 2.93 GHz Intel QC Xeon X5570 processors)
      • 2 x (24 GB memory)
      • Hyper-Threading (HT) off, Turbo Mode on

Software Configuration

  • SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
  • Scali MPI 5.6.6
  • gcc 4.1.2 (1/15/2007), gfortran 4.1.2 (1/15/2007)

Key Points and Best Practices

  • Models with large numbers of atoms scale better than models with small numbers of atoms.

About the Sun Blade X6275

The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.2GHz. This feature was was enabled when generating the results reported here.

Benchmark Description

Molecular dynamics simulation is widely used in biological and materials science research. NAMD is a public-domain molecular dynamics software application for which a variety of molecular input directories are available. Three of these directories define:
  • the Synthetic Tobacco Mosaic Virus (STMV) that comprises 1,066,628 atoms
  • the f1 ATPase enzyme that comprises 327,506 atoms
  • the ApoA1 enzyme that comprises 92,224 atoms
Each input directory also specifies the type of molecular dynamics simulation to be performed, for example, Langevin dynamics with a 12 Angstrom cutoff for 500 time steps, or particle mesh Ewald dynamics with an 11 Angstrom cutoff for 500 time steps.

See Also

Disclosure Statement

NAMD, see http://www.ks.uiuc.edu/Research/namd/performance.html for more information, results as of 6/26/2009.

Tuesday Jun 23, 2009

New CPU2006 Records: 3x better integer throughput, 9x better fp throughput

Significance of Results

A Sun Constellation system, composed of 48 Sun Blade X6440 server modules in a Sun Blade 6048 chassis, running OpenSolaris 2008.11 and using the Sun Studio 12 Update 1 compiler delivered World Record SPEC CPU2006 rate results.

On the SPECint_rate_base2006 benchmark, Sun delivered 4.7 times more performance than the IBM power 595 (5GHz POWER6); this IBM system requires a slightly larger cabinet than the Sun Blade 6048 chassis (details below). 

On the SPECfp_rate_base2006 benchmark Sun delivered 3.9 times more performance than the largest IBM power 595 (5GHz POWER6); this IBM system requires a slightly larger cabinet than the Sun Blade 6048 chassis (details below).

  • The Sun Constellation System equipped with AMD Opteron QC 8384 2.7 GHz processors, running OpenSolaris 2008.11 and using the Sun Studio 12 update 1 compiler, delivered the World Record SPECint_rate_base2006 score of 8840.
  • This SPECint_rate_base2006 score beat the previous record holding score by over three times.
  • The Sun Constellation System equipped with AMD Opteron QC 8384 2.7 GHz processors, running OpenSolaris 2008.11 and using the Sun Studio 12 update 1 compiler, delivered the fastest x86 SPECfp_rate_base2006 score of 6500.
  • This SPECfp_rate_base2006 score beat the previous x86 record holding score by nine times.

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results.

SPECint_rate2006

System Processors Performance Results Notes (1)
Type GHz Chips Cores Peak Base
Sun Blade 6048 Opteron 8384 2.7 192 768
8840 New Record
SGI Altix 4700 Density System Itanium 9150M 1.66 128 256 3354 2893 Previous Best
SGI Altix 4700 Bandwidth System Itanium2 9040 1.6 128 256 2971 2715
Fujitsu/Sun SPARC Enterprise M9000 SPARC64 VII 2.52 64 256 2290 2090
IBM Power 595 POWER6 5.0 32 64 2160 1870 Best POWER6

(1) Results as of 23 June 2009 from www.spec.org.

SPECfp_rate2006

System Processors Performance Results Notes (2)
Type GHz Chips Cores Peak Base
SGI Altix 4700 Density System Itanium 9140M 1.66 512 1024
10580
Sun Blade 6048 Opteron 8384 2.7 192 768
6500 New x86 Record
SGI Altix 4700 Bandwidth System Itanium2 9040 1.6 128 256 3507 3419
IBM Power 595 POWER 6 5.0 64 32 2184 1681 Best POWER6
Fujitsu/Sun SPARC Enterprise M9000 SPARC64 VII 2.52 64 256 2005 1861
SGI Altix 4700 Bandwidth System Itanium 9150M 1.66 128 256 1947 1832
SGI Altix ICE 8200EX Intel X5570 2.93 8 32 742 723

(2) Results as of 23 June 2009 from www.spec.org.

(2) Results as of 23 June 2009 from www.spec.org.

Results and Configuration Summary

Hardware Configuration:
    1 x Sun Blade 6048
      48 x Sun Blade X6440, each with
        4 x 2.7 GHz QC AMD Opteron 8384 processors
        32 GB, (8 x 4GB)

Software Configuration:

    O/S: OpenSolaris 2008.11
    Compiler: Sun Studio 12 Update 1
    Other SW: MicroQuill SmartHeap Library 9.01 x64
    Benchmark: SPEC CPU2006 V1.1

Key Points and Best Practices

The Sun Blade 6048 chassis is able to contain a variety of server modules. In this case, the Sun Blade X6440 was used to provide this capacity solution. This single rack delivered results which have not been seen in this form factor.

To run this many jobs, the benchmark requires a reasonably good file server where the benchmark is run. The Sun Fire X4540 server was used to provide the disk space required being accessed by NFS by the blades.

Sun has shown 4.7x greater SPECint_rate_base2006 and 3.9x greater SPECfp_rate_base2006 in a slightly smaller cabinet. IBM specifications are at: http://www-03.ibm.com/systems/power/hardware/595/specs.html. One frame (slimline doors): 79.3"H x 30.5"W x 58.5"D weight: 3,376 lb. One frame (acoustic doors): 79.3"H x 30.5"W x 71.1"D weight: 3,422 lb. The Sun Blade 6048 specifications are at: http://www.sun.com/servers/blades/6048chassis/specs.xml One Sun Blade 6048: 81.6"H x 23.9"W x 40.3"D weight: 2,300 lb (fully configured). 

Disclosure Statement:

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 6/22/2009 and this report. Sun Blade 6048 chassis with Sun Blade X6440 server modules (48 nodes with 4 chips, 16 cores, 16 threads each, OpenSolaris 2008.11, Studio 12 update 1) - 8840 SPECint_rate_base2006, 6500 SPECfp_rate_base2006; IBM p595, 1870 SPECint_rate_base2006, 1681 SPECfp_rate_base2006.

See Also

Wednesday Jun 03, 2009

Welcome to BestPerf group blog!

Welcome to BestPerf group blog!  This blog will contain many different performance results and the best practices learned from doing a wide variety of performance work on the broad range of Sun's products.

Over the coming days, you will see many engineers in the Strategic Applications Engineering group posting a wide variety topics and providing useful information to the users of Sun's technologies. Some of the areas explored will be:

world-record, performance, $/Perf, watts, watt/perf, scalability, bandwidth, RAS, virtualization, security, cluster, latency, HPC, Web, Application, Database

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« June 2016
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  
       
Today