Monday Oct 03, 2011

SPARC T4-4 Produces World Record Oracle OLAP Capacity

Oracle's SPARC T4-4 server delivered world record capacity on the Oracle OLAP Perf workload.

  • The SPARC T4-4 server was able to operate on a cube with a 3 billion row fact table of sales data containing 4 dimensions which represents as many as 70 quintillion aggregate rows (70 followed by 18 zeros).

  • The SPARC T4-4 server supported 3,500 cube-queries/minute against the Oracle OLAP cube with an average response time of 1.5 seconds and the median response time of 0.15 seconds.

Performance Landscape

Oracle OLAP Perf Benchmark
System Fact Table
Num of Rows
Cube-Queries/
minute
Median Response
seconds
Average Response
seconds
SPARC T4-4 3 Billion 3,500 0.15 1.5

Configuration Summary and Results

Hardware Configuration:

SPARC T4-4 server with
4 x SPARC T4 processors, 3.0 GHz
1 TB main memory
2 x Sun Storage F5100 Flash Array

Software Configuration:

Oracle Solaris 10 8/11
Oracle Database 11g Enterprise Edition with Oracle OLAP option

Benchmark Description

OLAP Perf is a workload designed to demonstrate and stress the Oracle OLAP product's core functionalities of fast query, fast update, and rich calculations on a dimensional model to support Enhanced Data Warehousing. The workload uses a set of realistic business intelligence (BI) queries that run against an OLAP cube.

Key Points and Best Practices

  • The SPARC T4-4 server is estimated to support 2,400 interactive users with this fast response time assuming only 5 seconds between query requests.

See Also

Disclosure Statement

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 10/3/2011.

Wednesday Sep 28, 2011

SPARC T4 Servers Set World Record on PeopleSoft HRMS 9.1

Oracle's SPARC T4-4 servers running Oracle's PeopleSoft HRMS Self-Service 9.1 benchmark and Oracle Database 11g Release 2 achieved World Record performance on Oracle Solaris 10.

  • Using two SPARC T4-4 servers to run the application and database tiers and one SPARC T4-2 server to run the webserver tier, Oracle demonstrated world record performance of 15,000 concurrent users running the PeopleSoft HRMS Self-Service 9.1 benchmark.

  • The combination of the SPARC T4 servers running the PeopleSoft HRMS 9.1 benchmark supports 3.8x more online users with faster response time compared to the best published result from IBM on the previous PeopleSoft HRMS 8.9 benchmark.

  • The average CPU utilization on the SPARC T4-4 server in the application tier handling 15,000 users was less than 50%, leaving significant room for application growth.

  • The SPARC T4-4 server on the application tier used Oracle Solaris Containers which provide a flexible, scalable and manageable virtualization environment.

Performance Landscape

PeopleSoft HRMS Self-Service 9.1 Benchmark
Systems Processors Users Ave Response -
Search (sec)
Ave Response -
Save (sec)
SPARC T4-2 (web)
SPARC T4-4 (app)
SPARC T4-4 (db)
2 x SPARC T4, 2.85 GHz
4 x SPARC T4, 3.0 GHz
4 x SPARC T4, 3.0 GHz
15,000 1.01 0.63
PeopleSoft HRMS Self-Service 8.9 Benchmark
IBM Power 570 (web/app)
IBM Power 570 (db)
12 x POWER5, 1.9 GHz
4 x POWER5, 1.9 GHz
4,000 1.74 1.25
IBM p690 (web)
IBM p690 (app)
IBM p690 (db)
4 x POWER4, 1.9 GHz
12 x POWER4, 1.9 GHz
6 x 4392 MPIS/Gen1
4,000 1.35 1.01

The main differences between version 9.1 and version 8.9 of the benchmark are:

  • the database expanded from 100K employees and 20K managers to 500K employees and 100K managers,
  • the manager data was expanded,
  • a new transaction, "Employee Add Profile," was added, the percent of users executing it is less then 2%, and the transaction has a heavier footprint,
  • version 9.1 has a different benchmark metric (Average Response search/save time for x number of users) versus single user search/save time,
  • newer versions of the PeopleSoft application and PeopleTools software are used.

Configuration Summary

Application Server:

1 x SPARC T4-4 server
4 x SPARC T4 processors 3.0 GHz
512 GB main memory
5 x 300 GB SAS internal disks,
2 x 100 GB internal SSDs
1 x 300 GB internal SSD
Oracle Solaris 10 8/11
PeopleSoft PeopleTools 8.51.02
PeopleSoft HCM 9.1
Oracle Tuxedo, Version 10.3.0.0, 64-bit, Patch Level 031
Java HotSpot(TM) 64-Bit Server VM on Solaris, version 1.6.0_20

Web Server:

1 x SPARC T4-2 server
2 x SPARC T4 processors 2.85 GHz
256 GB main memory
1 x 300 GB SAS internal disks
1 x 300 GB internal SSD
Oracle Solaris 10 8/11
PeopleSoft PeopleTools 8.51.02
Oracle WebLogic Server 11g (10.3.3)
Java HotSpot(TM) 64-Bit Server VM on Solaris, version 1.6.0_20

Database Server:

1 x SPARC T4-4 server
4 x SPARC T4 processors 3.0 GHz
256 GB main memory
3 x 300 GB SAS internal disks
1 x Sun Storage F5100 Flash Array (80 flash modules)
Oracle Solaris 10 8/11
Oracle Database 11g Release 2

Benchmark Description

The purpose of the PeopleSoft HRMS Self-Service 9.1 benchmark is to measure comparative online performance of the selected processes in PeopleSoft Enterprise HCM 9.1 with Oracle Database 11g. The benchmark kit is an Oracle standard benchmark kit run by all platform vendors to measure the performance. It's an OLTP benchmark with no dependency on remote COBOL calls, there is no batch workload, and DB SQLs are moderately complex. The results are certified by Oracle and a white paper is published.

PeopleSoft defines a business transaction as a series of HTML pages that guide a user through a particular scenario. Users are defined as corporate Employees, Managers and HR administrators. The benchmark consists of 14 scenarios which emulate users performing typical HCM transactions such as viewing paychecks, promoting and hiring employees, updating employee profiles and other typical HCM application transactions.

All these transactions are well-defined in the PeopleSoft HR Self-Service 9.1 benchmark kit. The benchmark metric is the Average Response Time for search and save for 15,000 users..

Key Points and Best Practices

  • The application tier was configured with two PeopleSoft application server instances on the SPARC T4-4 server hosted in two separate Oracle Solaris Containers to demonstrate consolidation of multiple application, ease of administration, and load balancing.

  • Each PeopleSoft Application Server instance running in an Oracle Solaris Container was configured to run 5 application server Domains with 30 application server instances to be able to effectively handle the 15,000 users workload with zero application server queuing and minimal use of resources.

  • The web tier was configured with 20 WebLogic instances and with 4 GB JVM heap size to load balance transactions across 10 PeopleSoft Domains. That enables equitable distribution of transactions and scaling to high number of users.

  • Internal SSDs were configured in the application tier to host PeopleSoft Application Servers object CACHE file systems and in the web tier for WebLogic servers' logging providing near zero millisecond service time and faster server response time.

See Also

Disclosure Statement

Oracle's PeopleSoft HRMS 9.1 benchmark, www.oracle.com/us/solutions/benchmark/apps-benchmark/peoplesoft-167486.html, results 9/26/2011.

Tuesday Sep 27, 2011

SPARC T4-2 Servers Set World Record on JD Edwards EnterpriseOne Day in the Life Benchmark with Batch, Outperforms IBM POWER7

Using Oracle's SPARC T4-2 server for the application tier and a SPARC T4-1 server for the database tier, a world record result was produced running the Oracle's JD Edwards EnterpriseOne application Day in the Life (DIL) benchmark concurrently with a batch workload.

  • The SPARC T4-2 server running online and batch with JD Edwards EnterpriseOne 9.0.2 is 1.7x faster and has better response time than the IBM Power 750 system which only ran the online component of JD Edwards EnterpriseOne 9.0 Day in the Life test.

  • The combination of SPARC T4 servers delivered a Day in the Life benchmark result of 10,000 online users with 0.35 seconds of average transaction response time running concurrently with 112 Universal Batch Engine (UBE) processes at 67 UBEs/minute.

  • This is the first JD Edwards EnterpriseOne benchmark for 10,000 users and payroll batch on a SPARC T4-2 server for the application tier and the database tier with Oracle Database 11g Release 2. All servers ran with the Oracle Solaris 10 operating system.

  • The single-thread performance of the SPARC T4 processor produced sub-second response for the online components and provided dramatic performance for the batch jobs.

  • The SPARC T4 servers, JD Edwards EnterpriseOne 9.0.2, and Oracle WebLogic Server 11g Release 1 support 17% more users per JAS (Java Application Server) than the SPARC T3-1 server for this benchmark.

  • The SPARC T4-2 server provided a 6.7x better batch processing rate than the previous SPARC T3-1 server record result and had 2.5x faster response time.

  • The SPARC T4-2 server used Oracle Solaris Containers, which provide flexible, scalable and manageable virtualization.

  • JD Edwards EnterpriseOne uses Oracle Fusion Middleware WebLogic Server 11g R1 and Oracle Fusion Middleware Cluster Web Tier Utilities 11g HTTP server.

  • The combination of the SPARC T4-2 server and Oracle JD Edwards EnterpriseOne in the application tier with a SPARC T4-1 server in the database tier measured low CPU utilization providing headroom for growth.

Performance Landscape

JD Edwards EnterpriseOne Day in the Life Benchmark
Online with Batch Workload

System Online
Users
Resp
Time (sec)
Batch
Concur
(# of UBEs)
Batch
Rate
(UBEs/m)
Version
2xSPARC T4-2 (app+web)
SPARC T4-1 (db)
10000 0.35 112 67 9.0.2
SPARC T3-1 (app+web)
SPARC Enterprise M3000 (db)
5000 0.88 19 10 9.0.1

Resp Time (sec) — Response time of online jobs reported in seconds
Batch Concur (# of UBEs) — Batch concurrency presented in the number of UBEs
Batch Rate (UBEs/m) — Batch transaction rate in UBEs per minute

Edwards EnterpriseOne Day in the Life Benchmark
Online Workload Only

System Online
Users
Response
Time (sec)
Version
SPARC T3-1, 1 x SPARC T3 (1.65 GHz), Solaris 10 (app)
M3000, 1 x SPARC64 VII (2.75 GHz), Solaris 10 (db)
5000 0.52 9.0.1
IBM Power 750, POWER7 (3.55 GHz) (app+db) 4000 0.61 9.0

IBM result from http://www-03.ibm.com/systems/i/advantages/oracle/, IBM used WebSphere

Configuration Summary

Application Tier Configuration:

1 x SPARC T4-2 server with
2 x 2.85 GHz SPARC T4 processors
128 GB main memory
6 x 300 GB 10K RPM SAS internal HDD
Oracle Solaris 10 9/10
JD Edwards EnterpriseOne 9.0.2 with Tools 8.98.3.3

Web Tier Configuration:

1 x SPARC T4-2 server with
2 x 2.85 GHz SPARC T4 processors
256 GB main memory
2 x 300 GB SSD
4 x 300 GB 10K RPM SAS internal HDD
Oracle Solaris 10 9/10
Oracle WebLogic Server 11g Release 1

Database Tier Configuration:

1 x SPARC T4-1 server with
1 x 2.85 GHz SPARC T4 processor
128 GB main memory
6 x 300 GB 10K RPM SAS internal HDD
2 x Sun Storage F5100 Flash Array
Oracle Solaris 10 9/10
Oracle Database 11g Release 2

Benchmark Description

JD Edwards EnterpriseOne is an integrated applications suite of Enterprise Resource Planning (ERP) software. Oracle offers 70 JD Edwards EnterpriseOne application modules to support a diverse set of business operations.

Oracle's Day in the Life (DIL) kit is a suite of scripts that exercises most common transactions of JD Edwards EnterpriseOne applications, including business processes such as payroll, sales order, purchase order, work order, and manufacturing processes, such as ship confirmation. These are labeled by industry acronyms such as SCM, CRM, HCM, SRM and FMS. The kit's scripts execute transactions typical of a mid-sized manufacturing company.

  • The workload consists of online transactions and the UBE – Universal Business Engine workload of 42 short, 8 medium and 4 long UBEs.

  • LoadRunner runs the DIL workload, collects the user’s transactions response times and reports the key metric of Combined Weighted Average Transaction Response time.

  • The UBE processes workload runs from the JD Enterprise Application server.

    • Oracle's UBE processes come as three flavors:
      • Short UBEs < 1 minute engage in Business Report and Summary Analysis,
      • Mid UBEs > 1 minute create a large report of Account, Balance, and Full Address,
      • Long UBEs > 2 minutes simulate Payroll, Sales Order, night only jobs.
    • The UBE workload generates large numbers of PDF files reports and log files.
    • The UBE Queues are categorized as the QBATCHD, a single threaded queue for large and medium UBEs, and the QPROCESS queue for short UBEs run concurrently.

Oracle’s UBE process performance metric is Number of Maximum Concurrent UBE processes at transaction rate, UBEs/minute.

Key Points and Best Practices

One JD Edwards EnterpriseOne Application Server and two Oracle WebLogic Servers 11g R1 coupled with two Oracle Fusion Middleware 11g Web Tier HTTP Server instances on the SPARC T4-2 servers were hosted in three separate Oracle Solaris Containers to demonstrate consolidation of multiple application and web servers.

  • Interrupt fencing was configured on all Oracle Solaris Containers to channel the interrupts to processors other than the processor sets used for the JD Edwards Application server and WebLogic servers.

  • Processor 0 was left alone for clock interrupts.

  • The applications were executed in the FX scheduling class to improve performance by reducing the frequency of context switches.

  • A WebLogic vertical cluster was configured on each WebServer Container with twelve managed instances each to load balance users' requests and to provide the infrastructure that enables scaling to high number of users with ease of deployment and high availability.

  • The database server was run in an Oracle Solaris Container hosted on the SPARC T4-2 server.

  • The database log writer was run in the real time RT class and bound to a processor set.

  • The database redo logs were configured on the raw disk partitions.

  • The private network between the SPARC T4-2 servers was configured with a 10 GbE interface.

  • The Oracle Solaris Container on the Enterprise Application server ran 42 Short UBEs, 8 Medium UBEs and 4 Long UBEs concurrently as the mixed size batch workload.

  • The mixed size UBEs ran concurrently from the application server with the 10000 online users driven by the LoadRunner.

See Also

Disclosure Statement

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/26/2011.

Monday Sep 19, 2011

Halliburton ProMAX® Seismic Processing on Sun Blade X6270 M2 with Sun ZFS Storage 7320

Halliburton/Landmark's ProMAX® 3D Pre-Stack Kirchhoff Time Migration's (PSTM) single workflow scalability and multiple workflow throughput using various scheduling methods are evaluated on a cluster of Oracle's Sun Blade X6270 M2 server modules attached to Oracle's Sun ZFS Storage 7320 appliance.

Two resource scheduling methods, compact and distributed, are compared while increasing the system load with additional concurrent ProMAX® workflows.

  • Multiple concurrent 24-process ProMAX® PSTM workflow throughput is constant; 10 workflows on 10 nodes finish as fast as 1 workflow on one compute node. Additionally, processing twice the data volume yields similar traces/second throughput performance.

  • A single ProMAX® PSTM workflow has good scaling from 1 to 10 nodes of a Sun Blade X6270 M2 cluster scaling 4.5X. ProMAX® scales to 4.7X on 10 nodes with one input data set and 6.3X with two consecutive input data sets (i.e. twice the data).

  • A single ProMAX® PSTM workflow has near linear scaling of 11x on a Sun Blade X6270 M2 server module when running from 1 to 12 processes.

  • The 12-thread ProMAX® workflow throughput using the distributed scheduling method is equivalent or slightly faster than the compact scheme for 1 to 6 concurrent workflows.

Performance Landscape

Multiple 24-Process Workflow Throughput Scaling

This test measures the system throughput scalability as concurrent 24-process workflows are added, one workflow per node. The per workflow throughput and the system scalability are reported.

Aggregate system throughput scales linearly. Ten concurrent workflows finish in the same time as does one workflow on a single compute node.

Halliburton ProMAX® Pre-Stack Time Migration - Multiple Workflow Scaling


Single Workflow Scaling

This test measures single workflow scalability across a 10-node cluster. Utilizing a single data set, performance exhibits near linear scaling of 11x at 12 processes, and per-node scaling of 4x at 6 nodes; performance flattens quickly reaching a peak of 60x at 240 processors and per-node scaling of 4.7x with 10 nodes.

Running with two consecutive input data sets in the workflow, scaling is considerably improved with peak scaling ~35% higher than obtained using a single data set. Doubling the data set size minimizes time spent in workflow initialization, data input and output.

Halliburton ProMAX® Pre-Stack Time Migration - Single Workflow Scaling

This next test measures single workflow scalability across a 10-node cluster (as above) but limiting scheduling to a maximum of 12-process per node; effectively restricting a maximum of one process per physical core. The speedup relative to a single process, and single node are reported.

Utilizing a single data set, performance exhibits near linear scaling of 37x at 48 processes, and per-node scaling of 4.3x at 6 nodes. Performance of 55x at 120 processors and per-node scaling of 5x with 10 nodes is reached and scalability is trending higher more strongly compared to the the case of two processes running per physical core above. For equivalent total process counts, multi-node runs using only a single process per physical core appear to run between 28-64% more efficiently (96 and 24 processes respectively). With a full compliment of 10 nodes (120 processes) the peak performance is only 9.5% lower than with 2 processes per vcpu (240 processes).

Running with two consecutive input data sets in the workflow, scaling is considerably improved with peak scaling ~35% higher than obtained using a single data set.

Halliburton ProMAX® Pre-Stack Time Migration - Single Workflow Scaling

Multiple 12-Process Workflow Throughput Scaling, Compact vs. Distributed Scheduling

The fourth test compares compact and distributed scheduling of 1, 2, 4, and 6 concurrent 12-processor workflows.

All things being equal, the system bi-section bandwidth should improve with distributed scheduling of a fixed-size workflow; as more nodes are used for a workflow, more memory and system cache is employed and any node memory bandwidth bottlenecks can be offset by distributing communication across the network (provided the network and inter-node communication stack do not become a bottleneck). When physical cores are not over-subscribed, compact and distributed scheduling performance is within 3% suggesting that there may be little memory contention for this workflow on the benchmarked system configuration.

With compact scheduling of two concurrent 12-processor workflows, the physical cores become over-subscribed and performance degrades 36% per workflow. With four concurrent workflows, physical cores are oversubscribed 4x and performance is seen to degrade 66% per workflow. With six concurrent workflows over-subscribed compact scheduling performance degrades 77% per workflow. As multiple 12-processor workflows become more and more distributed, the performance approaches the non over-subscribed case.

Halliburton ProMAX® Pre-Stack Time Migration - Multiple Workflow Scaling

141616 traces x 624 samples


Test Notes

All tests were performed with one input data set (70808 traces x 624 samples) and two consecutive input data sets (2 * (70808 traces x 624 samples)) in the workflow. All results reported are the average of at least 3 runs and performance is based on reported total wall-clock time by the application.

All tests were run with NFS attached Sun ZFS Storage 7320 appliance and then with NFS attached legacy Sun Fire X4500 server. The StorageTek Workload Analysis Tool (SWAT) was invoked to measure the I/O characteristics of the NFS attached storage used on separate runs of all workflows.

Configuration Summary

Hardware Configuration:

10 x Sun Blade X6270 M2 server modules, each with
2 x 3.33 GHz Intel Xeon X5680 processors
48 GB DDR3-1333 memory
4 x 146 GB, Internal 10000 RPM SAS-2 HDD
10 GbE
Hyper-Threading enabled

Sun ZFS Storage 7320 Appliance
1 x Storage Controller
2 x 2.4 GHz Intel Xeon 5620 processors
48 GB memory (12 x 4 GB DDR3-1333)
2 TB Read Cache (4 x 512 GB Read Flash Accelerator)
10 GbE
1 x Disk Shelf
20.0 TB RAID-Z (20 x 1 TB SAS-2, 7200 RPM HDD)
4 x Write Flash Accelerators

Sun Fire X4500
2 x 2.8 GHz AMD 290 processors
16 GB DDR1-400 memory
34.5 TB RAID-Z (46 x 750 GB SATA-II, 7200 RPM HDD)
10 GbE

Software Configuration:

Oracle Linux 5.5
Parallel Virtual Machine 3.3.11 (bundled with ProMAX)
Intel 11.1.038 Compilers
Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

The ProMAX® family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX® is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX® is integrated with Halliburton's OpenWorks® Geoscience Oracle Database to index prestack seismic data and populate the database with processed seismic.

This benchmark evaluates single workflow scalability and multiple workflow throughput of the ProMAX® 3D Prestack Kirchhoff Time Migration (PSTM) while processing the Halliburton benchmark data set containing 70,808 traces with 8 msec sample interval and trace length of 4992 msec. Benchmarks were performed with both one and two consecutive input data sets.

Each workflow consisted of:

  • reading the previously constructed MPEG encoded processing parameter file
  • reading the compressed seismic data traces from disk
  • performing the PSTM imaging
  • writing the result to disk

Workflows using two input data sets were constructed by simply adding a second identical seismic data read task immediately after the first in the processing parameter file. This effectively doubled the data volume read, processed, and written.

This version of ProMAX® currently only uses Parallel Virtual Machine (PVM) as the parallel processing paradigm. The PVM software only used TCP networking and has no internal facility for assigning memory affinity and processor binding. Every compute node is running a PVM daemon.

The ProMAX® processing parameters used for this benchmark:

Minimum output inline = 65
Maximum output inline = 85
Inline output sampling interval = 1
Minimum output xline = 1
Maximum output xline = 200 (fold)
Xline output sampling interval = 1
Antialias inline spacing = 15
Antialias xline spacing = 15
Stretch Mute Aperature Limit with Maximum Stretch = 15
Image Gather Type = Full Offset Image Traces
No Block Moveout
Number of Alias Bands = 10
3D Amplitude Phase Correction
No compression
Maximum Number of Cache Blocks = 500000

Primary PSTM business metrics are typically time-to-solution and accuracy of the subsurface imaging solution.

Key Points and Best Practices

  • Multiple job system throughput scales perfectly; ten concurrent workflows on 10 nodes each completes in the same time and has the same throughput as a single workflow running on one node.
  • Best single workflow scaling is 6.6x using 10 nodes.

    When tasked with processing several similar workflows, while individual time-to-solution will be longer, the most efficient way to run is to fully distribute them one workflow per node (or even across two nodes) and run these concurrently, rather than to use all nodes for each workflow and running consecutively. For example, while the best-case configuration used here will run 6.6 times faster using all ten nodes compared to a single node, ten such 10-node jobs running consecutively will overall take over 50% longer to complete than ten jobs one per node running concurrently.

  • Throughput was seen to scale better with larger workflows. While throughput with both large and small workflows are similar with only one node, the larger dataset exhibits 11% and 35% more throughput with four and 10 nodes respectively.

  • 200 processes appears to be a scalability asymptote with these workflows on the systems used.
  • Hyperthreading marginally helps throughput. For the largest model run on 10 nodes, 240 processes delivers 11% more performance than with 120 processes.

  • The workflows do not exhibit significant I/O bandwidth demands. Even with 10 concurrent 24-process jobs, the measured aggregate system I/O did not exceed 100 MB/s.

  • 10 GbE was the only network used and, though shared for all interprocess communication and network attached storage, it appears to have sufficient bandwidth for all test cases run.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX®, GeoProbe®, OpenWorks®. Results as of 9/1/2011.

Thursday Sep 15, 2011

Sun Fire X4800 M2 Servers (now known as Sun Server X2-8) Produce World Record on SAP SD-Parallel Benchmark

Oracle delivered an SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution - Parallel (SD Parallel) Benchmark world record result using eight of Oracle's Sun Fire X4800 M2 servers (now known as Sun Server X2-8), Oracle Solaris 10 and Oracle Database 11g Real Application Clusters (RAC) software that achieved 180,000 users as of 10/03/2011.

  • The eight Sun Fire X4800 M2 servers delivered a world record result of 180,000 users on the SAP SD Parallel Benchmark.

  • The eight Sun Fire X4800 M2 server SD Parallel result of 180,000 users delivered 43% more performance compared to the IBM Power 795 server SD two-tier result of 126,063 users.

Performance Landscape

Selected SAP Sales and Distribution (SD) benchmark results are presented in decreasing order of performance. All benchmarks were using SAP enhancement package 4 for SAP ERP 6.0 (Unicode).

System OS
Database
Users SAPS Type Cert #
Eight Sun Fire X4800 M2
8 x Intel Xeon E7-8870 @2.4 GHz
512 GB
Oracle Solaris 10
Oracle 11g RAC
180,000 1,016,380 Parallel 2011037
Six Sun Fire X4800 M2
8 x Intel Xeon E7-8870 @2.4 GHz
512 GB
Oracle Solaris 10
Oracle 11g RAC
137,904 765,470 Parallel 2011038
IBM Power 795
32 x POWER7 @4.0 GHz
4096 GB
AIX 7.1
DB2 9.7
126,063 688,630 Two-Tier 2010046
Four Sun Fire X4800 M2
8 x Intel Xeon E7-8870 @2.4 GHz
512 GB
Oracle Solaris 10
Oracle 11g RAC
94,736 546,050 Parallel 2011039
Two Sun Fire X4800 M2
8 x Intel Xeon E7-8870 @2.4 GHz
512 GB
Oracle Solaris 10
Oracle 11g RAC
49,860 274,080 Parallel 2011040
Four Sun Fire X4470
4 x Intel Xeon X7560 @2.26 GHz
256 GB
Solaris 10
Oracle 11g RAC
40,000 221,020 Parallel 2010039

Complete benchmark results and descriptions can be found at the SAP standard applications benchmark website.
For SD benchmark results website: Two-Tier or Three-Tier. For SD Parallel benchmark results website: SD Parallel.

Configuration and Results Summary

Hardware Configuration:

8 x Sun Fire X4800 M2 servers, each with
8 x Intel Xeon E7-8870 @ 2.4 GHz (8 processors, 80 cores, 160 threads)
512 GB memory

Software Configuration:

SAP enhancement package 4 for SAP ERP 6.0
Oracle Database 11g Real Application Clusters (RAC)
Oracle Solaris 10

Results Summary:

Number of SAP SD benchmark users:
180,000
Average dialog response time:
0.63 seconds
Throughput:

Fully processed order line items per hour:
20,327,670

Dialog steps/hour:
60,983,000

SAPS:
1,016,380
Average database request time (dialog/update):
0.010 sec / 0.055 sec
SAP Certification:
2011037

Benchmark Description

The SAP Standard Application Sales and Distribution - Parallel (SD Parallel) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing and demonstrates the ability to run both the application and database software on a single system. The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments.

The SD Parallel Benchmark consists of the same transactions and user interaction steps as the two-tier and three-tier SD Benchmark. This means that the SD Parallel Benchmark runs the same business processes as the SD Benchmark. The difference between the benchmarks is the technical data distribution. Additionally, the benchmark requires equal distribution of the benchmark users across all database nodes for the used benchmark clients (round-robin method). Following this rule, all database nodes work on data of all clients. This avoids unrealistic configurations such as having only one client per database node.

The SAP Benchmark Council agreed to give the parallel benchmark a different name so that the difference can be easily recognized by any interested parties - customers, prospects, and analysts. The naming convention is SD Parallel for Sales & Distribution - Parallel.

SAP is one of the premier world-wide ERP application providers, and maintains a suite of benchmark tests to demonstrate the performance of competitive systems on the various SAP products.

See Also

Disclosure Statement

SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, results as of 10/03/2011.

SD Parallel, 8 x Sun Fire X4800 M2 (each 8 processors, 80 cores, 160 threads) 180,000 SAP SD Users, Oracle Solaris 10, Oracle 11g Real Application Clusters (RAC), Certification Number 2011037.
SD Parallel, 6 x Sun Fire X4800 M2 (each 8 processors, 80 cores, 160 threads) 137,904 SAP SD Users, Oracle Solaris 10, Oracle 11g Real Application Clusters (RAC), Certification Number 2011038.
SD Parallel, 4 x Sun Fire X4470 (each 4 processors, 32 cores, 64 threads) 40,000 SAP SD Users, Oracle Solaris 10, Oracle 11g Real Application Clusters (RAC), Certification Number 2010039.
SD Two-Tier, IBM Power 795 (32 processors, 256 cores, 1024 threads) 126,063 SAP SD Users, AIX 7.1, DB2 9.7, Certification Number 2010046.

SAP, R/3 are registered trademarks of SAP AG in Germany and other countries. More information may be found at www.sap.com/benchmark.

Friday Jul 01, 2011

SPARC T3-1 Record Results Running JD Edwards EnterpriseOne Day in the Life Benchmark with Added Batch Component

Using Oracle's SPARC T3-1 server for the application tier and Oracle's SPARC Enterprise M3000 server for the database tier, a world record result was produced running the Oracle's JD Edwards EnterpriseOne applications Day in the Life benchmark run concurrently with a batch workload.

  • The SPARC T3-1 server based result has 25% better performance than the IBM Power 750 POWER7 server even though the IBM result did not include running a batch component.

  • The SPARC T3-1 server based result has 25% better space/performance than the IBM Power 750 POWER7 server as measured by the online component.

  • The SPARC T3-1 server based result is 5x faster than the x86-based IBM x3650 M2 server system when executing the online component of the JD Edwards EnterpriseOne 9.0.1 Day in the Life benchmark. The IBM result did not include a batch component.

  • The SPARC T3-1 server based result has 2.5x better space/performance than the x86-based IBM x3650 M2 server as measured by the online component.

  • The combination of SPARC T3-1 and SPARC Enterprise M3000 servers delivered a Day in the Life benchmark result of 5000 online users with 0.875 seconds of average transaction response time running concurrently with 19 Universal Batch Engine (UBE) processes at 10 UBEs/minute. The solution exercises various JD Edwards EnterpriseOne applications while running Oracle WebLogic Server 11g Release 1 and Oracle Web Tier Utilities 11g HTTP server in Oracle Solaris Containers, together with the Oracle Database 11g Release 2.

  • The SPARC T3-1 server showed that it could handle the additional workload of batch processing while maintaining the same number of online users for the JD Edwards EnterpriseOne Day in the Life benchmark. This was accomplished with minimal loss in response time.

  • JD Edwards EnterpriseOne 9.0.1 takes advantage of the large number of compute threads available in the SPARC T3-1 server at the application tier and achieves excellent response times.

  • The SPARC T3-1 server consolidates the application/web tier of the JD Edwards EnterpriseOne 9.0.1 application using Oracle Solaris Containers. Containers provide flexibility, easier maintenance and better CPU utilization of the server leaving processing capacity for additional growth.

  • A number of Oracle advanced technology and features were used to obtain this result: Oracle Solaris 10, Oracle Solaris Containers, Oracle Java Hotspot Server VM, Oracle WebLogic Server 11g Release 1, Oracle Web Tier Utilities 11g, Oracle Database 11g Release 2, the SPARC T3 and SPARC64 VII+ based servers.

  • This is the first published result running both online and batch workload concurrently on the JD Enterprise Application server. No published results are available from IBM running the online component together with a batch workload.

  • The 9.0.1 version of the benchmark saw some minor performance improvements relative to 9.0. When comparing between 9.0.1 and 9.0 results, the reader should take this into account when the difference between results is small.

Performance Landscape

JD Edwards EnterpriseOne Day in the Life Benchmark
Online with Batch Workload

This is the first publication on the Day in the Life benchmark run concurrently with batch jobs. The batch workload was provided by Oracle's Universal Batch Engine.

System Rack
Units
Online
Users
Resp
Time (sec)
Batch
Concur
(# of UBEs)
Batch
Rate
(UBEs/m)
Version
SPARC T3-1, 1xSPARC T3 (1.65 GHz), Solaris 10
M3000, 1xSPARC64 VII+ (2.86 GHz), Solaris 10
4 5000 0.88 19 10 9.0.1

Resp Time (sec) — Response time of online jobs reported in seconds
Batch Concur (# of UBEs) — Batch concurrency presented in the number of UBEs
Batch Rate (UBEs/m) — Batch transaction rate in UBEs/minute.

JD Edwards EnterpriseOne Day in the Life Benchmark
Online Workload Only

These results are for the Day in the Life benchmark. They are run without any batch workload.

System Rack
Units
Online
Users
Response
Time (sec)
Version
SPARC T3-1, 1xSPARC T3 (1.65 GHz), Solaris 10
M3000, 1xSPARC64 VII (2.75 GHz), Solaris 10
4 5000 0.52 9.0.1
IBM Power 750, 1xPOWER7 (3.55 GHz), IBM i7.1 4 4000 0.61 9.0
IBM x3650M2, 2xIntel X5570 (2.93 GHz), OVM 2 1000 0.29 9.0

IBM result from http://www-03.ibm.com/systems/i/advantages/oracle/, IBM used WebSphere

Configuration Summary

Hardware Configuration:

1 x SPARC T3-1 server
1 x 1.65 GHz SPARC T3
128 GB memory
16 x 300 GB 10000 RPM SAS
1 x Sun Flash Accelerator F20 PCIe Card, 96 GB
1 x 10 GbE NIC
1 x SPARC Enterprise M3000 server
1 x 2.86 SPARC64 VII+
64 GB memory
1 x 10 GbE NIC
2 x StorageTek 2540 + 2501

Software Configuration:

JD Edwards EnterpriseOne 9.0.1 with Tools 8.98.3.3
Oracle Database 11g Release 2
Oracle 11g WebLogic server 11g Release 1 version 10.3.2
Oracle Web Tier Utilities 11g
Oracle Solaris 10 9/10
Mercury LoadRunner 9.10 with Oracle Day in the Life kit for JD Edwards EnterpriseOne 9.0.1
Oracle’s Universal Batch Engine - Short UBEs and Long UBEs

Benchmark Description

JD Edwards EnterpriseOne is an integrated applications suite of Enterprise Resource Planning (ERP) software. Oracle offers 70 JD Edwards EnterpriseOne application modules to support a diverse set of business operations.

Oracle's Day in the Life (DIL) kit is a suite of scripts that exercises most common transactions of JD Edwards EnterpriseOne applications, including business processes such as payroll, sales order, purchase order, work order, and other manufacturing processes, such as ship confirmation. These are labeled by industry acronyms such as SCM, CRM, HCM, SRM and FMS. The kit's scripts execute transactions typical of a mid-sized manufacturing company.

  • The workload consists of online transactions and the UBE workload of 15 short and 4 long UBEs.

  • LoadRunner runs the DIL workload, collects the user’s transactions response times and reports the key metric of Combined Weighted Average Transaction Response time.

  • The UBE processes workload runs from the JD Enterprise Application server.

    • Oracle's UBE processes come as three flavors:

      • Short UBEs < 1 minute engage in Business Report and Summary Analysis,
      • Mid UBEs > 1 minute create a large report of Account, Balance, and Full Address,
      • Long UBEs > 2 minutes simulate Payroll, Sales Order, night only jobs.
    • The UBE workload generates large numbers of PDF files reports and log files.

    • The UBE Queues are categorized as the QBATCHD, a single threaded queue for large UBEs, and the QPROCESS queue for short UBEs run concurrently.

  • One of the Oracle Solaris Containers ran 4 Long UBEs, while another Container ran 15 short UBEs concurrently.

  • The mixed size UBEs ran concurrently from the SPARC T3-1 server with the 5000 online users driven by the LoadRunner.

  • Oracle’s UBE process performance metric is Number of Maximum Concurrent UBE processes at transaction rate, UBEs/minute.

Key Points and Best Practices

Two JD Edwards EnterpriseOne Application Servers and two Oracle Fusion Middleware WebLogic Servers 11g R1 coupled with two Oracle Fusion Middleware 11g Web Tier HTTP Server instances on the SPARC T3-1 server were hosted in four separate Oracle Solaris Containers to demonstrate consolidation of multiple application and web servers.

See Also

Disclosure Statement

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 6/27/2011.

Wednesday Mar 23, 2011

SPARC T3-1B Doubles Performance on Oracle Fusion Middleware WebLogic Avitek Medical Records Sample Application

The Oracle WebLogic Server 11g software was used to demonstrate the performance of the Avitek Medical Records sample application. A configuration using SPARC T3-1B and SPARC Enterprise M5000 servers from Oracle was used and showed excellent scaling of different configurations as well as doubling previous generation SPARC blade performance.

  • A SPARC T3-1B server, running a typical real-world J2EE application on Oracle WebLogic Server 11g, together with a SPARC Enterprise M5000 server running the Oracle database, had 2.1x times the transactional throughput over the previous generation UltraSPARC T2 processor based Sun Blade T6320 server module.

  • The SPARC T3-1B server shows linear scaling as the number of cores in the SPARC T3 processor used in the SPARC T3-1B system module are doubled.

  • The Avitek Medical Records application instances were deployed in Oracle Solaris zones on the SPARC T3-1B server, allowing for flexible, scalable and lightweight architecture of the application tier.

Performance Landscape

Performance for the application tier is presented. Results are the maximum transactions per second (TPS).

Server Processor Memory Maximum TPS
SPARC T3-1B 1 x SPARC T3, 1.65 GHz, 16 cores 128 GB 28,156
SPARC T3-1B 1 x SPARC T3, 1.65 GHz, 8 cores 128 GB 14,030
Sun Blade T6320 1 x UltraSPARC T2, 1.4 GHz, 8 cores 64 GB 13,386

The same SPARC Enterprise M5000 server from Oracle was used in each case as the database server. Internal disk storage was used.

Configuration Summary

Hardware Configuration:

1 x SPARC T3-1B
1 x 1.65 GHz SPARC T3
128 GB memory

1 x Sun Blade T6320
1 x 1.4Ghz GHz SPARC T2
64 GB memory

1 x SPARC Enterprise M5000
8 x 2.53 SPARC64 VII
128 GB memory

Software Configuration:

Avitek Medical Records
Oracle Database 10g Release 2
Oracle WebLogic Server 11g R1 version 10.3.3 (Oracle Fusion Middleware)
Oracle Solaris 10 9/10
HP Mercury LoadRunner 9.5

Benchmark Description

Avitek Medical Records (or MedRec) is an Oracle WebLogic Server 11g sample application suite that demonstrates all aspects of the J2EE platform. MedRec showcases the use of each J2EE component, and illustrates best practice design patterns for component interaction and client development. Oracle WebLogic server 11g is a key component of Oracle Fusion Middleware 11g.

The MedRec application provides a framework for patients, doctors, and administrators to manage patient data using a variety of different clients. Patient data includes:

  • Patient profile information: A patient's name, address, social security number, and log-in information.

  • Patient medical records: Details about a patient's visit with a physician, such as the patient's vital signs and symptoms as well as the physician's diagnosis and prescriptions.

MedRec comprises of two main Java EE applications supporting different user scenarios:

medrecEar – Patients log in to the web application (patientWebApp) to register their profile or edit. Patients can also view medical records or their prior visits. Administrators use the web application (adminWebApp) to approve or deny new patient profile requests. medrecEar also provides all of the controller and business logic used by the MedRec application suite, as well as the Web Service used by different clients.

physicianEar – Physicians and nurses login to the web application (physicianWebApp) to search and access patient profiles, create and review medical records, and prescribe medicine to patients. The physician application is designed to communicate using the Web Service provided in the medrecEar.

The medrecEAR and physicianEar application are deployed to Oracle WebLogic Server 11g instance called MedRecServer. The physicianEAR application communicates with the controller components of medrecEAR using Web Services.

The workload injected into the MedRec applications measures the average transactions per second for the following sequence:

  1. A client opens page http://{host}:7011/Start.jsp (MedRec)
  2. Patient completes Registration process
  3. Administrator login, approves the patient profile, and logout
  4. Physician connect to the on-line system and logs in
  5. Physician performs search for a patient and looks up patient's visit information
  6. Physician logs out
  7. Patient logs in and reviews the profile
  8. Patient makes changes to the profile and updates the information
  9. Patient logs out

Each of the above steps constitutes a single transaction.

Key Points and Best Practices

Please see the Oracle documentation on the Oracle Technical Network for tuning your Oracle WebLogic Server 11g deployment.

See Also

Disclosure Statement

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 3/22/2011.

Thursday Feb 17, 2011

SPARC T3-1 takes JD Edwards "Day In the Life" benchmark lead, beats IBM Power7 by 25%

Oracle's SPARC T3-1 server, running the application, together with Oracle's SPARC Enterprise M3000 server running the database, have achieved a record result of 5000 users, with 0.523 seconds of average transaction response time, for the online component of the "Day in the Life" JD Edwards EnterpriseOne benchmark.

  • The "Day in the Life" benchmark tests the Oracle JD Edwards EnterpriseOne applications, running Oracle Fusion Middleware WebLogic Server 11g R1, Oracle Fusion Middleware Web Tier Utilities 11g HTTP server and JD Edwards EnterpriseOne 9.0.1 in Oracle Solaris Containers, together with the Oracle Database 11g Release 2.

  • The SPARC T3-1 server is 25% faster and has better response time than the IBM P750 POWER7 system, when executing the JD Edwards EnterpriseOne 9.0.1 Day in the Life test, online component.

  • The SPARC T3-1 server had 25% better space/performance than the IBM P750 POWER7 server.

  • The SPARC T3-1 server is 5x faster than the x86-based IBM x3650 M2 server system, when executing the JD Edwards EnterpriseOne 9.0.1 Day in the Life test, online component.

  • The SPARC T3-1 server had 2.5x better space/performance than the x86-based IBM x3650 M2 server.

  • The SPARC T3-1 server consolidated the application/web tier of the JD Edwards EnterpriseOne 9.0.1 application using Oracle Solaris Containers. Containers provide flexibility, easier maintenance and better CPU utilization of the server leaving processing capacity for additional growth.

  • The SPARC Enterprise M3000 server provides enterprise class RAS features for customers deploying the Oracle 11g Release 2 database software.

  • To obtain this leading result, a number of Oracle advanced technology and features were used: Oracle Solaris 10, Oracle Solaris Containers, Oracle Java Hotspot Server VM, Oracle Fusion Middleware WebLogic Server 11g R1, Oracle Fusion Middleware Web Tier Utilities 11g, Oracle Database 11g Release 2, the SPARC T3 and the SPARC64 VII based servers.

Performance Landscape

JD Edwards EnterpriseOne DIL Online Component Performance Chart

System Memory OS #user JD Edwards
Version
Rack
Units
Response
Time
(sec)
SPARC T3-1, 1x1.65 GHz SPARC T3 128 Solaris 10 5000 9.0.1 2U 0.523
\*IBM Power 750, 1x3.55 GHz POWER7 120 IBM i7.1 4000 9.0 4U 0.61
IBM Power 570, 4x4.2 GHz POWER6 128 IBM i6.1 2400 8.12 4U 1.129
IBM x3650M2, 2x2.93 GHz X5570 64 OVM 1000 9.0 2U 0.29

\* from http://www-03.ibm.com/systems/i/advantages/oracle/, IBM used Websphere

Configuration Summary

Hardware Configuration:

1 x SPARC T3-1 server
1 x 1.65 GHz SPARC T3
128 GB memory
16 x 300 GB 10000 RPM SAS
1 x 1 GbE NIC
1 x SPARC Enterprise M3000
1 x 2.75 SPARC 64 VII
64 GB memory
1 x 1 GbE NIC
2 x StorageTek 2540/2501

Software Configuration:

JD Edwards EnterpriseOne 9.0.1 with Tools 8.98.3.3
Oracle Database 11g Release 2
Oracle Fusion Middleware 11g WebLogic server 11g R1 version 10.3.2
Oracle Fusion Middleware Web Tier Utilities 11g
Oracle Solaris 10 9/10
Mercury LoadRunner 9.10 with Oracle DIL kit for JD Edwards EnterpriseOne 9.0 update 1

Benchmark Description

Oracle's JD Edwards EnterpriseOne is an integrated applications suite of Enterprise Resource Planning software.

  • Oracle offers 70 JD Edwards EnterpriseOne application modules to support a diverse set of business operations.
  • Oracle 's Day-In-Life (DIL) kit is a suite of scripts that exercises most common transactions of J.D. Edwards EnterpriseOne applications including business processes such as payroll, sales order, purchase order, work order, and other manufacturing processes, such as ship confirmation. These are labeled by industry acronyms such as SCM, CRM, HCM, SRM and FMS.
  • Oracle's DIL kit's scripts execute transactions typical of a mid-sized manufacturing company.
  • The workload consists of online transactions. It does not include the batch processing job components.
  • LoadRunner is used to run the workload and collect the users' transactions response times against increasing numbers of users from 500 to 5000.
  • Key metric used to evaluate performance is the transaction response time which is reported by LoadRunner.

Key Points and Best Practices

Two JD Edwards EnterpriseOne and two Oracle Fusion Middleware WebLogic Servers 11g R1 coupled with two Fusion Middleware 11g Web Tier HTTP Servers instances on the SPARC T3-1 server were hosted in four separate Oracle Solaris Containers to demonstrate consolidation of multiple application and web servers.

  • Each Oracle Solaris container was bound to a separate processor set with 40 virtual processors allocated to each EnterpriseOne Server, 16 virtual processors allocated to each WebServer container and 16 to the default set. This was done to improve performance by using the physical memory closest to the processors, thereby, reducing memory access latency and reducing processor cross calls. The default processor set was used for network and disk interrupt handling.

  • The applications were executed in the FX scheduling class to improve performance by reducing the frequency of context switches.

  • A WebLogic Vertical cluster was configured on each WebServer container with seven managed instances each to load balance users' requests and to provide the infrastructure that enables scaling to high number of users with ease of deployment and high availability.

  • The database server was run in an Oracle Solaris Container hosted on the Oracle's SPARC Enterprise M3000 server.

See Also

Disclosure Statement

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 2/16/2011.

Wednesday Dec 08, 2010

Sun Blade X6275 M2 Cluster with Sun Storage 7410 Performance Running Seismic Processing Reverse Time Migration

This Oil & Gas benchmark highlights both the computational performance improvements of the Sun Blade X6275 M2 server module over the previous genernation server and the linear scalability achievable for the total application throughput using a Sun Storage 7410 system to deliver almost 2 GB/sec I/O effective write performance.

Oracle's Sun Storage 7410 system attached via 10 Gigabit Ethernet to a cluster of Oracle's Sun Blade X6275 M2 server modules was used to demonstrate the performance of a 3D VTI Reverse Time Migration application, a heavily used geophysical imaging and modeling application for Oil & Gas Exploration. The total application throughput scaling and computational kernel performance improvements are presented for imaging two production sized grids using 800 input samples.

  • The Sun Blade X6275 M2 server module showed up to a 40% performance improvement over the previous generation server module with super-linear scalability to 16 nodes for the 9-Point Stencil used in this Reverse Time Migration computational kernel.

  • The balanced combination of Oracle's Sun Storage 7410 system over 10 GbE to the Sun Blade X6275 M2 server module cluster showed linear scalability for the total application throughput, including the I/O and MPI communication, to produce a final 3-D seismic depth imaged cube for interpretation.

  • The final image write time from the Sun Blade X6275 M2 server module nodes to Oracle's Sun Storage 7410 system achieved 10GbE line speed of 1.25 GBytes/second or better write performance. The effects of I/O buffer caching on the Sun Blade X6275 M2 server module nodes and 34 GByte write optimized cache on the Sun Storage 7410 system gave up to 1.8 GBytes/second effective write performance.

Performance Landscape

Server Generational Performance Improvements

Performance improvements for the Reverse Time Migration computational kernel using a Sun Blade X6275 M2 cluster are compared to the previous generation Sun Blade X6275 cluster. Hyper-threading was enabled for both configurations allowing 24 OpenMP threads for the Sun Blade X6275 M2 server module nodes and 16 for the Sun Blade X6275 server module nodes.

Sun Blade X6275 M2 Performance Improvements
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
X6275 Kernel Time (sec) X6275 M2 Kernel Time (sec) X6275 M2 Speedup X6275 Kernel Time (sec) X6275 M2 Kernel Time (sec) X6275 M2 Speedup
16 306 242 1.3 728 576 1.3
14 355 271 1.3 814 679 1.2
12 435 346 1.3 945 797 1.2
10 541 390 1.4 1156 890 1.3
8 726 555 1.3 1511 1193 1.3

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Blade X6275 M2 server cluster with a Sun Storage 7410 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server node.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 501 242 2.1\* 2.3\* 1060 576 2.0 2.1\*
14 583 271 1.8 2.0 1219 679 1.7 1.8
12 681 346 1.6 1.6 1420 797 1.5 1.5
10 807 390 1.3 1.4 1688 890 1.2 1.3
8 1058 555 1.0 1.0 2085 1193 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache for larger node counts

Image File Effective Write Performance

The performance for writing the final 3D image from a Sun Blade X6275 M2 server cluster over 10 Gigabit Ethernet to a Sun Storage 7410 system are presented. Each server allocated one core per node for MPI I/O thus allowing 22 OpenMP compute threads per node with hyperthreading enabled. Captured performance analytics from the Sun Storage 7410 system indicate effective use of its 34 Gigabyte write optimized cache.

Image File Effective Write Performance
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Write Time (sec) Write Performance (GB/sec) Write Time (sec) Write Performance (GB/sec)
16 4.8 1.5 10.2 1.4
14 5.0 1.4 10.2 1.4
12 4.0 1.8 11.3 1.3
10 4.3 1.6 9.1 1.6
8 4.6 1.5 9.7 1.5

Note: Performance results better than 1.3GB/sec related to I/O buffer caching on server nodes.

Configuration Summary

Hardware Configuration:

8 x 2 node Sun Blade X6275 M2 server nodes, each node with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)
1 x QDR InfiniBand Host Channel Adapter

Sun Datacenter InfiniBand Switch IB-36
Sun Network 10 GbE Switch 72p

Sun Storage 7410 system connected via 10 Gigabit Ethernet
4 x 17 GB STEC ZeusIOPs SSD mirrored - 34 GB
40 x 750 GB 7500 RPM Seagate SATA disks mirrored - 14.4 TB
No L2ARC Readzilla Cache

Software Configuration:

Oracle Enterprise Linux Server release 5.5
Oracle Message Passing Toolkit 8.2.1c (for MPI)
Oracle Solaris Studio 12.2 C++, Fortran, OpenMP

Benchmark Description

This Vertical Transverse Isotropy (VTI) Anisotropic Reverse Time Depth Migration (RTM) application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk for the next work flow step involving 3-D seismic volume interpretation. In doing so, it reports the compute, interprocessor communication, and I/O performance of the individual functions that comprise the total solution. Unlike most references for the Reverse Time Migration, that focus solely on the performance of the 3D stencil compute kernel, this demonstration code additionally reports the total throughput involved in processing large data sets with a full 3D Anisotropic RTM application. It provides valuable insight into configuration and sizing for specific seismic processing requirements. The performance effects of new processors, interconnects, I/O subsystems, and software technologies can be evaluated while solving a real Exploration business problem.

This benchmark study uses the "in-core" implementation of this demonstration code where each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a 4 element array pad (based on spatial order 8) shared with it's neighbors to the left and right during the initialization phase. It maintains previous, current, and next wavefield state information for each of the source, receiver, and anisotropic wavefields in memory. The second two grid dimensions used in this benchmark are specifically chosen to be prime numbers to exaggerate the effects of data alignment. Algorithm adaptions for processing higher orders in space and alternative "out-of-core" solutions using SSDs for wave state checkpointing are implemented in this demonstration application to better understand the effects of problem size scaling. Care is taken to handle absorption boundary conditioning and a variety of imaging conditions, appropriately.

RTM Application Structure:

Read Processing Parameter File, Determine Domain Decomposition, and Initialize Data Structures, and Allocate Memory.

Read Velocity, Epsilon, and Delta Data Based on Domain Decomposition and create source, receiver, & anisotropic previous, current, and next wave states.

First Loop over Time Steps

Compute 3D Stencil for Source Wavefield (a,s) - 8th order in space, 2nd order in time
Propagate over Time to Create s(t,z,y,x) & a(t,z,y,x)
Inject Estimated Source Wavelet
Apply Absorption Boundary Conditioning (a)
Update Wavefield States and Pointers
Write Snapshot of Wavefield (out-of-core) or Push Wavefield onto Stack (in-core)
Communicate Boundary Information

Second Loop over Time Steps
Compute 3D Stencil for Receiver Wavefield (a,r) - 8th order in space, 2nd order in time
Propagate over Time to Create r(t,z,y,x) & a(t,z,y,x)
Read Receiver Trace and Inject Receiver Wavelet
Apply Absorption Boundary Conditioning (a)
Update Wavefield States and Pointers
Communicate Boundary Information
Read in Source Wavefield Snapshot (out-of-core) or Pop Off of Stack (in-core)
Cross-correlate Source and Receiver Wavefields
Update image using image conditioning parameters

Write 3D Depth Image i(z,x,y) = Sum over time steps s(t,z,x,y) \* r(t,z,x,y) or other imaging conditions.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

Image File MPI Write Performance Tuning

Changing the Image File Write from MPI non-blocking to MPI blocking and setting Oracle Message Passing Toolkit MPI environment variables revealed an 18x improvement in write performance to the Sun Storage 7410 system going from:

    86.8 to 4.8 seconds for the 1243 x 1151 x 1231 grid size
    183.1 to 10.2 seconds for the 2486 x 1151 x 1231 grid size

The Swat Sun Storage 7410 analytics data capture indicated an initial write performance of about 100 MB/sec with the MPI non-blocking implementation. After modifying to MPI blocking writes, Swat showed between 1.3 and 1.8 GB/sec with up to 13000 write ops/sec to write the final output image. The Swat results are consistent with the actual measured performance and provide valuable insight into the Reverse Time Migration application I/O performance.

The reason for this vast improvement has to do with whether the MPI file mode is sequential or not (MPI_MODE_SEQUENTIAL, O_SYNC, O_DSYNC). The MPI non-blocking routines, MPI_File_iwrite_at and MPI_wait, typically used for overlapping I/O and computation, do not support sequential file access mode. Therefore, the application could not take full performance advantages of the Sun Storage 7410 system write optimized cache. In contrast, the MPI blocking routine, MPI_File_write_at, defaults to MPI sequential mode and the performance advantages of the write optimized cache are realized. Since writing the final image is at the end of RTM execution, there is no need to overlap the I/O with computation.

Additional MPI parameters used:

    setenv SUNW_MP_PROCBIND true
    setenv MPI_SPIN 1
    setenv MPI_PROC_BIND 1

Adjusting the Level of Multithreading for Performance

The level of multithreading (8, 10, 12, 22, or 24) for various components of the RTM should be adjustable based on the type of computation taking place. Best to use OpenMP num_threads clause to adjust the level of multi-threading for each particular work task. Use numactl to specify how the threads are allocated to cores in accordance to the OpenMP parallelism level.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 12/07/2010.

Sun Blade X6275 M2 Delivers Best Fluent (MCAE Application) Performance on Tested Configurations

This Manufacturing Engineering benchmark highlights the performance advantage the Sun Blade X6275 M2 server module offers over IBM, Cray, and SGI solutions as shown by the ANSYS FLUENT fluid dynamics application.

A cluster of eight of Oracle's Sun Blade X6275 M2 server modules delivered outstanding performance running the FLUENT 12 benchmark test suite.

  • The Sun Blade X6275 M2 server module cluster delivered the best results in all 36 of the test configurations run, outperforming the best posted results by as much as 42%.
  • The Sun Blade X6275 M2 server module demonstrated up to 76% performance improvement over the previous generation Sun Blade X6275 server module.

Performance Landscape

In the following tables, results are "Ratings" (bigger is better).
Rating = No. of sequential runs of test case possible in 1 day: 86,400/(Total Elapsed Run Time in Seconds)

The following table compares results on the basis of core count, irrespective of processor generation. This means that in some cases, i.e., for the 32-core and 64-core configurations, systems with the Intel Xeon X5670 six-core processors did not utilize quite all of the cores available for the specified processor count.


FLUENT 12 Benchmark Test Suite

Competitive Comparisons

System
Processors Cores Benchmark Test Case Ratings
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Blade X6275 M2 16 96 9340.5 39272.7 8307.7 8533.3 903.8 786.9
Best Posted 24 96

7562.4
797.0 712.9
Best Posted 16 96 7337.6 33553.4 6533.1 5989.6 739.1 683.5

Sun Blade X6275 M2 11 64 6306.6 27212.6 5592.2 5158.2 568.8 518.9
Best Posted 16 64 5556.3 26381.7 5494.4 4902.1 566.6 518.6

Sun Blade X6275 M2 8 48 4620.3 19093.9 4080.3 3251.2 376.0 359.4
Best Posted 8 48 4494.1 18989.0 3990.8 3185.3 372.7 354.5

Sun Blade X6275 M2 6 32 4061.1 15091.7 3275.8 3013.1 299.5 267.8
Best Posted 8 32 3404.9 14832.6 3211.9 2630.1 286.7 266.7

Sun Blade X6275 M2 4 24 2751.6 10441.1 2161.4 1907.3 188.2 182.5
Best Posted 6 24 1458.2 9626.7 1820.9 1747.2 185.1 180.8
Best Posted 4 24 2565.7 10164.7 2109.9 1608.2 187.1 180.8

Sun Blade X6275 M2 2 12 1429.9 5358.1 1097.5 813.2 95.9 95.9
Best Posted 2 12 1338.0 5308.8 1073.3 808.6 92.9 94.4



The following table compares results on the basis of processor count showing inter-generational processor performance improvement.


FLUENT 12 Benchmark Test Suite

Intergenerational Comparisons

System
Processors Cores Benchmark Test Case Ratings
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Blade X6275 M2 16 96 9340.5 39272.7 8307.7 8533.3 903.8 786.9
Sun Blade X6275 16 64 5308.8 26790.7 5574.2 5074.9 547.2 525.2
X6275 M2 : X6275 16
1.76 1.47 1.49 1.68 1.65 1.50

Sun Blade X6275 M2 8 48 4620.3 19093.9 4080.3 3251.2 376.0 359.4
Sun Blade X6275 8 32 3066.5 13768.9 3066.5 2602.4 289.0 270.3
X6275 M2 : X6275 8
1.51 1.39 1.33 1.25 1.30 1.33

Sun Blade X6275 M2 4 24 2751.6 10441.1 2161.4 1907.3 188.2 182.5
Sun Blade X6275 4 16 1714.3 7545.9 1519.1 1345.8 144.4 141.8
X6275 M2 : X6275 4
1.61 1.38 1.42 1.42 1.30 1.29

Sun Blade X6275 M2 2 12 1429.9 5358.1 1097.5 813.2 95.9 95.9
Sun Blade X6275 2 8 931.8 4061.1 827.2 681.5 73.0 73.8
X6275 M2 : X6275 2
1.53 1.32 1.33 1.19 1.31 1.30

Configuration Summary

Hardware Configuration:

8 x Sun Blade X6275 M2 server modules, each with
4 Intel Xeon X5670 2.93 GHz processors, turbo enabled
96 GB memory 1333 MHz
2 x 24 GB SATA-based Sun Flash Modules
2 x QDR InfiniBand Host Channel Adapter
Sun Datacenter InfiniBand Switch IB-36

Software Configuration:

Oracle Enterprise Linux Enterprise Server 5.5
ANSYS FLUENT V12.1.2
ANSYS FLUENT Benchmark Test Suite

Benchmark Description

The following description is from the ANSYS FLUENT website:

The FLUENT benchmarks suite comprises of a set of test cases covering a large range of mesh sizes, physical models and solvers representing typical industry usage. The cases range in size from a few 100 thousand cells to more than 100 million cells. Both the segregated and coupled implicit solvers are included, as well as hexahedral, mixed and polyhedral cell cases. This broad coverage is expected to demonstrate the breadth of FLUENT performance on a variety of hardware platforms and test cases.

The performance of a CFD code will depend on several factors, including size and topology of the mesh, physical models, numerics and parallelization, compilers and optimization, in addition to performance characteristics of the hardware where the simulation is performed. The principal objective of this benchmark suite is to provide comprehensive and fair comparative information of the performance of FLUENT on available hardware platforms.

About the ANSYS FLUENT 12 Benchmark Test Suite

    CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.

Key Points and Best Practices

  • ANSYS FLUENT has not yet been certified by the vendor on Oracle Enterprise Linux (OEL). However, the ANSYS FLUENT benchmark tests have been run successfully on Oracle hardware running OEL as is (i.e. with NO changes or modifications).
  • The performance improvement of the Sun Blade X6275 M2 server module over the previous generation Sun Blade X6275 server module was due to two main factors: the increased core count per processor (6 vs. 4), and the more optimal, iterative dataset partitioning scheme used for the Sun Blade X6275 M2 server module.

See Also

Disclosure Statement

All information on the FLUENT website (http://www.fluent.com) is Copyrighted 1995-2010 by ANSYS Inc. Results as of December 06, 2010.

Thursday Dec 02, 2010

World Record TPC-C Result on Oracle's SPARC Supercluster with T3-4 Servers

Oracle demonstrated the world's fastest database performance using 27 of Oracle's SPARC T3-4 servers, 138 Sun Storage F5100 Flash Array storage systems and Oracle Database 11g Release 2 Enterprise Edition with Real Application Clusters (RAC) and Partitioning delivered a world-record TPC-C benchmark result.

  • The SPARC T3-4 server cluster delivered a world record TPC-C benchmark result of 30,249,688 tpmC and $1.01 $/tpmC (USD) using Oracle Database 11g Release 2 on a configuration available 6/1/2011.

  • The SPARC T3-4 server cluster is 2.9x faster than the performance of the IBM Power 780 (POWER7 3.86 GHz) cluster with IBM DB2 9.7 database and has 27% better price/performance on the TPC-C benchmark. Almost identical price discount levels were applied by Oracle and IBM.

  • The Oracle solution has three times better performance than the IBM configuration and only used twice the power during the run of the TPC-C benchmark.  (Based upon IBM's own claims of energy usage from their August 17, 2010 press release.)

  • The Oracle solution delivered 2.9x the performance in only 71% of the space compared to the IBM TPC-C benchmark result.

  • The SPARC T3-4 server with Sun Storage F5100 Flash Array storage solution demonstrates 3.2x faster response time than IBM Power 780 (POWER7 3.86 GHz) result on the TPC-C benchmark.

  • Oracle used a single-image database, whereas IBM used 96 separate database partitions on their 3-node cluster. It is interesting to note that IBM used 32 database images instead of running each server as a simple SMP.

  • IBM did not use DB2 Enterprise Database, but instead IBM used "DB2 InfoSphere Warehouse 9.7" which is a data warehouse and data management product and not their flagship OLTP product.

  • The multi-node SPARC T3-4 server cluster is 7.4x faster than the HP Superdome (1.6 GHz Itanium2) solution and has 66% better price/performance on the TPC-C benchmark.

  • The Oracle solution utilized Oracle's Sun FlashFire technology to deliver this result. The Sun Storage F5100 Flash Array storage system was used for database storage.

  • Oracle Database 11g Enterprise Edition Release 2 with Real Application Clusters and Partitioning scales and effectively uses all of the nodes in this configuration to produce the world record TPC-C benchmark performance.

  • This result showed Oracle's integrated hardware and software stacks provide industry leading performance.

Performance Landscape

TPC-C results (sorted by tpmC, bigger is better)

System tpmC Price/tpmC Avail Database Cluster Racks
27 x SPARC T3-4 30,249,688 1.01 USD 6/1/2011 Oracle 11g RAC Y 15
3 x IBM Power 780 10,366,254 1.38 USD 10/13/10 DB2 9.7 Y 10
HP Integrity Superdome 4,092,799 2.93 USD 08/06/07 Oracle 10g R2 N 46

Avail - Availability date
Racks - Clients, servers, storage, infrastructure

Oracle and IBM TPC-C Response times

System tpmC Response Time (sec)
New Order 90th%
Response Time (sec)
New Order Average
27 x SPARC T3-4 30,249,688 0.750 0.352
3 x IBM Power 780 10,366,254 2.1 1.137
Response Time Ratio - Oracle Better 2.9x 2.8x 3.2x

Oracle uses Average New Order Response time for comparison between Oracle and IBM.

Graphs of Oracle's and IBM's response times for New-Order can be found in the full disclosure reports on TPC's website TPC-C Official Result Page.

Configuration Summary and Results

Hardware Configuration:

15 racks used to hold

Servers
27 x SPARC T3-4 servers, each with
4 x 1.65 GHz SPARC T3 processors
512 GB memory
3 x 300 GB 10K RPM 2.5" SAS disks

Data Storage
69 x Sun Fire X4270 M2 servers configured as COMSTAR heads, each with
1 x 2.93 GHz Intel Xeon X5670 processor
8 GB memory
9 x 2 TB 7.2K RPM 3.5" SAS disks
2 x Sun Storage F5100 Flash Array storage (1.92 TB each)
1 x Brocade DCX switch

Redo Storage
28 x Sun Fire X4270 M2 servers configured as COMSTAR heads, each with
1 x 2.93 GHz Intel Xeon X5670 processor
8 GB memory
11 x 2 TB 7.2K RPM 3.5" SAS disks
2 x Brocade 5300 switches

Clients
81 x Sun Fire X4170 M2 servers, each with
2 x 2.93 GHz Intel X5670 processors
48 GB memory
2 x 146 GB 10K RMP 2.5" SAS disks

Software Configuration:

Oracle Solaris 10 9/10 (for SPARC T3-4 and Sun Fire X4170 M2)
Oracle Solaris 11 Express (COMSTAR for Sun Fire X4270 M2)
Oracle Database 11g Release 2 Enterprise Edition with Real Application Clusters and Partitioning
Oracle iPlanet Web Server 7.0 U5
Tuxedo CFS-R Tier 1

Results:

System 27 x SPARC T3-4
tpmC 30,249,688
Price/tpmC 1.01 USD
Avail 6/1/2011
Database Oracle Database 11g RAC
Cluster yes
Racks 15
New Order Ave Response 0.352 seconds

Benchmark Description

TPC-C is an OLTP system benchmark. It simulates a complete environment where a population of terminal operators executes transactions against a database. The benchmark is centered around the principal activities (transactions) of an order-entry environment. These transactions include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses.

Key Points and Best Practices

  • Oracle Database 11g Release 2 Enterprise Edition with Real Application Clusters and Partitioning scales easily to this high level of performance.

  • Sun Storage F5100 Flash Array storage provides high performance, very low latency, and very high storage density.

  • COMSTAR (Common Multiprotocol SCSI Target), new in Oracle Solaris 11 Express, is the software framework that enables a Solaris host to serve as a SCSI Target platform. COMSTAR uses a modular approach to break the huge task of handling all the different pieces in a SCSI target subsystem into independent functional modules which are glued together by the SCSI Target Mode Framework (STMF). The modules implementing functionality at SCSI level (disk, tape, medium changer etc.) are not required to know about the underlying transport. And the modules implementing the transport protocol (FC, iSCSI, etc.) are not aware of the SCSI-level functionality of the packets they are transporting. The framework hides the details of allocation providing execution context and cleanup of SCSI commands and associated resources and simplifies the task of writing the SCSI or transport modules.

  • Oracle iPlanet Web Server 7.0 U5 is used in the user tier of the benchmark with each of the web server instance supporting more than a quarter-million users, while satisfying the stringent response time requirement from the TPC-C benchmark.

See Also

Disclosure Statement

TPC Benchmark C, tpmC, and TPC-C are trademarks of the Transaction Processing Performance Council (TPC). 27-node SPARC T3-4 Cluster (4 x 1.65 GHz SPARC T3 processors) with Oracle Database 11g Release 2 Enterprise Edition with Real Application Clusters and Partitioning, 30,249,688 tpmC, $1.01/tpmC, Available 6/1/2011. IBM Power 780 Cluster (3 nodes using 3.86 GHz POWER7 processors) with IBM DB2 InfoSphere Warehouse Ent. Base Ed. 9.7, 10,366,254 tpmC, $1.38 USD/tpmC, available 10/13/2010. HP Integrity Superdome(1.6GHz Itanium2, 64 processors, 128 cores, 256 threads) with Oracle 10g Enterprise Edition, 4,092,799 tpmC, $2.93/tpmC, available 8/06/07. Energy claims based upon IBM calculations and internal measurements. Source: http://www.tpc.org/tpcc, results as of 11/22/2010

World Record Performance on PeopleSoft Enterprise Financials Benchmark run on Sun SPARC Enterprise M4000 and M5000

Oracle's Sun SPARC Enterprise M4000 and M5000 servers have combined to produce a world record result on Oracle's PeopleSoft Enterprise Financial Management 9.0 benchmark.

  • The Sun SPARC Enterprise M4000 and M5000 servers configured with SPARC64 VII+ processors along with Oracle's Sun Storage F5100 Flash Array system achieved a world record result using PeopleSoft Enterprise Financial Management and Oracle Database 11g Release 2 software running on the Oracle Solaris 10 operating system.

  • The PeopleSoft Enterprise Financial Management solution processed online business transactions to support 1000 concurrent users using 32 application server threads with compliant response times while simultaneously completing complex batch jobs in record time.

  • The Sun Storage F5100 Flash Array system is a high performance, high-density solid-state flash array which provides a read latency of only 0.5 msec which is about 10 times faster than the normal disk latencies of 5 msec measured on this benchmark.

  • The Sun SPARC Enterprise M4000 and M5000 servers were able to process online users and concurrent batch jobs simultaneously in 34.72 minutes on this benchmark that reflects complex, multi-tier environment and utilizes a large back-end database of nearly 1 TB.

  • The combination of Oracle's PeopleSoft Enterprise Financial Management 9.00.00.331, PeopleSoft PeopleTools 8.49.23 and Oracle WebLogic server was run on the Sun SPARC Enterprise M4000 server and Oracle database 11g Release 2 was run on the Sun SPARC Enterprise M5000 server for this benchmark.

Performance Landscape

The following table discloses the current and the single previously disclosed result for this benchmark. Results are elapsed times therefore the smaller number is better.

Servers CPU Tier Batch (mins) Batch
w/Online (mins)
Sun SPARC Enterprise M4000 2.66 GHz SPARC64 VII+ Web/App
33.09
34.72
Sun SPARC Enterprise M5000 2.66 GHz SPARC64 VII+ DB

SPARC T3-1 1.65 GHz SPARC T3 Web/App 35.82 37.01
Sun SPARC Enterprise M5000 2.5 GHz SPARC64 VII DB

Configuration Summary

Web/Application Tier Configuration:

1 x Sun SPARC Enterprise M4000
4 x 2.66 GHz SPARC64 VII+ processors
128 GB of memory

Database Tier Configuration:

1 x Sun SPARC Enterprise M5000
8 x 2.66 GHz SPARC64 VII+ processors
128 GB of memory
1 x Sun Storage F5100 Flash Array (74 x 24 GB FMODs)
2 x StorageTek 2540 (12 x 146 GB SAS 15K RPM)
1 x StorageTek 2501 (12 x 146 GB SAS 15K RPM)
1 x Dual-Port SAS Fibre Channel Host Bus Adapters (HBA)

Software Configurations:

Oracle Solaris 10 10/09
PeopleSoft Enterprise Financial Management/SCM 9.00.00.311 64-bit
PeopleSoft Enterprise (PeopleTools) 8.49.23 64-bit
Oracle Database 11g Release 2 11.1.0.6 64-bit
Oracle Tuxedo 9.1 RP36 with Jolt 9.1
Micro Focus COBOL Server Express 4.0 SP4 64-bit

Benchmark Description

This Day-in-the-Life benchmark measured the concurrent batch and online performance for a large database model. This scenario more accurately represents a production environment where users and scheduled batch jobs must run concurrently. This benchmark measured performance results during a Close-the-Books process.

The PeopleSoft Enterprise Financials 9 batch processes included in this benchmark are as follows:

  • Journal Generator: (AE) This process creates journals from accounting entries (AE) generated from various data sources, including non-PeopleSoft systems as well as PeopleSoft applications. In the benchmark, the Journal Generator (FS_JGEN) process is set up to create accounting entries from Oracle's PeopleSoft applications in the same database, such as PeopleSoft Enterprise Payables, Receivables, Asset Management, Expenses, Cash Management. The process is run with the option of Edit and Post turned on to edit and post the journals created by Journal generator. Journal Edit is an AE program and Post is a COBOL program.

  • Allocation: (AE) This process allocates balances held or accumulated in one or more entities to more than one business unit, department or other entities based on user-defined rules.

  • Journal Edit & Post: (AE & COBOL) Journal Edit validates journal transactions before posting them to the ledger. This validation ensures that journals are valid, for example: valid ChartFields values and combinations, debits and credits equal, and inter/intra-unit balanced, Journal Post process posts only valid, edited journals, ensures each journal line posts to the appropriate target detail ledgers, and then changes the journal's status to posted. In this benchmark, the Journal Edit & Post is also set up to edit and post Oracle's PeopleSoft applications from another database, such as PeopleSoft Enterprise Payroll data.

  • Summary Ledger: (AE) Summary Ledger processing summarizes detail ledger data across selected GL BUs. Summary Ledgers can be generated for reporting purposes or used in consolidations.

  • Consolidations: (COBOL) Consolidation processing summarizes ledger balances and generates elimination journal entries across business units based on user-defined rules.

  • SQR & nVision Reporting: Reporting will consist of nVision and SQR reports. A balance sheet, an income statement, and a trial balance will be generated for each GL BU by SQR processes GLS7002 and GLS7012. The consolidated results of the nVision reports are run by 10 nVision users using 4 standard delivered report request definitions such as BALANCE, INCOME, CONSBAL, and DEPTINC. Each of the nVision users will have ownership over 10 Business Units and each of the nVision users will submit multiple runs that are being executed in parallel to generate a total of 40 nVision reports.

Batch processes are run concurrently with more than 1000 emulated users executing 30 pre-defined online applications. Response times for the online applications are collected and must conform to a maximum time.

Key Points and Best Practices

The Sun SPARC Enterprise M4000 and M5000 servers were able process online users and concurrent batch jobs simultaneously in 34.72 minutes.

The Sun Storage F5100 Flash Array system, which is highly tuned for IOPS, contributed to the result through reduced IO latency.

The family of Sun SPARC Enterprise M-series servers, with Sun Storage F5100 Flash Array systems, form an ideal environment for hosting complex multi-tier applications. This is the second public disclosure of any system running this benchmark.

The Sun SPARC Enterprise M4000 server hosted the web and application server tiers providing good response time to emulated user requests. The benchmark specification allows 1000 users, but there is headroom for increased load.

The Sun SPARC Enterprise M5000 server was used for the database server along with a Sun Storage F5100 Flash Array system. The speed of the M-series server with the low latency of the Flash Array provided the overall low latency for user requests, even while completing complex batch jobs.

Despite the systems being lightly loaded, the increased frequency of the SPARC64 VII+ processors yielded lower latencies and faster elapsed times than previously disclosed results.

The low latency of the Sun Storage F5100 Flash Array storage contributed to the excellent response times of emulated users by making data quickly available to the database back-end. The array was configured as several RAID 0 volumes and data was distributed across the volumes, maximizing storage bandwidth.

The transaction processing capacity of the Sun SPARC Enterprise M5000 server enabled very fast batch processing times while supporting over 1000 online users.

While running the maximum workload specified by the benchmark, the systems were lightly loaded, providing headroom to grow.

Please see the white paper for information on PeopleSoft payroll best practices using flash.

See Also

Disclosure Statement

Oracle's PeopleSoft Financials 9.0 benchmark, Oracle's Sun SPARC Enterprise M4000 (4 2.66 SPARC64 VII+), Oracle's Sun SPARC Enterprise M5000 (8 2.66 SPARC64 VII+), 34.72 min. Results as of 12/02/2010, see www.oracle.com/apps_benchmark/html/white-papers-peoplesoft.html for more about PeopleSoft.

Tuesday Oct 26, 2010

3D VTI Reverse Time Migration Scalability On Sun Fire X2270-M2 Cluster with Sun Storage 7210

This Oil & Gas benchmark shows the Sun Storage 7210 system delivers almost 2 GB/sec bandwidth and realizes near-linear scaling performance on a cluster of 16 Sun Fire X2270 M2 servers.

Oracle's Sun Storage 7210 system attached via QDR InfiniBand to a cluster of sixteen of Oracle's Sun Fire X2270 M2 servers was used to demonstrate the performance of a Reverse Time Migration application, an important application in the Oil & Gas industry. The total application throughput and computational kernel scaling are presented for two production sized grids of 800 samples.

  • Both the Reverse Time Migration I/O and combined computation shows near-linear scaling from 8 to 16 nodes on the Sun Storage 7210 system connected via QDR InfiniBand to a Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 2.0x improvement
      2486 x 1151 x 1231: 1.7x improvement
  • The computational kernel of the Reverse Time Migration has linear to super-linear scaling from 8 to 16 nodes in Oracle's Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231 : 2.2x improvement
      2486 x 1151 x 1231 : 2.0x improvement
  • Intel Hyper-Threading provides additional performance benefits to both the Reverse Time Migration I/O and computation when going from 12 to 24 OpenMP threads on the Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 8% - computational kernel; 2% - total application throughput
      2486 x 1151 x 1231: 12% - computational kernel; 6% - total application throughput
  • The Sun Storage 7210 system delivers the Velocity, Epsilon, and Delta data to the Reverse Time Migration at a steady rate even when timing includes memory initialization and data object creation:

      1243 x 1151 x 1231: 1.4 to 1.6 GBytes/sec
      2486 x 1151 x 1231: 1.2 to 1.3 GBytes/sec

    One can see that when doubling the size of the problem, the additional complexity of overlapping I/O and multiple node file contention only produces a small reduction in read performance.

Performance Landscape

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 504 259 2.0 2.2\* 1024 551 1.7 2.0
14 565 279 1.8 2.0 1191 677 1.5 1.6
12 662 343 1.6 1.6 1426 817 1.2 1.4
10 784 394 1.3 1.4 1501 856 1.2 1.3
8 1024 560 1.0 1.0 1745 1108 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache

Application Scaling – Hyper-Threading Study

The affects of hyperthreading are presented when running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server.

Hyper-Threading Comparison – 12 versus 24 OpenMP Threads
Number Nodes Thread per Node Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup
16 24 504 259 1.02 1.08 1024 551 1.06 1.12
16 12 515 279 1.00 1.00 1088 616 1.00 1.00

Read Performance

Read performance is presented for the velocity, epsilon and delta files running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Velocity, Epsilon, and Delta File Read and Memory Initialization Performance
Number Nodes Overlap MBytes Read Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s
16 2040 16.7 1.1 23.2 1.4 36.8 1.1 44.3 1.2
8 951
14.8 1.0 22.1 1.6 33.0 1.0 43.2 1.3

Configuration Summary

Hardware Configuration:

16 x Sun Fire X2270 M2 servers, each with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)

Sun Storage 7210 system connected via QDR InfiniBand
2 x 18 GB SATA SSD (logzilla)
40 x 1 TB 7200 RM SATA disk

Software Configuration:

SUSE Linux Enterprise Server SLES 10 SP 2
Oracle Message Passing Toolkit 8.2.1 (for MPI)
Sun Studio 12 Update 1 C++, Fortran, OpenMP

Benchmark Description

This Reverse Time Migration (RTM) demonstration application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk. In this version, each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a four element inline 3-D array pad (spatial order of eight) shared with its neighbors to the left and right during the initialization phase. It represents a full RTM application including the data input, computation, communication, and final output image to be used by the next work flow step involving 3D volumetric seismic interpretation.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

I/O Characterization without Optimal Checkpointing

Velocity, Epsilon, and Delta Files - Grid Reading

The additional amount of overlapping reads to share velocity, epsilon, and delta edge data with neighbors can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (z_dimension) x (4 bytes) x (3 files)

For this particular benchmark study, the additional 3-D pad overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 1231 x 4 x 3 = 2.04 GB extra
    8 nodes: 7 x 8 x 1151 x 1231 x 4 x 3 = 0.95 GB extra

For the first of the two test cases, the total size of the three files used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 1231 x 4 bytes = 7.05 GB per file x 3 files = 21.13 GB

With the additional 3-D pad, the total amount of data read is:

    16 nodes: 2.04 GB + 21.13 GB = 23.2 GB
    8 nodes: 0.95 GB + 21.13 GB = 22.1 GB

For the second of the two test cases, the total size of the three files used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 1231 x 4 bytes = 14.09 GB per file x 3 files = 42.27 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: 2.04 GB + 42.27 GB = 44.3 GB
    8 nodes: 0.95 GB + 42.27 GB = 43.2 GB

Note that the amount of overlapping data read increases, not only by the number of nodes, but as the y dimension and/or the z dimension increases.

Trace Reading

The additional amount of overlapping reads to share trace edge data with neighbors for can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (4 bytes) x (number_of_time_slices)

For this particular benchmark study, the additional overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 4 x 800 = 442MB extra
    8 nodes: 7 x 8 x 1151 x 4 x 800 = 206MB extra

For the first case the size of the trace data file used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 4 bytes x 800 = 4.578 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 4.578 GB = 5.0 GB
    8 nodes: .206 GB + 4.578 GB = 4.8 GB

For the second case the size of the trace data file used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 4 bytes x 800 = 9.156 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 9.156 GB = 9.6 GB
    8 nodes: .206 GB + 9.156 GB = 9.4 GB

As the number of nodes is increased, the overlap causes more disk lock contention.

Writing Final Output Image

1243x1151x1231 - 7.1 GB per file:

    16 nodes: 78 x 1151 x 1231 x 4 = 442MB/node (7.1 GB total)
    8 nodes: 156 x 1151 x 1231 x 4 = 884MB/node (7.1 GB total)

2486x1151x1231 - 14.1 GB per file:

    16 nodes: 156 x 1151 x 1231 x 4 = 930 MB/node (14.1 GB total)
    8 nodes: 311 x 1151 x 1231 x 4 = 1808 MB/node (14.1 GB total)

Resource Allocation

It is best to allocate one node as the Oracle Grid Engine resource scheduler and MPI master host. This is especially true when running with 24 OpenMP threads in hyperthreading mode to avoid oversubscribing a node that is cooperating in delivering the solution.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 10/20/2010.

Tuesday Sep 28, 2010

SPARC T3-2 Delivers First Oracle E-Business X-Large Benchmark Self-Service (OLTP) Result

With Oracle's SPARC T3-2 server running the application and Oracle's Sun SPARC Enterprise M5000 server running the database, Oracle set a world record result for the Oracle E-Business Standard X-Large HR Self-Service (OLTP) benchmark.

  • The combination of a SPARC T3-2 server for the application and a Sun SPARC Enterprise M5000 server for the database achieved a result of 4000 HR Self-Service Online users on the Oracle E-Business X-Large benchmark dataset.

  • Oracle's Sun Storage F5100 Flash Array storage which was utilized in the benchmark was instrumental in obtaining an average transaction response time as low as 1.2 seconds.

  • Oracle has published the first Oracle E-Business R12.1.2 XL benchmark for 4000 HR Self-Service online users on a SPARC T3-2 server for the application tier and a Sun SPARC Enterprise M5000 server on database tier with Oracle 11g R2 database. Both servers ran with the Oracle Solaris 10 operating system.

  • The combination of the SPARC T3-2 server and Oracle E-Business R12.1.2 in the application tier with low CPU utilization provides headroom for growth.

  • The Sun Storage F5100 Flash Array storage provides higher performance with smaller footprint and lower power/cooling costs.

  • The result shows that the SPARC T3-2 server works well as a high capacity application server.

Performance Landscape

This is the FIRST published result for this X-large benchmark.

Workload HR Self-Service
X-Large Configuration
Vendor/System OS Users
SPARC T3-2 Oracle Solaris 10 9/10 4000

Results and Configuration Summary

Application Tier Configuration:

1 x SPARC T3-2 server
2 x SPARC T3 processors, 1.65 GHz
128 GB memory
Oracle Solaris 10 9/10
Oracle E-Business Suite 12.1.2

Database Tier Configuration:

1 x Sun SPARC Enterprise M5000 server
4 x SPARC64 VII processors, 2.53 GHz
128 GB memory
Oracle Solaris 10 10/09
Oracle Database 11g Release 2

Storage Configuration:

1 x Sun Storage F5100 Flash Array storage
1 x StorageTek 2540 array
300 GB

Benchmark Description

The Oracle R12 E-Business Standard Benchmark combines online transaction execution by simulated users with concurrent batch processing to model a typical scenario for a global enterprise. This benchmark includes one online component and 2 batch components. The goal is to obtain reference response times and throughput for Oracle EBS R12. Results can be published in four configurations:

  • X-large: Maximum online users running all business flows between 10,000 to 20,000; 750,000 order to cash lines per hour and 250,000 payroll checks per hour.
    • HR Self-Service Online -- 4000 users
      • The percentage across the 4 transactions in HR Self-Service module is:
        • Create Query Cash Expense -- 20%
        • Create Query Credit Expense -- 20%
        • View Payslip -- 30%
        • Create TimeCard -- 30%
    • Customer Support Flow -- 8000 users
    • Procure to Pay -- 2000 users
    • Order to Cash -- 2400 users
  • Large: 10,000 online users; 100,000 order to cash lines per hour and 100,000 payroll checks per hour.
  • Medium: up to 3000 online users; 50,000 order to cash lines per hour and 10,000 payroll checks per hour.
  • Small: up to 1000 online users; 10,000 order to cash lines per hour and 5,000 payroll checks per hour.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/20/2010.

Monday Sep 27, 2010

Sun Fire X2270 M2 Super-Linear Scaling of Hadoop Terasort and CloudBurst Benchmarks

A 16-node cluster of Oracle's Sun Fire X2270 M2 servers showed super-linear scaling of two Hadoop benchmarks. Performance was measured using the Terasort benchmark with a 100GB data set. In addition, performance was measured using Cloudburst which maps next generation "short read" sequence data onto the human and other genomes.

  • On the Terasort workload, a 16-node Sun Fire X2270 M2 cluster sorted the 100GB data set at a rate of 433.3 MB/s finishing in 236.3 seconds.

  • The 16-node Sun Fire X2270 M2 cluster was 9.3x faster on a per node basis than the 2010 winner of the Terasort benchmark competition (www.sortbenchmark.org) which used a 3,452-node Xeon cluster to sort 100 TB of input data in 173 minutes. Both systems used Hadoop, Terasort and 2-socket x86 servers. Allowances have to be made for the differences in problem complexity.

  • The Terasort benchmark showed super-linear scaling on the Sun Fire X2270 M2 cluster (total of 32 Intel 2.93GHz Xeons).

  • Using Cloudburst on a workload of the human genome and the SRR001113 short read data set, a 16-node Sun Fire X2270 M2 cluster finished mapping the short reads onto the human genome in 34.2 minutes.

  • On a per node basis, a 2-node Sun Fire X2270 M2 cluster was 1.7x faster than a 12-node Xeon cluster that processed the human genome and the SRR001113 short read data set in approximately 60,000 seconds (see figure 3 of this journal article). Both systems used Hadoop, CloudBurst and x86 servers.

  • The Terasort benchmark showed super-linear scaling on the Sun Fire X2270 M2 cluster (total of 32 Intel 2.93GHz Xeons).

Performance Landscape

Terasort
100 GB input data set
Performance is "real" execution time reported by /usr/bin/time in seconds (smaller is better)
Number
of Nodes
Seconds Scaling Linear
Scaling
16 236.3 25.4 16
8 466.3 12.9 8
4 927.2 6.5 4
2 2140.8 2.8 2
1 6010.2 1.0 1

CloudBurst
SR001111 short read data set mapped onto the hs_ref_GRCh37 human genome
Performance is "total running" time reported by CloudBurst in seconds (smaller is better)
Number
of Nodes
Seconds Scaling Linear
Scaling
16 2054.9 8.4 8
8 3615.8 4.7 4
4 7895.7 2.2 2
2 17155.1 1.0 1

Results and Configuration Summary

Hardware Configuration:

16 x Sun Fire X2270 M2 server, each server with
2 Intel Xeon X5670 2.93GHz processors, turbo enabled
96 GB memory 1066 MHz
HDD SATA 1 TB 7200 RPM 3.5-in.
2 x 10/100/1000 ethernet

Software Configuration:

Oracle Solaris 10 10/09
Java Platform, Standard Edition, JDK 6 Update 20 Performance Release
Hadoop v0.20.2

Benchmark Description

The Apache Hadoop middleware is the Yahoo implementation of Google's Map Reduce. Map Reduce permits the programmer to write serial code that Map Reduce schedules for parallel execution. Map Reduce has been applied to a wide variety of problems, including image processing, sorting, database merging and genomics.

Hadoop uses the Hadoop Distributed Filesystem (HDFS) that distributes data across the local disks of a cluster such that each node in the cluster accesses its local disk to the greatest extent possible.

Results for two different Hadoop benchmarks are reported above:

  • Terasort is an I/O intensive benchmark that was originally developed by Jim Gray. By having many Hadoop data nodes, it is possible to achieve high I/O capacity. For purposes of benchmarking, the Teragen program was used to create an input data set that comprised 100 GB.

  • CloudBurst is a genome assembly benchmark that was developed by Michael Schatz, previously of the University of Maryland and presently of Cold Springs Harbor Laboratory. CloudBurst maps what is known as DNA short read data onto a reference genome. For purposes of benchmarking, the SRR001113 short read data set is mapped onto the hs_ref_GRCh37 sequence data for all chromosomes of the human genome. Specifically, the hs_ref_GRCh37 FASTA files for chromosomes 1, 2, ... 21, 22, X and Y were catenated in that order to obtain one large FASTA file that represented all chromosomes of the human genome. For purposes of benchmarking, any DNA fragment from the SRR001113 short read data set that contained more than three mismatches was ignored.

See Also

Disclosure Statement

Hadoop, see http://hadoop.apache.org/ for more information. Results as of 9/20/2010.
About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today