Thursday Nov 08, 2012

Improved Performance on PeopleSoft Combined Benchmark using SPARC T4-4

Oracle's SPARC T4-4 server running Oracle's PeopleSoft HCM 9.1 combined online and batch benchmark achieved a world record 18,000 concurrent users experiencing subsecond response time while executing a PeopleSoft Payroll batch job of 500,000 employees in 32.4 minutes.

  • This result was obtained with a SPARC T4-4 server running Oracle Database 11g Release 2, a SPARC T4-4 server running PeopleSoft HCM 9.1 application server and a SPARC T4-2 server running Oracle WebLogic Server in the web tier.

  • The SPARC T4-4 server running the application tier used Oracle Solaris Zones which provide a flexible, scalable and manageable virtualization environment.

  • The average CPU utilization on the SPARC T4-2 server in the web tier was 17%, on the SPARC T4-4 server in the application tier it was 59%, and on the SPARC T4-4 server in the database tier was 47% (online and batch) leaving significant headroom for additional processing across the three tiers.

  • The SPARC T4-4 server used for the database tier hosted Oracle Database 11g Release 2 using Oracle Automatic Storage Management (ASM) for database files management with I/O performance equivalent to raw devices.

Performance Landscape

Results are presented for the PeopleSoft HRMS Self-Service and Payroll combined benchmark. The new result with 128 streams shows significant improvement in the payroll batch processing time with little impact on the self-service component response time.

PeopleSoft HRMS Self-Service and Payroll Benchmark
Systems Users Ave Response
Search (sec)
Ave Response
Save (sec)
Batch
Time (min)
Streams
SPARC T4-2 (web)
SPARC T4-4 (app)
SPARC T4-4 (db)
18,000 0.988 0.539 32.4 128
SPARC T4-2 (web)
SPARC T4-4 (app)
SPARC T4-4 (db)
18,000 0.944 0.503 43.3 64

The following results are for the PeopleSoft HRMS Self-Service benchmark that was previous run. The results are not directly comparable with the combined results because they do not include the payroll component.

PeopleSoft HRMS Self-Service 9.1 Benchmark
Systems Users Ave Response
Search (sec)
Ave Response
Save (sec)
Batch
Time (min)
Streams
SPARC T4-2 (web)
SPARC T4-4 (app)
2x SPARC T4-2 (db)
18,000 1.048 0.742 N/A N/A

The following results are for the PeopleSoft Payroll benchmark that was previous run. The results are not directly comparable with the combined results because they do not include the self-service component.

PeopleSoft Payroll (N.A.) 9.1 - 500K Employees (7 Million SQL PayCalc, Unicode)
Systems Users Ave Response
Search (sec)
Ave Response
Save (sec)
Batch
Time (min)
Streams
SPARC T4-4 (db)
N/A N/A N/A 30.84 96

Configuration Summary

Application Configuration:

1 x SPARC T4-4 server with
4 x SPARC T4 processors, 3.0 GHz
512 GB memory
Oracle Solaris 11 11/11
PeopleTools 8.52
PeopleSoft HCM 9.1
Oracle Tuxedo, Version 10.3.0.0, 64-bit, Patch Level 031
Java Platform, Standard Edition Development Kit 6 Update 32

Database Configuration:

1 x SPARC T4-4 server with
4 x SPARC T4 processors, 3.0 GHz
256 GB memory
Oracle Solaris 11 11/11
Oracle Database 11g Release 2
PeopleTools 8.52
Oracle Tuxedo, Version 10.3.0.0, 64-bit, Patch Level 031
Micro Focus Server Express (COBOL v 5.1.00)

Web Tier Configuration:

1 x SPARC T4-2 server with
2 x SPARC T4 processors, 2.85 GHz
256 GB memory
Oracle Solaris 11 11/11
PeopleTools 8.52
Oracle WebLogic Server 10.3.4
Java Platform, Standard Edition Development Kit 6 Update 32

Storage Configuration:

1 x Sun Server X2-4 as a COMSTAR head for data
4 x Intel Xeon X7550, 2.0 GHz
128 GB memory
1 x Sun Storage F5100 Flash Array (80 flash modules)
1 x Sun Storage F5100 Flash Array (40 flash modules)

1 x Sun Fire X4275 as a COMSTAR head for redo logs
12 x 2 TB SAS disks with Niwot Raid controller

Benchmark Description

This benchmark combines PeopleSoft HCM 9.1 HR Self Service online and PeopleSoft Payroll batch workloads to run on a unified database deployed on Oracle Database 11g Release 2.

The PeopleSoft HRSS benchmark kit is a Oracle standard benchmark kit run by all platform vendors to measure the performance. It's an OLTP benchmark where DB SQLs are moderately complex. The results are certified by Oracle and a white paper is published.

PeopleSoft HR SS defines a business transaction as a series of HTML pages that guide a user through a particular scenario. Users are defined as corporate Employees, Managers and HR administrators. The benchmark consist of 14 scenarios which emulate users performing typical HCM transactions such as viewing paycheck, promoting and hiring employees, updating employee profile and other typical HCM application transactions.

All these transactions are well-defined in the PeopleSoft HR Self-Service 9.1 benchmark kit. This benchmark metric is the weighted average response search/save time for all the transactions.

The PeopleSoft 9.1 Payroll (North America) benchmark demonstrates system performance for a range of processing volumes in a specific configuration. This workload represents large batch runs typical of a ERP environment during a mass update. The benchmark measures five application business process run times for a database representing large organization. They are Paysheet Creation, Payroll Calculation, Payroll Confirmation, Print Advice forms, and Create Direct Deposit File. The benchmark metric is the cumulative elapsed time taken to complete the Paysheet Creation, Payroll Calculation and Payroll Confirmation business application processes.

The benchmark metrics are taken for each respective benchmark while running simultaneously on the same database back-end. Specifically, the payroll batch processes are started when the online workload reaches steady state (the maximum number of online users) and overlap with online transactions for the duration of the steady state.

Key Points and Best Practices

  • Two PeopleSoft Domain sets with 200 application servers each on a SPARC T4-4 server were hosted in 2 separate Oracle Solaris Zones to demonstrate consolidation of multiple application servers, ease of administration and performance tuning.

  • Each Oracle Solaris Zone was bound to a separate processor set, each containing 15 cores (total 120 threads). The default set (1 core from first and third processor socket, total 16 threads) was used for network and disk interrupt handling. This was done to improve performance by reducing memory access latency by using the physical memory closest to the processors and offload I/O interrupt handling to default set threads, freeing up cpu resources for Application Servers threads and balancing application workload across 240 threads.

  • A total of 128 PeopleSoft streams server processes where used on the database node to complete payroll batch job of 500,000 employees in 32.4 minutes.

See Also

Disclosure Statement

Copyright 2012, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 8 November 2012.

Tuesday Oct 02, 2012

Performance of Oracle Business Intelligence Benchmark on SPARC T4-4

Oracle's SPARC T4-4 server configured with four SPARC T4 3.0 GHz processors delivered 25,000 concurrent users on Oracle Business Intelligence Enterprise Edition (BI EE) 11g benchmark using Oracle Database 11g Release 2 running on Oracle Solaris 10.

  • A SPARC T4-4 server running Oracle Business Intelligence Enterprise Edition 11g achieved 25,000 concurrent users with an average response time of 0.36 seconds with Oracle BI server cache set to ON.

  • The benchmark data clearly shows that the underlying hardware, SPARC T4 server, and the Oracle BI EE 11g (11.1.1.6.0 64-bit) platform scales within a single system supporting 25,000 concurrent users while executing 415 transactions/sec.

  • The benchmark demonstrated the scalability of Oracle Business Intelligence Enterprise Edition 11g 11.1.1.6.0, which was deployed in a vertical scale-out fashion on a single SPARC T4-4 server.

  • Oracle Internet Directory configured on SPARC T4 server provided authentication for the 25,000 Oracle BI EE users with sub-second response time.

  • A SPARC T4-4 with internal Solid State Drive (SSD) using the ZFS file system showed significant I/O performance improvement over traditional disk for the Web Catalog activity. In addition, ZFS helped get past the UFS limitation of 32767 sub-directories in a Web Catalog directory.

  • The multi-threaded 64-bit Oracle Business Intelligence Enterprise Edition 11g and SPARC T4-4 server proved to be a successful combination by providing sub-second response times for the end user transactions, consuming only half of the available CPU resources at 25,000 concurrent users, leaving plenty of head room for increased load.

  • The Oracle Business Intelligence on SPARC T4-4 server benchmark results demonstrate that comprehensive BI functionality built on a unified infrastructure with a unified business model yields best-in-class scalability, reliability and performance.

  • Oracle BI EE 11g is a newer version of Business Intelligence Suite with richer and superior functionality. Results produced with Oracle BI EE 11g benchmark are not comparable to results with Oracle BI EE 10g benchmark. Oracle BI EE 11g is a more difficult benchmark to run, exercising more features of Oracle BI.

Performance Landscape

Results for the Oracle BI EE 11g version of the benchmark. Results are not comparable to the Oracle BI EE 10g version of the benchmark.

Oracle BI EE 11g Benchmark
System Number of Users Response Time (sec)
1 x SPARC T4-4 (4 x SPARC T4 3.0 GHz) 25,000 0.36

Results for the Oracle BI EE 10g version of the benchmark. Results are not comparable to the Oracle BI EE 11g version of the benchmark.

Oracle BI EE 10g Benchmark
System Number of Users
2 x SPARC T5440 (4 x SPARC T2+ 1.6 GHz) 50,000
1 x SPARC T5440 (4 x SPARC T2+ 1.6 GHz) 28,000

Configuration Summary

Hardware Configuration:

SPARC T4-4 server
4 x SPARC T4-4 processors, 3.0 GHz
128 GB memory
4 x 300 GB internal SSD

Storage Configuration:

Sun ZFS Storage 7120
16 x 146 GB disks

Software Configuration:

Oracle Solaris 10 8/11
Oracle Solaris Studio 12.1
Oracle Business Intelligence Enterprise Edition 11g (11.1.1.6.0)
Oracle WebLogic Server 10.3.5
Oracle Internet Directory 11.1.1.6.0
Oracle Database 11g Release 2

Benchmark Description

Oracle Business Intelligence Enterprise Edition (Oracle BI EE) delivers a robust set of reporting, ad-hoc query and analysis, OLAP, dashboard, and scorecard functionality with a rich end-user experience that includes visualization, collaboration, and more.

The Oracle BI EE benchmark test used five different business user roles - Marketing Executive, Sales Representative, Sales Manager, Sales Vice-President, and Service Manager. These roles included a maximum of 5 different pre-built dashboards. Each dashboard page had an average of 5 reports in the form of a mix of charts, tables and pivot tables, returning anywhere from 50 rows to approximately 500 rows of aggregated data. The test scenario also included drill-down into multiple levels from a table or chart within a dashboard.

The benchmark test scenario uses a typical business user sequence of dashboard navigation, report viewing, and drill down. For example, a Service Manager logs into the system and navigates to his own set of dashboards using Service Manager. The BI user selects the Service Effectiveness dashboard, which shows him four distinct reports, Service Request Trend, First Time Fix Rate, Activity Problem Areas, and Cost Per Completed Service Call spanning 2002 to 2005. The user then proceeds to view the Customer Satisfaction dashboard, which also contains a set of 4 related reports, drills down on some of the reports to see the detail data. The BI user continues to view more dashboards – Customer Satisfaction and Service Request Overview, for example. After navigating through those dashboards, the user logs out of the application. The benchmark test is executed against a full production version of the Oracle Business Intelligence 11g Applications with a fully populated underlying database schema. The business processes in the test scenario closely represent a real world customer scenario.

See Also

Disclosure Statement

Copyright 2012, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 30 September 2012.

SPARC T4-4 Delivers World Record First Result on PeopleSoft Combined Benchmark

Oracle's SPARC T4-4 servers running Oracle's PeopleSoft HCM 9.1 combined online and batch benchmark achieved World Record 18,000 concurrent users while executing a PeopleSoft Payroll batch job of 500,000 employees in 43.32 minutes and maintaining online users response time at < 2 seconds.

  • This world record is the first to run online and batch workloads concurrently.

  • This result was obtained with a SPARC T4-4 server running Oracle Database 11g Release 2, a SPARC T4-4 server running PeopleSoft HCM 9.1 application server and a SPARC T4-2 server running Oracle WebLogic Server in the web tier.

  • The SPARC T4-4 server running the application tier used Oracle Solaris Zones which provide a flexible, scalable and manageable virtualization environment.

  • The average CPU utilization on the SPARC T4-2 server in the web tier was 17%, on the SPARC T4-4 server in the application tier it was 59%, and on the SPARC T4-4 server in the database tier was 35% (online and batch) leaving significant headroom for additional processing across the three tiers.

  • The SPARC T4-4 server used for the database tier hosted Oracle Database 11g Release 2 using Oracle Automatic Storage Management (ASM) for database files management with I/O performance equivalent to raw devices.

  • This is the first three tier mixed workload (online and batch) PeopleSoft benchmark also processing PeopleSoft payroll batch workload.

Performance Landscape

PeopleSoft HR Self-Service and Payroll Benchmark
Systems Users Ave Response
Search (sec)
Ave Response
Save (sec)
Batch
Time (min)
Streams
SPARC T4-2 (web)
SPARC T4-4 (app)
SPARC T4-4 (db)
18,000 0.944 0.503 43.32 64

Configuration Summary

Application Configuration:

1 x SPARC T4-4 server with
4 x SPARC T4 processors, 3.0 GHz
512 GB memory
1 x 600 GB SAS internal disks
4 x 300 GB SAS internal disks
1 x 100 GB and 2 x 300 GB internal SSDs
2 x 10 Gbe HBA
Oracle Solaris 11 11/11
PeopleTools 8.52
PeopleSoft HCM 9.1
Oracle Tuxedo, Version 10.3.0.0, 64-bit, Patch Level 031
Java Platform, Standard Edition Development Kit 6 Update 32

Database Configuration:

1 x SPARC T4-4 server with
4 x SPARC T4 processors, 3.0 GHz
256 GB memory
1 x 600 GB SAS internal disks
2 x 300 GB SAS internal disks
Oracle Solaris 11 11/11
Oracle Database 11g Release 2
PeopleTools 8.52
Oracle Tuxedo, Version 10.3.0.0, 64-bit, Patch Level 031

Web Tier Configuration:

1 x SPARC T4-2 server with
2 x SPARC T4 processors, 2.85 GHz
256 GB memory
2 x 300 GB SAS internal disks
1 x 300 GB internal SSD
1 x 100 GB internal SSD
Oracle Solaris 11 11/11
PeopleTools 8.52
Oracle WebLogic Server 10.3.4
Java Platform, Standard Edition Development Kit 6 Update 32

Storage Configuration:

1 x Sun Server X2-4 as a COMSTAR head for data
4 x Intel Xeon X7550, 2.0 GHz
128 GB memory
1 x Sun Storage F5100 Flash Array (80 flash modules)
1 x Sun Storage F5100 Flash Array (40 flash modules)

1 x Sun Fire X4275 as a COMSTAR head for redo logs
12 x 2 TB SAS disks with Niwot Raid controller

Benchmark Description

This benchmark combines PeopleSoft HCM 9.1 HR Self Service online and PeopleSoft Payroll batch workloads to run on a unified database deployed on Oracle Database 11g Release 2.

The PeopleSoft HRSS benchmark kit is a Oracle standard benchmark kit run by all platform vendors to measure the performance. It's an OLTP benchmark where DB SQLs are moderately complex. The results are certified by Oracle and a white paper is published.

PeopleSoft HR SS defines a business transaction as a series of HTML pages that guide a user through a particular scenario. Users are defined as corporate Employees, Managers and HR administrators. The benchmark consist of 14 scenarios which emulate users performing typical HCM transactions such as viewing paycheck, promoting and hiring employees, updating employee profile and other typical HCM application transactions.

All these transactions are well-defined in the PeopleSoft HR Self-Service 9.1 benchmark kit. This benchmark metric is the weighted average response search/save time for all the transactions.

The PeopleSoft 9.1 Payroll (North America) benchmark demonstrates system performance for a range of processing volumes in a specific configuration. This workload represents large batch runs typical of a ERP environment during a mass update. The benchmark measures five application business process run times for a database representing large organization. They are Paysheet Creation, Payroll Calculation, Payroll Confirmation, Print Advice forms, and Create Direct Deposit File. The benchmark metric is the cumulative elapsed time taken to complete the Paysheet Creation, Payroll Calculation and Payroll Confirmation business application processes.

The benchmark metrics are taken for each respective benchmark while running simultaneously on the same database back-end. Specifically, the payroll batch processes are started when the online workload reaches steady state (the maximum number of online users) and overlap with online transactions for the duration of the steady state.

Key Points and Best Practices

  • Two Oracle PeopleSoft Domain sets with 200 application servers each on a SPARC T4-4 server were hosted in 2 separate Oracle Solaris Zones to demonstrate consolidation of multiple application servers, ease of administration and performance tuning.

  • Each Oracle Solaris Zone was bound to a separate processor set, each containing 15 cores (total 120 threads). The default set (1 core from first and third processor socket, total 16 threads) was used for network and disk interrupt handling. This was done to improve performance by reducing memory access latency by using the physical memory closest to the processors and offload I/O interrupt handling to default set threads, freeing up cpu resources for Application Servers threads and balancing application workload across 240 threads.

See Also

Disclosure Statement

Oracle's PeopleSoft HR and Payroll combined benchmark, www.oracle.com/us/solutions/benchmark/apps-benchmark/peoplesoft-167486.html, results 09/30/2012.

Tuesday Apr 10, 2012

World Record Oracle E-Business Suite 12.1.3 Standard Extra-Large Payroll (Batch) Benchmark on Sun Server X3-2L

Oracle's Sun Server X3-2L (formerly Sun Fire X4270 M3) server set a world record running the Oracle E-Business Suite 12.1.3 Standard Extra-Large Payroll (Batch) benchmark.

  • This is the first published result using Oracle E-Business 12.1.3.

  • The Sun Server X3-2L result ran the Extra-Large Payroll workload in 19 minutes.

Performance Landscape

This is the first published result for the Payroll Extra-Large model using Oracle E-Business 12.1.3 benchmark.

Batch Workload: Payroll Extra-Large Model
System Employees/Hr Elapsed Time
Sun Server X3-2L 789,515 19 minutes

Configuration Summary

Hardware Configuration:

Sun Server X3-2L
2 x Intel Xeon E5-2690, 2.9 GHz
128 GB memory
8 x 100 GB SSD for data
1 x 300 GB SSD for log

Software Configuration:

Oracle Linux 5.7
Oracle E-Business Suite R12 (12.1.3)
Oracle Database 11g (11.2.0.3)

Benchmark Description

The Oracle E-Business Suite Standard R12 Benchmark combines online transaction execution by simulated users with concurrent batch processing to model a typical scenario for a global enterprise. This benchmark ran one Batch component, Payroll, in the Extra-Large size. The goal of the benchmark proposal is to execute and achieve best batch-payroll performance using X-Large configuragion.

Results can be published in four sizes and use one or more online/batch modules

  • X-large: Maximum online users running all business flows between 10,000 to 20,000; 750,000 order to cash lines per hour and 250,000 payroll checks per hour.
    • Order to Cash Online -- 2400 users
      • The percentage across the 5 transactions in Order Management module is:
        • Insert Manual Invoice -- 16.66%
        • Insert Order -- 32.33%
        • Order Pick Release -- 16.66%
        • Ship Confirm -- 16.66%
        • Order Summary Report -- 16.66%
    • HR Self-Service -- 4000 users
    • Customer Support Flow -- 8000 users
    • Procure to Pay -- 2000 users
  • Large: 10,000 online users; 100,000 order to cash lines per hour and 100,000 payroll checks per hour.
  • Medium: up to 3000 online users; 50,000 order to cash lines per hour and 10,000 payroll checks per hour.
  • Small: up to 1000 online users; 10,000 order to cash lines per hour and 5,000 payroll checks per hour.

See Also

Disclosure Statement

Oracle E-Business X-Large Batch-Payroll benchmark, Sun Server X3-2L, 2.90 GHz, 2 chips, 16 cores, 32 threads, 128 GB memory, elapsed time 19.0 minutes, 789,515 Employees/HR, Oracle Linux 5.7, Oracle E-Business Suite 12.1.3, Oracle Database 11g Release 2, Results as of 7/10/2012.

Wednesday Dec 08, 2010

Sun Blade X6275 M2 Cluster with Sun Storage 7410 Performance Running Seismic Processing Reverse Time Migration

This Oil & Gas benchmark highlights both the computational performance improvements of the Sun Blade X6275 M2 server module over the previous genernation server and the linear scalability achievable for the total application throughput using a Sun Storage 7410 system to deliver almost 2 GB/sec I/O effective write performance.

Oracle's Sun Storage 7410 system attached via 10 Gigabit Ethernet to a cluster of Oracle's Sun Blade X6275 M2 server modules was used to demonstrate the performance of a 3D VTI Reverse Time Migration application, a heavily used geophysical imaging and modeling application for Oil & Gas Exploration. The total application throughput scaling and computational kernel performance improvements are presented for imaging two production sized grids using 800 input samples.

  • The Sun Blade X6275 M2 server module showed up to a 40% performance improvement over the previous generation server module with super-linear scalability to 16 nodes for the 9-Point Stencil used in this Reverse Time Migration computational kernel.

  • The balanced combination of Oracle's Sun Storage 7410 system over 10 GbE to the Sun Blade X6275 M2 server module cluster showed linear scalability for the total application throughput, including the I/O and MPI communication, to produce a final 3-D seismic depth imaged cube for interpretation.

  • The final image write time from the Sun Blade X6275 M2 server module nodes to Oracle's Sun Storage 7410 system achieved 10GbE line speed of 1.25 GBytes/second or better write performance. The effects of I/O buffer caching on the Sun Blade X6275 M2 server module nodes and 34 GByte write optimized cache on the Sun Storage 7410 system gave up to 1.8 GBytes/second effective write performance.

Performance Landscape

Server Generational Performance Improvements

Performance improvements for the Reverse Time Migration computational kernel using a Sun Blade X6275 M2 cluster are compared to the previous generation Sun Blade X6275 cluster. Hyper-threading was enabled for both configurations allowing 24 OpenMP threads for the Sun Blade X6275 M2 server module nodes and 16 for the Sun Blade X6275 server module nodes.

Sun Blade X6275 M2 Performance Improvements
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
X6275 Kernel Time (sec) X6275 M2 Kernel Time (sec) X6275 M2 Speedup X6275 Kernel Time (sec) X6275 M2 Kernel Time (sec) X6275 M2 Speedup
16 306 242 1.3 728 576 1.3
14 355 271 1.3 814 679 1.2
12 435 346 1.3 945 797 1.2
10 541 390 1.4 1156 890 1.3
8 726 555 1.3 1511 1193 1.3

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Blade X6275 M2 server cluster with a Sun Storage 7410 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server node.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 501 242 2.1\* 2.3\* 1060 576 2.0 2.1\*
14 583 271 1.8 2.0 1219 679 1.7 1.8
12 681 346 1.6 1.6 1420 797 1.5 1.5
10 807 390 1.3 1.4 1688 890 1.2 1.3
8 1058 555 1.0 1.0 2085 1193 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache for larger node counts

Image File Effective Write Performance

The performance for writing the final 3D image from a Sun Blade X6275 M2 server cluster over 10 Gigabit Ethernet to a Sun Storage 7410 system are presented. Each server allocated one core per node for MPI I/O thus allowing 22 OpenMP compute threads per node with hyperthreading enabled. Captured performance analytics from the Sun Storage 7410 system indicate effective use of its 34 Gigabyte write optimized cache.

Image File Effective Write Performance
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Write Time (sec) Write Performance (GB/sec) Write Time (sec) Write Performance (GB/sec)
16 4.8 1.5 10.2 1.4
14 5.0 1.4 10.2 1.4
12 4.0 1.8 11.3 1.3
10 4.3 1.6 9.1 1.6
8 4.6 1.5 9.7 1.5

Note: Performance results better than 1.3GB/sec related to I/O buffer caching on server nodes.

Configuration Summary

Hardware Configuration:

8 x 2 node Sun Blade X6275 M2 server nodes, each node with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)
1 x QDR InfiniBand Host Channel Adapter

Sun Datacenter InfiniBand Switch IB-36
Sun Network 10 GbE Switch 72p

Sun Storage 7410 system connected via 10 Gigabit Ethernet
4 x 17 GB STEC ZeusIOPs SSD mirrored - 34 GB
40 x 750 GB 7500 RPM Seagate SATA disks mirrored - 14.4 TB
No L2ARC Readzilla Cache

Software Configuration:

Oracle Enterprise Linux Server release 5.5
Oracle Message Passing Toolkit 8.2.1c (for MPI)
Oracle Solaris Studio 12.2 C++, Fortran, OpenMP

Benchmark Description

This Vertical Transverse Isotropy (VTI) Anisotropic Reverse Time Depth Migration (RTM) application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk for the next work flow step involving 3-D seismic volume interpretation. In doing so, it reports the compute, interprocessor communication, and I/O performance of the individual functions that comprise the total solution. Unlike most references for the Reverse Time Migration, that focus solely on the performance of the 3D stencil compute kernel, this demonstration code additionally reports the total throughput involved in processing large data sets with a full 3D Anisotropic RTM application. It provides valuable insight into configuration and sizing for specific seismic processing requirements. The performance effects of new processors, interconnects, I/O subsystems, and software technologies can be evaluated while solving a real Exploration business problem.

This benchmark study uses the "in-core" implementation of this demonstration code where each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a 4 element array pad (based on spatial order 8) shared with it's neighbors to the left and right during the initialization phase. It maintains previous, current, and next wavefield state information for each of the source, receiver, and anisotropic wavefields in memory. The second two grid dimensions used in this benchmark are specifically chosen to be prime numbers to exaggerate the effects of data alignment. Algorithm adaptions for processing higher orders in space and alternative "out-of-core" solutions using SSDs for wave state checkpointing are implemented in this demonstration application to better understand the effects of problem size scaling. Care is taken to handle absorption boundary conditioning and a variety of imaging conditions, appropriately.

RTM Application Structure:

Read Processing Parameter File, Determine Domain Decomposition, and Initialize Data Structures, and Allocate Memory.

Read Velocity, Epsilon, and Delta Data Based on Domain Decomposition and create source, receiver, & anisotropic previous, current, and next wave states.

First Loop over Time Steps

Compute 3D Stencil for Source Wavefield (a,s) - 8th order in space, 2nd order in time
Propagate over Time to Create s(t,z,y,x) & a(t,z,y,x)
Inject Estimated Source Wavelet
Apply Absorption Boundary Conditioning (a)
Update Wavefield States and Pointers
Write Snapshot of Wavefield (out-of-core) or Push Wavefield onto Stack (in-core)
Communicate Boundary Information

Second Loop over Time Steps
Compute 3D Stencil for Receiver Wavefield (a,r) - 8th order in space, 2nd order in time
Propagate over Time to Create r(t,z,y,x) & a(t,z,y,x)
Read Receiver Trace and Inject Receiver Wavelet
Apply Absorption Boundary Conditioning (a)
Update Wavefield States and Pointers
Communicate Boundary Information
Read in Source Wavefield Snapshot (out-of-core) or Pop Off of Stack (in-core)
Cross-correlate Source and Receiver Wavefields
Update image using image conditioning parameters

Write 3D Depth Image i(z,x,y) = Sum over time steps s(t,z,x,y) \* r(t,z,x,y) or other imaging conditions.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

Image File MPI Write Performance Tuning

Changing the Image File Write from MPI non-blocking to MPI blocking and setting Oracle Message Passing Toolkit MPI environment variables revealed an 18x improvement in write performance to the Sun Storage 7410 system going from:

    86.8 to 4.8 seconds for the 1243 x 1151 x 1231 grid size
    183.1 to 10.2 seconds for the 2486 x 1151 x 1231 grid size

The Swat Sun Storage 7410 analytics data capture indicated an initial write performance of about 100 MB/sec with the MPI non-blocking implementation. After modifying to MPI blocking writes, Swat showed between 1.3 and 1.8 GB/sec with up to 13000 write ops/sec to write the final output image. The Swat results are consistent with the actual measured performance and provide valuable insight into the Reverse Time Migration application I/O performance.

The reason for this vast improvement has to do with whether the MPI file mode is sequential or not (MPI_MODE_SEQUENTIAL, O_SYNC, O_DSYNC). The MPI non-blocking routines, MPI_File_iwrite_at and MPI_wait, typically used for overlapping I/O and computation, do not support sequential file access mode. Therefore, the application could not take full performance advantages of the Sun Storage 7410 system write optimized cache. In contrast, the MPI blocking routine, MPI_File_write_at, defaults to MPI sequential mode and the performance advantages of the write optimized cache are realized. Since writing the final image is at the end of RTM execution, there is no need to overlap the I/O with computation.

Additional MPI parameters used:

    setenv SUNW_MP_PROCBIND true
    setenv MPI_SPIN 1
    setenv MPI_PROC_BIND 1

Adjusting the Level of Multithreading for Performance

The level of multithreading (8, 10, 12, 22, or 24) for various components of the RTM should be adjustable based on the type of computation taking place. Best to use OpenMP num_threads clause to adjust the level of multi-threading for each particular work task. Use numactl to specify how the threads are allocated to cores in accordance to the OpenMP parallelism level.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 12/07/2010.

Tuesday Jun 29, 2010

Sun Fire X2270 M2 Demonstrates Outstanding Single Node Performance on MSC.Nastran Benchmarks

Oracle's Sun Fire X2270 M2 server results showed outstanding performance running the MCAE application MSC.Nastran as shown by the MD Nastran MDR3 serial and parallel test cases.

Performance Landscape

Complete information about the serial results presented below can be found on the MSC Nastran website.


MD Nastran MDR3 Serial Test Results
Platform Benchmark Problem
Results are total elapsed run time in seconds
xl0imf1 xx0xst0 xl1fn40 vl0sst1
Sun Fire X2270 M2 999 704 2337 115
Sun Blade X6275 1107 798 2285 120
Intel Nehalem 1235 971 2453 123
Intel Nehalem w/ SSD 1484 767 2456 120
IBM:P6 570 ( I8 )
1510 4612 132
IBM:P6 570 ( I4 ) 1016 1618 5534 147

Complete information about the parallel results presented below can be found on the MSC Nastran website.


MD Nastran MDR3 Parallel Test Results
Platform Benchmark Problem
Results are total elapsed run time in seconds
xx0cmd2 md0mdf1
Serial DMP=2 DMP=4 DMP=8 Serial DMP=2 DMP=4
Sun Blade X6275 840 532 391 279 880 422 223
Sun Fire X2270 M2 847 558 371 297 889 462 232
Intel Nehalem w/ 4 SSD 887 639 405
902 479 235
Intel Nehalem 915 561 408
922 470 251
IBM:P6 570 ( I8 ) 920 574 392 322


IBM:P6 570 ( I4 ) 959 616 419 343 911 469 242

Results and Configuration Summary

Hardware Configuration:

Sun Fire X2270 M2
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory
4 x 24 GB SSDs (striped)

Software Configuration:

64-bit SUSE Linux Enterprise Server 10 SP 3
MSC Software MD 2008 R3
MD Nastran MDR3 benchmark test suite

Benchmark Description

The benchmark tests are representative of typical MSC.Nastran applications including both serial and parallel (DMP) runs involving linear statics, nonlinear statics, and natural frequency extraction as well as others. MD Nastran is an integrated simulation system with a broad set of multidiscipline analysis capabilities.

Key Points and Best Practices

  • The test cases for the MSC.Nastran module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). The required scratch file size ranges from less than 1 GB on up to about 140 GB. To obtain best performance, it is important to have a high performance storage system when running MD Nastran.

  • To improve performance, it is possible to make use of the MD Nastran feature which sets the maximum amount of memory the application will use. This allows a user to configure where temporary files are held, including in memory file systems like tmpfs.

See Also

Disclosure Statement

MSC.Software is a registered trademark of MSC. All information on the MSC.Software website is copyrighted. MD Nastran MDR3 results from http://www.mscsoftware.com and this report as of June 28, 2010.

Sun Fire X2270 M2 Achieves Leading Single Node Results on ABAQUS Benchmark

Oracle's Sun Fire X2270 M2 server outperforms the best posted results running the ABAQUS/Standard Server Benchmark test suite on a single platform.
  • The Sun Fire X2270 M2 server performed up to 1.25 times faster than the previous generation Sun Fire X2270 server on the ABAQUS/Standard Server Benchmark test suite.

Performance Landscape

Comparisons below are against the top results found at the ABAQUS website. For the complete set of results, please go to the benchmark website.


ABAQUS/Standard Server Benchmark Test Suite (Single Platform)
Platform Cores Benchmark Problem
Results are total elapsed run time in seconds
S2a S4b S4d S6
Sun Fire X2270 M2 12 256 1033 730 1234

Sun Fire X2270 M2 8 313 1248 834 1340
Sun Fire X2270 8 319 1280 832 1360
HP BL460c G6 8 324 1309 843 1322
SGI XE340 8 332 1338 867 1348

Sun Fire X2270 M2 4 544 1981 1230 1794
Sun Fire X2270 4 546 1983 1210 1771
HP BL460c G6 4 561 2062 1234 1812
SGI XE340 4 548 2089 1232 1770

Ratio 8-core X2270: 12-core X2270 M2 1.25 1.24 1.14 1.10

Results and Configuration Summary

Hardware Configuration:

Sun Fire X2270 M2
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory
4 x 24 GB striped SSDs

Sun Fire X2270
2 x 2.93 GHz Intel Xeon X5570 processors
48 GB memory
2 x 24 GB internal striped SSDs

Software Configuration:

64-bit SUSE Linux Enterprise Server 10 SP 3 (SP 2 for X2270)
ABAQUS Standard Module V6.9
ABAQUS Standard Benchmark Test Suite - Server test cases

Benchmark Description

ABAQUS/Standard server benchmark problems provide an estimate of the performance that can be expected when running the ABAQUS/Standard module. The tests include analysis of a flywheel with centrifugal loading (S2A), bolting a cylinder head onto an engine block (S4B & S4D), and determining the footprint on an automobile tire (S6).

Key Points and Best Practices

  • The memory requirements for the test cases in the ABAQUS/Standard server benchmark test suite are substantial with some of the test cases requiring over 20 GB of memory. There are two memory limit parameters one can set to tune performance. One is to control when out-of-core memory when be used. You want to tune this to avoid excessive use of the out-of-core algorithms. The other control concerns I/O. This control will minimize the amount of unneeded disk activity. These limits may be tuned before a full run by doing a preliminary diagnostic mode run.

  • The test cases for the ABAQUS standard module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). Performance will be enhanced by using a high performance file system.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Dassault Systemes, or its subsidiaries in the United States and/or other countries: Simulia, ABAQUS, ABAQUS/Standard, ABAQUS/Explicit. All information on the ABAQUS website is Copyrighted 2004-2010 by Dassault Systemes. Results from http://www.simulia.com/support/v69/v69_performance.php as of June 28, 2010.

Tuesday Apr 13, 2010

Oracle Sun Flash Accelerator F20 PCIe Card Accelerates Web Caching Performance

Using Oracle's Sun FlashFire technology, the Sun Accelerator F20 PCIe Card is shown to be a high performance and cost effective caching device for web servers. Many current web and application servers are designed with an active cache that is used for holding things like session objects, files and web pages. The Sun F20 card is shown to be an excellent candidate to improve performance over using HDD solutions.

  • The Sun Flash Accelerator F20 PCIe Card provides 2x better Quality of Service (QoS) at the same load as compared to 15K RPM high performance disk drives.

  • The Sun Flash Accelerator F20 PCIe Card enables scaling to 3x more users than 15K RPM high performance disk drives.

  • The Sun Flash Accelerator F20 PCIe Card provides 25% higher Quality of Service (QoS) than 15K RPM high performance disk drives at maximum rate.

  • The Sun Flash Accelerator F20 PCIe Card allows for easy expansion of the webcache. Each card provides an additional 96 GB of storage.

  • The Sun Flash Accelerator F20 PCIe Card used as a caching device offers Bitrate and Quality of Service (QoS) comparable to that provided by memory. While memory also provides excellent caching performance in comparison to disk, memory capacity is limited in servers.

Performance Landscape

Experiment results using three Sun Flash Accelerator F20 PCIe Cards.

Load Factor No Cache F20 Webcache Memcache
Max Load @Disk Load Max Load @F20 Load
Max Connections 7,000 7,000 27,000 27,000
Average Bitrate 445 Kbps 870 Kbps 602 Kbps 678 Kbps
Cache Hit Rate 0% 98% 99% 56%

QoS Bitrates %Connect %Connect %Connect %Connect
900 Kbps - 1 Mbps 0% 97% 0% 0%
800 Kbps 0% 3% 0% 6%
700 Kbps 0% 0% 64% 70%
600 Kbps 18% 0% 24% 15%
420 Kbps - 500 Kbps 88% 0% 12% 9%

Experiment results using two Sun Flash Accelerator F20 PCIe Cards.

Load Factor No Cache F20 Webcache Memcache
Max Load @Disk Load Max Load @F20 Load
Max Connections 7,000 7,000 22,000 27,000
Average Bitrate 445 Kbps 870 Kbps 622 Kbps 678 Kbps
Cache Hit Rate 0% 98% 80% 56%

QoS Bitrates %Connect %Connect %Connect %Connect
900 Kbps - 1 Mbps 0% 97% 0% 0%
800 Kbps 0% 3% 1% 6%
700 Kbps 0% 0% 68% 70%
600 Kbps 18% 0% 26% 15%
420 Kbps - 500 Kbps 88% 0% 5% 9%

Results and Configuration Summary

Hardware Configuration:

Sun Fire X4270, 72 GB memory
3 X Sun Flash Accelerator F20 PCIe Card
Sun Storage J4400 (12 15K RPM disks)

Software Configuration:

Sun Java System Web Server 7
OpenSolaris
Flickr Photo Download Workload
Oracle Solaris Zettabyte File System (ZFS)

Three configurations are compared:

  1. No cache, 12 x high-speed 15K RPM Disks
  2. 3 x Sun Flash Accelerator F20 PCIe Cards as cache device
  3. 64 GB server memory as cache device

Benchmark Description

This benchmark is based upon the description of the flickr website presented at http://highscalability.com/flickr-architecture. It measures performance of an HTTP-based file photo Slide Show workload. The workload randomly selects and downloads from 80 photos stored in 4 bins:

  • 20 large photos, 1800x1800p, 1 MB, 1% probability
  • 20 medium photos, 1000x1000p, 500 KB, 4% probability
  • 20 small photos, 540x540p, 100K, 35% probability
  • 20 thumbnail photos, 100x100p, 5k, 60% probability

Benchmark metrics are:

  • Scalability – Number of persistent connections achieved
  • Quality of Service (QoS) – bitrate achieved by each user
    • max speed: 1 Mbps, min speed SLA: 420 Kbps
    • divides bitrates between max and min in 5 bands, corresponding to dial-in, T1, etc.
    • example: 900 Kbps, 800 Kbps, 700 Kbps, 600 Kbps, 500 Kbps
    • reports %users in each bitrate band

Three cases were tested:

  • Disk as OverFlow Cache – Contents are served from 12 high-performance 15K RPM disks configured in a ZFS zpool.
  • Sun Flash Accelerator F20 PCIe Card as Cache Device – Contents are served from 2 F20 Cards, with 8 component DOMs configured in a ZFS spool
  • Memory as Cache – Contents are served from tmpfs

Key Points and Best Practices

See Also

Disclosure Statement

Results as of 4/1/2010.

Thursday Jan 21, 2010

SPARC Enterprise M4000 PeopleSoft NA Payroll 240K Employees Performance (16 Streams)

The Sun SPARC Enterprise M4000 server combined with Sun FlashFire technology, the Sun Storage F5100 flash array, has produced World Record Performance on PeopleSoft Payroll 9.0 (North American) 240K employees benchmark.

  • The Sun SPARC Enterprise M4000 server with four 2.53 GHz SPARC64 VII processors and the Sun Storage F5100 flash array using 16 job streams (payroll threads) is 55% faster than the HP rx6600 (4 x 1.6GHz Itanium2 processors) as measured for payroll processing tasks in the PeopleSoft Payroll 9.0 (North American) benchmark. The Sun result used the Oracle 11gR1 database running on Solaris 10.

  • The Sun SPARC Enterprise M4000 server with four 2.53GHz SPARC64 VII processors and the Sun Storage F5100 flash array is 2.1x faster than the 2027 MIPs IBM Z990 (6 Z990 Gen1 processors) as measured for payroll processing tasks in the PeopleSoft Payroll 9.0 (North American) benchmark. The Sun result use the Oracle 11gR1 database running on Solaris 10 while the IBM result was run with 8 payroll threads and used IBM DB2 for Z/OS 8.1 for the database.

  • The Sun SPARC Enterprise M4000 server with four 2.53GHz SPARC64 VII processors and a Sun Storage F5100 flash array processed payroll for 240K employees using PeopleSoft Payroll 9.0 (North American) and Oracle 11gR1 running on Solaris 10 with different execution strategies with resulted in a maximum CPU utilization of 45% compared to HP's reported CPU utilization of 89%.

  • The Sun SPARC Enterprise M4000 server combined with Sun FlashFire technology processed 16 Sequential Jobs and single run control with a total time of 534 minutes, an improvement of 19% compared to HP's time of 633 minutes.

  • Sun's FlashFire technology dramatically improves IO performance for the Peoplesoft Payroll 9.0 (North American) benchmark with significant performance boost over best optimized FC disks (60+).

  • The Sun Storage F5100 Flash Array is a high performance high density solid state flash array which provides a read latency of only 0.5 msec which is about 10 times faster than the normal disk latencies 5 msec measured on this benchmark.

  • Sun estimates that the MIPS rating for a Sun SPARC Enterprise M4000 server is over 3000 MIPS.

Performance Landscape

240K Employees

System Processor OS/Database Time in Minutes Num of
Streams
Ver
Payroll
Processing
Result
Run 1 Run 2 Run 3
Sun M4000 4x 2.53GHz SPARC64 VII Solaris/Oracle 11gR1 43.78 51.26 286.11 534.35 16 9.0
HP rx6600 4x 1.6GHz Itanium2 HP-UX/Oracle 11g 68.07 81.17 350.16 633.25 16 9.0
IBM Z990 6x Gen1 2027 MIPS Z/OS /DB2 91.70 107.34 328.66 544.80 8 9.0

Note: IBM benchmark documents show that 6 Gen1 procs is 2027 mips. 13 Gen1 processors were in this config but only 6 were available for testing.

Results and Configuration Summary

Hardware Configuration:

    1 x Sun SPARC Enterprise M4000 (4 x 2.53 GHz/32GB)
    1 x Sun Storage F5100 Flash Array (40 x 24GB FMODs)
    1 x Sun Storage J4200 (12 x 450GB SAS 15K RPM)

Software Configuration:

    Solaris 10 5/09
    Oracle PeopleSoft HCM 9.0 64-bit
    Oracle PeopleSoft Enterprise (PeopleTools) 8.49.08 64-bit
    Micro Focus Server Express 4.0 SP4 64-bit
    Oracle RDBMS 11.1.0.7 64-bit
    HP's Mercury Interactive QuickTest Professional 9.0

Benchmark Description

The PeopleSoft 9.0 Payroll (North America) benchmark is a performance benchmark established by PeopleSoft to demonstrate system performance for a range of processing volumes in a specific configuration. This information may be used to determine the software, hardware, and network configurations necessary to support processing volumes. This workload represents large batch runs typical of OLTP workloads during a mass update.

To measure five application business process run times for a database representing large organization. The five processes are:

  • Paysheet Creation: generates payroll data worksheet for employees, consisting of std payroll information for each employee for given pay cycle.

  • Payroll Calculation: Looks at Paysheets and calculates checks for those employees.

  • Payroll Confirmation: Takes information generated by Payroll Calculation and updates the employees' balances with the calculated amounts.

  • Print Advice forms: The process takes the information generated by payroll Calculations and Confirmation and produces an Advice for each employee to report Earnings, Taxes, Deduction, etc.

  • Create Direct Deposit File: The process takes information generated by above processes and produces an electronic transmittal file use to transfer payroll funds directly into an employee bank a/c.

For the benchmark, we collect at least three data points with different number of job streams (parallel jobs). This batch benchmark allows a maximum of sixteen job streams to be configured to run in parallel.

Key Points and Best Practices

Please see the white paper for information on PeopleSoft payroll best practices using flash.

See Also

Disclosure Statement

Oracle PeopleSoft Payroll 9.0 benchmark, Sun M4000 (4 2.53GHz SPARC64) 43.78 min, IBM Z990 (6 gen1) 91.70 min, HP rx6600 (4 1.6GHz Itanium2) 68.07 min, www.oracle.com/apps_benchmark/html/white-papers-peoplesoft.html, results 1/21/2010.

Monday Oct 12, 2009

MCAE ABAQUS faster on Sun F5100 and Sun X4270 - Single Node World Record

The Sun Storage F5100 Flash Array can substantially improve performance over internal hard disk drives as shown by the I/O intensive ABAQUS MCAE application Standard benchmark tests on a Sun Fire X4270 server.

The I/O intensive ABAQUS "Standard" benchmarks test cases were run on a single Sun Fire X4270 server. Data is presented for runs at both 8 and 16 thread counts.

The ABAQUS "Standard" module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal striped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "S4b" test case by 14%.

  • The Sun Fire X4270 server coupled with a Sun Storage F5100 Flash Array established the world record performance on a single node for the four test cases S2A, S4B, S4D and S6.

Performance Landscape

ABAQUS "Standard" Benchmark Test S4B: Advantage of Sun Storage F5100

Results are total elapsed run times in seconds

Threads 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
8 1504 1318 14%
16 1811 1649 10%

ABAQUS Standard Server Benchmark Subset: Single Node Record Performance

Results are total elapsed run times in seconds

Platform Cores S2a S4b S4d S6
X4270 w/F5100 8 302 1192 779 1237
HP BL460c G6 8 324 1309 843 1322
X4270 w/F5100 4 552 1970 1181 1706
HP BL460c G6 4 561 2062 1234 1812

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      Hyperthreading enabled
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: ABAQUS V6.9-1 Standard Module
    Benchmark: ABAQUS Standard Benchmark Test Suite

Benchmark Description

Abaqus/Standard Benchmark Problems

These problems provide an estimate of the performance that can be expected when running Abaqus/Standard or similar commercially available MCAE (FEA) codes like ANSYS and MSC/Nastran on different computers. The jobs are representative of those typically analyzed by Abaqus/Standard and other MCAE applications. These analyses include linear statics, nonlinear statics, and natural frequency extraction.

Please go here for a more complete description of the tests.

Key Points and Best Practices

  • The memory requirements for the test cases in the ABAQUS Standard benchmark test suite are rather substantial with some of the test cases requiring slightly over 20GB of memory. There are two memory limits one a minimum where out of core "memory" will be used when this limit is exceeded. This requires more time consuming cpu and another maximum memory limit that minimizes I/O operations. These memory limits are given in the ABAQUS output and can be established before making a full execution in a preliminary diagnostic mode run.
  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the ABAQUS job. This is done in the "abaqus_v6.env" file that either resides in the subdirectory from where the job was launched or in the abaqus "site" subdirectory under the home installation directory.
  • Sometimes when running multiple cores on a single node, it is preferable from a performance standpoint to run in "smp" shared memory mode This is specified using the "THREADS" option on the "mpi_mode" line in the abaqus_v6.env file as opposed to the "MPI" option on this line. The test case considered here illustrates this point.
  • The test cases for the ABAQUS standard module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system with high performance interconnects. On Linux OS's advantage can be taken of excess memory that can be used to cache and accelerate I/O.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Abaqus, Inc. or its subsidiaries in the United States and/or o ther countries: Abaqus, Abaqus/Standard, Abaqus/Explicit. All information on the ABAQUS website is Copyrighted 2004-2009 by Dassault Systemes. Results from http://www.simulia.com/support/v69/v69_performance.php as of October 12, 2009.

MCAE ANSYS faster on Sun F5100 and Sun X4270

Significance of Results

The Sun Storage F5100 Flash Array can greatly improve performance over internal hard disk drives as shown by the I/O intensive ANSYS MCAE application BMD benchmark tests on a Sun Fire X4270 server.

Select ANSYS 12 BMD benchmarks were run on a single Sun Fire X4270 server. These I/O intensive test cases were run to compare the performance of conventional high performance disk to Sun FlashFire technology.

The ANSYS 12.0 module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal stripped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "BMD-4" test case by 67% in the 8-core/8-thread server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "BMD-7" test case by 18% in the 8-core/16-thread server configuration.

Performance Landscape

ANSYS 12 "BMD" Test Suite on Single X4270 (24GB mem.) - SMP Mode

Results are total elapsed run times in seconds

Test Case SMP 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
bmd-4 8 523 314 67%
bmd-7 16 357 303 18%

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      Hyperthreading enabled
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: ANSYS Multiphysics 12.0
    Benchmark: ANSYS 12 "BMD" Benchmark Test Suite

Benchmark Description

ANSYS is a general purpose engineering analysis MCAE application that is based on the Finite Element Method. It performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior or deformations are concerned. Ansys provides a number of benchmark tests which exercise the capabilities of the software.

Please go here for a more complete description of the tests.

Key Points and Best Practices

Performance Considerations

The performance of Ansys (IO-intensive MCAE application) can be increased by reducing the IO demands of the application by increasing server memory or by using SSDs to increase the bandwidth and reduce the latency. The most I/O intensive case in the ANSYS distributed "BMD" test suite is BMD-4 particularly at the (maximum) 8 core level for a single node.


  • Ansys now takes full advantage of inexpensive RAID0 disk arrays and delivers sustained I/O rates.

  • Large memory can cache file accesses but often the size of ANSYS files grows much larger than the available physical memory so that system file caching is not able to hide the I/O cost.
  • For fast ANSYS runs the recommended configuration is a RAID 0 setup using 4 or more disks and a fast RAID controller. These fast I/O configurations are inexpensive to put together for systems and can achieve I/O rates in excess of 200 MB/sec.
  • SSD drives have much lower seek times, use less power, and tend to be about 2X faster than the fastest rotating disks for sustained throughput. The observed speed of a RAID 0 configuration of SSD drives for ANSYS simulations has been nearly as fast as I/O that is cached by large memory systems. SSD drives then may be the most affordable way to extend the capacity of a system to jobs that are too large to run in-core without incurring the performance penalty usually associated with I/O demands.

More About The ANSYS BMD "Distributed" Benchmarks

ANSYS is a general purpose engineering analysis MCAE application that is based on the Finite Element Method. It performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior or deformations are concerned.

In the most recent release of the ANSYS benchmarks there are now two test suites: The SMP "BM" suite designed to run on a single node with multi processors and the DMP "BMD" suite intended to run on multi node clusters but which can also run on a single node in SMP mode as in this study.

  • The test cases from both ANSYS test suites all have a substantial I/O component where 15% to 20% of the total run times are associated with I/O activity (primarily scratch files). Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system with high performance interconnects. When running with the SX64 build a ZFS system might be a good idea to employ.
  • The ANSYS test cases don't scale very well (BMD better than BM) ; at best on up 8 cores.
  • The memory requirements for the test cases in the ANSYS BMD are greater than for the standard benchmark test suite. The requirements for the standard suite are not great requiring less than 3GB.

See Also

MCAE, SSD, HPC, ANSYS, Linux, SuSE, Performance, X64, Intel

Disclosure Statement

The following are trademarks or registered trademarks of ANSYS, Inc., ANSYS Multiphysics TM. All information on the ANSYS website is Copyrighted by ANSYS, Inc. Results from http://www.ansys.com/services/ss-intel-bench120.htm as of October 12, 2009.

Thursday Jul 23, 2009

World Record Performance of Sun CMT Servers

This week, Sun continues to highlight the record-breaking performance of its latest update to the chip multi-threaded (CMT) Sun SPARC Enterprise server family running Solaris.  Some of these benchmarks leverage the use of a variety of Sun's unique technologies including ZFS, SSD, various Storage Products and many more. These benchmarks were blogged about by various members or our team and the URLs are shown below.

Messages

  • Sun's CMT is the most powerful CPU regardless of architectural/implementation details (#transistors, #cores, threads, MHz, etc.)!
  • Performance tests show that Sun can outperform IBM Power6 by more than 2x on a variety of benchmarks.
  • Performance tests show Sun's new 1.6GHz CMT systems can be 20% faster than Sun's previous generation 1.4GHz processors, given Sun's continual advancements in both hardware and software.

Benchmark Results Recently Blogged

Sun T5440 Oracle BI EE World Record Performance
http://blogs.sun.com/BestPerf/entry/sun_t5440_oracle_bi_ee

Sun T5440 World Record SAP-SD 4-Processor Two-tier SAP ERP 6.0 EP 4 (Unicode), Beats IBM POWER6 (note1)
http://blogs.sun.com/BestPerf/entry/sun_t5440_world_record_sap

Zeus ZXTM Traffic Manager World Record on Sun T5240
http://blogs.sun.com/BestPerf/entry/top_performance_on_sun_sparc

Sun T5440 SPECjbb2005, Sun 1.6GHz T2 Plus chip is 2.3x IBM 4.7GHz POWER6 chip
http://blogs.sun.com/BestPerf/entry/sun_t5440_specjbb2005_beats_ibm

New SPECjAppServer2004 Performance on the Sun SPARC Enterprise T5440
http://blogs.sun.com/BestPerf/entry/new_specjappserver2004_performance_on_sun

1.6 GHz SPEC CPU2006: World Record 4-chip system, Rate Benchmarks, Beats IBM POWER6
http://blogs.sun.com/BestPerf/entry/1_6_ghz_spec_cpu2006

Sun Blade T6320 World Record 1-chip SPECjbb2005 performance, Sun 1.6GHz T2 Plus chip is 2.6x IBM 4.7GHz POWER6 chip
http://blogs.sun.com/BestPerf/entry/new_specjbb2005_performance_on_the

Comparison Table

Benchmark Sun CMT Tier Software Key Messages
Oracle BI EE Sun T5440 Appl,
Database
Oracle 11g,
Oracle BIEE,
ZFS,
Solaris
  • World Record: T5440
  • Achieved 28,000 users
  • Reference
SAP-SD 2-Tier Sun T5440 Appl,
Database
SAP ECC 6.0,EP4
Oracle 10g,
Solaris
  • World Record 4-socket: T5440
  • T5440 Beats 4-socket IBM 550 5GHz Power6 by 26% (note1)
  • T5440 Beats HP DL585 G6 4-socket Opteron (note1)
  • Unicode version
SPECjAppServer
2004
Sun T5440 Appl, Database Oracle WebLogic,
Oracle 11g,
JDK 1.6.0_14,
Solaris
  • World Record Single System (Appl Tier): T5440
  • T5440 is 6.4x faster of IBM Power 570 4.7GHz Power6
  • T5440 is 73% faster than HP DL 580 G5 Xeon 6C
  • Oracle Fusion Middleware
Sun T5440
SPECjbb2005
Sun T5440 Appl Java HotSpot,
OpenSolaris
  • 1.6GHz US T2 Plus CPU is 2.3x faster of IBM 4.7GHz Power6 CPU
  • 1.6GHz US T2 Plus CPU is 21% faster than previous generation 1.4GHz US T2 Plus CPU
  • Sun T5440 has 2.3x better power/perf than the IBM 570 (8 4.7GHz Power6)
Sun Blade T6320 SPECjbb2005 Sun T6320 Appl Java HotSpot,
OpenSolaris
  • World Record 1-socket: T6320
  • 1.6GHz US T2 Plus CPU is 2.6x faster than IBM 4.7GHz Power6 CPU
  • T6320 is 3% faster than Fujitsu 3.16GHz Xeon QC
SPEC CPU2006 Sun T5440,
Sun T5240,
Sun T5220,
Sun T5120,
Sun T6320
all tiers Sun Studio12,
Solaris,
ZFS
  • World Record 4-socket: T5440
  • 1.6GHz US T2 Plus CPU is 2.6x faster than IBM 4.7GHz Power6 CPU
  • T6320 is 3% faster than Fujitsu 3.16GHz Xeon QC
Zeus ZXTM
Traffic Manager
Sun T5240 Web Zeus ZXTM v5.1r1,
Solaris
  • World Record: T5240
  • T5240 Beats f5 BIG-IP VIPRON by 34%; 2.6x better $/perf
  • T5240 Beats f5 BIG-IP 8800 by 91%; 2.7x better $/perf⁞
  • T5240 Beats Citrix 12000 by 2.2x; 3.3x better $/perf
  • No IBM result

Virtualization

Sun's announcement also included updated virtualization software (LDOMs 1.1). Downloads are available to existing SPARC Enterprise server customers at: http://www.sun.com/servers/coolthreads/ldoms/index.jsp.  Also look the the blog posting "LDoms for Dummies" at http://blogs.sun.com/PierreReynes/entry/ldoms_for_dummies

Try & Buy Program

Sun is also offering free 60-day trials on Sun CMT servers with with a very popular Try and Buy program: http://www.sun.com/tryandbuy.

Benchmark Performance Disclosure Statements (the URLs listed above go into more detail on each of these benchmarks)

Note1: 4-processor world record on the 2-tier SAP SD Standard Application Benchmark with 4720 SD User, as of July 23, 2009, IBM System 550 (4 processors, 8 cores, 16 threads) 3,752 SAP SD Users, 4x 5 GHz Power6, 64 GB memory, DB2 9.5, AIX 6.1, Cert# 2009023. T5440 beats HP new 4-socket Opteron Servers (HPDL585 G6 with 4665 SD User and HP BL685c G6 with 4422 SD User)

Two-tier SAP Sales and Distribution (SD) standard SAP ERP 6.0 2005/EP4 (Unicode) application benchmarks as of 07/21/09: Sun SPARC Enterprise T5440 Server (4 processors, 32 cores, 256 threads) 4,720 SAP SD Users, 4x 1.6 GHz UltraSPARC T2 Plus, 256 GB memory, Oracle10g, Solaris10, Cert# 2009026. HP ProLiant DL585 G6 (4 processors, 24 cores, 24 threads) 4,665 SAP SD Users, 4x 2.8 GHz AMD Opteron Processor 8439 SE, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009025. HP ProLiant BL685c G6 (4 processors, 24 cores, 24 threads) 4,422 SAP SD Users, 4x 2.6 GHz AMD Opteron Processor 8435, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009021. IBM System 550 (4 processors, 8 cores, 16 threads) 3,752 SAP SD Users, 4x 5 GHz Power6, 64 GB memory, DB2 9.5, AIX 6.1, Cert# 2009023. HP ProLiant DL585 G5 (4 processors, 16 cores, 16 threads) 3,430 SAP SD Users, 4x 3.1 GHz AMD Opteron Processor 8393 SE, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009008. HP ProLiant BL685 G6 (4 processors, 16 cores, 16 threads) 3,118 SAP SD Users, 4x 2.9 GHz AMD Opteron Processor 8389, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009007. NEC Express5800 (4 processors, 24 cores, 24 threads) 2,957 SAP SD Users, 4x 2.66 GHz Intel Xeon Processor X7460, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009018. Dell PowerEdge M905 (4 processors, 16 cores, 16 threads) 2,129 SAP SD Users, 4x 2.7 GHz AMD Opteron Processor 8384, 96 GB memory, SQL Server 2005, Windows Server 2003 Enterprise Edition, Cert# 2009017. Sun Fire X4600M2 (8 processors, 32 cores, 32 threads) 7,825 SAP SD Users, 8x 2.7 GHz AMD Opteron 8384, 128 GB memory, MaxDB 7.6, Solaris 10, Cert# 2008070. IBM System x3650 M2 (2 Processors, 8 Cores, 16 Threads) 5,100 SAP SD users,2x 2.93 Ghz Intel Xeon X5570, DB2 9.5, Windows Server 2003 Enterprise Edition, Cert# 2008079. HP ProLiant DL380 G6 (2 processors, 8 cores, 16 threads) 4,995 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, SQL Server 2005, Windows Server 2003 Enterprise Edition, Cert# 2008071. SAP, R/3, reg TM of SAP AG in Germany and other countries. More info www.sap.com/benchmark.

Oracle Business Intelligence Enterprise Edition benchmark, see http://www.oracle.com/solutions/business_intelligence/resource-library-whitepapers.html for more. Results as of 7/20/09.

Zeus is TM of Zeus Technology Limited. Results as of 7/21/2009 on http://www.zeus.com/news/press_articles/zeus-price-performance-press-release.html?gclid=CLn4jLuuk5cCFQsQagod7gTkJA.

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Competitive results from www.spec.org as of 16 July 2009. Sun's new results quoted on this page have been submitted to SPEC. Sun Blade T6320 89.2 SPECint_rate_base2006, 96.7 SPECint_rate2006, 64.1 SPECfp_rate_base2006, 68.5 SPECfp_rate2006; Sun SPARC Enterprise T5220/T5120 89.1 SPECint_rate_base2006, 97.0 SPECint_rate2006, 64.1 SPECfp_rate_base2006, 68.5 SPECfp_rate2006; Sun SPARC Enterprise T5240 172 SPECint_rate_base2006, 183 SPECint_rate2006, 124 SPECfp_rate_base2006, 133 SPECfp_rate2006; Sun SPARC Enterprise T5440 338 SPECint_rate_base2006, 360 SPECint_rate2006, 254 SPECfp_rate_base2006, 270 SPECfp_rate2006; Sun Blade T6320 76.4 SPECint_rate_base2006, 85.5 SPECint_rate2006, 58.1 SPECfp_rate_base2006, 62.3 SPECfp_rate2006; Sun SPARC Enterprise T5220/T5120 76.2 SPECint_rate_base2006, 83.9 SPECint_rate2006, 57.9 SPECfp_rate_base2006, 62.3 SPECfp_rate2006; Sun SPARC Enterprise T5240 142 SPECint_rate_base2006, 157 SPECint_rate2006, 111 SPECfp_rate_base2006, 119 SPECfp_rate2006; Sun SPARC Enterprise T5440 270 SPECint_rate_base2006, 301 SPECint_rate2006, 212 SPECfp_rate_base2006, 230 SPECfp_rate2006; IBM p 570 53.2 SPECint_rate_base2006, 60.9 SPECint_rate2006, 51.5 SPECfp_rate_base2006, 58.0 SPECfp_rate2006; IBM Power 520 102 SPECint_rate_base2006, 124 SPECint_rate2006, 88.7 SPECfp_rate_base2006, 105 SPECfp_rate2006; IBM Power 550 215 SPECint_rate_base2006, 263 SPECint_rate2006, 188 SPECfp_rate_base2006, 222 SPECfp_rate2006; HP Integrity BL870c 114 SPECint_rate_base2006; HP Integrity rx7640 87.4 SPECfp_rate_base2006, 90.8 SPECfp_rate2006.

SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results as of 7/17/2009 on http://www.spec.org. SPECjbb2005, Sun Blade T6320 229576 SPECjbb2005 bops, 28697 SPECjbb2005 bops/JVM; IBM p 570 88089 SPECjbb2005 bops, 88089 SPECjbb2005 bops/JVM; Fujitsu TX100 223691 SPECjbb2005 bops, 111846 SPECjbb2005 bops/JVM; IBM x3350 194256 SPECjbb2005 bops, 97128 SPECjbb2005 bops/JVM; Sun SPARC Enterprise T5120 192055 SPECjbb2005 bops, 24007 SPECjbb2005 bops/JVM.

SPECjAppServer2004, Sun SPARC Enterprise T5440 (4 chips, 32 cores) 7661.16 SPECjAppServer2004 JOPS@Standard; HP DL580 G5 (4 chips, 24 cores) 4410.07 SPECjAppServer2004 JOPS@Standard; HP DL580 G5 (4 chips, 16 cores) 3339.94 SPECjAppServer2004 JOPS@Standard; Two Dell PowerEdge 2950 (4 chips, 16 cores) 4794.33 SPECjAppServer2004 JOPS@Standard; Dell PowerEdge R610 (2 chips, 8 cores) 3975.13 SPECjAppServer2004 JOPS@Standard; Two Dell PowerEdge R610 (4 chips, 16 cores) 7311.50 SPECjAppServer2004 JOPS@Standard; IBM Power 570 (2 chips, 4 cores) 1197.51 SPECjAppServer2004 JOPS@Standard; SPEC, SPECjAppServer reg tm of Standard Performance Evaluation Corporation. Results from http://www.spec.org as of 7/20/09.

SPECjbb2005 Sun SPARC Enterprise T5440 (4 chips, 32 cores) 841380 SPECjbb2005 bops, 26293 SPECjbb2005 bops/JVM. Results submitted to SPEC. HP DL585 G5 (4 chips, 24 cores) 937207 SPECjbb2005 bops, 234302 SPECjbb2005 bops/JVM. IBM Power 570 (8 chips, 16 cores) 798752 SPECjbb2005 bops, 99844 SPECjbb2005 bops/JVM. Sun SPARC Enterprise T5440 (4 chips, 32 cores) 692736 SPECjbb2005 bops, 21648 SPECjbb2005 bops/JVM. SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 7/20/09.

IBM p 570 8P 4.7GHz (4 building blocks) power specifications calculated as 80% of maximum input power reported 7/8/09 in “Facts and Features Report”: ftp://ftp.software.ibm.com/common/ssi/pm/br/n/psb01628usen/PSB01628USEN.PDF

Thursday Jun 25, 2009

Sun SSD Server Platform Bandwidth and IOPS (Speeds & Feeds)

The Sun SSD (32 GB SATA 2.5" SSD) is the world's first enterprise-quality, open-standard Flash design. Built to an industry-standard JEDEC form factor, the module is being made available to developers and the OpenSolaris Storage community to foster Flash innovation. The Sun SSD delivers unprecedented IO performance, saves on power, space, and cooling, and will enable new levels of server optimization and datacenter efficiencies.

  • The Sun SSD demonstrated performance of 98K 4K random read IOPS on a Sun Fire X4450 server running the Solaris operating system.

Performance Landscape

Solaris 10 Results

Test SSD Result
X4450 T5240
Random Read (4K) 98.4K IOPS 71.5K IOPS
Random Write (4K) 31.8K IOPS 14.4K IOPS
50-50 Read/Write (4K) 14.9K IOPS 15.7K IOPS
Sequential Read (MB/sec) 764 MB/sec 1012 MB/sec
Sequential Write (MB/sec) 376 MB/sec 531 MB/sec

Results and Configuration Summary

Storage:

    4 x Sun SSD
    32 GB SATA 2.5" SSD (24 GB usable)
    2.5in drive form factor

Servers:

    Sun SPARC Enterprise T5240 - 4 internal drive slots used (LSI driver)
    Sun Fire X4450 - 4 internal drive slots used (LSI driver)

Software:

    OpenSolaris 2009.06 or Solaris 10 10/09 (MPT driver enhancements)
    Vdbench 5.0

Benchmark Description

Sun measured a wide variety of IO performance metrics on the Sun SSD using Vdbench 5.0 measuring 100% Random Read, 100% Random Write, 100% Sequential Read, 100% Sequential Write, and 50-50 read/write. This demonstrates the maximum performance and throughput of the storage system.

Vdbench profile:

    wd=wm_80dr,sd=sd\*,readpct=0,rhpct=0,seekpct=100
    wd=ws_80dr,sd=sd\*,readpct=0,rhpct=0,seekpct=0
    wd=rm_80dr,sd=(sd1-sd80),readpct=100,rhpct=0,seekpct=100
    wd=rs_80dr,sd=(sd1-sd80),readpct=100,rhpct=0,seekpct=0
    wd=rwm_80dr,sd=sd\*,readpct=50,rhpct=0,seekpct=100
    rd=default
    ###Random Read and writes tests varying transfer size
    rd=default,el=30m,in=6,forx=(4K),forth=(32),io=max,pause=20
    rd=run1_rm_80dr,wd=rm_80dr
    rd=run2_wm_80dr,wd=wm_80dr
    rd=run3_rwm_80dr,wd=rwm_80dr
    ###Sequential read and Write tests varying transfer size
    rd=default,el=30m,in=6,forx=(512k),forth=(32),io=max,pause=20
    rd=run4_rs_80dr,wd=rs_80dr
    rd=run5_ws_80dr,wd=ws_80dr
Vdbench is publicly available for download at: http://vdbench.org

Key Points and Best Practices

  • All measurements were done with the internal HBA and not the internal RAID.

See Also

Disclosure Statement

Sun SSD delivered 71.5K 4K read IOPS and 1012 MB/sec sequential read. Vdbench 5.0 (http://vdbench.org) was used for the test. Results as of June 17, 2009.

Friday Jun 19, 2009

SSDs in HPC: Reducing the I/O Bottleneck BluePrint Best Practices

High-Performance Computing (HPC) applications can be dramatically increased by simply using SSDs instead of traditional hard drives. To read about these findings see the Sun BluePrint by Larry McIntosh and Michael Burke, called "Solid State Drives in HPC: Reducing the I/O Bottleneck".

There was a BestPerf blog posting on the NASTRAN/SSD results at:
http://blogs.sun.com/BestPerf/entry/sun_fire_x2270_msc_nastran

Our BestPerf authors will blog about more of their recent benchmarks in the coming weeks.

Tuesday Jun 16, 2009

Sun Fire X2270 MSC/Nastran Vendor_2008 Benchmarks

Significance of Results

The I/O intensive MSC/Nastran Vendor_2008 benchmark test suite was used to compare the performance on a Sun Fire X2270 server when using SSDs internally instead of HDDs.

The effect on performance from increasing memory to augment I/O caching was also examined. The Sun Fire X2270 server was equipped with Intel QC Xeon X5570 processors (Nehalem). The positive effect of adding memory to increase I/O caching is offset to some degree by the reduction in memory frequency with additional DIMMs in the bays of each memory channel on each cpu socket for these Nehalem processors.

  • SSDs can significantly improve NASTRAN performance especially on runs with larger core counts.
  • Additional memory in the server can also increase performance, however in some systems additional memory can decrease memory GHz so this may offset the benefits of increased capacity.
  • If SSDs are not used striped disks will often improve performance of IO-bound MCAE applications.
  • To obtain the highest performance it is recommended that SSDs be used and servers be configured with the largest memory possible without decreasing memory GHz. One should always look at the workload characteristics and compare against this benchmark to correctly set expectations.

SSD vs. HDD Performance

The performance of two striped 30GB SSDs was compared to two striped 7200 rpm 500GB SATA drives on a Sun Fire X2270 server.

  • At the 8-core level (maximum cores for a single node) SSDs were 2.2x faster for the larger xxocmd2 and the smaller xlotdf1 cases.
  • For 1-core results SSDs are up to 3% faster.
  • On the smaller mdomdf1 test case there was no increase in performance on the 1-, 2-, and 4-cores configurations.

Performance Enhancement with I/O Memory Caching

Performance for Nastran can often be increased by additional memory to provide additional in-core space to cache I/O and thereby reduce the IO demands.

The main memory was doubled from 24GB to 48GB. At the 24GB level one 4GB DIMM was placed in the first bay of each of the 3 CPU memory channels on each of the two CPU sockets on the Sun Fire X2270 platform. This configuration allows a memory frequency of 1333MHz.

At the 48GB level a second 4GB DIMM was placed in the second bay of each of the 3 CPU memory channels on each socket. This reduces the memory frequency to 1066MHz.

Adding Memory With HDDs (SATA)

  • The additional server memory increased the performance when running with the slower SATA drives at the higher core levels (e.g. 4- & 8-cores on a single node)
  • The larger xxocmd2 case was 42% faster and the smaller xlotdf1 case was 32% faster at the maximum 8-core level on a single system.
  • The special I/O intensive getrag case was 8% faster at the 1-core level.

Adding Memory With SDDs

  • At the maximum 8-core level (for a single node) the larger xxocmd2 case was 47% faster in overall run time.
  • The effects were much smaller at lower core counts and in the tests at the 1-core level most test cases ran from 5% to 14% slower with the slower CPU memory frequency dominating over the added in-core space available for I/O caching vs. direct transfer to SSD.
  • Only the special I/O intensive getrag case was an exception running 6% faster at the 1-core level.

Increasing performance with Two Striped (SATA) Drives

The performance of multiple striped drives was also compared to single drive. The study compared two striped internal 7200 rpm 500GB SATA drives to a singe single internal SATA drive.

  • On a single node with 8 cores, the largest test xx0cmd2 was 40% faster, a smaller test case xl0tdf1 was 33% faster and even the smallest test case mdomdf1 case was 12% faster.

  • On 1-core the added boost in performance with striped disks was from 4% to 13% on the various test cases.

  • One 1-core the special I/O-intensive test case getrag was 29% faster.

Performance Landscape

Times in table are elapsed time (sec).


MSC/Nastran Vendor_2008 Benchmark Test Suite

Test Cores Sun Fire X2270
2 x X5570 QC 2.93 GHz
2 x 7200 RPM SATA HDDs
Sun Fire X2270
2 x X5570 QC 2.93 GHz
2 x SSDs
48 GB
1067MHz
24 GB
2 SATA
1333MHz
24 GB
1 SATA
1333MHz
Ratio (2xSATA):
48GB/
24GB
Ratio:
2xSATA/
1xSATA
48 GB
1067MHz
24 GB
1333MHz
Ratio:
48GB/
24GB
Ratio (24GB):
2xSATA/
2xSSD

vlosst1 1 133 127 134 1.05 0.95 133 126 1.05 1.01

xxocmd2 1
2
4
8
946
622
466
1049
895
614
631
1554
978
703
991
2590
1.06
1.01
0.74
0.68
0.87
0.87
0.64
0.60
947
600
426
381
884
583
404
711
1.07
1.03
1.05
0.53
1.01
1.05
1.56
2.18

xlotdf1 1
2
4
8
2226
1307
858
912
2000
1240
833
1562
2081
1308
1030
2336
1.11
1.05
1.03
0.58
0.96
0.95
0.81
0.67
2214
1315
744
674
1939
1189
751
712
1.14
1.10
0.99
0.95
1.03
1.04
1.11
2.19

xloimf1 1 1216 1151 1236 1.06 0.93 1228 1290 0.95 0.89

mdomdf1 1
2
4
987
524
270
913
485
237
983
520
269
1.08
1.08
1.14
0.93
0.93
0.88
987
524
270
911
484
250
1.08
1.08
1.08
1.00
1.00
0.95

Sol400_1
(xl1fn40_1)
1 2555 2479 2674 1.03 0.93 2549 2402 1.06 1.03

Sol400_S
(xl1fn40_S)
1 2450 2302 2481 1.06 0.93 2449 2262 1.08 1.02

getrag
(xx0xst0)
1 778 843 1178 0.92 0.71 771 817 0.94 1.03

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X2270
      1 2-socket rack mounted server
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      2 x internal striped SSDs
      2 x internal striped 7200 rpm 500GB SATA drives

Software Configuration:

    O/S: Linux 64-bit SUSE SLES 10 SP 2
    Application: MSC/NASTRAN MD 2008
    Benchmark: MSC/NASTRAN Vendor_2008 Benchmark Test Suite
    HP MPI: 02.03.00.00 [7585] Linux x86-64
    Voltaire OFED-5.1.3.1_5 GridStack for SLES 10

Benchmark Description

The benchmark tests are representative of typical MSC/Nastran applications including both SMP and DMP runs involving linear statics, nonlinear statics, and natural frequency extraction.

The MD (Multi Discipline) Nastran 2008 application performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior and/or deformations are concerned. The new release includes the MARC module for general purpose nonlinear analyses and the Dytran module that employs an explicit solver to analyze crash and high velocity impact conditions.

  • As of the Summer '08 there is now an official Solaris X64 version of the MD Nastran 2008 system that is certified and maintained.
  • The memory requirements for the test cases in the new MSC/Nastran Vendor 2008 benchmark test suite range from a few hundred megabytes to no more than 5 GB.

Please go here for a more complete description of the tests.

Key Points and Best Practices

For more on Best Practices of SSD on HPC applications also see the Sun Blueprint:
http://wikis.sun.com/display/BluePrints/Solid+State+Drives+in+HPC+-+Reducing+the+IO+Bottleneck

Additional information on the MSC/Nastran Vendor 2008 benchmark test suite.

  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the Nastran job. This is done on the command line with the mem= option. On Linux based systems where the platform has a large amount of memory and where the model does not have large scratch I/O requirements the memory can be allocated to a tmpfs scratch space file system. On Solaris X64 systems advantage can be taken of ZFS for higher I/O performance.

  • The MSC/Nastran Vendor 2008 test cases don't scale very well, a few not at all and the rest on up to 8 cores at best.

  • The test cases for the MSC/Nastran module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). The required scratch file size ranges from less than 1 GB on up to about 140 GB. Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system, further enhanced as indicated here by implementing the Lustre based I/O system. High performance interconnects such as Infiniband for inter node cluster message passing as well as I/O transfer from the storage system can also enhance performance substantially.

See Also

Disclosure Statement

MSC.Software is a registered trademark of MSC. All information on the MSC.Software website is copyrighted. MSC/Nastran Vendor 2008 results from http://www.mscsoftware.com and this report as of June 9, 2009.

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today