Monday Oct 26, 2015

In-Memory Aggregation: SPARC T7-2 Beats 4-Chip x86 E7 v2

Oracle's SPARC T7-2 server demonstrates better performance both in throughput and number of users compared to a four-chip x86 E7 v2 sever. The workload consists of a realistic set of business intelligence (BI) queries in a multi-user environment against a 500 million row fact table using Oracle Database 12c Enterprise Edition utilizing the In-Memory option.

  • The SPARC M7 chip delivers 2.3 times more query throughput per hour compared to an x86 E7 v2 chip.

  • The two-chip SPARC T7-2 server delivered 13% more query throughput per hour compared to a four-chip x86 E7 v2 server.

  • The two-chip SPARC T7-2 server supported over 10% more users than a four-chip x86 E7 v2 server.

  • Both the SPARC server and x86 server ran with just under 5 second average response time.

Performance Landscape

The results below were run as part of this benchmark. All results use 500,000,000 fact table rows and had average cpu utilization of 100%.

In-Memory Aggregation
500 Million Row Fact Table
System Users Queries
per Hour
Queries per Hour
per Chip
Average
Response Time
SPARC T7-2
2 x SPARC M7 (32core)
190 127,540 63,770 4.99 (sec)
x86 E7 v2
4 x E7-8895 v2 (4x 15core)
170 112,470 28,118 4.92 (sec)

The number of cores are listed per chip.

Configuration Summary

SPARC Configuration:

SPARC T7-2
2 x 4.13 GHz SPARC M7 processors
1 TB memory (32 x 32 GB)
Oracle Solaris 11.3
Oracle Database 12c Enterprise /Edition (12.1.0.2.0)

x86 Configuration:

Sun Server X4-4
4 x Intel Xeon Processor E7-8895 v2 processors
1 TB memory (64 x 16 GB)
Oracle Linux Server 6.5 (kernel 2.6.32-431.el6.x86_64)
Oracle Database 12c Enterprise /Edition (12.1.0.2.0)

Benchmark Description

The benchmark is designed to highlight the efficacy of the Oracle Database 12c In-Memory Aggregation facility (join and aggregation optimizations) together with the fast scan and filtering capability of Oracle's in-memory column store facility.

The benchmark runs analytic queries such as those seen in typical customer business intelligence (BI) applications. These are done in the context of a star schema database. The key metrics are query throughput, number of users and average response times

The implementation of the workload used to achieve the results is based on a schema consisting of 9 dimension tables together with a 500 million row fact table.

The query workload consists of randomly generated star-style queries simulating a collection of ad-hoc business intelligence users. Up to 300 concurrent users have been run, with each user running approximately 500 queries. The implementation includes a relatively small materialized view, which contains some precomputed data. The creation of the materialized view takes only a few minutes.

Key Points and Best Practices

The reported results were obtained by using the following settings on both systems except where otherwise noted:

  1. starting with a completely cold shared pool
  2. without making use of the result cache
  3. without using dynamic sampling or adaptive query optimization
  4. running all queries in parallel, where
    parallel_max_servers = 1600 (on the SPARC T7-2) or
    parallel_max_servers = 240 (on the Sun Server X4-4)
    each query hinted with PARALLEL(4)
    parallel_degree_policy = limited
  5. having appropriate queries rewritten to the materialized view, MV3, defined as
    SELECT
    /*+ append vector_transform */
    d1.calendar_year_name, d1.calendar_quarter_name, d2.all_products_name,
    d2.department_name, d2.category_name, d2.type_name, d3.all_customers_name,
    d3.region_name, d3.country_name, d3.state_province_name, d4.all_channels_name,
    d4.class_name, d4.channel_name, d5.all_ages_name, d5.age_name, d6.all_sizes_name,
    d6.household_size_name, d7.all_years_name, d7.years_customer_name, d8.all_incomes_name,
    d8.income_name, d9.all_status_name, d9.marital_status_name,
    SUM(f.sales) AS sales,
    SUM(f.units) AS units,
    SUM(f.measure_3) AS measure_3,
    SUM(f.measure_4) AS measure_4,
    SUM(f.measure_5) AS measure_5,
    SUM(f.measure_6) AS measure_6,
    SUM(f.measure_7) AS measure_7,
    SUM(f.measure_8) AS measure_8,
    SUM(f.measure_9) AS measure_9,
    SUM(f.measure_10) AS measure_10
    FROM time_dim d1, product_dim d2, customer_dim_500M_10 d3, channel_dim d4, age_dim d5,
    household_size_dim d6, years_customer_dim d7, income_dim d8, marital_status_dim d9,
    units_fact_500M_10 f
    WHERE d1.day_id = f.day_id AND
    d2.item_id = f.item_id AND
    d3.customer_id = f.customer_id AND
    d4.channel_id = f.channel_id AND
    d5.age_id = f.age_id AND
    d6.household_size_id = f.household_size_id AND
    d7.years_customer_id = f.years_customer_id AND
    d8.income_id = f.income_id AND
    d9.marital_status_id = f.marital_status_id
    GROUP BY d1.calendar_year_name, d1.calendar_quarter_name, d2.all_products_name,
    d2.department_name, d2.category_name, d2.type_name, d3.all_customers_name,
    d3.region_name, d3.country_name, d3.state_province_name, d4.all_channels_name,
    d4.class_name, d4.channel_name, d5.all_ages_name, d5.age_name, d6.all_sizes_name,
    d6.household_size_name, d7.all_years_name, d7.years
    

See Also

Disclosure Statement

Copyright 2015, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of October 25, 2015.

Tuesday Apr 10, 2012

World Record Oracle E-Business Suite 12.1.3 Standard Extra-Large Payroll (Batch) Benchmark on Sun Server X3-2L

Oracle's Sun Server X3-2L (formerly Sun Fire X4270 M3) server set a world record running the Oracle E-Business Suite 12.1.3 Standard Extra-Large Payroll (Batch) benchmark.

  • This is the first published result using Oracle E-Business 12.1.3.

  • The Sun Server X3-2L result ran the Extra-Large Payroll workload in 19 minutes.

Performance Landscape

This is the first published result for the Payroll Extra-Large model using Oracle E-Business 12.1.3 benchmark.

Batch Workload: Payroll Extra-Large Model
System Employees/Hr Elapsed Time
Sun Server X3-2L 789,515 19 minutes

Configuration Summary

Hardware Configuration:

Sun Server X3-2L
2 x Intel Xeon E5-2690, 2.9 GHz
128 GB memory
8 x 100 GB SSD for data
1 x 300 GB SSD for log

Software Configuration:

Oracle Linux 5.7
Oracle E-Business Suite R12 (12.1.3)
Oracle Database 11g (11.2.0.3)

Benchmark Description

The Oracle E-Business Suite Standard R12 Benchmark combines online transaction execution by simulated users with concurrent batch processing to model a typical scenario for a global enterprise. This benchmark ran one Batch component, Payroll, in the Extra-Large size. The goal of the benchmark proposal is to execute and achieve best batch-payroll performance using X-Large configuragion.

Results can be published in four sizes and use one or more online/batch modules

  • X-large: Maximum online users running all business flows between 10,000 to 20,000; 750,000 order to cash lines per hour and 250,000 payroll checks per hour.
    • Order to Cash Online -- 2400 users
      • The percentage across the 5 transactions in Order Management module is:
        • Insert Manual Invoice -- 16.66%
        • Insert Order -- 32.33%
        • Order Pick Release -- 16.66%
        • Ship Confirm -- 16.66%
        • Order Summary Report -- 16.66%
    • HR Self-Service -- 4000 users
    • Customer Support Flow -- 8000 users
    • Procure to Pay -- 2000 users
  • Large: 10,000 online users; 100,000 order to cash lines per hour and 100,000 payroll checks per hour.
  • Medium: up to 3000 online users; 50,000 order to cash lines per hour and 10,000 payroll checks per hour.
  • Small: up to 1000 online users; 10,000 order to cash lines per hour and 5,000 payroll checks per hour.

See Also

Disclosure Statement

Oracle E-Business X-Large Batch-Payroll benchmark, Sun Server X3-2L, 2.90 GHz, 2 chips, 16 cores, 32 threads, 128 GB memory, elapsed time 19.0 minutes, 789,515 Employees/HR, Oracle Linux 5.7, Oracle E-Business Suite 12.1.3, Oracle Database 11g Release 2, Results as of 7/10/2012.

SPEC CPU2006 Results on Oracle's Sun x86 Servers

Oracle's new Sun x86 servers delivered world records on the benchmarks SPECfp2006 and SPECint_rate2006 for two processor servers. This was accomplished with Oracle Solaris 11 and Oracle Solaris Studio 12.3 software.

  • The Sun Fire X4170 M3 (now known as Sun Server X3-2) server achieved a world record result in for SPECfp2006 benchmark with a score of 96.8.

  • The Sun Blade X6270 M3 server module (now known as Sun Blade X3-2B) produced best integer throughput performance for all 2-socket servers with a SPECint_rate2006 score of 705.

  • The Sun x86 servers with Intel Xeon E5-2690 2.9 GHz processors produced a cross-generational performance improvement up to 1.8x over the previous generation, Sun x86 M2 servers.

Performance Landscape

Complete benchmark results are at the SPEC website, SPEC CPU2006 Results. The tables below provide the new Oracle results, as well as, select results from other vendors.

SPECint2006
System Processor c/c/t * Peak Base O/S Compiler
Fujitsu PRIMERGY BX924 S3 Intel E5-2690, 2.9 GHz 2/16/16 60.8 56.0 RHEL 6.2 Intel 12.1.2.273
Sun Fire X4170 M3 Intel E5-2690, 2.9 GHz 2/16/32 58.5 54.3 Oracle Linux 6.1 Intel 12.1.0.225
Sun Fire X4270 M2 Intel X5690, 3.47 GHz 2/12/12 46.2 43.9 Oracle Linux 5.5 Intel 12.0.1.116

SPECfp2006
System Processor c/c/t * Peak Base O/S Compiler
Sun Fire X4170 M3 Intel E5-2690, 2.9 GHz 2/16/32 96.8 86.4 Oracle Solaris 11 Studio 12.3
Sun Blade X6270 M3 Intel E5-2690, 2.9 GHz 2/16/32 96.0 85.2 Oracle Solaris 11 Studio 12.3
Sun Fire X4270 M3 Intel E5-2690, 2.9 GHz 2/16/32 95.9 85.1 Oracle Solaris 11 Studio 12.3
Fujitsu CELSIUS R920 Intel E5-2687, 2.9 GHz 2/16/16 93.8 87.6 RHEL 6.1 Intel 12.1.2.273
Sun Fire X4270 M2 Intel X5690, 3.47 GHz 2/12/24 64.2 59.2 Oracle Solaris 10 Studio 12.2

Only 2-chip server systems listed below, excludes workstations.

SPECint_rate2006
System Processor Base
Copies
c/c/t * Peak Base O/S Compiler
Sun Blade X6270 M3 Intel E5-2690, 2.9 GHz 32 2/16/32 705 632 Oracle Solaris 11 Studio 12.3
Sun Fire X4270 M3 Intel E5-2690, 2.9 GHz 32 2/16/32 705 630 Oracle Solaris 11 Studio 12.3
Sun Fire X4170 M3 Intel E5-2690, 2.9 GHz 32 2/16/32 702 628 Oracle Solaris 11 Studio 12.3
Cisco UCS C220 M3 Intel E5-2690, 2.9 GHz 32 2/16/32 697 671 RHEL 6.2 Intel 12.1.0.225
Sun Blade X6270 M2 Intel X5690, 3.47 GHz 24 2/12/24 410 386 Oracle Linux 5.5 Intel 12.0.1.116

SPECfp_rate2006
System Processor Base
Copies
c/c/t * Peak Base O/S Compiler
Cisco UCS C240 M3 Intel E5-2690, 2.9 GHz 32 2/16/32 510 496 RHEL 6.2 Intel 12.1.2.273
Sun Fire X4270 M3 Intel E5-2690, 2.9 GHz 64 2/16/32 497 461 Oracle Solaris 11 Studio 12.3
Sun Blade X6270 M3 Intel E5-2690, 2.9 GHz 32 2/16/32 497 460 Oracle Solaris 11 Studio 12.3
Sun Fire X4170 M3 Intel E5-2690, 2.9 GHz 64 2/16/32 495 464 Oracle Solaris 11 Studio 12.3
Sun Fire X4270 M2 Intel E5690, 3.47 GHz 24 2/12/24 273 265 Oracle Linux 5.5 Intel 12.0.1.116

* c/c/t — chips / cores / threads enabled

Configuration Summary and Results

Hardware Configuration:

Sun Fire X4170 M3 server
2 x 2.90 GHz Intel Xeon E5-2690 processors
128 GB memory (16 x 8 GB 2Rx4 PC3-12800R-11, ECC)

Sun Fire X4270 M3 server
2 x 2.90 GHz Intel Xeon E5-2690 processors
128 GB memory (16 x 8 GB 2Rx4 PC3-12800R-11, ECC)

Sun Blade X6270 M3 server module
2 x 2.90 GHz Intel Xeon E5-2690 processors
128 GB memory (16 x 8 GB 2Rx4 PC3-12800R-11, ECC)

Software Configuration:

Oracle Solaris 11 11/11 (SRU2)
Oracle Solaris Studio 12.3 (patch update 1 nightly build 120313) Oracle Linux Server Release 6.1
Intel C++ Studio XE 12.1.0.225
SPEC CPU2006 V1.2

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark. It measures:

  • Speed — single copy performance of chip, memory, compiler
  • Rate — multiple copy (throughput)

The benchmark is also divided into integer intensive applications and floating point intensive applications:

  • integer: 12 benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding
  • floating point: 17 benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

It is also divided depending upon the amount of optimization allowed:

  • base: optimization is consistent per compiled language, all benchmarks must be compiled with the same flags per language.
  • peak: specific compiler optimization is allowed per application.

The overall metrics for the benchmark which are commonly used are:

  • SPECint_rate2006, SPECint_rate_base2006: integer, rate
  • SPECfp_rate2006, SPECfp_rate_base2006: floating point, rate
  • SPECint2006, SPECint_base2006: integer, speed
  • SPECfp2006, SPECfp_base2006: floating point, speed

See here for additional information.

See Also

Disclosure Statement

SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Results as of 10 April 2012 from www.spec.org and this report.

SPEC CPU2006 Results on Oracle's Netra Server X3-2

Oracle's Netra Server X3-2 (formerly Sun Netra X4270 M3) equipped with the new Intel Xeon processor E5-2658, is up to 2.5x faster than the previous generation Netra systems on SPEC CPU2006 workloads.

Performance Landscape

Complete benchmark results are at the SPEC website, SPEC CPU2006 results. The tables below provide the new Oracle results and previous generation results.

SPECint2006
System Processor c/c/t * Peak Base O/S Compiler
Netra Server X3-2
Intel E5-2658, 2.1 GHz 2/16/32 38.5 36.0 Oracle Linux 6.1 Intel 12.1.0.225
Sun Netra X4270 Intel L5518, 2.13 GHz 2/8/16 27.9 25.0 Oracle Linux 5.4 Intel 11.1
Sun Netra X4250 Intel L5408, 2.13 GHz 2/8/8 20.3 17.9 SLES 10 SP1 Intel 11.0

SPECfp2006
System Processor c/c/t * Peak Base O/S Compiler
Netra Server X3-2 Intel E5-2658, 2.1 GHz 2/16/32 65.3 61.6 Oracle Linux 6.1 Intel 12.1.0.225
Sun Netra X4270 Intel L5518, 2.13 GHz 2/8/16 32.5 29.4 Oracle Linux 5.4 Intel 11.1
Sun Netra X4250 Intel L5408, 2.13 GHz 2/8/8 18.5 17.7 SLES 10 SP1 Intel 11.0

SPECint_rate2006
System Processor Base
Copies
c/c/t * Peak Base O/S Compiler
Netra Server X3-2 Intel E5-2658, 2.1 GHz 32 2/16/32 477 455 Oracle Linux 6.1 Intel 12.1.0.225
Sun Netra X4270 Intel L5518, 2.13 GHz 16 2/8/16 201 189 Oracle Linux 5.4 Intel 11.1
Sun Netra X4250 Intel L5408, 2.13 GHz 8 2/8/8 103 82.0 SLES 10 SP1 Intel 11.0

SPECfp_rate2006
System Processor Base
Copies
c/c/t * Peak Base O/S Compiler
Netra Server X3-2 Intel E5-2658, 2.1 GHz 32 2/16/32 392 383 Oracle Linux 6.1 Intel 12.1.0.225
Sun Netra X4270 Intel L5518, 2.13 GHz 16 2/8/16 155 153 Oracle Linux 5.4 Intel 11.1
Sun Netra X4250 Intel L5408, 2.13 GHz 8 2/8/8 55.9 52.3 SLES 10 SP1 Intel 11.0

* c/c/t — chips / cores / threads enabled

Configuration Summary

Hardware Configuration:

Netra Server X3-2
2 x 2.10 GHz Intel Xeon E5-2658 processors
128 GB memory (16 x 8 GB 2Rx4 PC3-12800R-11, ECC)

Software Configuration:

Oracle Linux Server Release 6.1
Intel C++ Studio XE 12.1.0.225
SPEC CPU2006 V1.2

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark. It measures:

  • Speed — single copy performance of chip, memory, compiler
  • Rate — multiple copy (throughput)

The benchmark is also divided into integer intensive applications and floating point intensive applications:

  • integer: 12 benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding
  • floating point: 17 benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

It is also divided depending upon the amount of optimization allowed:

  • base: optimization is consistent per compiled language, all benchmarks must be compiled with the same flags per language.
  • peak: specific compiler optimization is allowed per application.

The overall metrics for the benchmark which are commonly used are:

  • SPECint_rate2006, SPECint_rate_base2006: integer, rate
  • SPECfp_rate2006, SPECfp_rate_base2006: floating point, rate
  • SPECint2006, SPECint_base2006: integer, speed
  • SPECfp2006, SPECfp_base2006: floating point, speed

See here for additional information.

See Also

Disclosure Statement

SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Results as of 10 July 2012 from www.spec.org and this report.

Thursday Mar 29, 2012

Sun Server X2-8 (formerly Sun Fire X4800 M2) Delivers World Record TPC-C for x86 Systems

Oracle's Sun Server X2-8 (formerly Sun Fire X4800 M2 server) equipped with eight 2.4 GHz Intel Xeon Processor E7-8870 chips obtained a result of 5,055,888 tpmC on the TPC-C benchmark. This result is a world record for x86 servers. Oracle demonstrated this world record database performance running Oracle Database 11g Release 2 Enterprise Edition with Partitioning.

  • The Sun Server X2-8 delivered a new x86 TPC-C world record of 5,055,888 tpmC with a price performance of $0.89/tpmC using Oracle Database 11g Release 2. This configuration is available 7/10/12.

  • The Sun Server X2-8 delivers 3.0x times better performance than the next 8-processor result, an IBM System p 570 equipped with POWER6 processors.

  • The Sun Server X2-8 has 3.1x times better price/performance than the 8-processor 4.7GHz POWER6 IBM System p 570.

  • The Sun Server X2-8 has 1.6x times better performance than the 4-processor IBM x3850 X5 system equipped with Intel Xeon processors.

  • This is the first TPC-C result on any system using eight Intel Xeon Processor E7-8800 Series chips.

  • The Sun Server X2-8 is the first x86 system to get over 5 million tpmC.

  • The Oracle solution utilized Oracle Linux operating system and Oracle Database 11g Enterprise Edition Release 2 with Partitioning to produce the x86 world record TPC-C benchmark performance.

Performance Landscape

Select TPC-C results (sorted by tpmC, bigger is better)

System p/c/t tpmC Price
/tpmC
Avail Database Memory
Size
Sun Server X2-8 8/80/160 5,055,888 0.89 USD 7/10/2012 Oracle 11g R2 4 TB
IBM x3850 X5 4/40/80 3,014,684 0.59 USD 7/11/2011 DB2 ESE 9.7 3 TB
IBM x3850 X5 4/32/64 2,308,099 0.60 USD 5/20/2011 DB2 ESE 9.7 1.5 TB
IBM System p 570 8/16/32 1,616,162 3.54 USD 11/21/2007 DB2 9.0 2 TB

p/c/t - processors, cores, threads
Avail - availability date

Oracle and IBM TPC-C Response times

System tpmC Response Time (sec)
New Order 90th%
Response Time (sec)
New Order Average

Sun Server X2-8 5,055,888 0.210 0.166
IBM x3850 X5 3,014,684 0.500 0.272
Ratios - Oracle Better 1.6x 1.4x 1.3x

Oracle uses average new order response time for comparison between Oracle and IBM.

Graphs of Oracle's and IBM's response times for New-Order can be found in the full disclosure reports on TPC's website TPC-C Official Result Page.

Configuration Summary and Results

Hardware Configuration:

Server
Sun Server X2-8
8 x 2.4 GHz Intel Xeon Processor E7-8870
4 TB memory
8 x 300 GB 10K RPM SAS internal disks
8 x Dual port 8 Gbs FC HBA

Data Storage
10 x Sun Fire X4270 M2 servers configured as COMSTAR heads, each with
1 x 3.06 GHz Intel Xeon X5675 processor
8 GB memory
10 x 2 TB 7.2K RPM 3.5" SAS disks
2 x Sun Storage F5100 Flash Array storage (1.92 TB each)
1 x Brocade 5300 switches

Redo Storage
2 x Sun Fire X4270 M2 servers configured as COMSTAR heads, each with
1 x 3.06 GHz Intel Xeon X5675 processor
8 GB memory
11 x 2 TB 7.2K RPM 3.5" SAS disks

Clients
8 x Sun Fire X4170 M2 servers, each with
2 x 3.06 GHz Intel Xeon X5675 processors
48 GB memory
2 x 300 GB 10K RPM SAS disks

Software Configuration:

Oracle Linux (Sun Fire 4800 M2)
Oracle Solaris 11 Express (COMSTAR for Sun Fire X4270 M2)
Oracle Solaris 10 9/10 (Sun Fire X4170 M2)
Oracle Database 11g Release 2 Enterprise Edition with Partitioning
Oracle iPlanet Web Server 7.0 U5
Tuxedo CFS-R Tier 1

Results:

System: Sun Server X2-8
tpmC: 5,055,888
Price/tpmC: 0.89 USD
Available: 7/10/2012
Database: Oracle Database 11g
Cluster: no
New Order Average Response: 0.166 seconds

Benchmark Description

TPC-C is an OLTP system benchmark. It simulates a complete environment where a population of terminal operators executes transactions against a database. The benchmark is centered around the principal activities (transactions) of an order-entry environment. These transactions include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses.

Key Points and Best Practices

  • Oracle Database 11g Release 2 Enterprise Edition with Partitioning scales easily to this high level of performance.

  • COMSTAR (Common Multiprotocol SCSI Target) is the software framework that enables an Oracle Solaris host to serve as a SCSI Target platform. COMSTAR uses a modular approach to break the huge task of handling all the different pieces in a SCSI target subsystem into independent functional modules which are glued together by the SCSI Target Mode Framework (STMF). The modules implementing functionality at SCSI level (disk, tape, medium changer etc.) are not required to know about the underlying transport. And the modules implementing the transport protocol (FC, iSCSI, etc.) are not aware of the SCSI-level functionality of the packets they are transporting. The framework hides the details of allocation providing execution context and cleanup of SCSI commands and associated resources and simplifies the task of writing the SCSI or transport modules.

  • Oracle iPlanet Web Server middleware is used for the client tier of the benchmark. Each web server instance supports more than a quarter-million users while satisfying the response time requirement from the TPC-C benchmark.

See Also

Disclosure Statement

TPC Benchmark C, tpmC, and TPC-C are trademarks of the Transaction Processing Performance Council (TPC). Sun Server X2-8 (8/80/160) with Oracle Database 11g Release 2 Enterprise Edition with Partitioning, 5,055,888 tpmC, $0.89 USD/tpmC, available 7/10/2012. IBM x3850 X5 (4/40/80) with DB2 ESE 9.7, 3,014,684 tpmC, $0.59 USD/tpmC, available 7/11/2011. IBM x3850 X5 (4/32/64) with DB2 ESE 9.7, 2,308,099 tpmC, $0.60 USD/tpmC, available 5/20/2011. IBM System p 570 (8/16/32) with DB2 9.0, 1,616,162 tpmC, $3.54 USD/tpmC, available 11/21/2007. Source: http://www.tpc.org/tpcc, results as of 7/15/2011.

Wednesday Nov 02, 2011

SPARC T4-2 Server Beats 2-Socket 3.46 GHz x86 on Black-Scholes Option Pricing Test

Oracle's SPARC T4-2 server (two SPARC T4 processors at 2.85 GHz) delivered 21% better performance compared to a two-socket x86 server (with two Intel X5690 3.46 GHz processors) running a Black-Scholes options pricing test on 10 million options.

  • The hyper-threads of the Intel processor did not deliver additional performance, it actually caused a reduction in performance of 6%. The performance of hyper-threading on Intel processors will vary depending on workload

  • This test shows how delivered performance is not easily predicted just by processor frequency alone. It is vital that hardware and software be designed in tandem in order to deliver best performance.

Performance Landscape

Black-Scholes options pricing, 10 million options, results in seconds, 100 iterations of the test, smaller is better.

System Time (sec)
SPARC T4-2 (2 x SPARC T4, 2.85 GHz, 128 software threads) 9.2
2-socket x86 (2 x X5690, 3.46 GHz, 12 software threads) 11.7

Advantage SPARC T4-2 21% faster

The hyper-threads of the Intel processor did not deliver additional performance, causing a reduction in performance of 6%.

Configuration Summary

SPARC Configuration:

SPARC T4-2 server
2 x SPARC T4 processor 2.85 GHz
128 GB memory
Oracle Solaris 10 8/11

Intel Configuration:

Sun Fire X4270 M2
2 x Intel Xeon X5690 3.46 GHz, Hyper-Threading and Turbo Boost active
48 GB memory
Oracle Linux 6.1

Benchmark Description

Black-Scholes option pricing model is a financial market algorithm that uses the Black-Scholes partial differential equation (PDE) to calculate prices for European stock options. The key idea is that the value of the option fluctuates over time with the actual value of the stock. The reported time is just for calculating the options, no I/O component. The computation is floating point intensive and requires the calculation of logarithms, exponentials and square roots.

See Also

Disclosure Statement

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 11/1/2011.

Wednesday Sep 28, 2011

SPARC T4-2 Server Beats Intel (Westmere AES-NI) on Oracle Database Tablespace Encryption Queries

Oracle's SPARC T4 processor with Encryption Instruction Accelerators greatly improves performance over software implementations. This will greatly expand the use of TDE for many customers.

  • Oracle's SPARC T4-2 server is over 42% faster than Oracle's Sun Fire X4270 M2 (Intel AES-NI) when running DSS-style queries referencing an encrypted tablespace.

Oracle's Transparent Data Encryption (TDE) feature of the Oracle Database simplifies the encryption of data within datafiles preventing unauthorized access to it from the operating system. Tablespace encryption allows encryption of the entire contents of a tablespace.

TDE tablespace encryption has been certified with Siebel, PeopleSoft, and Oracle E-Business Suite applications

Performance Landscape

Total Query Time (time in seconds)
System GHz AES-128 AES-192 AES-256
SPARC T4-2 server 2.85 588 588 588
Sun Fire X4270 M2 (Intel X5690) 3.46 836 841 842
SPARC T4-2 Advantage
42% 43% 43%

Configuration Summary

SPARC Configuration:

SPARC T4-2 server
2 x SPARC T4 processors, 2.85 GHz
256 GB memory
2 x Sun Storage F5100 Flash Array
Oracle Solaris 11
Oracle Database 11g Release 2

Intel Configuration:

Sun Fire X4270 M2 server
2 x Intel Xeon X5690 processors, 3.46 GHz
48 GB memory
2 x Sun Storage F5100 Flash Array
Oracle Linux 5.7
Oracle Database 11g Release 2

Benchmark Description

To test the performance of TDE, a 1 TB database was created. To demonstrate secure transactions, four 25 GB tables emulating customer private data were created: clear text, encrypted AES-128, encrypted AES-192, and encrypted AES-256. Eight queries of varying complexity that join on the customer table were executed.

The time spent scanning the customer table during each query was measured and query plans analyzed to ensure a fair comparison, e.g. no broken queries. The total query time for all queries is reported.

Key Points and Best Practices

  • Oracle Database 11g Release 2 is required for SPARC T4 processor Encryption Instruction Accelerators support with TDE tablespaces.

  • TDE tablespaces support the SPARC T4 processor Encryption Instruction Accelerators for Advanced Encryption Standard (AES) only.

  • AES-CFB is the mode used in the Oracle database with TDE

  • Prior to using TDE tablespaces you must create a wallet and setup an encryption key. Here is one method to do that:

  • Create a wallet entry in $ORACLE_HOME/network/admin/sqlnet.ora.
    ENCRYPTION_WALLET_LOCATION=
    (SOURCE=(METHOD=FILE)(METHOD_DATA=
    (DIRECTORY=/oracle/app/oracle/product/11.2.0/dbhome_1/encryption_wallet)))
    
    Set an encryption key. This also opens the wallet.
    $ sqlplus / as sysdba
    SQL> ALTER SYSTEM SET ENCRYPTION KEY IDENTIFIED BY "tDeDem0";
    
    On subsequent instance startup open the wallet.
    $ sqlplus / as sysdba
    SQL> STARTUP;
    SQL> ALTER SYSTEM SET ENCRYPTION WALLET OPEN IDENTIFIED BY "tDeDem0";
    
  • TDE tablespace encryption and decryption occur on physical writes and reads of database blocks, respectively.

  • For parallel query using direct path reads decryption overhead varies inversely with the complexity of the query.

    For a simple full table scan query overhead can be reduced and performance improved by reducing the degree of parallelism (DOP) of the query.

See Also

Disclosure Statement

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/26/2011.

Monday Sep 19, 2011

Halliburton ProMAX® Seismic Processing on Sun Blade X6270 M2 with Sun ZFS Storage 7320

Halliburton/Landmark's ProMAX® 3D Pre-Stack Kirchhoff Time Migration's (PSTM) single workflow scalability and multiple workflow throughput using various scheduling methods are evaluated on a cluster of Oracle's Sun Blade X6270 M2 server modules attached to Oracle's Sun ZFS Storage 7320 appliance.

Two resource scheduling methods, compact and distributed, are compared while increasing the system load with additional concurrent ProMAX® workflows.

  • Multiple concurrent 24-process ProMAX® PSTM workflow throughput is constant; 10 workflows on 10 nodes finish as fast as 1 workflow on one compute node. Additionally, processing twice the data volume yields similar traces/second throughput performance.

  • A single ProMAX® PSTM workflow has good scaling from 1 to 10 nodes of a Sun Blade X6270 M2 cluster scaling 4.5X. ProMAX® scales to 4.7X on 10 nodes with one input data set and 6.3X with two consecutive input data sets (i.e. twice the data).

  • A single ProMAX® PSTM workflow has near linear scaling of 11x on a Sun Blade X6270 M2 server module when running from 1 to 12 processes.

  • The 12-thread ProMAX® workflow throughput using the distributed scheduling method is equivalent or slightly faster than the compact scheme for 1 to 6 concurrent workflows.

Performance Landscape

Multiple 24-Process Workflow Throughput Scaling

This test measures the system throughput scalability as concurrent 24-process workflows are added, one workflow per node. The per workflow throughput and the system scalability are reported.

Aggregate system throughput scales linearly. Ten concurrent workflows finish in the same time as does one workflow on a single compute node.

Halliburton ProMAX® Pre-Stack Time Migration - Multiple Workflow Scaling


Single Workflow Scaling

This test measures single workflow scalability across a 10-node cluster. Utilizing a single data set, performance exhibits near linear scaling of 11x at 12 processes, and per-node scaling of 4x at 6 nodes; performance flattens quickly reaching a peak of 60x at 240 processors and per-node scaling of 4.7x with 10 nodes.

Running with two consecutive input data sets in the workflow, scaling is considerably improved with peak scaling ~35% higher than obtained using a single data set. Doubling the data set size minimizes time spent in workflow initialization, data input and output.

Halliburton ProMAX® Pre-Stack Time Migration - Single Workflow Scaling

This next test measures single workflow scalability across a 10-node cluster (as above) but limiting scheduling to a maximum of 12-process per node; effectively restricting a maximum of one process per physical core. The speedup relative to a single process, and single node are reported.

Utilizing a single data set, performance exhibits near linear scaling of 37x at 48 processes, and per-node scaling of 4.3x at 6 nodes. Performance of 55x at 120 processors and per-node scaling of 5x with 10 nodes is reached and scalability is trending higher more strongly compared to the the case of two processes running per physical core above. For equivalent total process counts, multi-node runs using only a single process per physical core appear to run between 28-64% more efficiently (96 and 24 processes respectively). With a full compliment of 10 nodes (120 processes) the peak performance is only 9.5% lower than with 2 processes per vcpu (240 processes).

Running with two consecutive input data sets in the workflow, scaling is considerably improved with peak scaling ~35% higher than obtained using a single data set.

Halliburton ProMAX® Pre-Stack Time Migration - Single Workflow Scaling

Multiple 12-Process Workflow Throughput Scaling, Compact vs. Distributed Scheduling

The fourth test compares compact and distributed scheduling of 1, 2, 4, and 6 concurrent 12-processor workflows.

All things being equal, the system bi-section bandwidth should improve with distributed scheduling of a fixed-size workflow; as more nodes are used for a workflow, more memory and system cache is employed and any node memory bandwidth bottlenecks can be offset by distributing communication across the network (provided the network and inter-node communication stack do not become a bottleneck). When physical cores are not over-subscribed, compact and distributed scheduling performance is within 3% suggesting that there may be little memory contention for this workflow on the benchmarked system configuration.

With compact scheduling of two concurrent 12-processor workflows, the physical cores become over-subscribed and performance degrades 36% per workflow. With four concurrent workflows, physical cores are oversubscribed 4x and performance is seen to degrade 66% per workflow. With six concurrent workflows over-subscribed compact scheduling performance degrades 77% per workflow. As multiple 12-processor workflows become more and more distributed, the performance approaches the non over-subscribed case.

Halliburton ProMAX® Pre-Stack Time Migration - Multiple Workflow Scaling

141616 traces x 624 samples


Test Notes

All tests were performed with one input data set (70808 traces x 624 samples) and two consecutive input data sets (2 * (70808 traces x 624 samples)) in the workflow. All results reported are the average of at least 3 runs and performance is based on reported total wall-clock time by the application.

All tests were run with NFS attached Sun ZFS Storage 7320 appliance and then with NFS attached legacy Sun Fire X4500 server. The StorageTek Workload Analysis Tool (SWAT) was invoked to measure the I/O characteristics of the NFS attached storage used on separate runs of all workflows.

Configuration Summary

Hardware Configuration:

10 x Sun Blade X6270 M2 server modules, each with
2 x 3.33 GHz Intel Xeon X5680 processors
48 GB DDR3-1333 memory
4 x 146 GB, Internal 10000 RPM SAS-2 HDD
10 GbE
Hyper-Threading enabled

Sun ZFS Storage 7320 Appliance
1 x Storage Controller
2 x 2.4 GHz Intel Xeon 5620 processors
48 GB memory (12 x 4 GB DDR3-1333)
2 TB Read Cache (4 x 512 GB Read Flash Accelerator)
10 GbE
1 x Disk Shelf
20.0 TB RAID-Z (20 x 1 TB SAS-2, 7200 RPM HDD)
4 x Write Flash Accelerators

Sun Fire X4500
2 x 2.8 GHz AMD 290 processors
16 GB DDR1-400 memory
34.5 TB RAID-Z (46 x 750 GB SATA-II, 7200 RPM HDD)
10 GbE

Software Configuration:

Oracle Linux 5.5
Parallel Virtual Machine 3.3.11 (bundled with ProMAX)
Intel 11.1.038 Compilers
Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

The ProMAX® family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX® is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX® is integrated with Halliburton's OpenWorks® Geoscience Oracle Database to index prestack seismic data and populate the database with processed seismic.

This benchmark evaluates single workflow scalability and multiple workflow throughput of the ProMAX® 3D Prestack Kirchhoff Time Migration (PSTM) while processing the Halliburton benchmark data set containing 70,808 traces with 8 msec sample interval and trace length of 4992 msec. Benchmarks were performed with both one and two consecutive input data sets.

Each workflow consisted of:

  • reading the previously constructed MPEG encoded processing parameter file
  • reading the compressed seismic data traces from disk
  • performing the PSTM imaging
  • writing the result to disk

Workflows using two input data sets were constructed by simply adding a second identical seismic data read task immediately after the first in the processing parameter file. This effectively doubled the data volume read, processed, and written.

This version of ProMAX® currently only uses Parallel Virtual Machine (PVM) as the parallel processing paradigm. The PVM software only used TCP networking and has no internal facility for assigning memory affinity and processor binding. Every compute node is running a PVM daemon.

The ProMAX® processing parameters used for this benchmark:

Minimum output inline = 65
Maximum output inline = 85
Inline output sampling interval = 1
Minimum output xline = 1
Maximum output xline = 200 (fold)
Xline output sampling interval = 1
Antialias inline spacing = 15
Antialias xline spacing = 15
Stretch Mute Aperature Limit with Maximum Stretch = 15
Image Gather Type = Full Offset Image Traces
No Block Moveout
Number of Alias Bands = 10
3D Amplitude Phase Correction
No compression
Maximum Number of Cache Blocks = 500000

Primary PSTM business metrics are typically time-to-solution and accuracy of the subsurface imaging solution.

Key Points and Best Practices

  • Multiple job system throughput scales perfectly; ten concurrent workflows on 10 nodes each completes in the same time and has the same throughput as a single workflow running on one node.
  • Best single workflow scaling is 6.6x using 10 nodes.

    When tasked with processing several similar workflows, while individual time-to-solution will be longer, the most efficient way to run is to fully distribute them one workflow per node (or even across two nodes) and run these concurrently, rather than to use all nodes for each workflow and running consecutively. For example, while the best-case configuration used here will run 6.6 times faster using all ten nodes compared to a single node, ten such 10-node jobs running consecutively will overall take over 50% longer to complete than ten jobs one per node running concurrently.

  • Throughput was seen to scale better with larger workflows. While throughput with both large and small workflows are similar with only one node, the larger dataset exhibits 11% and 35% more throughput with four and 10 nodes respectively.

  • 200 processes appears to be a scalability asymptote with these workflows on the systems used.
  • Hyperthreading marginally helps throughput. For the largest model run on 10 nodes, 240 processes delivers 11% more performance than with 120 processes.

  • The workflows do not exhibit significant I/O bandwidth demands. Even with 10 concurrent 24-process jobs, the measured aggregate system I/O did not exceed 100 MB/s.

  • 10 GbE was the only network used and, though shared for all interprocess communication and network attached storage, it appears to have sufficient bandwidth for all test cases run.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX®, GeoProbe®, OpenWorks®. Results as of 9/1/2011.

Wednesday Dec 08, 2010

Sun Blade X6275 M2 Delivers Best Fluent (MCAE Application) Performance on Tested Configurations

This Manufacturing Engineering benchmark highlights the performance advantage the Sun Blade X6275 M2 server module offers over IBM, Cray, and SGI solutions as shown by the ANSYS FLUENT fluid dynamics application.

A cluster of eight of Oracle's Sun Blade X6275 M2 server modules delivered outstanding performance running the FLUENT 12 benchmark test suite.

  • The Sun Blade X6275 M2 server module cluster delivered the best results in all 36 of the test configurations run, outperforming the best posted results by as much as 42%.
  • The Sun Blade X6275 M2 server module demonstrated up to 76% performance improvement over the previous generation Sun Blade X6275 server module.

Performance Landscape

In the following tables, results are "Ratings" (bigger is better).
Rating = No. of sequential runs of test case possible in 1 day: 86,400/(Total Elapsed Run Time in Seconds)

The following table compares results on the basis of core count, irrespective of processor generation. This means that in some cases, i.e., for the 32-core and 64-core configurations, systems with the Intel Xeon X5670 six-core processors did not utilize quite all of the cores available for the specified processor count.


FLUENT 12 Benchmark Test Suite

Competitive Comparisons

System
Processors Cores Benchmark Test Case Ratings
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Blade X6275 M2 16 96 9340.5 39272.7 8307.7 8533.3 903.8 786.9
Best Posted 24 96

7562.4
797.0 712.9
Best Posted 16 96 7337.6 33553.4 6533.1 5989.6 739.1 683.5

Sun Blade X6275 M2 11 64 6306.6 27212.6 5592.2 5158.2 568.8 518.9
Best Posted 16 64 5556.3 26381.7 5494.4 4902.1 566.6 518.6

Sun Blade X6275 M2 8 48 4620.3 19093.9 4080.3 3251.2 376.0 359.4
Best Posted 8 48 4494.1 18989.0 3990.8 3185.3 372.7 354.5

Sun Blade X6275 M2 6 32 4061.1 15091.7 3275.8 3013.1 299.5 267.8
Best Posted 8 32 3404.9 14832.6 3211.9 2630.1 286.7 266.7

Sun Blade X6275 M2 4 24 2751.6 10441.1 2161.4 1907.3 188.2 182.5
Best Posted 6 24 1458.2 9626.7 1820.9 1747.2 185.1 180.8
Best Posted 4 24 2565.7 10164.7 2109.9 1608.2 187.1 180.8

Sun Blade X6275 M2 2 12 1429.9 5358.1 1097.5 813.2 95.9 95.9
Best Posted 2 12 1338.0 5308.8 1073.3 808.6 92.9 94.4



The following table compares results on the basis of processor count showing inter-generational processor performance improvement.


FLUENT 12 Benchmark Test Suite

Intergenerational Comparisons

System
Processors Cores Benchmark Test Case Ratings
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Blade X6275 M2 16 96 9340.5 39272.7 8307.7 8533.3 903.8 786.9
Sun Blade X6275 16 64 5308.8 26790.7 5574.2 5074.9 547.2 525.2
X6275 M2 : X6275 16
1.76 1.47 1.49 1.68 1.65 1.50

Sun Blade X6275 M2 8 48 4620.3 19093.9 4080.3 3251.2 376.0 359.4
Sun Blade X6275 8 32 3066.5 13768.9 3066.5 2602.4 289.0 270.3
X6275 M2 : X6275 8
1.51 1.39 1.33 1.25 1.30 1.33

Sun Blade X6275 M2 4 24 2751.6 10441.1 2161.4 1907.3 188.2 182.5
Sun Blade X6275 4 16 1714.3 7545.9 1519.1 1345.8 144.4 141.8
X6275 M2 : X6275 4
1.61 1.38 1.42 1.42 1.30 1.29

Sun Blade X6275 M2 2 12 1429.9 5358.1 1097.5 813.2 95.9 95.9
Sun Blade X6275 2 8 931.8 4061.1 827.2 681.5 73.0 73.8
X6275 M2 : X6275 2
1.53 1.32 1.33 1.19 1.31 1.30

Configuration Summary

Hardware Configuration:

8 x Sun Blade X6275 M2 server modules, each with
4 Intel Xeon X5670 2.93 GHz processors, turbo enabled
96 GB memory 1333 MHz
2 x 24 GB SATA-based Sun Flash Modules
2 x QDR InfiniBand Host Channel Adapter
Sun Datacenter InfiniBand Switch IB-36

Software Configuration:

Oracle Enterprise Linux Enterprise Server 5.5
ANSYS FLUENT V12.1.2
ANSYS FLUENT Benchmark Test Suite

Benchmark Description

The following description is from the ANSYS FLUENT website:

The FLUENT benchmarks suite comprises of a set of test cases covering a large range of mesh sizes, physical models and solvers representing typical industry usage. The cases range in size from a few 100 thousand cells to more than 100 million cells. Both the segregated and coupled implicit solvers are included, as well as hexahedral, mixed and polyhedral cell cases. This broad coverage is expected to demonstrate the breadth of FLUENT performance on a variety of hardware platforms and test cases.

The performance of a CFD code will depend on several factors, including size and topology of the mesh, physical models, numerics and parallelization, compilers and optimization, in addition to performance characteristics of the hardware where the simulation is performed. The principal objective of this benchmark suite is to provide comprehensive and fair comparative information of the performance of FLUENT on available hardware platforms.

About the ANSYS FLUENT 12 Benchmark Test Suite

    CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.

Key Points and Best Practices

  • ANSYS FLUENT has not yet been certified by the vendor on Oracle Enterprise Linux (OEL). However, the ANSYS FLUENT benchmark tests have been run successfully on Oracle hardware running OEL as is (i.e. with NO changes or modifications).
  • The performance improvement of the Sun Blade X6275 M2 server module over the previous generation Sun Blade X6275 server module was due to two main factors: the increased core count per processor (6 vs. 4), and the more optimal, iterative dataset partitioning scheme used for the Sun Blade X6275 M2 server module.

See Also

Disclosure Statement

All information on the FLUENT website (http://www.fluent.com) is Copyrighted 1995-2010 by ANSYS Inc. Results as of December 06, 2010.

Tuesday Oct 26, 2010

3D VTI Reverse Time Migration Scalability On Sun Fire X2270-M2 Cluster with Sun Storage 7210

This Oil & Gas benchmark shows the Sun Storage 7210 system delivers almost 2 GB/sec bandwidth and realizes near-linear scaling performance on a cluster of 16 Sun Fire X2270 M2 servers.

Oracle's Sun Storage 7210 system attached via QDR InfiniBand to a cluster of sixteen of Oracle's Sun Fire X2270 M2 servers was used to demonstrate the performance of a Reverse Time Migration application, an important application in the Oil & Gas industry. The total application throughput and computational kernel scaling are presented for two production sized grids of 800 samples.

  • Both the Reverse Time Migration I/O and combined computation shows near-linear scaling from 8 to 16 nodes on the Sun Storage 7210 system connected via QDR InfiniBand to a Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 2.0x improvement
      2486 x 1151 x 1231: 1.7x improvement
  • The computational kernel of the Reverse Time Migration has linear to super-linear scaling from 8 to 16 nodes in Oracle's Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231 : 2.2x improvement
      2486 x 1151 x 1231 : 2.0x improvement
  • Intel Hyper-Threading provides additional performance benefits to both the Reverse Time Migration I/O and computation when going from 12 to 24 OpenMP threads on the Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 8% - computational kernel; 2% - total application throughput
      2486 x 1151 x 1231: 12% - computational kernel; 6% - total application throughput
  • The Sun Storage 7210 system delivers the Velocity, Epsilon, and Delta data to the Reverse Time Migration at a steady rate even when timing includes memory initialization and data object creation:

      1243 x 1151 x 1231: 1.4 to 1.6 GBytes/sec
      2486 x 1151 x 1231: 1.2 to 1.3 GBytes/sec

    One can see that when doubling the size of the problem, the additional complexity of overlapping I/O and multiple node file contention only produces a small reduction in read performance.

Performance Landscape

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 504 259 2.0 2.2\* 1024 551 1.7 2.0
14 565 279 1.8 2.0 1191 677 1.5 1.6
12 662 343 1.6 1.6 1426 817 1.2 1.4
10 784 394 1.3 1.4 1501 856 1.2 1.3
8 1024 560 1.0 1.0 1745 1108 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache

Application Scaling – Hyper-Threading Study

The affects of hyperthreading are presented when running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server.

Hyper-Threading Comparison – 12 versus 24 OpenMP Threads
Number Nodes Thread per Node Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup
16 24 504 259 1.02 1.08 1024 551 1.06 1.12
16 12 515 279 1.00 1.00 1088 616 1.00 1.00

Read Performance

Read performance is presented for the velocity, epsilon and delta files running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Velocity, Epsilon, and Delta File Read and Memory Initialization Performance
Number Nodes Overlap MBytes Read Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s
16 2040 16.7 1.1 23.2 1.4 36.8 1.1 44.3 1.2
8 951
14.8 1.0 22.1 1.6 33.0 1.0 43.2 1.3

Configuration Summary

Hardware Configuration:

16 x Sun Fire X2270 M2 servers, each with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)

Sun Storage 7210 system connected via QDR InfiniBand
2 x 18 GB SATA SSD (logzilla)
40 x 1 TB 7200 RM SATA disk

Software Configuration:

SUSE Linux Enterprise Server SLES 10 SP 2
Oracle Message Passing Toolkit 8.2.1 (for MPI)
Sun Studio 12 Update 1 C++, Fortran, OpenMP

Benchmark Description

This Reverse Time Migration (RTM) demonstration application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk. In this version, each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a four element inline 3-D array pad (spatial order of eight) shared with its neighbors to the left and right during the initialization phase. It represents a full RTM application including the data input, computation, communication, and final output image to be used by the next work flow step involving 3D volumetric seismic interpretation.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

I/O Characterization without Optimal Checkpointing

Velocity, Epsilon, and Delta Files - Grid Reading

The additional amount of overlapping reads to share velocity, epsilon, and delta edge data with neighbors can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (z_dimension) x (4 bytes) x (3 files)

For this particular benchmark study, the additional 3-D pad overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 1231 x 4 x 3 = 2.04 GB extra
    8 nodes: 7 x 8 x 1151 x 1231 x 4 x 3 = 0.95 GB extra

For the first of the two test cases, the total size of the three files used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 1231 x 4 bytes = 7.05 GB per file x 3 files = 21.13 GB

With the additional 3-D pad, the total amount of data read is:

    16 nodes: 2.04 GB + 21.13 GB = 23.2 GB
    8 nodes: 0.95 GB + 21.13 GB = 22.1 GB

For the second of the two test cases, the total size of the three files used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 1231 x 4 bytes = 14.09 GB per file x 3 files = 42.27 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: 2.04 GB + 42.27 GB = 44.3 GB
    8 nodes: 0.95 GB + 42.27 GB = 43.2 GB

Note that the amount of overlapping data read increases, not only by the number of nodes, but as the y dimension and/or the z dimension increases.

Trace Reading

The additional amount of overlapping reads to share trace edge data with neighbors for can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (4 bytes) x (number_of_time_slices)

For this particular benchmark study, the additional overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 4 x 800 = 442MB extra
    8 nodes: 7 x 8 x 1151 x 4 x 800 = 206MB extra

For the first case the size of the trace data file used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 4 bytes x 800 = 4.578 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 4.578 GB = 5.0 GB
    8 nodes: .206 GB + 4.578 GB = 4.8 GB

For the second case the size of the trace data file used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 4 bytes x 800 = 9.156 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 9.156 GB = 9.6 GB
    8 nodes: .206 GB + 9.156 GB = 9.4 GB

As the number of nodes is increased, the overlap causes more disk lock contention.

Writing Final Output Image

1243x1151x1231 - 7.1 GB per file:

    16 nodes: 78 x 1151 x 1231 x 4 = 442MB/node (7.1 GB total)
    8 nodes: 156 x 1151 x 1231 x 4 = 884MB/node (7.1 GB total)

2486x1151x1231 - 14.1 GB per file:

    16 nodes: 156 x 1151 x 1231 x 4 = 930 MB/node (14.1 GB total)
    8 nodes: 311 x 1151 x 1231 x 4 = 1808 MB/node (14.1 GB total)

Resource Allocation

It is best to allocate one node as the Oracle Grid Engine resource scheduler and MPI master host. This is especially true when running with 24 OpenMP threads in hyperthreading mode to avoid oversubscribing a node that is cooperating in delivering the solution.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 10/20/2010.

Tuesday Sep 21, 2010

ProMAX Performance and Throughput on Sun Fire X2270 and Sun Storage 7410

Halliburton/Landmark's ProMAX 3D Prestack Kirchhoff Time Migration's single job scalability and multiple job throughput using various scheduling methods are evaluated on a cluster of Oracle's Sun Fire X2270 servers attached via QDR InfiniBand to Oracle's Sun Storage 7410 system.

Two resource scheduling methods, compact and distributed, are compared while increasing the system load with additional concurrent ProMAX jobs.

  • A single ProMAX job has near linear scaling of 5.5x on 6 nodes of a Sun Fire X2270 cluster.

  • A single ProMAX job has near linear scaling of 7.5x on a Sun Fire X2270 server when running from 1 to 8 threads.

  • ProMAX can take advantage of Oracle's Sun Storage 7410 system features compared to dedicated local disks. There was no significant difference in run time observed when running up to 8 concurrent 16 thread jobs.

  • The 8-thread ProMAX job throughput using the distributed scheduling method is equivalent or slightly faster than the compact scheme for 1 to 4 concurrent jobs.

  • The 16-thread ProMAX job throughput using the distributed scheduling method is up to 8% faster when compared to the compact scheme on an 8-node Sun Fire X2270 cluster.

The multiple job throughput characterization revealed in this benchmark study are key in pre-configuring Oracle Grid Engine resource scheduling for ProMAX on a Sun Fire X2270 cluster and provide valuable insight for server consolidation.

Performance Landscape

Single Job Scaling

Single job performance on a single node is near linear up the number of cores in the node, i.e. 2 Intel Xeon X5570s with 4 cores each. With hyperthreading (2 active threads per core) enabled, more ProMAX threads are used increasing the load on the CPU's memory architecture causing the reduced speedups.
ProMAX single job performance on the 6-node cluster shows near linear speedup node to node.
Single Job 6-Node Scalability
Hyperthreading Enabled - 16 Threads/Node Maximum
Number of Nodes Threads Per Node Speedup to 1 Thread Speedup to 1 Node
6 16 54.2 5.5
4 16 36.2 3.6
3 16 26.1 2.6
2 16 17.6 1.8
1 16 10.0 1.0
1 14 9.2
1 12 8.6
1 10 7.2\*
1 8 7.5
1 6 5.9
1 4 3.9
1 3 3.0
1 2 2.0
1 1 1.0

\* 2 threads contend with two master node daemons

Multiple Job Throughput Scaling, Compact Scheduling

With the Sun Storage 7410 system, performance of 8 concurrent jobs on the cluster using compact scheduling is equivalent to a single job.

Multiple Job Throughput Scalability
Hyperthreading Enabled - 16 Threads/Node Maximum
Number of Nodes Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Percent Cluster Used
1 1 16 1.00 1 13
2 1 16 1.00 2 25
4 1 16 1.00 4 50
8 1 16 1.00 8 100

Multiple 8-Thread Job Throughput Scaling, Compact vs. Distributed Scheduling

These results report the difference of different distributed method resource scheduling levels to 1, 2, and 4 concurrent job compact method baselines.

Multiple 8-Thread Job Scheduling
HyperThreading Enabled - Use 8 Threads/Node Maximum
Number of Jobs Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Total Threads per Node Used Percent of PVM Master 8 Threads Used
1 1 8 1.00 1 8 100
1 4 2 1.01 4 2 25
1 8 1 1.01 8 1 13

2 1 8 1.00 2 8 100
2 4 2 1.01 4 4 50
2 8 1 1.01 8 2 25

4 1 8 1.00 4 8 100
4 4 2 1.00 4 8 100
4 8 1 1.01 8 4 100

Multiple 16-Thread Job Throughput Scaling, Compact vs. Distributed Scheduling

The results are reported relative to the performance of 1, 2, 4, and 8 concurrent 2-node, 8-thread jobs.

Multiple 16-Thread Job Scheduling
HyperThreading Enabled - 16 Threads/Node Available
Number of Jobs Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Total Threads per Node Used Percent of PVM Master 16 Threads Used
1 1 16 0.66 1 16 100\*
1 2 8 1.00 2 8 50
1 4 4 1.03 4 4 25
1 8 2 1.06 8 2 13

2 1 16 0.70 2 16 100\*
2 2 8 1.00 4 8 50
2 4 4 1.07 8 4 25
2 8 2 1.08 8 4 25

4 1 16 0.74 4 16 100\*
4 4 4 0.74 4 16 100\*
4 2 8 1.00 8 8 50
4 4 4 1.05 8 8 50
4 8 2 1.04 8 8 50

8 1 16 1.00 8 16 100\*
8 4 4 1.00 8 16 100\*
8 8 2 1.00 8 16 100\*

\* master PVM host; running 20 to 21 total threads (over-subscribed)

Results and Configuration Summary

Hardware Configuration:

8 x Sun Fire X2270 servers, each with
2 x 2.93 GHz Intel Xeon X5570 processors
48 GB memory at 1333 MHz
1 x 500 GB SATA
Sun Storage 7410 system
4 x 2.3 GHz AMD Opteron 8356 processors
128 GB memory
2 Internal 233GB SAS drives = 466 GB
2 Internal 93 GB read optimized SSD = 186 GB
1 External Sun Storage J4400 array with 22 1TB SATA drives and 2 18GB write optimized SSD
11 TB mirrored data and mirrored write optimized SSD

Software Configuration:

SUSE Linux Enterprise Server 10 SP 2
Parallel Virtual Machine 3.3.11
Oracle Grid Engine
Intel 11.1 Compilers
OpenWorks Database requires Oracle 10g Enterprise Edition
Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

The ProMAX family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX is integrated with Halliburton's OpenWorks Geoscience Oracle Database to index prestack seismic data and populate the database with processed seismic.

This benchmark evaluates single job scalability and multiple job throughput of the ProMAX 3D Prestack Kirchhoff Time Migration while processing the Halliburton benchmark data set containing 70,808 traces with 8 msec sample interval and trace length of 4992 msec. Alternative thread scheduling methods are compared for optimizing single and multiple job throughput. The compact scheme schedules the threads of a single job in as few nodes as possible, whereas, the distributed scheme schedules the threads across a many nodes as possible. The effects of load on the Sun Storage 7410 system are measured. This information provides valuable insight into determining the Oracle Grid Engine resource management policies.

Hyperthreading is enabled for all of the tests. It should be noted that every node is running a PVM daemon and ProMAX license server daemon. On the master PVM daemon node, there are three additional ProMAX daemons running.

The first test measures single job scalability across a 6-node cluster with an additional node serving as the master PVM host. The speedup relative to a single node, single thread are reported.

The second test measures multiple job scalability running 1 to 8 concurrent 16-thread jobs using the Sun Storage 7410 system. The performance is reported relative to a single job.

The third test compares 8-thread multiple job throughput using different job scheduling methods on a cluster. The compact method involves putting all 8 threads for a job on the same node. The distributed method involves spreading the 8 threads of job across multiple nodes. The results report the difference of different distributed method resource scheduling levels to 1, 2, and 4 concurrent job compact method baselines.

The fourth test is similar to the second test except running 16-thread ProMAX jobs. The results are reported relative to the performance of 1, 2, 4, and 8 concurrent 2-node, 8-thread jobs.

The ProMAX processing parameters used for this benchmark:

Minimum output inline = 65
Maximum output inline = 85
Inline output sampling interval = 1
Minimum output xline = 1
Maximum output xline = 200 (fold)
Xline output sampling interval = 1
Antialias inline spacing = 15
Antialias xline spacing = 15
Stretch Mute Aperature Limit with Maximum Stretch = 15
Image Gather Type = Full Offset Image Traces
No Block Moveout
Number of Alias Bands = 10
3D Amplitude Phase Correction
No compression
Maximum Number of Cache Blocks = 500000

Key Points and Best Practices

  • The application was rebuilt with the Intel 11.1 Fortran and C++ compilers with these flags.

    -xSSE4.2 -O3 -ipo -no-prec-div -static -m64 -ftz -fast-transcendentals -fp-speculation=fast
  • There are additional execution threads associated with a ProMAX node. There are two threads that run on each node: the license server and PVM daemon. There are at least three additional daemon threads that run on the PVM master server: the ProMAX interface GUI, the ProMAX job execution - SuperExec, and the PVM console and control. It is best to allocate one node as the master PVM server to handle the additional 5+ threads. Otherwise, hyperthreading can be enabled and the master PVM host can support up to 8 ProMAX job threads.

  • When hyperthreading is enabled in on one of the non-master PVM hosts, there is a 7% penalty going from 8 to 10 threads. However, 12 threads are 11 percent faster than 8. This can be contributed to the two additional support threads when hyperthreading initiates.

  • Single job performance on a single node is near linear up the number of cores in the node, i.e. 2 Intel Xeon X5570s with 4 cores each. With hyperthreading (2 active threads per core) enabled, more ProMAX threads are used increasing the load on the CPU's memory architecture causing the reduced speedups.

    Users need to be aware of these performance differences and how it effects their production environment.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX. Results as of 9/20/2010.

Monday Sep 20, 2010

Schlumberger's ECLIPSE 300 Performance Throughput On Sun Fire X2270 Cluster with Sun Storage 7410

Oracle's Sun Storage 7410 system, attached via QDR InfiniBand to a cluster of eight of Oracle's Sun Fire X2270 servers, was used to evaluate multiple job throughput of Schlumberger's Linux-64 ECLIPSE 300 compositional reservoir simulator processing their standard 2 Million Cell benchmark model with 8 rank parallelism (MM8 job).

  • The Sun Storage 7410 system showed little difference in performance (2%) compared to running the MM8 job with dedicated local disk.

  • When running 8 concurrent jobs on 8 different nodes all to the Sun Storage 7140 system, the performance saw little degradation (5%) compared to a single MM8 job running on dedicated local disk.

Experiments were run changing how the cluster was utilized in scheduling jobs. Rather than running with the default compact mode, tests were run distributing the single job among the various nodes. Performance improvements were measured when changing from the default compact scheduling scheme (1 job to 1 node) to a distributed scheduling scheme (1 job to multiple nodes).

  • When running at 75% of the cluster capacity, distributed scheduling outperformed the compact scheduling by up to 34%. Even when running at 100% of the cluster capacity, the distributed scheduling is still slightly faster than compact scheduling.

  • When combining workloads, using the distributed scheduling allowed two MM8 jobs to finish 19% faster than the reference time and a concurrent PSTM workload to find 2% faster.

The Oracle Solaris Studio Performance Analyzer and Sun Storage 7410 system analytics were used to identify a 3D Prestack Kirchhoff Time Migration (PSTM) as a potential candidate for consolidating with ECLIPSE. Both scheduling schemes are compared while running various job mixes of these two applications using the Sun Storage 7410 system for I/O.

These experiments showed a potential opportunity for consolidating applications using Oracle Grid Engine resource scheduling and Oracle Virtual Machine templates.

Performance Landscape

Results are presented below on a variety of experiments run using the 2009.2 ECLIPSE 300 2 Million Cell Performance Benchmark (MM8). The compute nodes are a cluster of Sun Fire X2270 servers connected with QDR InfiniBand. First, some definitions used in the tables below:

Local HDD: Each job runs on a single node to its dedicated direct attached storage.
NFSoIB: One node hosts its local disk for NFS mounting to other nodes over InfiniBand.
IB 7410: Sun Storage 7410 system over QDR InfiniBand.
Compact Scheduling: All 8 MM8 MPI processes run on a single node.
Distributed Scheduling: Allocate the 8 MM8 MPI processes across all available nodes.

First Test

The first test compares the performance of a single MM8 test on a single node using local storage to running a number of jobs across the cluster and showing the effect of different storage solutions.

Compact Scheduling
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Local HDD Relative Throughput NFSoIB Relative Throughput IB 7410 Relative Throughput
13% 1 1.00 1.00\* 0.98
25% 2 0.98 0.97 0.98
50% 4 0.98 0.96 0.97
75% 6 0.98 0.95 0.95
100% 8 0.98 0.95 0.95

\* Performance measured on node hosting its local disk to other nodes in the cluster.

Second Test

This next test uses the Sun Storage 7410 system and compares the performance of running the MM8 job on 1 node using the compact scheduling to running multiple jobs with compact scheduling and to running multiple jobs with the distributed schedule. The tests are run on a 8 node cluster, so each distributed job has only 1 MPI process per node.

Comparing Compact and Distributed Scheduling
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Compact Scheduling
Relative Throughput
Distributed Scheduling\*
Relative Throughput
13% 1 1.00 1.34
25% 2 1.00 1.32
50% 4 0.99 1.25
75% 6 0.97 1.10
100% 8 0.97 0.98

\* Each distributed job has 1 MPI process per node.

Third Test

This next test uses the Sun Storage 7410 system and compares the performance of running the MM8 job on 1 node using the compact scheduling to running multiple jobs with compact scheduling and to running multiple jobs with the distributed schedule. This test only uses 4 nodes, so each distributed job has two MPI processes per node.

Comparing Compact and Distributed Scheduling on 4 Nodes
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Compact Scheduling
Relative Throughput
Distributed Scheduling\*
Relative Throughput
25% 1 1.00 1.39
50% 2 1.00 1.28
100% 4 1.00 1.00

\* Each distributed job it has two MPI processes per node.

Fourth Test

The last test involves running two different applications on the 4 node cluster. It compares the performance of running the cluster fully loaded and changing how the applications are run, either compact or distributed. The comparisons are made against the individual application running the compact strategy (as few nodes as possible). It shows that appropriately mixing jobs can give better job performance than running just one kind of application on a single cluster.

Multiple Job, Multiple Application Throughput Results
Comparing Scheduling Strategies
2009.2 ECLIPSE 300 MM8 2 Million Cell and 3D Kirchoff Time Migration (PSTM)

Number of PSTM Jobs Number of MM8 Jobs Compact Scheduling
(1 node x 8 processes
per job)
ECLIPSE
Distributed Scheduling
(4 nodes x 2 processes
per job)
ECLIPSE
Distributed Scheduling
(4 nodes x 4 processes
per job)
PSTM
Compact Scheduling
(2 nodes x 8 processes per job)
PSTM
Cluster Load
0 1 1.00 1.40

25%
0 2 1.00 1.27

50%
0 4 0.99 0.98

100%
1 2
1.19 1.02
100%
2 0

1.07 0.96 100%
1 0

1.08 1.00 50%

Results and Configuration Summary

Hardware Configuration:

8 x Sun Fire X2270 servers, each with
2 x 2.93 GHz Intel Xeon X5570 processors
24 GB memory (6 x 4 GB memory at 1333 MHz)
1 x 500 GB SATA
Sun Storage 7410 system, 24 TB total, QDR InfiniBand
4 x 2.3 GHz AMD Opteron 8356 processors
128 GB memory
2 Internal 233GB SAS drives (466 GB total)
2 Internal 93 GB read optimized SSD (186 GB total)
1 Sun Storage J4400 with 22 1 TB SATA drives and 2 18 GB write optimized SSD
20 TB RAID-Z2 (double parity) data and 2-way striped write optimized SSD or
11 TB mirrored data and mirrored write optimized SSD
QDR InfiniBand Switch

Software Configuration:

SUSE Linux Enterprise Server 10 SP 2
Scali MPI Connect 5.6.6
GNU C 4.1.2 compiler
2009.2 ECLIPSE 300
ECLIPSE license daemon flexlm v11.3.0.0
3D Kirchoff Time Migration

Benchmark Description

The benchmark is a home-grown study in resource usage options when running the Schlumberger ECLIPSE 300 Compositional reservoir simulator with 8 rank parallelism (MM8) to process Schlumberger's standard 2 Million Cell benchmark model. Schlumberger pre-built executables were used to process a 260x327x73 (2 Million Cell) sub-grid with 6,206,460 total grid cells and model 7 different compositional components within a reservoir. No source code modifications or executable rebuilds were conducted.

The ECLIPSE 300 MM8 job uses 8 MPI processes. It can run within a single node (compact) or across multiple nodes of a cluster (distributed). By using the MM8 job, it is possible to compare the performance between running each job on a separate node using local disk to using a shared network attached storage solution. The benchmark tests study the affect of increasing the number of MM8 jobs in a throughput model.

The first test compares the performance of running 1, 2, 4, 6 and 8 jobs on a cluster of 8 nodes using local disk, NFSoIB disk, and the Sun Storage 7410 system connected via InfiniBand. Results are compared against the time it takes to run 1 job with local disk. This test shows what performance impact there is when loading down a cluster.

The second test compares different methods of scheduling jobs on a cluster. The compact method involves putting all 8 MPI processes for a job on the same node. The distributed method involves using 1 MPI processes per node. The results compare the performance against 1 job on one node.

The third test is similar to the second test, but uses only 4 nodes in the cluster, so when running distributed, there are 2 MPI processes per node.

The fourth test compares the compact and distributed scheduling methods on 4 nodes while running a 2 MM8 jobs and one 16-way parallel 3D Prestack Kirchhoff Time Migration (PSTM).

Key Points and Best Practices

  • ECLIPSE is very sensitive to memory bandwidth and needs to be run on 1333 MHz or greater memory speeds. In order to maintain 1333 MHz memory, the maximum memory configuration for the processors used in this benchmark is 24 GB. Bios upgrades now allow 1333 MHz memory for up to 48 GB of memory. Additional nodes can be used to handle data sets that require more memory than available per node. Allocating at least 20% of memory per node for I/O caching helps application performance.

  • If allocating an 8-way parallel job (MM8) to a single node, it is best to use an ECLIPSE license for that particular node to avoid the any additional network overhead of sharing a global license with all the nodes in a cluster.

  • Understanding the ECLIPSE MM8 I/O access patterns is essential to optimizing a shared storage solution. The analytics available on the Oracle Unified Storage 7410 provide valuable I/O characterization information even without source code access. A single MM8 job run shows an initial read and write load related to reading the input grid, parsing Petrel ascii input parameter files and creating an initial solution grid and runtime specifications. This is followed by a very long running simulation that writes data, restart files, and generates reports to the 7410. Due to the nature of the small block I/O, the mirrored configuration for the 7410 outperformed the RAID-Z2 configuration.

    A single MM8 job reads, processes, and writes approximately 240 MB of grid and property data in the first 36 seconds of execution. The actual read and write of the grid data, that is intermixed with this first stage of processing, is done at a rate of 240 MB/sec to the 7410 for each of the two operations.

    Then, it calculates and reports the well connections at an average 260 KB writes/second with 32 operations/second = 32 x 8 KB writes/second. However, the actual size of each I/O operation varies between 2 to 100 KB and there are peaks every 20 seconds. The write cache is on average operating at 8 accesses/second at approximately 61 KB/second (8 x 8 KB writes/sec). As the number of concurrent jobs increases, the interconnect traffic and random I/O operations per second to the 7410 increases.

  • MM8 multiple job startup time is reduced on shared file systems, if each job uses separate input files.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/20/2010.

Tuesday Jun 29, 2010

Sun Fire X2270 M2 Achieves Leading Single Node Results on ANSYS FLUENT Benchmark

Oracle's Sun Fire X2270 M2 server produced leading single node performance results running the ANSYS FLUENT benchmark cases as compared to the best single node results currently posted at the ANSYS FLUENT website. ANSYS FLUENT is a prominent MCAE application used for computational fluid dynamics (CFD).

  • The Sun Fire X2270 M2 server outperformed all single node systems in 5 of 6 test cases at the 12 core level, beating systems from Cray and SGI.
  • For the truck_14m test, the Sun Fire X2270 M2 server outperformed all single node systems at all posted core counts, beating systems from SGI, Cray and HP. When considering performance on a single node, the truck_14m model is most representative of customer CFD model sizes in the test suite.
  • The Sun Fire X2270 M2 server with 12 cores performed up to 1.3 times faster than the previous generation Sun Fire X2270 server with 8 cores.

Performance Landscape

Results are presented for six of the seven ANSYS FLUENT benchmark tests. The seventh test is not a practical test for a single system. Results are ratings, where bigger is better. A rating is the number of jobs that could be run in a single day (86,400 / run time). Competitive results are from the ANSYS FLUENT benchmark website as of 25 June 2010.

Single System Performance

ANSYS FLUENT Benchmark Tests
Results are Ratings, Bigger is Better
System Benchmark Test
eddy_417k turbo_500k aircraft_2m sedan_4m truck_14m truck_poly_14m
Sun Fire X2270 M2 1129.4 5391.6 1105.9 814.1 94.8 96.4
SGI Altix 8400EX 1338.0 5308.8 1073.3 796.3 - -
SGI Altix XE1300C 1299.2 5284.4 1071.3 801.3 90.2 -
Cray CX1 1060.8 5127.6 1069.6 808.6 86.1 87.5

Scaling of Benchmark Test truck_14m

ANSYS FLUENT truck_14m Model
Results are Ratings, Bigger is Better
System Cores Used
12 8 4 2 1
Sun Fire X2270 M2 94.8 73.6 41.4 21.0 10.4
SGI Altix XE1300C 90.2 60.9 41.1 20.7 9.0
Cray CX1 (X5570) - 71.7 33.2 18.9 8.1
HP BL460 G6 (X5570) - 70.3 38.6 19.6 9.2

Comparing System Generations, Sun Fire X2270 M2 to Sun Fire X2270

ANSYS FLUENT Benchmark Tests
Results are Ratings, Bigger is Better
System Benchmark Test
eddy_417k turbo_500k aircraft_2m sedan_4m truck_14m truck_poly_14m
Sun Fire X2270 M2 1129.4 5374.8 1103.8 814.1 94.8 96.4
Sun Fire X2270 981.5 4163.9 862.7 691.2 73.6 73.3

Ratio 1.15 1.29 1.28 1.18 1.29 1.32

Results and Configuration Summary

Hardware Configuration:

Sun Fire X2270 M2
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory
1 x 500 GB 7200 rpm SATA internal HDD

Sun Fire X2270
2 x 2.93 GHz Intel Xeon X5570 processors
48 GB memory
2 x 24 GB internal striped SSDs

Software Configuration:

64-bit SUSE Linux Enterprise Server 10 SP 3 (SP 2 for X2270)
ANSYS FLUENT V12.1.2
ANSYS FLUENT Benchmark Test Suite

Benchmark Description

The following description is from the ANSYS FLUENT website:

The FLUENT benchmarks suite comprises of a set of test cases covering a large range of mesh sizes, physical models and solvers representing typical industry usage. The cases range in size from a few 100 thousand cells to more than 100 million cells. Both the segregated and coupled implicit solvers are included, as well as hexahedral, mixed and polyhedral cell cases. This broad coverage is expected to demonstrate the breadth of FLUENT performance on a variety of hardware platforms and test cases.

The performance of a CFD code will depend on several factors, including size and topology of the mesh, physical models, numerics and parallelization, compilers and optimization, in addition to performance characteristics of the hardware where the simulation is performed. The principal objective of this benchmark suite is to provide comprehensive and fair comparative information of the performance of FLUENT on available hardware platforms.

About the ANSYS FLUENT 12 Benchmark Test Suite

    CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.

See Also

Disclosure Statement

All information on the FLUENT website (http://www.fluent.com) is Copyrighted 1995-2010 by ANSYS Inc. Results as of June 25, 2010.

Sun Fire X2270 M2 Demonstrates Outstanding Single Node Performance on MSC.Nastran Benchmarks

Oracle's Sun Fire X2270 M2 server results showed outstanding performance running the MCAE application MSC.Nastran as shown by the MD Nastran MDR3 serial and parallel test cases.

Performance Landscape

Complete information about the serial results presented below can be found on the MSC Nastran website.


MD Nastran MDR3 Serial Test Results
Platform Benchmark Problem
Results are total elapsed run time in seconds
xl0imf1 xx0xst0 xl1fn40 vl0sst1
Sun Fire X2270 M2 999 704 2337 115
Sun Blade X6275 1107 798 2285 120
Intel Nehalem 1235 971 2453 123
Intel Nehalem w/ SSD 1484 767 2456 120
IBM:P6 570 ( I8 )
1510 4612 132
IBM:P6 570 ( I4 ) 1016 1618 5534 147

Complete information about the parallel results presented below can be found on the MSC Nastran website.


MD Nastran MDR3 Parallel Test Results
Platform Benchmark Problem
Results are total elapsed run time in seconds
xx0cmd2 md0mdf1
Serial DMP=2 DMP=4 DMP=8 Serial DMP=2 DMP=4
Sun Blade X6275 840 532 391 279 880 422 223
Sun Fire X2270 M2 847 558 371 297 889 462 232
Intel Nehalem w/ 4 SSD 887 639 405
902 479 235
Intel Nehalem 915 561 408
922 470 251
IBM:P6 570 ( I8 ) 920 574 392 322


IBM:P6 570 ( I4 ) 959 616 419 343 911 469 242

Results and Configuration Summary

Hardware Configuration:

Sun Fire X2270 M2
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory
4 x 24 GB SSDs (striped)

Software Configuration:

64-bit SUSE Linux Enterprise Server 10 SP 3
MSC Software MD 2008 R3
MD Nastran MDR3 benchmark test suite

Benchmark Description

The benchmark tests are representative of typical MSC.Nastran applications including both serial and parallel (DMP) runs involving linear statics, nonlinear statics, and natural frequency extraction as well as others. MD Nastran is an integrated simulation system with a broad set of multidiscipline analysis capabilities.

Key Points and Best Practices

  • The test cases for the MSC.Nastran module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). The required scratch file size ranges from less than 1 GB on up to about 140 GB. To obtain best performance, it is important to have a high performance storage system when running MD Nastran.

  • To improve performance, it is possible to make use of the MD Nastran feature which sets the maximum amount of memory the application will use. This allows a user to configure where temporary files are held, including in memory file systems like tmpfs.

See Also

Disclosure Statement

MSC.Software is a registered trademark of MSC. All information on the MSC.Software website is copyrighted. MD Nastran MDR3 results from http://www.mscsoftware.com and this report as of June 28, 2010.

Sun Fire X4170 M2 Sets World Record on SPEC CPU2006 Benchmark

Oracle's Sun Fire X4170 M2 server equipped with two Intel Xeon X5670 2.93 GHz processors and running the Oracle Solaris 10 operating system delivered the a world record score of 53.5 SPECfp_base2006.

  • The Sun Fire X4170 M2 server using the Oracle Solaris Studio Express 06/10 compiler delivered a world record result of 53.5 SPECfp_base2006.

  • The Sun Fire X4170 M2 server delivered 20% better performance on the SPECfp_base2006 benchmark compared to the IBM 780 POWER7 based system.

  • The Sun Fire X4170 M2 server beat systems from Supermicro (X8DTU-LN4F+), Dell (R710), IBM (x3650 M3) and Bull (R460 F2) on SPECfp_base2006.

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results. All results as of 06/28/10.

In the tables below
"Base" = SPECint_base2006, SPECfp_base2006, SPECint_rate_base2006 or SPECfp_rate_base2006
"Peak" = SPECint2006, SPECfp2006, SPECint_rate2006 or SPECfp_rate2006

SPECfp2006 results

System Processors Performance Results
Cores/
Chips
Type GHz Peak Base
Sun Fire X4170 M2 12/2 Xeon X5670 2.93 57.6 53.5
Sun Fire X2270 M2 12/2 Xeon X5670 2.93 58.6 49.9
Supermicro X8DTU-LN4F+ 8/2 Xeon X5677 3.46 48.8 45.9
IBM x3650 M3 8/2 Xeon X5677 3.46 48.9 45.8
Bull R460 F2 8/2 Xeon X5677 3.46 49.3 45.8
Dell R710 8/2 Xeon X5677 3.46 49.3 45.8
Dell R710 12/2 Xeon X5680 3.33 48.5 45.0
IBM 780 16/2 POWER7 3.94 71.5 44.5
Dell R710 12/2 Xeon X5670 2.93 45.8 42.5

SPECint_rate2006 results

System Processors Base
Copies
Performance Results
Cores/
Chips
Type GHz Peak Base
Dell R815 24/2 Opteron 6176 2.3 24 401 314
Fijitsu BX922 S2 12/2 Xeon X5680 3.33 24 381 354
Dell R710 12/2 Xeon X5680 3.33 24 379 355
Sun Blade X6270 M2 12/2 Xeon X5680 3.33 24 369 337
Sun Fire X4170 M2 12/2 Xeon X5670 2.93 24 353 316
Sun Fire X2270 M2 (S10) 12/2 Xeon X5670 2.93 24 346 311
Sun Fire X2270 M2 (OEL) 12/2 Xeon X5670 2.93 24 342 320

SPECfp_rate2006 results

System Processors Base
Copies
Performance Results
Cores/
Chips
Type GHz Peak Base
Dell R815 24/2 Opteron 6176 2.3 24 323 295
Dell R710 12/2 Xeon X5680 3.33 24 256 248
Fijitsu BX922 S2 12/2 Xeon X5680 3.33 24 256 248
Sun Blade X6270 M2 12/2 Xeon X5680 3.33 24 255 247
Sun Fire X4170 M2 12/2 Xeon X5670 2.93 24 245 234
Sun Fire X2270 M2 (S10) 12/2 Xeon X5670 2.93 24 240 231
Sun Fire X2270 M2 (OEL) 12/2 Xeon X5670 2.93 24 235 226

Results and Configuration Summary

Hardware Configuration:

Sun Fire X4170
2 x 2.93 GHz Intel Xeon X5670
48 GB
Sun Fire X2270
2 x 2.93 GHz Intel Xeon X5670
48 GB
Sun Blade X6270
2 x 3.33 GHz Intel Xeon X5680
48 GB

Software Configuration:

Oracle Solaris 10 10/09
Oracle Solaris Studio Express 6/10
SPEC CPU2006 suite v1.1
MicroQuill SmartHeap Library v8.1

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark, with over 8000 results published in the three years since it was introduced. It measures:

  • "Speed" - single copy performance of chip, memory, compiler
  • "Rate" - multiple copy (throughput)

The rate metrics are used for the throughput-oriented systems described on this page. These metrics include:

  • SPECint_rate2006: throughput for 12 integer benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding
  • SPECfp_rate2006: throughput for 17 floating point benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

There are base variants of both the above metrics that require more conservative compilation. In particular, all benchmarks of a particular programming language must use the same compilation flags.

See Also

Disclosure Statement

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 24 June 2010 and this report. Sun Fire X4170 M2 53.5 SPECfp_base2006.

Monday Jun 28, 2010

Sun Fire X4470 Sets World Record on SPEC CPU2006 Rate Benchmark

Oracle's Sun Fire X4470 server delivered a world record SPECint_rate2006 result for all x86 systems with 4 chips.

  • The Sun Fire X4470 server with four Intel Xeon X7560 processors achieved a SPECint_rate2006 score of 788 and a SPECfp_rate2006 score of 573

  • The Sun Fire X4470 server delivered better 4 socket x86 system performance on the SPECint_rate2006 benchmark compared to HP (DL585 G7), Cicso (UCS C460 M1), Dell (R815) and IBM (x3850 X5).

  • The Sun Fire X4470 server delivered better performance on the SPECfp_rate2006 benchmark compared to similar Intel Xeon X7560 processor based systems from Cisco (UCS C460 M1), IBM (x3850 X5), and Fujitsu (RX600 S5).

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results. All results as of 06/28/10.

In the tables below
"Base" = SPECint_rate_base2006 or SPECfp_rate_base2006
"Peak" = SPECint_rate2006 or SPECfp_rate2006

SPECint_rate2006 results

System Processors Base
Copies
Performance Results
Cores/
Chips
Type GHz Peak Base
Sun Fire X4470 32/4 Xeon X7560 2.26 64 788 724
HP DL585 G7 48/4 Opteron 6176 2.3 48 782 610
Cisco UCS C460 M1 32/4 Xeon X7560 2.26 64 772 723
Dell R815 48/4 Opteron 6174 2.20 48 771 602
IBM x3850 X5 32/4 Xeon X7560 2.26 64 770 720
Sun Fire X4640 48/8 Opteron 8435 2.6 48 730 574

SPECfp_rate2006 results

System Processors Base
Copies
Performance Results
Cores/
Chips
Type GHz Peak Base
Dell R815 48/4 Opteron 6174 2.20 48 626 574
HP DL585 G7 48/4 Opteron 6176 2.3 48 619 572
Sun Fire X4470 32/4 Xeon X7560 2.26 64 573 547
Cicso UCS C460 M1 32/4 Xeon X7560 2.26 64 568 549
IBM x3850 X5 32/4 Xeon X7560 2.26 64 560 543
Fujitsu RX600 S5 32/4 Xeon X7560 2.26 64 559 538
Sun Fire X4640 48/8 Opteron 8435 2.6 48 470 434

Results and Configuration Summary

Hardware Configuration:

Sun Fire X4470
4 x 2.26 GHz Intel Xeon X7560
256 GB

Software Configuration:

Oracle Solaris 10 10/09
Oracle Solaris Studio Express 6/10
SPEC CPU2006 suite v1.1
MicroQuill SmartHeap Library v8.1

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark, with over 8000 results published in the three years since it was introduced. It measures:

  • "Speed" - single copy performance of chip, memory, compiler
  • "Rate" - multiple copy (throughput)

The rate metrics are used for the throughput-oriented systems described on this page. These metrics include:

  • SPECint_rate2006: throughput for 12 integer benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding
  • SPECfp_rate2006: throughput for 17 floating point benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

There are base variants of both the above metrics that require more conservative compilation. In particular, all benchmarks of a particular programming language must use the same compilation flags.

See Also

Disclosure Statement

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 24 June 2010 and this report. Sun Fire X4470 788 SPECint_rate2006.

Tuesday Apr 06, 2010

WRF Benchmark: X6275 Beats Power6

Significance of Results

Oracle's Sun Blade X6275 cluster is 28% faster than the IBM POWER6 cluster on Weather Research and Forecasting (WRF) continental United Status (CONUS) benchmark datasets. The Sun Blade X6275 cluster used a Quad Data Rate (QDR) InfiniBand connection along with Intel compilers and MPI.

  • On the 12 km CONUS data set, the Sun Blade X6275 cluster was 28% faster than the IBM POWER6 cluster at 512 cores.

  • The Sun Blade X6275 cluster with 768 cores (one full Sun Blade 6048 chassis) was 47% faster than 1024 cores of the IBM POWER6 cluster (multiple racks).

  • On the 2.5 km CONUS data set, the Sun Blade X6275 cluster was 21% faster than the IBM POWER6 cluster at 512 cores.

  • The Sun Blade X6275 cluster with 768 cores (on full Sun Blade 6048 chassis) outperforms the IBM Power6 cluster with 1024 cores by 28% on the 2.5 km CONUS dataset.

Performance Landscape

The performance in GFLOPS is shown below on multiple datasets.

Weather Research and Forecasting
CONUS 12 KM Dataset
Cores Performance in GFLOPS
Sun
X6275
Intel
Whitebox
IBM
POWER6
Cray
XT5
SGI TACC
Ranger
Blue
Gene/P
8 17.5 19.8 17.0
10.2

16 38.7 37.5 33.6 21.4 20.1 10.8
32 71.6 73.3 66.5 40.4 39.8 21.2 5.9
64 132.5 131.4 117.2 75.2 77.0 37.8
128 235.8 232.8 209.1 137.4 114.0 74.5 20.4
192 323.6





256 405.2 415.1 363.1 243.2 197.9 121.0 37.4
384 556.6





512 691.9 696.7 542.2 392.2 375.2 193.9 65.6
768 912.0






1024

618.5 634.1 605.9 271.7 108.5
1700



840.1


2048





175.6

All cores used on each node which participates in each run.

Sun X6275 - 2.93 GHz X5570, InfiniBand
Intel Whitebox - 2.8 GHz GHz X5560, InfiniBand
IBM POWER6 - IBM Power 575, 4.7 GHz POWER6, InfiniBand, 3 frames
Cray XT5 - 2.7 GHz AMD Opteron (Shanghai), Cray SeaStar 2.1
SGI - best of a variety of results
TACC Ranger - 2.3 GHz AMD Opteron (Barcelona), InfiniBand
Blue Gene/P - 850 MHz PowerPC 450, 3D-Torus (proprietary)

Weather Research and Forecasting
CONUS 2.5 KM Dataset
Cores Performance in GFLOPS
Sun
X6275
SGI
8200EX
Blue
Gene/L
IBM
POWER6
Cray
XT5
Intel
Whitebox
TACC
Ranger
16 35.2






32 69.6

64.3


64 140.2

130.9
147.8 24.5
128 278.9 89.9
242.5 152.1 290.6 87.7
192 400.5





256 514.8 179.6 8.3 431.3 306.3 535.0 145.3
384 735.1





512 973.5 339.9 16.5 804.4 566.2 1019.9 311.0
768 1367.7





1024
721.5 124.8 1067.3 1075.9 1911.4 413.4
2048
1389.5 241.2
1849.7 3251.1
2600




4320.6
3072
1918.7 350.5
2651.3

4096
2543.5 453.2
3288.7

6144
3057.3 642.3
4280.1

8192
3569.7 820.4
5140.4

18432

1238.0



Sun X6275 - 2.93 GHz X5570, InfiniBand
SGI 8200EX - 3.0 GHz E5472, InfiniBand
Blue Gene/L - 700 MHz PowerPC 440, 3D-Torus (proprietary)
IBM POWER6 - IBM Power 575, 4.7 GHz POWER6, InfiniBand, 3 frames
Cray XT5 - 2.4 GHz AMD Opteron (Shanghai), Cray SeaStar 2.1
Intel Whitebox - 2.8 GHz GHz X5560, InfiniBand
TACC Ranger - 2.3 GHz AMD Opteron (Barcelona), InfiniBand

Results and Configuration Summary

Hardware Configuration:

48 x Sun Blade X6275 server modules, 2 nodes per blade, each node with
2 Intel Xeon X5570 2.93 GHz processors, turbo enabled, ht disabled
24 GB memory
QDR InfiniBand

Software Configuration:

SUSE Linux Enterprise Server 10 SP2
Intel Compilers 11.1.059
Intel MPI 3.2.2
WRF 3.0.1.1
WRF 3.1.1
netCDF 4.0.1

Benchmark Description

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. WRF is designed to be a flexible, state-of-the-art atmospheric simulation system that is portable and efficient on available parallel computing platforms. It features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility.

There are two fixed-size benchmark cases.

Single domain, medium size 12KM Continental US (CONUS-12K)

  • 425x300x35 cell volume
  • 48hr, 12km resolution dataset from Oct 24, 2001
  • Benchmark is a 3hr simulation for hrs 25-27 starting from a provided restart file
  • Iterations output at every 72 sec of simulation time, with the computation cost of each time step ~30 GFLOP

Single domain, large size 2.5KM Continental US (CONUS-2.5K)

  • 1501x1201x35 cell volume
  • 6hr, 2.5km resolution dataset from June 4, 2005
  • Benchmark is the final 3hr simulation for hrs 3-6 starting from a provided restart file; the benchmark may also be performed (but seldom reported) for the full 6hrs starting from a cold start
  • Iterations output at every 15 sec of simulation time, with the computation cost of each time step ~412 GFLOP

See Also

Disclosure Statement

WRF, see http://www.mmm.ucar.edu/wrf/WG2/bench/, results as of 3/8/2010.

Monday Mar 29, 2010

Sun Blade X6275/QDR IB/ Reverse Time Migration

Significance of Results

Oracle's Sun Blade X6275 cluster with a Lustre file system was used to demonstrate the performance potential of the system when running reverse time migration applications complete with I/O processing.

  • Reduced the Total Application run time for the Reverse Time Migration when processing 800 input traces for two production sized surveys from a QDR Infiniband Lustre file system on 24 X6275 nodes, by implementing algorithm I/O optimizations and taking advantage of MPI I/O features in HPC ClusterTools:

    • 1243x1151x1231 - Wall clock time reduced from 11.5 to 6.3 minutes (1.8x improvement)
    • 2486x1151x1231 - Wall clock time reduced from 21.5 to 13.5 minutes (1.6x improvement)
  • Reduced the I/O Intensive Trace-Input time for the Reverse Time Migration when reading 800 input traces for two production sized surveys from a QDR Infiniband Lustre file system on 24 X6275 nodes running HPC ClusterTools, by modifying the algorithm to minimize the per node data requirement and avoiding unneeded synchronization:

    • 2486x1151x1231 : Time reduced from 121.5 to 3.2 seconds (38.0x improvement)
    • 1243x1151x1231 : Time reduced from 71.5 to 2.2 seconds (32.5x improvement)
  • Reduced the I/O Intensive Grid Initialization time for the Reverse Time Migration Grid when reading the Velocity, Epsilon, and Delta slices for two production sized surveys from a QDR Infiniband Lustre file system on 24 X6275 nodes running HPC ClusterTools, by modifying the algorithm to minimize the per node grid data requirement:

    • 2486x1151x1231 : Time reduced from 15.6 to 4.9 seconds (3.2x improvement)
    • 1243x1151x1231 : Time reduced from 8.9 to 1.2 seconds (7.4x improvement)

Performance Landscape

In the tables below, the hyperthreading feature is enabled and the systems are fully utilized.

This first table presents the total application performance in minutes. The overall performance improved significantly because of the improved I/O performance and other benefits.


Total Application Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes 1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Original
Time (mins)
MPI I/O
Time (mins)
Improvement Original
Time (mins)
MPI I/O
Time (mins)
Improvement
24 11.5 6.3 1.8x 21.5 13.5 1.6x
20 12.0 8.0 1.5x 21.9 15.4 1.4x
16 13.8 9.7 1.4x 26.2 18.0 1.5x
12 21.7 13.2 1.6x 29.5 23.1 1.3x

This next table presents the initialization I/O time. The results are presented in seconds and shows the advantage of the improved MPI I/O strategy.


Initialization Time Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes 1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Original
Time (sec)
MPI I/O
Time (sec)
Improvement Original
Time (sec)
MPI I/O
Time (sec)
Improvement
24 8.9 1.2 7.4x 15.6 4.9 3.2x
20 9.3 1.5 6.2x 16.9 3.9 4.3x
16 9.7 2.5 3.9x 17.4 11.3 1.5x
12 9.8 3.3 3.0x 22.5 14.9 1.5x

This last table presents the trace I/O time. The results are presented in seconds and shows the significant advantage of the improved MPI I/O strategy.


Trace I/O Time Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes 1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Original
Time (sec)
MPI I/O
Time (sec)
Improvement Original
Time (sec)
MPI I/O
Time (sec)
Improvement
24 71.5 2.2 32.5x 121.5 3.2 38.0x
20 67.7 2.4 28.2x 118.3 3.9 30.3x
16 64.2 2.7 23.7x 110.7 4.6 24.1x
12 69.9 4.2 16.6x 296.3 14.6 20.3x

Results and Configuration Summary

Hardware Configuration:

Oracle's Sun Blade 6048 Modular Modular System with
12 x Oracle's Sun Blade x6275 Server Modules, each with
4 x 2.93 GHz Intel Xeon QC X5570 processors
12 x 4 GB memory at 1333 MHz
2 x 24 GB Internal Flash
QDR InfiniBand Lustre 1.8.0.1 File System

Software Configuration:

OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
MPI: Oracle Message Passing Toolkit 8.2.1 for I/O optimization to Lustre file system
MPI: Scali MPI Connect 5.6.6-59413 for original Lustre file system runs
Compiler: Oracle Solaris Studio 12 C++, Fortran, OpenMP

Benchmark Description

The primary objective of this Reverse Time Migration Benchmark is to present MPI I/O tuning techniques, exploit the power of the Sun's HPC ClusterTools MPI I/O implementation, and demonstrate the world-class performance of Sun's Lustre File System to Exploration Geophysicists throughout the world. A Sun Blade 6048 Modular System with 12 Sun Blade X6275 server modules were clustered together with a QDR Infiniband Lustre File System to show performance improvements in the Reverse Time Migration Throughput by using the Sun HPC ClusterTools MPI-IO features to implement specific algorithm I/O optimizations.

This Reverse Time Migration Benchmark measures the total time it takes to image 800 samples of various production size grids and write the final image to disk. In this new I/O optimized version, each node reads in only the data to be processed by that node plus a 4 element inline pad shared with it's neighbors to the left and right. This latest version, essentially loads the boundary condition data during the initialization phase. The previous version handled boundary conditions by having each node read in all the trace, velocity, and conditioning data. Or, alternatively, the master node would read in all the data and distribute it in it's entirety to every node in the cluster. With the previous version, each node had full memory copies of all input data sets even when it only processed a subset of that data. The new version only holds the inline dimensions and pads to be processed by a particular node in memory.

Key Points and Best Practices

  • The original implementation of the trace I/O involved the master node reading in nx \* ny floats and communicating this trace data to all the other nodes in a synchronous manner. Each node only used a subset of the trace data for each of the 800 time steps. The optimized I/O version has each node asynchronously read in only the (nx/num_procs + 8) \* ny floats that it will be processing. The additional 8 inline values for the optimized I/O version are the 4 element pads of a node's left and right neighbors to handle initial boundary conditions. The MPI_Barrier needed for the original implementation, for synchronization, and the additional I/O for each node to load all the data values, truly impacts performance. For the I/O optimized version, each node reads only the data values it needs and does not require the same MPI_Barrier synchronization as the original version of the Reverse Time Migration Benchmark. By performing such I/O optimizations, a significant improvement is seen in the Trace I/O.

  • For the best MPI performance, allocate the X6275 nodes in blade by blade order and run with HyperThreading enabled. The "Binary Conditioning" part of the Reverse Time Migration specifically likes hyperthreading.

  • To get the best I/O performance, use a maximum of 70% of each nodes available memory for the Reverse Time Migration application. Execution time may vary I/O results can occur if the nodes have different memory size configurations.

See Also

Friday Nov 20, 2009

Sun Blade X6275 cluster delivers leading results for Fluent truck_111m benchmark

A Sun Blade 6048 Modular System with 16 Sun Blade X6275 Server Modules configured with QDR InfiniBand cluster interconnect delivered outstanding performance running the FLUENT benchmark test suite truck_111m case.

  • A cluster of Sun Blade X6275 server modules with 2.93 GHz Intel X5570 processors achieved leading 32-node performance for the largest truck test case, truck_111m.
  • The Sun Blade X6275 cluster delivered the best performance for the 64-core/8-node, 128-core/16-node, and 256-core/32-node configurations, outperforming the SGI Altix result by as much as 8%.
  • NOTE: These results are will not be published on the Fluent website as Fluent has stopped accepting results for this version.

Performance Landscape


FLUENT 12 Benchmark Test Suite - truck_111m
  Results are "Ratings" (bigger is better)
  Rating = No. of sequential runs of test case possible in 1 day = 86,400 sec/(Total Elapsed Run Time in seconds)

System (1)
cores Benchmark Test Case
truck
111m

Sun Blade X6275, 32 nodes 256 240.0
SGI Altix ICE 8200 IP95, 32 nodes 256 238.9
Intel Whitebox, 32 nodes 256 219.8

Sun Blade X6275, 16 nodes 128 129.6
SGI Altix ICE 8200 IP95, 16 nodes 128 120.8
Intel Whitebox, 16 nodes 128 116.9

Sun Blade X6275, 8 nodes 64 64.6
SGI Altix ICE 8200 IP95, 8 nodes 64 59.8
Intel Whitebox, 8 nodes 64 57.4

(1) Sun Blade X6275, X5570 QC 2.93GHz, QDR
Intel Whitebox, X5560 QC 2.8GHz, DDR
SGI Altix ICE 8200, X5570 QC 2.93GHz, DDR

Results and Configuration Summary

Hardware Configuration:

    16 x Sun Blade X6275 Server Module ( Dual-Node Blade, 32 nodes ) each node with
      2 x 2.93GHz Intel X5570 QC processors
      24 GB (6 x 4GB, 1333 MHz DDR3 dimms)
      On-board QDR InfiniBand Host Channel Adapters, QNEM

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Interconnect Software: OFED ver 1.4.1
    Shared File System: Lustre ver 1.8.1
    Application: FLUENT V12.0.16
    Benchmark: FLUENT 12 Benchmark Test Suite

Benchmark Description

The benchmark test are representative of typical user large CFD models intended for execution in distributed memory processor (DMP) mode over a cluster of multi-processor platforms.

Key Points and Best Practices

Observations About the Results

The Sun Blade X6275 cluster delivered excellent performance on the largest Fluent benchmark problem, truck_111m.

The Intel X5570 processors include a turbo boost feature coupled with a speedstep option in the CPU section of the advanced BIOS settings. This, under specific circumstances, can provide a cpu upclocking, temporarily increasing the processor frequency from 2.93GHz to 3.2GHz.

Memory placement is a very significant factor with Nehalem processors. Current Nehalem platforms have two sockets. Each socket has three memory channels and each channel has 3 bays for DIMMs. For example if one DIMM is placed in the 1st bay of each of the 3 channels the DIMM speed will be 1333 MHz with the X5570's altering the DIMM arrangement to an off balance configuration by say adding just one more DIMM into the 2nd bay of one channel will cause the DIMM frequency to drop from 1333 MHz to 1067 MHz.

About the FLUENT 12 Benchmark Test Suite

The FLUENT application performs computational fluid dynamic analysis on a variety of different types of flow and allows for chemically reacting species. transient dynamic and can be linear or nonlinear as far

  • CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.
  • CFD models typically scale very well and are very suited for execution on clusters. The FLUENT 12 benchmark test cases scale well.
  • The memory requirements for the test cases in the FLUENT 12 benchmark test suite range from a few hundred megabytes to about 25 GB. As the job is distributed over multiple nodes the memory requirements per node correspondingly are reduced.
  • The benchmark test cases for the FLUENT module do not have a substantial I/O component. component. However performance will be enhanced very substantially by using high performance interconnects such as InfiniBand for inter node cluster message passing. This nodal message passing data can be stored locally on each node or on a shared file system.

See Also

Current FLUENT 12 Benchmark:
http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.4.x/

Disclosure Statement

All information on the Fluent website is Copyrighted 1995-2009 by Fluent Inc. Results from http://www.fluent.com/software/fluent/fl6bench/ as of November 12, 2009 and this presentation.

Monday Nov 02, 2009

Sun Blade X6275 Cluster Beats SGI Running Fluent Benchmarks

A Sun Blade 6048 Modular System with 8 Sun Blade X6275 Server Modules configured with QDR InfiniBand cluster interconnect delivered outstanding performance running the FLUENT 12 benchmark test suite. Sun consistently delivered the best or near best results per node for the 6 benchmark tests considered up to the available nodes considered for these runs.

  • The Sun Blade X6275 cluster delivered the best results for the truck_poly_14M tests for all Rank counts tested.
  • For this large truck_poly_14m test case, the Sun Blade X6275 cluster beat the best results by SGI by as much as 19%.

  • Of the 54 test cases presented here, the Sun Blade X6275 cluster delivered the best results in 87% of the tests, 47 of the 54 cases.

Performance Landscape


FLUENT 12 Benchmark Test Suite
  Results are "Ratings" (bigger is better)
  Rating = No. of sequential runs of test case possible in 1 day 86,400/(Total Elapsed Run Time in Seconds)

System
Nodes Ranks Benchmark Test Case
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Blade X6275 16 128 6496.2 19307.3 8408.8 6341.3 1060.1 984.1
Best Intel 16 128 5236.4 (3) 15638.0 (7) 7981.5 (1) 6582.9 (1) 1005.8 (1) 933.0 (1)
Best SGI 16 128 7578.9 (5) 14706.4 (6) 6789.8 (4) 6249.5 (5) 1044.7 (4) 926.0 (4)

Sun Blade X6275 8 64 5308.8 26790.7 5574.2 5074.9 547.2 525.2
Best Intel 8 64 5016.0 (1) 25226.3 (1) 5220.5 (1) 4614.2 (1) 513.4 (1) 490.9 (1)
Best SGI 8 64 5142.9 (4) 23834.5 (4) 4614.2 (4) 4352.6 (4) 529.4 (4) 479.2 (4)

Sun Blade X6275 4 32 3066.5 13768.9 3066.5 2602.4 289.0 270.3
Best Intel 4 32 2856.2 (1) 13041.5 (1) 2837.4 (1) 2465.0 (1) 266.4 (1) 251.2 (1)
Best SGI 4 32 3083.0 (4) 13190.8 (4) 2588.8 (5) 2445.9 (5) 266.6 (4) 246.5 (4)

Sun Blade X6275 2 16 1714.3 7545.9 1519.1 1345.8 144.4 141.8
Best Intel 2 16 1585.3 (1) 7125.8 (1) 1428.1 (1) 1278.6 (1) 134.7 (1) 132.5 (1)
Best SGI 2 16 1708.4 (4) 7384.6 (4) 1507.9 (4) 1264.1 (5) 128.8 (4) 133.5 (4)

Sun Blade X6275 1 8 931.8 4061.1 827.2 681.5 73.0 73.8
Best Intel 1 8 920.1 (2) 3900.7 (2) 784.9 (2) 644.9 (1) 70.2 (2)) 70.9 (2)
Best SGI 1 8 953.1 (4) 4032.7 (4) 843.3 (4) 651.0 (4) 71.4 (4) 72.0 (4)

Sun Blade X6275 1 4 550.4 2425.3 533.6 423.0 41.6 41.6
Best Intel 1 4 515.7 (1) 2244.2 (1) 490.8 (1) 392.2 (1) 37.8 (1) 38.4 (1)
Best SGI 1 4 561.6 (4) 2416.8 (4) 526.9 (4) 412.6 (4) 40.9 (4) 40.8 (4)

Sun Blade X6275 1 2 299.6 1328.2 293.9 232.1 21.3 21.6
Best Intel 1 2 274.3 (1) 1201.7 (1) 266.1 (1) 214.2 (1) 18.9 (1) 19.6 (1)
Best SGI 1 2 294.2 (4) 1302.7 (4) 289.0 (4) 226.4 (4) 20.5 (4) 21.2 (4)

Sun Blade X6275 1 1 154.7 682.6 149.1 114.8 9.7 10.1
Best Intel 1 1 143.5 (1) 631.1 (1) 137.4 (1) 106.2 (1) 8.8 (1) 9.0 (1)
Best SGI 1 1 153.3 (4) 677.5 (4) 147.3 (4) 111.2 (4) 10.3 (4) 9.5 (4)

Sun Blade X6275 1 serial 155.6 676.6 156.9 110.0 9.4 10.3
Best Intel 1 serial 146.6 (2) 650.0 (2) 150.2 (2) 105.6 (2) 8.8 (2) 9.7 (2)

    Sun Blade X6275, X5570 QC 2.93 GHz, QDR SMT on / Turbo mode on

    (1) Intel Whitebox (X5560 QC 2.80 GHz, RHEL5, IB)
    (2) Intel Whitebox (X5570 QC 2.93 GHz, RHEL5)
    (3) Intel Whitebox (X5482 QC 3.20 GHz, RHEL5, IB)
    (4) SGI Altix ICE_8200IP95 (X5570 2.93 GHz +turbo, SLES10, IB)
    (5) SGI Altix_ICE_8200IP95 (X5570 2.93 GHz, SLES10, IB)
    (6) SGI Altix_ICE_8200EX (Intel64 QC 3.00 GHz, Linux, IB)
    (7) Qlogic Cluster (X5472 QC 3.00 GHz, RHEL5.2, IB Truescale)

Results and Configuration Summary

Hardware Configuration:

    8 x Sun Blade X6275 Server Module ( Dual-Node Blade, 16 nodes ) each node with
      2 x 2.93GHz Intel X5570 QC processors
      24 GB (6 x 4GB, 1333 MHz DDR3 dimms)
      On-board QDR InfiniBand Host Channel Adapters, QNEM

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Interconnect Software: OFED ver 1.4.1
    Shared File System: Lustre ver 1.8.0.1
    Application: FLUENT V12.0.16
    Benchmark: FLUENT 12 Benchmark Test Suite

Benchmark Description

The benchmark tests are representative of typical user large CFD models intended for execution in distributed memory processor (DMP) mode over a cluster of multi-processor platforms.

Key Points and Best Practices

Observations About the Results

The Sun Blade X6275 cluster delivered excellent performance, especially shining with the larger models

These processors include a turbo boost feature coupled with a speedstep option in the CPU section of the Advanced BIOS settings. This, under specific circumstances, can provide a cpu up clocking, temporarily increasing the processor frequency from 2.93GHz to 3.2GHz.

Memory placement is a very significant factor with Nehalem processors. Current Nehalem platforms have two sockets. Each socket has three memory channels and each channel has 3 bays for DIMMs. For example if one DIMM is placed in the 1st bay of each of the 3 channels the DIMM speed will be 1333 MHz with the X5570's altering the DIMM arrangement to an off balance configuration by say adding just one more DIMM into the 2nd bay of one channel will cause the DIMM frequency to drop from 1333 MHz to 1067 MHz.

About the FLUENT 12 Benchmark Test Suite

The FLUENT application performs computational fluid dynamic analysis on a variety of different types of flow and allows for chemically reacting species. transient dynamic and can be linear or nonlinear as far

  • CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.
  • CFD models typically scale very well and are very suited for execution on clusters. The FLUENT 12 benchmark test cases scale well.
  • The memory requirements for the test cases in the FLUENT 12 benchmark test suite range from a few hundred megabytes to about 25 GB. As the job is distributed over multiple nodes the memory requirements per node correspondingly are reduced.
  • The benchmark test cases for the FLUENT module do not have a substantial I/O component. component. However performance will be enhanced very substantially by using high performance interconnects such as InfiniBand for inter node cluster message passing. This nodal message passing data can be stored locally on each node or on a shared file system.
  • As a result of the large amount of inter node message passing performance can be further enhanced by more than a 3x factor as indicated here by implementing the Lustre based shared file I/O system.

See Also

FLUENT 12.0 Benchmark:
http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.4.x/

Disclosure Statement

All information on the Fluent website is Copyrighted 1995-2009 by Fluent Inc. Results from http://www.fluent.com/software/fluent/fl6bench/ as of October 20, 2009 and this presentation.

Monday Oct 12, 2009

MCAE ABAQUS faster on Sun F5100 and Sun X4270 - Single Node World Record

The Sun Storage F5100 Flash Array can substantially improve performance over internal hard disk drives as shown by the I/O intensive ABAQUS MCAE application Standard benchmark tests on a Sun Fire X4270 server.

The I/O intensive ABAQUS "Standard" benchmarks test cases were run on a single Sun Fire X4270 server. Data is presented for runs at both 8 and 16 thread counts.

The ABAQUS "Standard" module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal striped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "S4b" test case by 14%.

  • The Sun Fire X4270 server coupled with a Sun Storage F5100 Flash Array established the world record performance on a single node for the four test cases S2A, S4B, S4D and S6.

Performance Landscape

ABAQUS "Standard" Benchmark Test S4B: Advantage of Sun Storage F5100

Results are total elapsed run times in seconds

Threads 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
8 1504 1318 14%
16 1811 1649 10%

ABAQUS Standard Server Benchmark Subset: Single Node Record Performance

Results are total elapsed run times in seconds

Platform Cores S2a S4b S4d S6
X4270 w/F5100 8 302 1192 779 1237
HP BL460c G6 8 324 1309 843 1322
X4270 w/F5100 4 552 1970 1181 1706
HP BL460c G6 4 561 2062 1234 1812

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      Hyperthreading enabled
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: ABAQUS V6.9-1 Standard Module
    Benchmark: ABAQUS Standard Benchmark Test Suite

Benchmark Description

Abaqus/Standard Benchmark Problems

These problems provide an estimate of the performance that can be expected when running Abaqus/Standard or similar commercially available MCAE (FEA) codes like ANSYS and MSC/Nastran on different computers. The jobs are representative of those typically analyzed by Abaqus/Standard and other MCAE applications. These analyses include linear statics, nonlinear statics, and natural frequency extraction.

Please go here for a more complete description of the tests.

Key Points and Best Practices

  • The memory requirements for the test cases in the ABAQUS Standard benchmark test suite are rather substantial with some of the test cases requiring slightly over 20GB of memory. There are two memory limits one a minimum where out of core "memory" will be used when this limit is exceeded. This requires more time consuming cpu and another maximum memory limit that minimizes I/O operations. These memory limits are given in the ABAQUS output and can be established before making a full execution in a preliminary diagnostic mode run.
  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the ABAQUS job. This is done in the "abaqus_v6.env" file that either resides in the subdirectory from where the job was launched or in the abaqus "site" subdirectory under the home installation directory.
  • Sometimes when running multiple cores on a single node, it is preferable from a performance standpoint to run in "smp" shared memory mode This is specified using the "THREADS" option on the "mpi_mode" line in the abaqus_v6.env file as opposed to the "MPI" option on this line. The test case considered here illustrates this point.
  • The test cases for the ABAQUS standard module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system with high performance interconnects. On Linux OS's advantage can be taken of excess memory that can be used to cache and accelerate I/O.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Abaqus, Inc. or its subsidiaries in the United States and/or o ther countries: Abaqus, Abaqus/Standard, Abaqus/Explicit. All information on the ABAQUS website is Copyrighted 2004-2009 by Dassault Systemes. Results from http://www.simulia.com/support/v69/v69_performance.php as of October 12, 2009.

MCAE ANSYS faster on Sun F5100 and Sun X4270

Significance of Results

The Sun Storage F5100 Flash Array can greatly improve performance over internal hard disk drives as shown by the I/O intensive ANSYS MCAE application BMD benchmark tests on a Sun Fire X4270 server.

Select ANSYS 12 BMD benchmarks were run on a single Sun Fire X4270 server. These I/O intensive test cases were run to compare the performance of conventional high performance disk to Sun FlashFire technology.

The ANSYS 12.0 module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal stripped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "BMD-4" test case by 67% in the 8-core/8-thread server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "BMD-7" test case by 18% in the 8-core/16-thread server configuration.

Performance Landscape

ANSYS 12 "BMD" Test Suite on Single X4270 (24GB mem.) - SMP Mode

Results are total elapsed run times in seconds

Test Case SMP 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
bmd-4 8 523 314 67%
bmd-7 16 357 303 18%

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      Hyperthreading enabled
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: ANSYS Multiphysics 12.0
    Benchmark: ANSYS 12 "BMD" Benchmark Test Suite

Benchmark Description

ANSYS is a general purpose engineering analysis MCAE application that is based on the Finite Element Method. It performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior or deformations are concerned. Ansys provides a number of benchmark tests which exercise the capabilities of the software.

Please go here for a more complete description of the tests.

Key Points and Best Practices

Performance Considerations

The performance of Ansys (IO-intensive MCAE application) can be increased by reducing the IO demands of the application by increasing server memory or by using SSDs to increase the bandwidth and reduce the latency. The most I/O intensive case in the ANSYS distributed "BMD" test suite is BMD-4 particularly at the (maximum) 8 core level for a single node.


  • Ansys now takes full advantage of inexpensive RAID0 disk arrays and delivers sustained I/O rates.

  • Large memory can cache file accesses but often the size of ANSYS files grows much larger than the available physical memory so that system file caching is not able to hide the I/O cost.
  • For fast ANSYS runs the recommended configuration is a RAID 0 setup using 4 or more disks and a fast RAID controller. These fast I/O configurations are inexpensive to put together for systems and can achieve I/O rates in excess of 200 MB/sec.
  • SSD drives have much lower seek times, use less power, and tend to be about 2X faster than the fastest rotating disks for sustained throughput. The observed speed of a RAID 0 configuration of SSD drives for ANSYS simulations has been nearly as fast as I/O that is cached by large memory systems. SSD drives then may be the most affordable way to extend the capacity of a system to jobs that are too large to run in-core without incurring the performance penalty usually associated with I/O demands.

More About The ANSYS BMD "Distributed" Benchmarks

ANSYS is a general purpose engineering analysis MCAE application that is based on the Finite Element Method. It performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior or deformations are concerned.

In the most recent release of the ANSYS benchmarks there are now two test suites: The SMP "BM" suite designed to run on a single node with multi processors and the DMP "BMD" suite intended to run on multi node clusters but which can also run on a single node in SMP mode as in this study.

  • The test cases from both ANSYS test suites all have a substantial I/O component where 15% to 20% of the total run times are associated with I/O activity (primarily scratch files). Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system with high performance interconnects. When running with the SX64 build a ZFS system might be a good idea to employ.
  • The ANSYS test cases don't scale very well (BMD better than BM) ; at best on up 8 cores.
  • The memory requirements for the test cases in the ANSYS BMD are greater than for the standard benchmark test suite. The requirements for the standard suite are not great requiring less than 3GB.

See Also

MCAE, SSD, HPC, ANSYS, Linux, SuSE, Performance, X64, Intel

Disclosure Statement

The following are trademarks or registered trademarks of ANSYS, Inc., ANSYS Multiphysics TM. All information on the ANSYS website is Copyrighted by ANSYS, Inc. Results from http://www.ansys.com/services/ss-intel-bench120.htm as of October 12, 2009.

MCAE MCS/NASTRAN faster on Sun F5100 and Fire X4270

Significance of Results

The Sun Storage F5100 Flash Array can double performance over internal hard disk drives as shown by the I/O intensive MSC/Nastran MCAE application MDR3 benchmark tests on a Sun Fire X4270 server.

The MD Nastran MDR3 benchmarks were run on a single Sun Fire X4270 server. The I/O intensive test cases were run at different core levels from one up to the maximum of 8 available cores in SMP mode.

The MSC/Nastran MD 2008 R3 module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal stripped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xx0cmd2" test case by 107% in the 8-core server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xl0tdf1"test case by 85% in the 8-core server configuration.

The MD Nastran MDR3 test suite was designed to include some very I/O intensive test cases albeit some are not very scalable. These cases are the called "xx0wmd0" and "xx0xst0". Both were run and results are presented using a single core server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xx0xst0"test case by 33% in the single-core server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xx0wmd0"test case by 20% in the single-core server configuration.

Performance Landscape

MD Nastran MDR3 Benchmark Tests

Results in seconds

Test Case DMP 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
xx0cmd2 8 959 463 107%
xl0tdf1 8 1104 596 85%
xx0xst0 1 1307 980 33%
xx0wmd0 1 20250 16806 20%

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: MSC/NASTRAN MD 2008 R3
    Benchmark: MDR3 Benchmark Test Suite
    HP MPI: 02.03.00.00 [7585] Linux x86-64

Benchmark Description

The benchmark tests are representative of typical MSC/Nastran applications including both SMP and DMP runs involving linear statics, nonlinear statics, and natural frequency extraction.

The MD (Multi Discipline) Nastran 2008 application performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior and/or deformations are concerned. The new release includes the MARC module for general purpose nonlinear analyses and the Dytran module that employs an explicit solver to analyze crash and high velocity impact conditions.

Please go here for a more complete description of the tests.

Key Points and Best Practices

  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the Nastran job. This is done on the command line with the mem= option. On Linux based systems where the platform has a large amount of memory and where the model does not have large scratch I/O requirements the memory can be allocated to a tmpfs scratch space file system. On Solaris X64 systems advantage can be taken of ZFS for higher I/O performance.

  • The MD Nastran MDR3 test cases don't scale very well, a few not at all and the rest on up to 8 cores at best.

  • The test cases for the MSC/Nastran module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). The required scratch file size ranges from less than 1 GB on up to about 140 GB. Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system, further enhanced as indicated here by implementing the Lustre based I/O system. High performance interconnects such as InfiniBand for inter node cluster message passing as well as I/O transfer from the storage system can also enhance performance substantially.

See Also

Disclosure Statement

MSC.Software is a registered trademark of MSC. All information on the MSC.Software website is copyrighted. MD Nastran MDR3 results from http://www.mscsoftware.com and this report as of October 12, 2009.

Friday Oct 09, 2009

X6275 Cluster Demonstrates Performance and Scalability on WRF 2.5km CONUS Dataset

Significance of Results

Results are presented for the Weather Research and Forecasting (WRF) code running on twelve Sun Blade X6275 server modules, housed in the Sun Blade 6048 chassis, using the 2.5 km CONUS benchmark dataset.

  • The Sun Blade X6275 cluster was able to achieve 373 GFLOP/s on the CONUS 2.5-KM Dataset.
  • The results demonstrate an 91% speedup efficiency, or 11x speedup, from 1 to 12 blades.
  • The current results results were run with turbo on.

Performance Landscape

Performance is expressed in terms "simulation speedup" which is the ratio of the simulated time step per iteration to the average wall clock time required to compute it. A larger number implies better performance.

The current results were run with turbo mode on.

WRF 3.0.1.1: Weather Research and Forecasting CONUS 2.5-KM Dataset
#
Blade
#
Node
#
Proc
#
Core
Performance
(Simulation Speedup)
Computation Rate
GFLOP/sec
Speedup/Efficiency
(vs. 1 blade)
Turbo On
Relative Perf
Turbo On Turbo Off Turbo On Turbo Off Turbo On Turbo Off
12 24 48 192 13.58 12.93 373.0 355.1 11.0 / 91% 10.4 / 87% +6%
 8  16  32  128  9.27
254.6
 7.5 / 93% 

 6 12 24  96  7.03  6.60 193.1 181.3  5.7 / 94%  5.3 / 89% +7%
 4  8  16  64  4.74
130.2
 3.8 / 96% 

 2  4  8  32  2.44
67.0
 2.0 / 98% 

 1  2  4  16  1.24  1.24 34.1 34.1 1.0 / 100% 1.0 / 100% +0%

Results and Configuration Summary

Hardware Configuration:

    Sun Blade 6048 Modular System
      12 x Sun Blade X6275 Server Modules, each with
        4 x 2.93 GHz Intel QC X5570 processors
        24 GB (6 x 4GB)
        QDR InfiniBand
        HT disabled in BIOS
        Turbo mode enabled in BIOS

Software Configuration:

    OS: SUSE Linux Enterprise Server 10 SP 2
    Compiler: PGI 7.2-5
    MPI Library: Scali MPI v5.6.4
    Benchmark: WRF 3.0.1.1
    Support Library: netCDF 3.6.3

Benchmark Description

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. WRF is designed to be a flexible, state-of-the-art atmospheric simulation system that is portable and efficient on available parallel computing platforms. It features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility.

Dataset used:

    Single domain, large size 2.5KM Continental US (CONUS-2.5K)

    • 1501x1201x35 cell volume
    • 6hr, 2.5km resolution dataset from June 4, 2005
    • Benchmark is the final 3hr simulation for hrs 3-6 starting from a provided restart file; the benchmark may also be performed (but seldom reported) for the full 6hrs starting from a cold start
    • Iterations output at every 15 sec of simulation time, with the computation cost of each time step ~412 GFLOP

Key Points and Best Practices

  • Processes were bound to processors in round-robin fashion.
  • Model simulation time is 15 seconds per iteration as defined in input job deck. An achieved speedup of 2.67X means that each model iteration of 15s of simulation time was executed in 5.6s of real wallclock time (on average).
  • Computational requirements are 412 GFLOP per simulation time step as (measured empirically and) documented on the UCAR web site for this data model.
  • Model was run as single MPI job.
  • Benchmark was built and run as a pure-MPI variant. With larger process counts building and running WRF as a hybrid OpenMP/MPI variant may be more efficient.
  • Input and output (netCDF format) datasets can be very large for some WRF data models and run times will generally benefit by using a scalable filesystem. Performance with very large datasets (>5GB) can benefit by enabling WRF quilting of I/O across designated processors/servers. The master thread (or rank-0) performs most of the I/O (unless quilting specifies otherwise), with all processes potentially generating some I/O.

See Also

Disclosure Statement

WRF, CONUS-2.5K, see http://www.mmm.ucar.edu/wrf/WG2/bench/, results as of 9/21/2009.

Monday Jul 06, 2009

Sun Blade 6048 Chassis with Sun Blade X6275: RADIOSS Benchmark Results

Significance of Results

The Sun Blade X6275 cluster, equipped with 2.93 GHz Intel QC X5570 processors and QDR InfiniBand interconnect, delivered the best performance at 32, 64 and 128 cores for the RADIOSS Neon_1M and Taurus_Frontal benchmarks.

  • Using half the nodes (16), the Sun Blade X6275 cluster was 3% faster than the 32-node SGI cluster running the Neon_1M test case.
  • In the 128-core configuration, the Sun Blade X6275 cluster was 49% faster than the SGI cluster running the Neon_1M test case.
  • In the 128-core configuration, the Sun Blade X6275 cluster was 49% faster than the SGI cluster running the Neon_1M test case.
  • In the 128-core configuration, the Sun Blade X6275 cluster was 16% faster than the top SGI cluster running the Taurus_Frontal test case.
  • At both the 32- and 64-core levels the Sun Blade X6275 cluster was 60% faster running the Neon_1M test case.
  • At both the 32- and 64-core levels the Sun Blade X6275 cluster was 4% faster running the Taurus_Frontal test case.

Performance Landscape


RADIOSS Public Benchmark Test Suite
  Results are Total Elapsed Run Times (sec.)

System
cores Benchmark Test Case
TAURUS_FRONTAL
1.8M
NEON_1M
1.06M
NEON_300K
277K

SGI Altix ICE 8200 IP95 2.93GHz, 32 nodes, DDR 256 3559 1672 310

Sun Blade X6275 2.93GHz, 16 nodes, QDR 128 4397 1627 361
SGI Altix ICE 8200 IP95 2.93GHz, 16 nodes, DDR 128 5033 2422 360

Sun Blade X6275 2.93GHz, 8 nodes, QDR 64 5934 2526 587
SGI Altix ICE 8200 IP95 2.93GHz, 8 nodes, DDR 64 6181 4088 584

Sun Blade X6275 2.93GHz, 4 nodes, QDR 32 9764 4720 1035
SGI Altix ICE 8200 IP95 2.93GHz, 4 nodes, DDR 32 10120 7574 1017

Results and Configuration Summary

Hardware Configuration:
    8 x Sun Blade X6275
    2x2.93 GHz Intel QC X5570 processors, turbo enabled (per half blade)
    24 GB (6 x 4GB 1333 MHz DDR3 dimms)
    InfiniBand QDR interconnects

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Application: RADIOSS V9.0 SP 1
    Benchmark: RADIOSS Public Benchmark Test Suite

Benchmark Description

Altair has provided a suite of benchmarks to demonstrate the performance of RADIOSS. The initial set of benchmarks provides four automotive crash models. Future updates will add in marine and aerospace applications, as well as including automotive NVH applications. The benchmarks use real data, requiring double precision computations and the parith feature (Parallel arithmetic algorithm) to obtain exactly the same results whatever the number of processors used.

Please go here for a more complete description of the tests.

Key Points and Best Practices

The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.2GHz. This feature was was enabled when generating the results reported here.

Node to Node MPI ping-pong tests show a bandwidth of 3000 MB/sec on the Sun Blade X6275 cluster using QDR. The same tests performed on a Sun Fire X2270 cluster and equipped with DDR interconnect produced a bandwidth of 1500 MB/sec. On another recent Intel based Sun Fire X2250 cluster (3.4 GHz DC E5272 processors) also equipped with DDR interconnects, the bandwidth was 1250 MB/sec. This same Sun Fire X2250 cluster equipped with SDR IB interconnect produced an MPI ping-pong bandwidth of 975 MB/sec.

See Also

Current RADIOSS Benchmark Results:
http://www.altairhyperworks.com/Benchmark.aspx

Disclosure Statement

All information on the Fluent website is Copyright 2009 Altair Engineering, Inc. All Rights Reserved. Results from http://www.altairhyperworks.com/Benchmark.aspx

Friday Jun 26, 2009

Sun Fire X2270 Cluster Fluent Benchmark Results

Significance of Results

A Sun Fire X2270 cluster equipped with 2.93 GHz QC Intel X5570 proceesors and DDR Infiniband interconnects delivered outstanding performance running the FLUENT benchmark test suite.

  • The Sun Fire X2270 cluster running at 64-cores delivered the best performance for the 3 largest test cases. On the "truck" workload Sun was 14% faster than SGI Altix ICE 8200.
  • The Sun Fire X2270 cluster running at 32-cores delivered the best performance for 5 of the 6 test cases.
  • The Sun Fire X2270 cluster running at 16-cores beat all comers in all 6 test cases.

Performance Landscape


New FLUENT Benchmark Test Suite
  Results are "Ratings" (bigger is better)
  Rating = No. of sequential runs of test case possible in 1 day 86,400/(Total Elapsed Run Time in Seconds)
  Results ordered by truck_poly column

System (1)
cores Benchmark Test Case
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Fire X2270, 8 node 64 4645.2 23671.2 3445.7 4909.1 566.9 494.8
Intel Endeavor, 8 node 64 5016.0 25226.3 5220.5 4614.2 513.4 490.9
SGI Altix ICE 8200 IP95, 8 node 64 5142.9 23834.5 4614.2 4352.6 496.8 479.2

Sun Fire X2270, 4-node 32 2971.6 13824.0 3074.7 2644.2 291.8 271.8
Intel Endeavor, 4-node 32 2856.2 13041.5 2837.4 2465.0 266.4 251.2
SGI Altix ICE 8200 IP95, 4-node 32 3083.0 13190.8 2563.8 2405.0 266.6 246.5
Sun Fire X2250, 8-node 32 2095.8 9600.0 1844.2 1394.1 203.2 196.8

Sun Fire X2270, 2-node 16 1726.3 7595.6 1520.5 1363.3 145.5 141.8
SGI Altix ICE 8200 IP95, 2-node 16 1708.4 7384.6 1507.9 1211.8 128.8 133.5
Intel Endeavor, 2-node 16 1585.3 7125.8 1428.1 1278.6 134.7 132.5
Sun Fire X2250, 4-node 16 1404.9 6249.5 1324.6 996.3 127.7 129.2

Sun Fire X2270, 1-node 8 945.8 4129.0 883.0 682.5 73.5 72.4
SGI Altix ICE 8200 IP95, 1-node 8 953.1 4032.7 843.3 651.0 71.4 72.0
Sun Fire X2250, 2-node 8 824.2 3248.1 711.4 517.9 66.1 67.9

SGI Altix ICE 8200 IP95, 1-node 4 561.6 2416.8 526.9 412.6 40.9 40.8
Sun Fire X2270, 1-node 4 541.5 2346.2 515.7 409.3 40.8 40.2
Sun Fire X2250, 1-node 4 449.2 1691.6 389.0 271.8 33.6 34.9

Sun Fire X2270, 1-node 2 292.8 1282.4 283.4 223.1 20.9 21.2
SGI Altix ICE 8200 IP95, 1-node 2 294.2 1302.7 289.0 226.4 20.5 21.2
Sun Fire X2250, 1-node 2 224.4 881.0 197.9 134.4 16.3 17.6

Sun Fire X2270, 1-node 1 150.7 658.3 143.2 110.1 10.2 10.6
SGI Altix ICE 8200 IP95, 1-node 1 153.3 677.5 147.3 111.2 10.3 9.5
Sun Fire X2250, 1-node 1 115.4 458.2 100.1 66.6 8.0 9.0

Sun Fire X2270, 1-node serial 151.4 656.7 151.3 107.1 9.3 10.1
Intel Endeavor, 1-node serial 146.6 650.0 150.2 105.6 8.8 9.7
Sun Fire X2250, 1-node serial 115.2 461.7 101.0 65.0 7.2 9.0

(1) SGI Altix ICE 8200, X5570 QC 2.93GHz, DDR
Intel Endeavor, X5560 QC 2.8GHz, DDR
Sun Fire X2250, X5272 DC 3.4GHz, DDR IB
Sun Fire X2270, X5570 QC 3.4GHz, DDR

Results and Configuration Summary

Hardware Configuration

    8 x Sun Fire X2270 (each with)
    2 x 2.93GHz Intel X5570 QC processors (Nehalem)
    1333 MHz DDR3 dimms
    Infiniband (Voltaire) DDR interconnects & DDR switch, IB

Software Configuration

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Interconnect software: Voltaire OFED GridStack-5.1.3.1_5
    Application: FLUENT Beta V12.0.15
    Benchmark: FLUENT "6.3" Benchmark Test Suite

Benchmark Description

The benchmark test are representative of typical user large CFD models intended for execution in distributed memory processor (DMP) mode over a cluster of multi-processor platforms.

Please go here for a more complete description of the tests.

Key Points and Best Practices

Observations About the Results

The Sun Fires X2270 cluster delivered excellent performance, especially shining with the larger problems (truck and truck_poly).

These processors include a turbo boost feature coupled with a speedstep option in the CPU section of the Advanced BIOS settings. This, under specific circumstances, can provide a cpu upclocking, temporarily increasing the processor frequency from 2.93GHz to 3.2GHz.

Memory placement is a very significant factor with Nehalem processors. Current Nehalem platforms have two sockets. Each socket has three memory channels and each channel has 3 bays for DIMMs. For example if one DIMM is placed in the 1st bay of each of the 3 channels the DIMM speed will be 1333 MHz with the X5570's altering the DIMM arrangement to an off balance configuration by say adding just one more DIMM into the 2nd bay of one channel will cause the DIMM frequency to drop from 1333 MHz to 1067 MHz.

About the FLUENT "6.3" Benchmark Test Suite

The FLUENT application performs computational fluid dynamic analysis on a variety of different types of flow and allows for chemically reacting species. transient dynamic and can be linear or nonlinear as far

  • CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.
  • CFD models typically scale very well and are very suited for execution on clusters. The FLUENT "6.3" benchmark test cases scale well particularly up to 64 cores.
  • The memory requirements for the test cases in the new FLUENT "6.3" benchmark test suite range from a few hundred megabytes to about 25 GB. As the job is distributed over multiple nodes the memory requirements per node correspondingly are reduced.
  • The benchmark test cases for the FLUENT module do not have a substantial I/O component. component. However performance will be enhanced very substantially by using high performance interconnects such as Infiniband for inter node cluster message passing. This nodal message passing data can be stored locally on each node or on a shared file system.
  • As a result of the large amount of inter node message passing performance can be further enhanced by more than a 3x factor as indicated here by implementing the Lustre based shared file I/O system.

See Also

Current FLUENT "12.0 Beta" Benchmark:
http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.4.x/

Disclosure Statement

All information on the Fluent website is Copyrighted 1995-2009 by Fluent Inc. Results from http://www.fluent.com/software/fluent/fl6bench/ as of June 9, 2009 and this presentation.

Tuesday Jun 16, 2009

Sun Fire X2270 MSC/Nastran Vendor_2008 Benchmarks

Significance of Results

The I/O intensive MSC/Nastran Vendor_2008 benchmark test suite was used to compare the performance on a Sun Fire X2270 server when using SSDs internally instead of HDDs.

The effect on performance from increasing memory to augment I/O caching was also examined. The Sun Fire X2270 server was equipped with Intel QC Xeon X5570 processors (Nehalem). The positive effect of adding memory to increase I/O caching is offset to some degree by the reduction in memory frequency with additional DIMMs in the bays of each memory channel on each cpu socket for these Nehalem processors.

  • SSDs can significantly improve NASTRAN performance especially on runs with larger core counts.
  • Additional memory in the server can also increase performance, however in some systems additional memory can decrease memory GHz so this may offset the benefits of increased capacity.
  • If SSDs are not used striped disks will often improve performance of IO-bound MCAE applications.
  • To obtain the highest performance it is recommended that SSDs be used and servers be configured with the largest memory possible without decreasing memory GHz. One should always look at the workload characteristics and compare against this benchmark to correctly set expectations.

SSD vs. HDD Performance

The performance of two striped 30GB SSDs was compared to two striped 7200 rpm 500GB SATA drives on a Sun Fire X2270 server.

  • At the 8-core level (maximum cores for a single node) SSDs were 2.2x faster for the larger xxocmd2 and the smaller xlotdf1 cases.
  • For 1-core results SSDs are up to 3% faster.
  • On the smaller mdomdf1 test case there was no increase in performance on the 1-, 2-, and 4-cores configurations.

Performance Enhancement with I/O Memory Caching

Performance for Nastran can often be increased by additional memory to provide additional in-core space to cache I/O and thereby reduce the IO demands.

The main memory was doubled from 24GB to 48GB. At the 24GB level one 4GB DIMM was placed in the first bay of each of the 3 CPU memory channels on each of the two CPU sockets on the Sun Fire X2270 platform. This configuration allows a memory frequency of 1333MHz.

At the 48GB level a second 4GB DIMM was placed in the second bay of each of the 3 CPU memory channels on each socket. This reduces the memory frequency to 1066MHz.

Adding Memory With HDDs (SATA)

  • The additional server memory increased the performance when running with the slower SATA drives at the higher core levels (e.g. 4- & 8-cores on a single node)
  • The larger xxocmd2 case was 42% faster and the smaller xlotdf1 case was 32% faster at the maximum 8-core level on a single system.
  • The special I/O intensive getrag case was 8% faster at the 1-core level.

Adding Memory With SDDs

  • At the maximum 8-core level (for a single node) the larger xxocmd2 case was 47% faster in overall run time.
  • The effects were much smaller at lower core counts and in the tests at the 1-core level most test cases ran from 5% to 14% slower with the slower CPU memory frequency dominating over the added in-core space available for I/O caching vs. direct transfer to SSD.
  • Only the special I/O intensive getrag case was an exception running 6% faster at the 1-core level.

Increasing performance with Two Striped (SATA) Drives

The performance of multiple striped drives was also compared to single drive. The study compared two striped internal 7200 rpm 500GB SATA drives to a singe single internal SATA drive.

  • On a single node with 8 cores, the largest test xx0cmd2 was 40% faster, a smaller test case xl0tdf1 was 33% faster and even the smallest test case mdomdf1 case was 12% faster.

  • On 1-core the added boost in performance with striped disks was from 4% to 13% on the various test cases.

  • One 1-core the special I/O-intensive test case getrag was 29% faster.

Performance Landscape

Times in table are elapsed time (sec).


MSC/Nastran Vendor_2008 Benchmark Test Suite

Test Cores Sun Fire X2270
2 x X5570 QC 2.93 GHz
2 x 7200 RPM SATA HDDs
Sun Fire X2270
2 x X5570 QC 2.93 GHz
2 x SSDs
48 GB
1067MHz
24 GB
2 SATA
1333MHz
24 GB
1 SATA
1333MHz
Ratio (2xSATA):
48GB/
24GB
Ratio:
2xSATA/
1xSATA
48 GB
1067MHz
24 GB
1333MHz
Ratio:
48GB/
24GB
Ratio (24GB):
2xSATA/
2xSSD

vlosst1 1 133 127 134 1.05 0.95 133 126 1.05 1.01

xxocmd2 1
2
4
8
946
622
466
1049
895
614
631
1554
978
703
991
2590
1.06
1.01
0.74
0.68
0.87
0.87
0.64
0.60
947
600
426
381
884
583
404
711
1.07
1.03
1.05
0.53
1.01
1.05
1.56
2.18

xlotdf1 1
2
4
8
2226
1307
858
912
2000
1240
833
1562
2081
1308
1030
2336
1.11
1.05
1.03
0.58
0.96
0.95
0.81
0.67
2214
1315
744
674
1939
1189
751
712
1.14
1.10
0.99
0.95
1.03
1.04
1.11
2.19

xloimf1 1 1216 1151 1236 1.06 0.93 1228 1290 0.95 0.89

mdomdf1 1
2
4
987
524
270
913
485
237
983
520
269
1.08
1.08
1.14
0.93
0.93
0.88
987
524
270
911
484
250
1.08
1.08
1.08
1.00
1.00
0.95

Sol400_1
(xl1fn40_1)
1 2555 2479 2674 1.03 0.93 2549 2402 1.06 1.03

Sol400_S
(xl1fn40_S)
1 2450 2302 2481 1.06 0.93 2449 2262 1.08 1.02

getrag
(xx0xst0)
1 778 843 1178 0.92 0.71 771 817 0.94 1.03

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X2270
      1 2-socket rack mounted server
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      2 x internal striped SSDs
      2 x internal striped 7200 rpm 500GB SATA drives

Software Configuration:

    O/S: Linux 64-bit SUSE SLES 10 SP 2
    Application: MSC/NASTRAN MD 2008
    Benchmark: MSC/NASTRAN Vendor_2008 Benchmark Test Suite
    HP MPI: 02.03.00.00 [7585] Linux x86-64
    Voltaire OFED-5.1.3.1_5 GridStack for SLES 10

Benchmark Description

The benchmark tests are representative of typical MSC/Nastran applications including both SMP and DMP runs involving linear statics, nonlinear statics, and natural frequency extraction.

The MD (Multi Discipline) Nastran 2008 application performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior and/or deformations are concerned. The new release includes the MARC module for general purpose nonlinear analyses and the Dytran module that employs an explicit solver to analyze crash and high velocity impact conditions.

  • As of the Summer '08 there is now an official Solaris X64 version of the MD Nastran 2008 system that is certified and maintained.
  • The memory requirements for the test cases in the new MSC/Nastran Vendor 2008 benchmark test suite range from a few hundred megabytes to no more than 5 GB.

Please go here for a more complete description of the tests.

Key Points and Best Practices

For more on Best Practices of SSD on HPC applications also see the Sun Blueprint:
http://wikis.sun.com/display/BluePrints/Solid+State+Drives+in+HPC+-+Reducing+the+IO+Bottleneck

Additional information on the MSC/Nastran Vendor 2008 benchmark test suite.

  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the Nastran job. This is done on the command line with the mem= option. On Linux based systems where the platform has a large amount of memory and where the model does not have large scratch I/O requirements the memory can be allocated to a tmpfs scratch space file system. On Solaris X64 systems advantage can be taken of ZFS for higher I/O performance.

  • The MSC/Nastran Vendor 2008 test cases don't scale very well, a few not at all and the rest on up to 8 cores at best.

  • The test cases for the MSC/Nastran module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). The required scratch file size ranges from less than 1 GB on up to about 140 GB. Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system, further enhanced as indicated here by implementing the Lustre based I/O system. High performance interconnects such as Infiniband for inter node cluster message passing as well as I/O transfer from the storage system can also enhance performance substantially.

See Also

Disclosure Statement

MSC.Software is a registered trademark of MSC. All information on the MSC.Software website is copyrighted. MSC/Nastran Vendor 2008 results from http://www.mscsoftware.com and this report as of June 9, 2009.

Friday Jun 12, 2009

OpenSolaris Beats Linux on memcached Sun Fire X2270

OpenSolaris provided 25% better performance on memcached than Linux on the Sun Fire X2270 server. memcached 1.3.2 using OpenSolaris gave a maximum throughput of 352K ops/sec compared to the same server running RHEL5 (with kernel 2.6.29) which produced a result of 281K ops/sec.

memcached is the de-facto distributed caching server used to scale many web2.0 sites today. With the requirement to support a very large number of users as sites grow, memcached aids scalability by effectively cutting down on MySQL traffic and improving response times.

  • memcached is a very light-weight server but is known not to scale beyond 4-6 threads. Some scalability improvements have gone into the 1.3 release (still in beta).
  • As customers move to the newer, more powerful Intel Nehalem based systems, it is important that they have the ability to use these systems efficiently using appropriate software and hardware components.

Performance Landscape

memcached performance results: ops/sec (bigger is better)

System C/C/T Processors Memory Operating System Performance
Ops/Sec
GHz Type
Sun Fire X2270 2/8/16 2.93 Intel X5570 QC 48GB OpenSolaris 2009.06 352K
Sun Fire X2270 2/8/16 2.93 Intel X5570 QC 48GB RedHat Enterprise Linux 5 (kernel 2.6.29) 281K

C/C/T: Chips, Cores, Threads

Results and Configuration Summary

Sun's results used the following hardware and software components.

Hardware:

    Sun Fire X2270
    2 x Intel X5570 QC 2.93 GHz
    48GB of memory
    10GbE Intel Oplin card

Software:

    OpenSolaris 2009.06
    Linux RedHat 5 (on kernel 2.6.29)

Benchmark Description

memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. The memcached benchmark was based on Apache Olio - a web2.0 workload.

The benchmark initially populates the server cache with objects of different sizes to simulate the types of data that real sites typically store in memcached :

  • small objects (4-100 bytes) to represent locks and query results
  • medium objects (1-2 KBytes) to represent thumbnails, database rows, resultsets
  • large objects (5-20 KBytes) to represent whole or partially generated pages

The benchmark then runs a mixture of operations (90% gets, 10% sets) and measures the throughput and response times when the system reaches steady-state. The workload is implemented using Faban, an open-source benchmark development framework. It not only speeds benchmark development, but the Faban harness is a great way to queue, monitor and archive runs for analysis.

Key Points and Best Practices

OpenSolaris Tuning

The following /etc/system settings were used to set the number of MSIX:

  • set ddi_msix_alloc_limit=4
  • set pcplusmp:apic_intr_policy=1

For the ixgbe interface, 4 transmit and 4 receive rings gave the best performance :

  • tx_queue_number=4, rx_queue_number=4

The crossbow threads were bound:

dladm set-linkprop -p cpus=12,13,14,15 ixgbe0

Linux Tuning

Linux was more complicated to tune, the following Linux tunables were changed to try and get the best performance:

  • net.ipv4.tcp_timestamps = 0
  • net.core.wmem_default = 67108864
  • net.core.wmem_max = 67108864
  • net.core.optmem_max = 67108864
  • net.ipv4.tcp_dsack = 0
  • net.ipv4.tcp_sack = 0
  • net.ipv4.tcp_window_scaling = 0
  • net.core.netdev_max_backlog = 300000
  • net.ipv4.tcp_max_syn_backlog = 200000

Here are the ixgbe specific settings that were used (2 transmit, 2 receive rings):

  • RSS=2,2 InterruptThrottleRate=1600,1600

Linux Issues

The 10GbE Intel Oplin card on Linux resulted in the following driver and kernel re-builds.

  • With the default ixgbe driver from the RedHat distribution (version 1.3.30-k2 on kernel 2.6.18)), the interface simply hung during the benchmark test.
  • This led to downloading the driver from the Intel site (1.3.56.11-2-NAPI) and re-compiling it. This version does work and we got a maximum throughput of 232K operations/sec on the same linux kernel (2.6.18). However, this version of the kernel does not have support for multiple TX rings.
  • The kernel version 2.6.29 includes support for multiple TX rings but still doesn't have the ixgbe driver which is 1.3.56.11-2-NAPI. So we downloaded, built and installed these versions of the kernel and driver. This worked well giving a maximum throughput of 281K with some tuning.

See Also

Disclosure Statement

Sun Fire X2270 server with OpenSolaris 352K ops/sec. Sun Fire X2270 server with RedHat Linux 281K ops/sec. For memcached information, visit http://www.danga.com/memcached. Results as of June 8, 2009.

Tuesday Jun 09, 2009

Free Compiler Wins Nehalem Race by 2x

Not the Free Compiler That You Thought, No, This Other One.

Nehalem performance measured with several software configurations

Contributed by: John Henning and Karsten Guthridge

Introduction

race

The GNU C Compiler, GCC, is popular, widely available, and an exemplary collaborative effort.

But how does it do for performance -- for example, on Intel's latest hot "Nehalem" processor family? How does it compare to the freely available Sun Studio compiler?

Using the SPEC CPU benchmarks, we take a look at this question. These benchmarks depend primarily on performance of the chip, the memory hierarchy, and the compiler. By holding the first two of these constant, it is possible to focus in on compiler contributions to performance.

Current Record Holder

The current SPEC CPU2006 floating point speed record holder is the Sun Blade X6270 server module. Using 2x Intel Xeon X5570 processor chips and 24 GB of DDR3-1333 memory, it delivers a result of 45.0 SPECfp_base2006 and 50.4 SPECfp2006. [1]

We used this same blade system to compare GCC vs. Studio. On separate, but same-model disks, this software was installed:

  • SuSE Linux Enterprise Server 11.0 (x86_64) and GCC V4.4.0 built with gmp-4.3.1 and mpfr-2.4.1
  • OpenSolaris2008.11 and Sun Studio 12 Update 1

Tie One Hand Behind Studio's Back

In order to make the comparison more fair to GCC, we took several steps.

  1. We simplified the tuning for the OpenSolaris/Sun Studio configuration. This was done in order to counter the criticism that one sometimes hears that SPEC benchmarks have overly aggressive tuning. Benchmarks were optimized with a reasonably short tuning string:

    For all:  -fast -xipo=2 -m64 -xvector=simd -xautopar
    For C++, add:  -library=stlport4
  2. Recall that SPEC CPU2006 allows two kinds of tuning: "base", and "peak". The base metrics require that all benchmarks of a given language use the same tuning. The peak metrics allow individual benchmarks to have differing tuning, and more aggressive optimizations, such as compiler feedback. The simplified Studio configuration used only the less aggressive base tuning.

Both of the above changes limited the performance of Sun Studio.  Several measures were used to increase the performance of GCC:

  1. We tested the latest released version of GCC, 4.4.0, which was announced on 21 April 2009. In our testing, GCC 4.4.0 provides about 10% better overall floating point performance than V4.3.2. Note that GCC 4.4.0 is more recent than the compiler that is included with recent Linux distributions such as SuSE 11, which includes 4.3.2; or Ubuntu 8.10, which updates to 4.3.2 when one does "apt-get install gcc". It was installed with the math libraries mpfr 2.4.1 and gmp 4.3.1, which are labeled as the latest releases as of 1 June 2009.

  2. A tuning effort was undertaken with GCC, including testing of -O2 -O3 -fprefetch-loop-arrays -funroll-all-loops -ffast-math -fno-strict-aliasing -ftree-loop-distribution -fwhole-program -combine and -fipa-struct-reorg

  3. Eventually, we settled on this tuning string for GCC base:

    For all:  -O3 -m64 -mtune=core2 -msse4.2 -march=core2
    -fprefetch-loop-arrays -funroll-all-loops
    -Wl,-z common-page-size=2M
    For C++, add:  -ffast-math

    The reason that only the C++ benchmarks used the fast math library was that 435.gromacs, which uses C and Fortran, fails validation with this flag. (Note: we verified that the benchmarks successfully obtained 2MB pages.)

Studio wins by 2x, even with one hand tied behind its back

At this point, a fair base-to-base comparison can be made, and Sun Studio/OpenSolaris finishes the race while GCC/Linux is still looking for its glasses: 44.8 vs. 21.1 (see Table 1). Notice that Sun Studio provides more than 2x the performance of GCC.

Table 1: Initial Comparisons, SPECfp2006,
Sun Studio/Solaris vs. GCC/Linux
  Base Peak
Industry FP Record
    Sun Studio 12 Update 1
    OpenSolaris 2008.11
45.0 50.4
Studio/OpenSolaris: simplify above (less tuned) 44.8  
GCC V4.4 / SuSE Linux 11 21.1  
Notes: All results reported are from rule-compliant, "reportable" runs of the SPEC CPU2006 floating point suite, CFP2006. "Base" indicates the metric SPECfp_base2006. "Peak" indicates SPECfp2006. Peak uses the same benchmarks and workloads as base, but allows more aggressive tuning. A base result, may, optionally, be quoted as peak, but the converse is not allowed. For details, see SPEC's Readme1st.

Fair? Did you say "Fair"?

Wait, wait, the reader may protest - this is all very unfair to GCC, because the Studio result used all 8 cores on this 2-chip system, whereas GCC used only one core! You're using trickery!

To this plaintive whine, we respond that:

  • Compiler auto-parallelization technology is not a trick. Rather, it is an essential technology in order to get the best performance from today's multi-core systems. Nearly all contemporary CPU chips provide support for multiple cores. Compilers should do everything possible to make it easy to take advantage of these resources.

  • We tried to use more than one core for GCC, via the -ftree-parallelize-loops=n flag. GCC's autoparallelization appears to be in a much earlier development stage than Studio's, since we did not observe any improvements for all values of "n" that we tested. From the GCC wiki, it appears that a new autoparallelization effort is under development, which may improve its results at a later time.

  • But, all right, if you insist, we will make things even harder for Studio, and see how it does.

Tie Another Hand Behind Studio's Back

The earlier section mentioned various ways in which the performance comparison had been made easier for GCC. Continuing the paragraph numbering from above, we took these additional measures:

  1. Removed the autoparallelization from Studio, substituting instead a request for 2MB pagesizes (which the GCC tuning already had).

  2. Added "peak" tuning to GCC: for benchmarks that benefit, add -ffast-math, and compiler profile-driven feedback

At this point, Studio base beats GCC base by 38%, and Studio base beats GCC peak by more than 25% (see table 2).

Table 2: Additional Comparisons, SPECfp2006,
Sun Studio/Solaris vs. GCC/Linux
  Base Peak
Sun Studio/OpenSolaris: base only, noautopar 29.1  
GCC V4.4 / SuSE Linux 11 21.1 23.1
The notes from Table 1 apply here as well.

Bottom line

The freely available Sun Studio 12 Update 1 compiler on OpenSolaris provides more than double the performance of GCC V4.4 on SuSE Linux, as measured by SPECfp_base2006.

If compilation is restricted to avoid using autoparallelization, Sun Studio still wins by 38% (base to base), or by more than 25% (Studio base vs. GCC peak).

YMMV

Your mileage may vary. It is certain that both GCC and Studio could be improved with additional tuning efforts. Both provide dozens of compiler flags, which can keep the tester delightfully engaged for an unbounded number of days. We feel that the tuning presented here is reasonable, and that additional tuning effort, if applied to both compilers, would not radically alter the conclusions.

Additional Information

The results disclosed in this article are from "reportable" runs of the SPECfp2006 benchmarks, which have been submitted to SPEC.

[1] SPEC and SPECfp are registered trademarks of the Standard Performance Evaluation Corporation. Competitive comparisons are based on data published at www.spec.org as of 1 June 2009. The X6270 result can be found at http://www.spec.org/cpu2006/results/res2009q2/cpu2006-20090413-07019.html.

Wednesday Jun 03, 2009

Wide Variety of Topics to be discussed on BestPerf

A sample of the various Sun and partner technologies to be discussed:
OpenSolaris, Solaris, Linux, Windows, vmware, gcc, Java, Glassfish, MySQL, Java, Sun-Studio, ZFS, dtrace, perflib, Oracle, DB2, Sybase, OpenStorage, CMT, SPARC64, X64, X86, Intel, AMD

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« May 2016
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today