Friday Sep 30, 2011

SPARC T4 Processor Beats Intel (Westmere AES-NI) on AES Encryption Tests

The cryptography benchmark suite was internally developed by Oracle to measure the maximum throughput of in-memory, on-chip encryption operations that a system can perform. Multiple threads are used to achieve the maximum throughput.

  • Oracle's SPARC T4 processor running Oracle Solaris 11 is 1.5x faster on AES 256-bit key CFB mode encryption than the Intel Xeon X5690 processor running Oracle Linux 6.1 for in-memory encryption of 32 KB blocks.

  • The SPARC T4 processor running Oracle Solaris 11 is 1.7x faster on AES 256-bit key CBC mode encryption than the Intel Xeon X5690 processor running Oracle Linux 6.1 for in-memory encryption of 32 KB blocks.

  • The SPARC T4 processor running Oracle Solaris 11 is 3.6x faster on AES 256-bit key CCM mode encryption than the Intel Xeon X5690 processor running Oracle Linux 6.1 for in-memory encryption with authentication of 32 KB blocks.

  • The SPARC T4 processor running Oracle Solaris 11 is 1.4x faster on AES 256-bit key GCM mode encryption than the Intel Xeon X5690 processor running Oracle Linux 6.1 for in-memory encryption with authentication of 32 KB blocks.

  • The SPARC T4 processor running Oracle Solaris 11 is 9% faster on single-threaded AES 256-bit key CFB mode encryption than the Intel Xeon X5690 processor running Oracle Linux 6.1 for in-memory encryption of 32 KB blocks.

  • The SPARC T4 processor running Oracle Solaris 11 is 1.8x faster on AES 256-bit key CFB mode encryption than the SPARC T3 running Solaris 11 Express.

  • AES CFB mode is used by the Oracle Database 11g for Transparent Data Encryption (TDE) which provides security to database storage.

Performance Landscape

Encryption Performance – AES-CFB

Performance is presented for in-memory AES-CFB128 mode encryption. Multiple key sizes of 256-bit, 192-bit and 128-bit are presented. The encryption was performance on 32 KB of pseudo-random data (same data for each run).

AES-256-CFB
Microbenchmark Performance (MB/sec)
Processor GHz Th Performance Software Environment
SPARC T4 2.85 64 10,963 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 7,526 Oracle Linux 6.1, IPP/AES-NI
SPARC T3 1.65 32 6,023 Oracle Solaris 11 Express, libpkcs11
Intel X5690 3.47 12 2,894 Oracle Solaris 11, libsoftcrypto
SPARC T4 2.85 1 712 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 653 Oracle Linux 6.1, IPP/AES-NI
Intel X5690 3.47 1 425 Oracle Solaris 11, libsoftcrypto
SPARC T3 1.65 1 331 Oracle Solaris 11 Express, libpkcs11

AES-192-CFB
Microbenchmark Performance (MB/sec)
Processor GHz Th Performance Software Environment
SPARC T4 2.85 64 12,451 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 8,677 Oracle Linux 6.1, IPP/AES-NI
SPARC T3 1.65 32 6,175 Oracle Solaris 11 Express, libpkcs11
Intel X5690 3.47 12 2,976 Oracle Solaris 11, libsoftcrypto
SPARC T4 2.85 1 816 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 752 Oracle Linux 6.1, IPP/AES-NI
Intel X5690 3.47 1 461 Oracle Solaris 11, libsoftcrypto
SPARC T3 1.65 1 371 Oracle Solaris 11 Express, libpkcs11

AES-128-CFB
Microbenchmark Performance (MB/sec)
Processor GHz Th Performance Software Environment
SPARC T4 2.85 64 14,388 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 10,214 Oracle Solaris 11, libsoftcrypto
SPARC T3 1.65 32 6,390 Oracle Solaris 11 Express, libpkcs11
Intel X5690 3.47 12 3,115 Oracle Linux 6.1, IPP/AES-NI
SPARC T4 2.85 1 953 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 886 Oracle Linux 6.1, IPP/AES-NI
Intel X5690 3.47 1 509 Oracle Solaris 11, libsoftcrypto
SPARC T3 1.65 1 395 Oracle Solaris 11 Express, libpkcs11

Encryption Performance – AES-CBC

Performance is presented for in-memory AES-CBC mode encryption. Multiple key sizes of 256-bit, 192-bit and 128-bit are presented. The encryption was performance on 32 KB of pseudo-random data (same data for each run).

AES-256-CBC
Microbenchmark Performance (MB/sec)
Processor GHz Th Performance Software Environment
SPARC T4 2.85 64 11,588 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 7,171 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 6,704 Oracle Linux 6.1, IPP/AES-NI
SPARC T3 1.65 32 5,980 Oracle Solaris 11 Express, libpkcs11
SPARC T4 2.85 1 748 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 592 Oracle Linux 6.1, IPP/AES-NI
Intel X5690 3.47 1 569 Oracle Solaris 11, libsoftcrypto
SPARC T3 1.65 1 336 Oracle Solaris 11 Express, libpkcs11

AES-192-CBC
Microbenchmark Performance (MB/sec)
Processor GHz Th Performance Software Environment
SPARC T4 2.85 64 13,216 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 8,211 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 7,588 Oracle Linux 6.1, IPP/AES-NI
SPARC T3 1.65 32 6,333 Oracle Solaris 11 Express, libpkcs11
SPARC T4 2.85 1 862 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 672 Oracle Linux 6.1, IPP/AES-NI
Intel X5690 3.47 1 643 Oracle Solaris 11, libsoftcrypto
SPARC T3 1.65 1 358 Oracle Solaris 11 Express, libpkcs11

AES-128-CBC
Microbenchmark Performance (MB/sec)
Processor GHz Th Performance Software Environment
SPARC T4 2.85 64 15,323 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 9,785 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 8,746 Oracle Linux 6.1, IPP/AES-NI
SPARC T3 1.65 32 6,347 Oracle Solaris 11 Express, libpkcs11
SPARC T4 2.85 1 1,017 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 781 Oracle Linux 6.1, IPP/AES-NI
Intel X5690 3.47 1 739 Oracle Solaris 11, libsoftcrypto
SPARC T3 1.65 1 434 Oracle Solaris 11 Express, libpkcs11

Encryption Performance – AES-CCM

Performance is presented for in-memory AES-CCM mode encryption with authentication. Multiple key sizes of 256-bit, 192-bit and 128-bit are presented. The encryption/authentication was performance on 32 KB of pseudo-random data (same data for each run).

AES-256-CCM
Microbenchmark Performance (MB/sec)
Processor GHz Th Performance Software Environment
SPARC T4 2.85 64 5,850 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 1,860 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 1,613 Oracle Linux 6.1, IPP/AES-NI
SPARC T4 2.85 1 480 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 258 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 190 Oracle Linux 6.1, IPP/AES-NI

AES-192-CCM
Microbenchmark Performance (MB/sec)
Processor GHz Th Performance Software Environment
SPARC T4 2.85 64 6,709 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 1,930 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 1,715 Oracle Linux 6.1, IPP/AES-NI
SPARC T4 2.85 1 565 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 293 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 206 Oracle Linux 6.1, IPP/AES-NI

AES-128-CCM
Microbenchmark Performance (MB/sec)
Processor GHz Th Performance Software Environment
SPARC T4 2.85 64 7,856 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 2,031 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 1,838 Oracle Linux 6.1, IPP/AES-NI
SPARC T4 2.85 1 664 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 321 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 225 Oracle Linux 6.1, IPP/AES-NI

Encryption Performance – AES-GCM

Performance is presented for in-memory AES-GCM mode encryption with authentication. Multiple key sizes of 256-bit, 192-bit and 128-bit are presented. The encryption/authentication was performance on 32 KB of pseudo-random data (same data for each run).

AES-256-GCM
Microbenchmark Performance (MB/sec)
Processor GHz Th Performance Software Environment
SPARC T4 2.85 64 6,871 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 4,794 Oracle Linux 6.1, IPP/AES-NI
Intel X5690 3.47 12 1,685 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 691 Oracle Linux 6.1, IPP/AES-NI
SPARC T4 2.85 1 571 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 253 Oracle Solaris 11, libsoftcrypto

AES-192-GCM
Microbenchmark Performance (MB/sec)
Processor GHz Th Performance Software Environment
SPARC T4 2.85 64 7,450 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 5,054 Oracle Linux 6.1, IPP/AES-NI
Intel X5690 3.47 12 1,724 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 727 Oracle Linux 6.1, IPP/AES-NI
SPARC T4 2.85 1 618 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 268 Oracle Solaris 11, libsoftcrypto

AES-128-GCM
Microbenchmark Performance (MB/sec)
Processor GHz Th Performance Software Environment
SPARC T4 2.85 64 7,987 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 12 5,315 Oracle Linux 6.1, IPP/AES-NI
Intel X5690 3.47 12 1,781 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 765 Oracle Linux 6.1, IPP/AES-NI
SPARC T4 2.85 1 655 Oracle Solaris 11, libsoftcrypto
Intel X5690 3.47 1 281 Oracle Solaris 11, libsoftcrypto

Configuration Summary

SPARC T4-1 server
1 x SPARC T4 processor, 2.85 GHz
128 GB memory
Oracle Solaris 11

SPARC T3-1 server
1 x SPARC T3 processor, 1.65 GHz
128 GB memory
Oracle Solaris 11 Express

Sun Fire X4270 M2 server
2 x Intel Xeon X5690, 3.47 GHz
Hyper-Threading enabled
Turbo Boost enabled
24 GB memory
Oracle Linux 6.1

Sun Fire X4270 M2 server
2 x Intel Xeon X5690, 3.47 GHz
Hyper-Threading enabled
Turbo Boost enabled
24 GB memory
Oracle Solaris 11 Express

Benchmark Description

The benchmark measures cryptographic capabilities in terms of general low-level encryption, in-memory and on-chip using various ciphers, including AES-128-CFB, AES-192-CFB, AES-256-CFB, AES-128-CBC, AES-192-CBC, AES-256-CBC, AES-128-CCM, AES-192-CCM, AES-256-CCM, AES-128-GCM, AES-192-GCM and AES-256-GCM.

The benchmark results were obtained using tests created by Oracle which use various application interfaces to perform the various ciphers. They were run using optimized libraries for each platform to obtain the best possible performance.

See Also

Disclosure Statement

Copyright 2012, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 1/13/2012.

Thursday Sep 29, 2011

SPARC T4 Processor Outperforms IBM POWER7 and Intel (Westmere AES-NI) on OpenSSL AES Encryption Test

Oracle's SPARC T4 processor is faster than the Intel Xeon X5690 (with AES-NI) and the IBM POWER7.

  • On single-thread OpenSSL encryption, the 2.85 GHz SPARC T4 processor is 4.3 times faster than the 3.5 GHz IBM POWER7 processor.

  • On single-thread OpenSSL encryption, the 2.85 GHz SPARC T4 processor is 17% faster than the 3.46 GHz Intel Xeon X5690 processor.

The SPARC T4 processor has Encryption Instruction Accelerators for encryption and decryption for AES and many other ciphers. The Intel Xeon X5690 processor has AES-NI instructions which accelerate only AES ciphers. The IBM POWER7 does not have cryptographic instructions, but cryptographic coprocessors are available.

Performance Landscape

The table below shows results when running the OpenSSL speed command with the AES-256-CBC cipher. The reported results are for a message size of 8192 bytes. Results are reported for a single thread and for running on all available hardware threads (no over subscribing).

OpenSSL Performance with
AES-256-CBC Encryption
Processor Performance (MB/sec)
1 Thread Maximum Throughput
(at number of threads)
SPARC T4, 2.85 GHz 769 11,967 (64)
Intel Xeon X5690, 3.46 GHz 660 7,362 (12)
IBM POWER7, 3.5 GHz 179 2,860 (est*)

(est*) The performance of the IBM POWER7 is estimated at 16 times the rate of the single thread performance. The estimate is considered an upper bound on expected performance for this processor.

Configuration Summary

SPARC Configuration:

SPARC T4-1 server
1 x SPARC T4 processors, 2.85 GHz
64 GB memory
Oracle Solaris 11

Intel Configuration:

Sun Fire X4270 M2 server
1 x Intel Xeon X5690 processors, 3.46 GHz
24 GB memory
Oracle Solaris 11

Software Configuration:

OpenSSL 1.0.0.d
gcc 3.4.3

Benchmark Description

The in-memory SSL performance was measured with the openssl command. openssl has an option for measuring the speed of various ciphers and message sizes. The actual command used to measure the speed of AES-256-CBC was:

openssl speed -multi {number of threads} -evp aes-256-cbc

openssl runs for several minutes and measures the speed, in units of MB/sec, of the specified cipher for messages of sizes 16 bytes to 8192 bytes.

Key Points and Best Practices

  • The Encryption Instruction Accelerators are accessed through a platform independent API for cryptographic engines.
  • The OpenSSL libraries use the API. The default is to not use the Encryption Instruction Accelerators.
  • Cryptography is compute intensive. Using all available threads streams, both the SPARC T4 processor and the Intel Xeon processor were able to saturate the memory bandwidth of the respective systems.

See Also

Disclosure Statement

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/26/2011.

SPARC T4-1 Server Outperforms Intel (Westmere AES-NI) on IPsec Encryption Tests

Oracle's SPARC T4 processor has significantly greater performance than the Intel Xeon X5690 processor when both are using Oracle Solaris 11 secure IP networking (IPsec). The SPARC T4 processor using IPsec AES-256-CCM mode achieves line speed over a 10 GbE network.

  • On IPsec, SPARC T4 processor is 23% faster than the 3.46 GHz Intel Xeon X5690 processor (Intel AES-NI).

  • The SPARC T4 processor is only at 23% utilization when running at its maximum throughput making it 3.6 times more efficient at secure networking than the 3.46 GHz Intel Xeon X5690 processor.

  • The 3.46 GHz Intel Xeon X5690 processor is nearly fully utilized at its maximum throughput leaving little CPU for application processing.

  • The SPARC T4 processor using IPsec AES-256-CCM mode achieves line speed over a 10 GbE network.

  • The SPARC T4 processor approaches line speed with fewer than one-quarter the number of IPsec streams required for the Intel Xeon X5690 processor to achieve its peak throughput. The SPARC T4 processor supports the additional streams with minimal extra CPU utilization.

IPsec provides general purpose networking security which is transparent to applications. This is ideal for supplying the capability to those networking applications that don't have cryptography built-in. IPsec provides for more than Virtual Private Networking (VPN) deployments where the technology is often first encountered.

Performance Landscape

Performance was measured using the AES-256-CCM cipher in megabits per second (Mb/sec) aggregate over sufficient numbers of TCP/IP streams to achieve line rate threshold (SPARC T4 processor) or drive a peak throughput (Intel Xeon X5690).

Processor GHz AES Decrypt AES Encrypt
B/W (Mb/sec) CPU Util Streams B/W (Mb/sec) CPU Util Streams
– Peak performance
SPARC T4 2.85 9,800 23% 96 9,800 20% 78
Intel Xeon X5690 3.46 8,000 83% 4,700 81%
– Load at which SPARC T4 processor performance crosses 9000 Mb/sec
SPARC T4 2.85 9,300 19% 17 9,200 15% 17
Intel Xeon X5690 3.46 4,700 41% 3,200 47%

Configuration Summary

SPARC Configuration:

SPARC T4-1 server
1 x SPARC T4 processor 2.85 GHz
128 GB memory
Oracle Solaris 11
Single 10-Gigabit Ethernet XAUI Adapter

Intel Configuration:

Sun Fire X4270 M2
1 x Intel Xeon X5690 3.46 GHz, Hyper-Threading and Turbo Boost active
48 GB memory
Oracle Solaris 11
Sun Dual Port 10GbE PCIe 2.0 Networking Card with Intel 82599 10GbE Controller

Driver Systems Configuration:

2 x Sun Blade 6000 chassis each with
1 x Sun Blade 6000 Virtualized Ethernet Switched Network Express Module 10GbE (NEM)
10 x Sun Blade X6270 M2 server modules each with
2 x Intel Xeon X5680 3.33 GHz, Hyper-Threading and Turbo Boost active
48 GB memory
Oracle Solaris 11
Dual 10-Gigabit Ethernet Fabric Expansion Module (FEM)

Benchmark Configuration:

Netperf 2.4.5 network benchmark adapted for testing bandwidth of multiple streams in aggregate.

Benchmark Description

The results here are derived from runs of the Netperf 2.4.5 benchmark. Netperf is a client/server benchmark measuring network performance providing a number of independent tests, including the TCP streaming bandwidth tests used here.

Netperf is, however, a single network stream benchmark and to demonstrate peak network bandwidth over a 10 GbE line under encryption requires many streams.

The Netperf documentation provides an example of using the software to drive multiple streams. The example is not sufficient to develop the workload because it does not scale beyond a single driver node which limits the processing power that can be applied. This subsequently limits how many full bandwidth streams can be supported. We chose to have a single server process on the target system (containing either the SPARC T4 processor or the Intel Xeon processor) and to spawn one or more Netperf client processes each across a cluster of the driver systems. The client processes are managed by the mpirun program of the Oracle Message Passing Toolkit.

Tabular results include aggregate bandwidth and CPU utilization. The aggregate bandwidth is computed by dividing the total traffic of the client processes by the overall runtime. CPU utilization on the target system is the average of that reported by all of the Netperf client processes.

IPsec is configured in the operating system of each participating server transparently to Netperf and applied to the dedicated network connecting the target system to the driver systems.

Key Points and Best Practices

  • Line speed is defined as data bandwidth within 10% of theoretical maximum bit rate of network line. For 10 GbE greater than 9000 Mb/sec bandwidth is defined as line speed.

  • IPsec provides network security that is configured completely in the operating system and is transparent to the application.

  • Peak bandwidths under IPsec are achieved only in aggregate with multiple client network streams to the target server.

  • Oracle Solaris receiver fanout must be increased from the default to support the large numbers of streams at quoted peak rates.

  • The ixgbe network driver relevant on servers with Intel 82599 10GbE controllers (driver systems and Intel Xeon target system) was limited to only a single receiver queue to maximize utilization of extra fanout.

  • IPsec is configured to make a unique security association (SA) for each connection to avoid a bottleneck over the large stream counts.

  • Jumbo frames are enabled (MTU of 9000) and network interrupt blanking (sometimes called interrupt coalescence) is disabled.

  • The TCP streaming bandwidth tests, which run continuously for minutes and multiple times to determine statistical significance, are configured to use message sizes of 1,048,576 bytes.

  • IPsec configuration defines that each SA is established through the use of a preshared key and Internet Key Exchange (IKE).

  • IPsec encryption uses the Solaris Cryptographic Framework which applies the appropriate accelerated provider on both the SPARC T4 processor and the Intel Xeon processor.

  • There is no need to configure a specific authentication algorithm for IPsec. With the Encapsulated Security Payload (ESP) security protocol and choosing AES-256-CCM for the encryption algorithm, the encapsulation is self-authenticating.

See Also

Disclosure Statement

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/26/2011.

SPARC T4-2 Server Beats Intel (Westmere AES-NI) on SSL Network Tests

Oracle's SPARC T4 processor is faster and more efficient than the Intel Xeon X5690 processor (with AES-NI) when running network SSL thoughput tests.

  • The SPARC T4 processor at 2.85 GHz is 20% faster than the 3.46 GHz Intel Xeon X5690 processor on single stream network SSL encryption.

  • The SPARC T4 processor requires fewer streams to attain near-linespeed of a 10 GbE secure network and does this with 5 times less CPU resources compared to the Intel Xeon X5690 processor.

  • Oracle's SPARC T4-2 server using 8 threads achieves line speed over a 10 GbE network with only 9% CPU utilization.

  • Oracle's Sun Fire X4270 M2 with two Intel Xeon X5690 processors achieves line speed with 8 threads, but at 45% CPU utilization.

The SPARC T4 processor has hardware support via Encryption Instruction Accelerators for encryption and decryption for AES and many other ciphers. The Intel Xeon X5690 processor has AES-NI instructions which accelerate only AES ciphers.

Performance Landscape

The following table shows single stream results running encrypted (SSL Read) and unencrypted (Clear Text) messages of 1 MB in size. These tests were run with the uperf benchmark and used the AES-256-CBC cipher. They were run across a 10 GbE connection. Write messages saw similar performance.

Single Stream Network Communication with Uperf
Processor Performance (Mb/sec)
Clear Text SSL Read
SPARC T4, 2.85 GHz 4,194 1,678
Intel Xeon X5690, 3.46 GHz 5,591 1,398

The next table shows how many streams it takes to achieve 90% of the 10 GbE network bandwidth (9000 Mb/sec) for encrypted read messages of 1 MB in size. These tests were run with the uperf benchmark and used the AES-256-CBC cipher. Write messages saw similar performance.

Uperf SSL Read with AES-256-CBC
Processor Number of
Streams for 90%
Network Utilization
CPU Utilization
SPARC T4, 2.85 GHz 8 9%
Intel Xeon X5690, 3.46 GHz 12 45%

Configuration Summary

SPARC T4 Configuration:

2 x SPARC T4-2 servers each with
2 x SPARC T4 processors, 2.85 GHz
128 GB memory
1 x 10-Gigabit Ethernet XAUI Adapter
Oracle Solaris 11
Back-to-back 10 GbE connection

Intel Configuration:

2 x Sun Fire X4270 M2 servers each with
2 x Intel Xeon X5690 processors, 3.46 GHz
48 GB memory
1 x Sun Dual Port 10GbE PCIe 2.0 Networking Card with Intel 82599 10GbE Controller
Oracle Solaris 11
Back-to-back 10 GbE connection

Software Configuration:

OpenSSL 1.0.0.d
uperf 1.0.3
gcc 3.4.3

Benchmark Description

Uperf is an open source benchmark program for simulating and measuring network performance. Uperf is able to measure the performance of various protocols, including TCP, UDP, SCTP and SSL. The uperf benchmark uses an input-defined workload to test network performance. This input workload can be used to model complex situations or to isolate simple tasks. The workload used for these tests was simple network reads and simple network writes.

Key Points and Best Practices

  • The Encryption Instruction Accelerators are accessed through a platform independent API for cryptographic engines.
  • The OpenSSL libraries use the API. The default is to not use the Encryption Instruction Accelerators.
  • Cryptography is compute intensive. Using 8 streams, the SPARC T4 processor was able to match the bandwidth of the 10 GbE network with 8 threads.

See Also

Disclosure Statement

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/26/2011.

Monday Sep 19, 2011

Halliburton ProMAX® Seismic Processing on Sun Blade X6270 M2 with Sun ZFS Storage 7320

Halliburton/Landmark's ProMAX® 3D Pre-Stack Kirchhoff Time Migration's (PSTM) single workflow scalability and multiple workflow throughput using various scheduling methods are evaluated on a cluster of Oracle's Sun Blade X6270 M2 server modules attached to Oracle's Sun ZFS Storage 7320 appliance.

Two resource scheduling methods, compact and distributed, are compared while increasing the system load with additional concurrent ProMAX® workflows.

  • Multiple concurrent 24-process ProMAX® PSTM workflow throughput is constant; 10 workflows on 10 nodes finish as fast as 1 workflow on one compute node. Additionally, processing twice the data volume yields similar traces/second throughput performance.

  • A single ProMAX® PSTM workflow has good scaling from 1 to 10 nodes of a Sun Blade X6270 M2 cluster scaling 4.5X. ProMAX® scales to 4.7X on 10 nodes with one input data set and 6.3X with two consecutive input data sets (i.e. twice the data).

  • A single ProMAX® PSTM workflow has near linear scaling of 11x on a Sun Blade X6270 M2 server module when running from 1 to 12 processes.

  • The 12-thread ProMAX® workflow throughput using the distributed scheduling method is equivalent or slightly faster than the compact scheme for 1 to 6 concurrent workflows.

Performance Landscape

Multiple 24-Process Workflow Throughput Scaling

This test measures the system throughput scalability as concurrent 24-process workflows are added, one workflow per node. The per workflow throughput and the system scalability are reported.

Aggregate system throughput scales linearly. Ten concurrent workflows finish in the same time as does one workflow on a single compute node.

Halliburton ProMAX® Pre-Stack Time Migration - Multiple Workflow Scaling


Single Workflow Scaling

This test measures single workflow scalability across a 10-node cluster. Utilizing a single data set, performance exhibits near linear scaling of 11x at 12 processes, and per-node scaling of 4x at 6 nodes; performance flattens quickly reaching a peak of 60x at 240 processors and per-node scaling of 4.7x with 10 nodes.

Running with two consecutive input data sets in the workflow, scaling is considerably improved with peak scaling ~35% higher than obtained using a single data set. Doubling the data set size minimizes time spent in workflow initialization, data input and output.

Halliburton ProMAX® Pre-Stack Time Migration - Single Workflow Scaling

This next test measures single workflow scalability across a 10-node cluster (as above) but limiting scheduling to a maximum of 12-process per node; effectively restricting a maximum of one process per physical core. The speedup relative to a single process, and single node are reported.

Utilizing a single data set, performance exhibits near linear scaling of 37x at 48 processes, and per-node scaling of 4.3x at 6 nodes. Performance of 55x at 120 processors and per-node scaling of 5x with 10 nodes is reached and scalability is trending higher more strongly compared to the the case of two processes running per physical core above. For equivalent total process counts, multi-node runs using only a single process per physical core appear to run between 28-64% more efficiently (96 and 24 processes respectively). With a full compliment of 10 nodes (120 processes) the peak performance is only 9.5% lower than with 2 processes per vcpu (240 processes).

Running with two consecutive input data sets in the workflow, scaling is considerably improved with peak scaling ~35% higher than obtained using a single data set.

Halliburton ProMAX® Pre-Stack Time Migration - Single Workflow Scaling

Multiple 12-Process Workflow Throughput Scaling, Compact vs. Distributed Scheduling

The fourth test compares compact and distributed scheduling of 1, 2, 4, and 6 concurrent 12-processor workflows.

All things being equal, the system bi-section bandwidth should improve with distributed scheduling of a fixed-size workflow; as more nodes are used for a workflow, more memory and system cache is employed and any node memory bandwidth bottlenecks can be offset by distributing communication across the network (provided the network and inter-node communication stack do not become a bottleneck). When physical cores are not over-subscribed, compact and distributed scheduling performance is within 3% suggesting that there may be little memory contention for this workflow on the benchmarked system configuration.

With compact scheduling of two concurrent 12-processor workflows, the physical cores become over-subscribed and performance degrades 36% per workflow. With four concurrent workflows, physical cores are oversubscribed 4x and performance is seen to degrade 66% per workflow. With six concurrent workflows over-subscribed compact scheduling performance degrades 77% per workflow. As multiple 12-processor workflows become more and more distributed, the performance approaches the non over-subscribed case.

Halliburton ProMAX® Pre-Stack Time Migration - Multiple Workflow Scaling

141616 traces x 624 samples


Test Notes

All tests were performed with one input data set (70808 traces x 624 samples) and two consecutive input data sets (2 * (70808 traces x 624 samples)) in the workflow. All results reported are the average of at least 3 runs and performance is based on reported total wall-clock time by the application.

All tests were run with NFS attached Sun ZFS Storage 7320 appliance and then with NFS attached legacy Sun Fire X4500 server. The StorageTek Workload Analysis Tool (SWAT) was invoked to measure the I/O characteristics of the NFS attached storage used on separate runs of all workflows.

Configuration Summary

Hardware Configuration:

10 x Sun Blade X6270 M2 server modules, each with
2 x 3.33 GHz Intel Xeon X5680 processors
48 GB DDR3-1333 memory
4 x 146 GB, Internal 10000 RPM SAS-2 HDD
10 GbE
Hyper-Threading enabled

Sun ZFS Storage 7320 Appliance
1 x Storage Controller
2 x 2.4 GHz Intel Xeon 5620 processors
48 GB memory (12 x 4 GB DDR3-1333)
2 TB Read Cache (4 x 512 GB Read Flash Accelerator)
10 GbE
1 x Disk Shelf
20.0 TB RAID-Z (20 x 1 TB SAS-2, 7200 RPM HDD)
4 x Write Flash Accelerators

Sun Fire X4500
2 x 2.8 GHz AMD 290 processors
16 GB DDR1-400 memory
34.5 TB RAID-Z (46 x 750 GB SATA-II, 7200 RPM HDD)
10 GbE

Software Configuration:

Oracle Linux 5.5
Parallel Virtual Machine 3.3.11 (bundled with ProMAX)
Intel 11.1.038 Compilers
Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

The ProMAX® family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX® is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX® is integrated with Halliburton's OpenWorks® Geoscience Oracle Database to index prestack seismic data and populate the database with processed seismic.

This benchmark evaluates single workflow scalability and multiple workflow throughput of the ProMAX® 3D Prestack Kirchhoff Time Migration (PSTM) while processing the Halliburton benchmark data set containing 70,808 traces with 8 msec sample interval and trace length of 4992 msec. Benchmarks were performed with both one and two consecutive input data sets.

Each workflow consisted of:

  • reading the previously constructed MPEG encoded processing parameter file
  • reading the compressed seismic data traces from disk
  • performing the PSTM imaging
  • writing the result to disk

Workflows using two input data sets were constructed by simply adding a second identical seismic data read task immediately after the first in the processing parameter file. This effectively doubled the data volume read, processed, and written.

This version of ProMAX® currently only uses Parallel Virtual Machine (PVM) as the parallel processing paradigm. The PVM software only used TCP networking and has no internal facility for assigning memory affinity and processor binding. Every compute node is running a PVM daemon.

The ProMAX® processing parameters used for this benchmark:

Minimum output inline = 65
Maximum output inline = 85
Inline output sampling interval = 1
Minimum output xline = 1
Maximum output xline = 200 (fold)
Xline output sampling interval = 1
Antialias inline spacing = 15
Antialias xline spacing = 15
Stretch Mute Aperature Limit with Maximum Stretch = 15
Image Gather Type = Full Offset Image Traces
No Block Moveout
Number of Alias Bands = 10
3D Amplitude Phase Correction
No compression
Maximum Number of Cache Blocks = 500000

Primary PSTM business metrics are typically time-to-solution and accuracy of the subsurface imaging solution.

Key Points and Best Practices

  • Multiple job system throughput scales perfectly; ten concurrent workflows on 10 nodes each completes in the same time and has the same throughput as a single workflow running on one node.
  • Best single workflow scaling is 6.6x using 10 nodes.

    When tasked with processing several similar workflows, while individual time-to-solution will be longer, the most efficient way to run is to fully distribute them one workflow per node (or even across two nodes) and run these concurrently, rather than to use all nodes for each workflow and running consecutively. For example, while the best-case configuration used here will run 6.6 times faster using all ten nodes compared to a single node, ten such 10-node jobs running consecutively will overall take over 50% longer to complete than ten jobs one per node running concurrently.

  • Throughput was seen to scale better with larger workflows. While throughput with both large and small workflows are similar with only one node, the larger dataset exhibits 11% and 35% more throughput with four and 10 nodes respectively.

  • 200 processes appears to be a scalability asymptote with these workflows on the systems used.
  • Hyperthreading marginally helps throughput. For the largest model run on 10 nodes, 240 processes delivers 11% more performance than with 120 processes.

  • The workflows do not exhibit significant I/O bandwidth demands. Even with 10 concurrent 24-process jobs, the measured aggregate system I/O did not exceed 100 MB/s.

  • 10 GbE was the only network used and, though shared for all interprocess communication and network attached storage, it appears to have sufficient bandwidth for all test cases run.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX®, GeoProbe®, OpenWorks®. Results as of 9/1/2011.

Thursday Sep 15, 2011

Sun Fire X4800 M2 Servers (now known as Sun Server X2-8) Produce World Record on SAP SD-Parallel Benchmark

Oracle delivered an SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution - Parallel (SD Parallel) Benchmark world record result using eight of Oracle's Sun Fire X4800 M2 servers (now known as Sun Server X2-8), Oracle Solaris 10 and Oracle Database 11g Real Application Clusters (RAC) software that achieved 180,000 users as of 10/03/2011.

  • The eight Sun Fire X4800 M2 servers delivered a world record result of 180,000 users on the SAP SD Parallel Benchmark.

  • The eight Sun Fire X4800 M2 server SD Parallel result of 180,000 users delivered 43% more performance compared to the IBM Power 795 server SD two-tier result of 126,063 users.

Performance Landscape

Selected SAP Sales and Distribution (SD) benchmark results are presented in decreasing order of performance. All benchmarks were using SAP enhancement package 4 for SAP ERP 6.0 (Unicode).

System OS
Database
Users SAPS Type Cert #
Eight Sun Fire X4800 M2
8 x Intel Xeon E7-8870 @2.4 GHz
512 GB
Oracle Solaris 10
Oracle 11g RAC
180,000 1,016,380 Parallel 2011037
Six Sun Fire X4800 M2
8 x Intel Xeon E7-8870 @2.4 GHz
512 GB
Oracle Solaris 10
Oracle 11g RAC
137,904 765,470 Parallel 2011038
IBM Power 795
32 x POWER7 @4.0 GHz
4096 GB
AIX 7.1
DB2 9.7
126,063 688,630 Two-Tier 2010046
Four Sun Fire X4800 M2
8 x Intel Xeon E7-8870 @2.4 GHz
512 GB
Oracle Solaris 10
Oracle 11g RAC
94,736 546,050 Parallel 2011039
Two Sun Fire X4800 M2
8 x Intel Xeon E7-8870 @2.4 GHz
512 GB
Oracle Solaris 10
Oracle 11g RAC
49,860 274,080 Parallel 2011040
Four Sun Fire X4470
4 x Intel Xeon X7560 @2.26 GHz
256 GB
Solaris 10
Oracle 11g RAC
40,000 221,020 Parallel 2010039

Complete benchmark results and descriptions can be found at the SAP standard applications benchmark website.
For SD benchmark results website: Two-Tier or Three-Tier. For SD Parallel benchmark results website: SD Parallel.

Configuration and Results Summary

Hardware Configuration:

8 x Sun Fire X4800 M2 servers, each with
8 x Intel Xeon E7-8870 @ 2.4 GHz (8 processors, 80 cores, 160 threads)
512 GB memory

Software Configuration:

SAP enhancement package 4 for SAP ERP 6.0
Oracle Database 11g Real Application Clusters (RAC)
Oracle Solaris 10

Results Summary:

Number of SAP SD benchmark users:
180,000
Average dialog response time:
0.63 seconds
Throughput:

Fully processed order line items per hour:
20,327,670

Dialog steps/hour:
60,983,000

SAPS:
1,016,380
Average database request time (dialog/update):
0.010 sec / 0.055 sec
SAP Certification:
2011037

Benchmark Description

The SAP Standard Application Sales and Distribution - Parallel (SD Parallel) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing and demonstrates the ability to run both the application and database software on a single system. The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments.

The SD Parallel Benchmark consists of the same transactions and user interaction steps as the two-tier and three-tier SD Benchmark. This means that the SD Parallel Benchmark runs the same business processes as the SD Benchmark. The difference between the benchmarks is the technical data distribution. Additionally, the benchmark requires equal distribution of the benchmark users across all database nodes for the used benchmark clients (round-robin method). Following this rule, all database nodes work on data of all clients. This avoids unrealistic configurations such as having only one client per database node.

The SAP Benchmark Council agreed to give the parallel benchmark a different name so that the difference can be easily recognized by any interested parties - customers, prospects, and analysts. The naming convention is SD Parallel for Sales & Distribution - Parallel.

SAP is one of the premier world-wide ERP application providers, and maintains a suite of benchmark tests to demonstrate the performance of competitive systems on the various SAP products.

See Also

Disclosure Statement

SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, results as of 10/03/2011.

SD Parallel, 8 x Sun Fire X4800 M2 (each 8 processors, 80 cores, 160 threads) 180,000 SAP SD Users, Oracle Solaris 10, Oracle 11g Real Application Clusters (RAC), Certification Number 2011037.
SD Parallel, 6 x Sun Fire X4800 M2 (each 8 processors, 80 cores, 160 threads) 137,904 SAP SD Users, Oracle Solaris 10, Oracle 11g Real Application Clusters (RAC), Certification Number 2011038.
SD Parallel, 4 x Sun Fire X4470 (each 4 processors, 32 cores, 64 threads) 40,000 SAP SD Users, Oracle Solaris 10, Oracle 11g Real Application Clusters (RAC), Certification Number 2010039.
SD Two-Tier, IBM Power 795 (32 processors, 256 cores, 1024 threads) 126,063 SAP SD Users, AIX 7.1, DB2 9.7, Certification Number 2010046.

SAP, R/3 are registered trademarks of SAP AG in Germany and other countries. More information may be found at www.sap.com/benchmark.

Wednesday Dec 08, 2010

Sun Blade X6275 M2 Cluster with Sun Storage 7410 Performance Running Seismic Processing Reverse Time Migration

This Oil & Gas benchmark highlights both the computational performance improvements of the Sun Blade X6275 M2 server module over the previous genernation server and the linear scalability achievable for the total application throughput using a Sun Storage 7410 system to deliver almost 2 GB/sec I/O effective write performance.

Oracle's Sun Storage 7410 system attached via 10 Gigabit Ethernet to a cluster of Oracle's Sun Blade X6275 M2 server modules was used to demonstrate the performance of a 3D VTI Reverse Time Migration application, a heavily used geophysical imaging and modeling application for Oil & Gas Exploration. The total application throughput scaling and computational kernel performance improvements are presented for imaging two production sized grids using 800 input samples.

  • The Sun Blade X6275 M2 server module showed up to a 40% performance improvement over the previous generation server module with super-linear scalability to 16 nodes for the 9-Point Stencil used in this Reverse Time Migration computational kernel.

  • The balanced combination of Oracle's Sun Storage 7410 system over 10 GbE to the Sun Blade X6275 M2 server module cluster showed linear scalability for the total application throughput, including the I/O and MPI communication, to produce a final 3-D seismic depth imaged cube for interpretation.

  • The final image write time from the Sun Blade X6275 M2 server module nodes to Oracle's Sun Storage 7410 system achieved 10GbE line speed of 1.25 GBytes/second or better write performance. The effects of I/O buffer caching on the Sun Blade X6275 M2 server module nodes and 34 GByte write optimized cache on the Sun Storage 7410 system gave up to 1.8 GBytes/second effective write performance.

Performance Landscape

Server Generational Performance Improvements

Performance improvements for the Reverse Time Migration computational kernel using a Sun Blade X6275 M2 cluster are compared to the previous generation Sun Blade X6275 cluster. Hyper-threading was enabled for both configurations allowing 24 OpenMP threads for the Sun Blade X6275 M2 server module nodes and 16 for the Sun Blade X6275 server module nodes.

Sun Blade X6275 M2 Performance Improvements
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
X6275 Kernel Time (sec) X6275 M2 Kernel Time (sec) X6275 M2 Speedup X6275 Kernel Time (sec) X6275 M2 Kernel Time (sec) X6275 M2 Speedup
16 306 242 1.3 728 576 1.3
14 355 271 1.3 814 679 1.2
12 435 346 1.3 945 797 1.2
10 541 390 1.4 1156 890 1.3
8 726 555 1.3 1511 1193 1.3

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Blade X6275 M2 server cluster with a Sun Storage 7410 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server node.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 501 242 2.1\* 2.3\* 1060 576 2.0 2.1\*
14 583 271 1.8 2.0 1219 679 1.7 1.8
12 681 346 1.6 1.6 1420 797 1.5 1.5
10 807 390 1.3 1.4 1688 890 1.2 1.3
8 1058 555 1.0 1.0 2085 1193 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache for larger node counts

Image File Effective Write Performance

The performance for writing the final 3D image from a Sun Blade X6275 M2 server cluster over 10 Gigabit Ethernet to a Sun Storage 7410 system are presented. Each server allocated one core per node for MPI I/O thus allowing 22 OpenMP compute threads per node with hyperthreading enabled. Captured performance analytics from the Sun Storage 7410 system indicate effective use of its 34 Gigabyte write optimized cache.

Image File Effective Write Performance
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Write Time (sec) Write Performance (GB/sec) Write Time (sec) Write Performance (GB/sec)
16 4.8 1.5 10.2 1.4
14 5.0 1.4 10.2 1.4
12 4.0 1.8 11.3 1.3
10 4.3 1.6 9.1 1.6
8 4.6 1.5 9.7 1.5

Note: Performance results better than 1.3GB/sec related to I/O buffer caching on server nodes.

Configuration Summary

Hardware Configuration:

8 x 2 node Sun Blade X6275 M2 server nodes, each node with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)
1 x QDR InfiniBand Host Channel Adapter

Sun Datacenter InfiniBand Switch IB-36
Sun Network 10 GbE Switch 72p

Sun Storage 7410 system connected via 10 Gigabit Ethernet
4 x 17 GB STEC ZeusIOPs SSD mirrored - 34 GB
40 x 750 GB 7500 RPM Seagate SATA disks mirrored - 14.4 TB
No L2ARC Readzilla Cache

Software Configuration:

Oracle Enterprise Linux Server release 5.5
Oracle Message Passing Toolkit 8.2.1c (for MPI)
Oracle Solaris Studio 12.2 C++, Fortran, OpenMP

Benchmark Description

This Vertical Transverse Isotropy (VTI) Anisotropic Reverse Time Depth Migration (RTM) application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk for the next work flow step involving 3-D seismic volume interpretation. In doing so, it reports the compute, interprocessor communication, and I/O performance of the individual functions that comprise the total solution. Unlike most references for the Reverse Time Migration, that focus solely on the performance of the 3D stencil compute kernel, this demonstration code additionally reports the total throughput involved in processing large data sets with a full 3D Anisotropic RTM application. It provides valuable insight into configuration and sizing for specific seismic processing requirements. The performance effects of new processors, interconnects, I/O subsystems, and software technologies can be evaluated while solving a real Exploration business problem.

This benchmark study uses the "in-core" implementation of this demonstration code where each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a 4 element array pad (based on spatial order 8) shared with it's neighbors to the left and right during the initialization phase. It maintains previous, current, and next wavefield state information for each of the source, receiver, and anisotropic wavefields in memory. The second two grid dimensions used in this benchmark are specifically chosen to be prime numbers to exaggerate the effects of data alignment. Algorithm adaptions for processing higher orders in space and alternative "out-of-core" solutions using SSDs for wave state checkpointing are implemented in this demonstration application to better understand the effects of problem size scaling. Care is taken to handle absorption boundary conditioning and a variety of imaging conditions, appropriately.

RTM Application Structure:

Read Processing Parameter File, Determine Domain Decomposition, and Initialize Data Structures, and Allocate Memory.

Read Velocity, Epsilon, and Delta Data Based on Domain Decomposition and create source, receiver, & anisotropic previous, current, and next wave states.

First Loop over Time Steps

Compute 3D Stencil for Source Wavefield (a,s) - 8th order in space, 2nd order in time
Propagate over Time to Create s(t,z,y,x) & a(t,z,y,x)
Inject Estimated Source Wavelet
Apply Absorption Boundary Conditioning (a)
Update Wavefield States and Pointers
Write Snapshot of Wavefield (out-of-core) or Push Wavefield onto Stack (in-core)
Communicate Boundary Information

Second Loop over Time Steps
Compute 3D Stencil for Receiver Wavefield (a,r) - 8th order in space, 2nd order in time
Propagate over Time to Create r(t,z,y,x) & a(t,z,y,x)
Read Receiver Trace and Inject Receiver Wavelet
Apply Absorption Boundary Conditioning (a)
Update Wavefield States and Pointers
Communicate Boundary Information
Read in Source Wavefield Snapshot (out-of-core) or Pop Off of Stack (in-core)
Cross-correlate Source and Receiver Wavefields
Update image using image conditioning parameters

Write 3D Depth Image i(z,x,y) = Sum over time steps s(t,z,x,y) \* r(t,z,x,y) or other imaging conditions.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

Image File MPI Write Performance Tuning

Changing the Image File Write from MPI non-blocking to MPI blocking and setting Oracle Message Passing Toolkit MPI environment variables revealed an 18x improvement in write performance to the Sun Storage 7410 system going from:

    86.8 to 4.8 seconds for the 1243 x 1151 x 1231 grid size
    183.1 to 10.2 seconds for the 2486 x 1151 x 1231 grid size

The Swat Sun Storage 7410 analytics data capture indicated an initial write performance of about 100 MB/sec with the MPI non-blocking implementation. After modifying to MPI blocking writes, Swat showed between 1.3 and 1.8 GB/sec with up to 13000 write ops/sec to write the final output image. The Swat results are consistent with the actual measured performance and provide valuable insight into the Reverse Time Migration application I/O performance.

The reason for this vast improvement has to do with whether the MPI file mode is sequential or not (MPI_MODE_SEQUENTIAL, O_SYNC, O_DSYNC). The MPI non-blocking routines, MPI_File_iwrite_at and MPI_wait, typically used for overlapping I/O and computation, do not support sequential file access mode. Therefore, the application could not take full performance advantages of the Sun Storage 7410 system write optimized cache. In contrast, the MPI blocking routine, MPI_File_write_at, defaults to MPI sequential mode and the performance advantages of the write optimized cache are realized. Since writing the final image is at the end of RTM execution, there is no need to overlap the I/O with computation.

Additional MPI parameters used:

    setenv SUNW_MP_PROCBIND true
    setenv MPI_SPIN 1
    setenv MPI_PROC_BIND 1

Adjusting the Level of Multithreading for Performance

The level of multithreading (8, 10, 12, 22, or 24) for various components of the RTM should be adjustable based on the type of computation taking place. Best to use OpenMP num_threads clause to adjust the level of multi-threading for each particular work task. Use numactl to specify how the threads are allocated to cores in accordance to the OpenMP parallelism level.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 12/07/2010.

Sun Blade X6275 M2 Delivers Best Fluent (MCAE Application) Performance on Tested Configurations

This Manufacturing Engineering benchmark highlights the performance advantage the Sun Blade X6275 M2 server module offers over IBM, Cray, and SGI solutions as shown by the ANSYS FLUENT fluid dynamics application.

A cluster of eight of Oracle's Sun Blade X6275 M2 server modules delivered outstanding performance running the FLUENT 12 benchmark test suite.

  • The Sun Blade X6275 M2 server module cluster delivered the best results in all 36 of the test configurations run, outperforming the best posted results by as much as 42%.
  • The Sun Blade X6275 M2 server module demonstrated up to 76% performance improvement over the previous generation Sun Blade X6275 server module.

Performance Landscape

In the following tables, results are "Ratings" (bigger is better).
Rating = No. of sequential runs of test case possible in 1 day: 86,400/(Total Elapsed Run Time in Seconds)

The following table compares results on the basis of core count, irrespective of processor generation. This means that in some cases, i.e., for the 32-core and 64-core configurations, systems with the Intel Xeon X5670 six-core processors did not utilize quite all of the cores available for the specified processor count.


FLUENT 12 Benchmark Test Suite

Competitive Comparisons

System
Processors Cores Benchmark Test Case Ratings
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Blade X6275 M2 16 96 9340.5 39272.7 8307.7 8533.3 903.8 786.9
Best Posted 24 96

7562.4
797.0 712.9
Best Posted 16 96 7337.6 33553.4 6533.1 5989.6 739.1 683.5

Sun Blade X6275 M2 11 64 6306.6 27212.6 5592.2 5158.2 568.8 518.9
Best Posted 16 64 5556.3 26381.7 5494.4 4902.1 566.6 518.6

Sun Blade X6275 M2 8 48 4620.3 19093.9 4080.3 3251.2 376.0 359.4
Best Posted 8 48 4494.1 18989.0 3990.8 3185.3 372.7 354.5

Sun Blade X6275 M2 6 32 4061.1 15091.7 3275.8 3013.1 299.5 267.8
Best Posted 8 32 3404.9 14832.6 3211.9 2630.1 286.7 266.7

Sun Blade X6275 M2 4 24 2751.6 10441.1 2161.4 1907.3 188.2 182.5
Best Posted 6 24 1458.2 9626.7 1820.9 1747.2 185.1 180.8
Best Posted 4 24 2565.7 10164.7 2109.9 1608.2 187.1 180.8

Sun Blade X6275 M2 2 12 1429.9 5358.1 1097.5 813.2 95.9 95.9
Best Posted 2 12 1338.0 5308.8 1073.3 808.6 92.9 94.4



The following table compares results on the basis of processor count showing inter-generational processor performance improvement.


FLUENT 12 Benchmark Test Suite

Intergenerational Comparisons

System
Processors Cores Benchmark Test Case Ratings
eddy
417k
turbo
500k
aircraft
2m
sedan
4m
truck
14m
truck_poly
14m

Sun Blade X6275 M2 16 96 9340.5 39272.7 8307.7 8533.3 903.8 786.9
Sun Blade X6275 16 64 5308.8 26790.7 5574.2 5074.9 547.2 525.2
X6275 M2 : X6275 16
1.76 1.47 1.49 1.68 1.65 1.50

Sun Blade X6275 M2 8 48 4620.3 19093.9 4080.3 3251.2 376.0 359.4
Sun Blade X6275 8 32 3066.5 13768.9 3066.5 2602.4 289.0 270.3
X6275 M2 : X6275 8
1.51 1.39 1.33 1.25 1.30 1.33

Sun Blade X6275 M2 4 24 2751.6 10441.1 2161.4 1907.3 188.2 182.5
Sun Blade X6275 4 16 1714.3 7545.9 1519.1 1345.8 144.4 141.8
X6275 M2 : X6275 4
1.61 1.38 1.42 1.42 1.30 1.29

Sun Blade X6275 M2 2 12 1429.9 5358.1 1097.5 813.2 95.9 95.9
Sun Blade X6275 2 8 931.8 4061.1 827.2 681.5 73.0 73.8
X6275 M2 : X6275 2
1.53 1.32 1.33 1.19 1.31 1.30

Configuration Summary

Hardware Configuration:

8 x Sun Blade X6275 M2 server modules, each with
4 Intel Xeon X5670 2.93 GHz processors, turbo enabled
96 GB memory 1333 MHz
2 x 24 GB SATA-based Sun Flash Modules
2 x QDR InfiniBand Host Channel Adapter
Sun Datacenter InfiniBand Switch IB-36

Software Configuration:

Oracle Enterprise Linux Enterprise Server 5.5
ANSYS FLUENT V12.1.2
ANSYS FLUENT Benchmark Test Suite

Benchmark Description

The following description is from the ANSYS FLUENT website:

The FLUENT benchmarks suite comprises of a set of test cases covering a large range of mesh sizes, physical models and solvers representing typical industry usage. The cases range in size from a few 100 thousand cells to more than 100 million cells. Both the segregated and coupled implicit solvers are included, as well as hexahedral, mixed and polyhedral cell cases. This broad coverage is expected to demonstrate the breadth of FLUENT performance on a variety of hardware platforms and test cases.

The performance of a CFD code will depend on several factors, including size and topology of the mesh, physical models, numerics and parallelization, compilers and optimization, in addition to performance characteristics of the hardware where the simulation is performed. The principal objective of this benchmark suite is to provide comprehensive and fair comparative information of the performance of FLUENT on available hardware platforms.

About the ANSYS FLUENT 12 Benchmark Test Suite

    CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.

Key Points and Best Practices

  • ANSYS FLUENT has not yet been certified by the vendor on Oracle Enterprise Linux (OEL). However, the ANSYS FLUENT benchmark tests have been run successfully on Oracle hardware running OEL as is (i.e. with NO changes or modifications).
  • The performance improvement of the Sun Blade X6275 M2 server module over the previous generation Sun Blade X6275 server module was due to two main factors: the increased core count per processor (6 vs. 4), and the more optimal, iterative dataset partitioning scheme used for the Sun Blade X6275 M2 server module.

See Also

Disclosure Statement

All information on the FLUENT website (http://www.fluent.com) is Copyrighted 1995-2010 by ANSYS Inc. Results as of December 06, 2010.

Tuesday Oct 26, 2010

3D VTI Reverse Time Migration Scalability On Sun Fire X2270-M2 Cluster with Sun Storage 7210

This Oil & Gas benchmark shows the Sun Storage 7210 system delivers almost 2 GB/sec bandwidth and realizes near-linear scaling performance on a cluster of 16 Sun Fire X2270 M2 servers.

Oracle's Sun Storage 7210 system attached via QDR InfiniBand to a cluster of sixteen of Oracle's Sun Fire X2270 M2 servers was used to demonstrate the performance of a Reverse Time Migration application, an important application in the Oil & Gas industry. The total application throughput and computational kernel scaling are presented for two production sized grids of 800 samples.

  • Both the Reverse Time Migration I/O and combined computation shows near-linear scaling from 8 to 16 nodes on the Sun Storage 7210 system connected via QDR InfiniBand to a Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 2.0x improvement
      2486 x 1151 x 1231: 1.7x improvement
  • The computational kernel of the Reverse Time Migration has linear to super-linear scaling from 8 to 16 nodes in Oracle's Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231 : 2.2x improvement
      2486 x 1151 x 1231 : 2.0x improvement
  • Intel Hyper-Threading provides additional performance benefits to both the Reverse Time Migration I/O and computation when going from 12 to 24 OpenMP threads on the Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 8% - computational kernel; 2% - total application throughput
      2486 x 1151 x 1231: 12% - computational kernel; 6% - total application throughput
  • The Sun Storage 7210 system delivers the Velocity, Epsilon, and Delta data to the Reverse Time Migration at a steady rate even when timing includes memory initialization and data object creation:

      1243 x 1151 x 1231: 1.4 to 1.6 GBytes/sec
      2486 x 1151 x 1231: 1.2 to 1.3 GBytes/sec

    One can see that when doubling the size of the problem, the additional complexity of overlapping I/O and multiple node file contention only produces a small reduction in read performance.

Performance Landscape

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 504 259 2.0 2.2\* 1024 551 1.7 2.0
14 565 279 1.8 2.0 1191 677 1.5 1.6
12 662 343 1.6 1.6 1426 817 1.2 1.4
10 784 394 1.3 1.4 1501 856 1.2 1.3
8 1024 560 1.0 1.0 1745 1108 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache

Application Scaling – Hyper-Threading Study

The affects of hyperthreading are presented when running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server.

Hyper-Threading Comparison – 12 versus 24 OpenMP Threads
Number Nodes Thread per Node Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup
16 24 504 259 1.02 1.08 1024 551 1.06 1.12
16 12 515 279 1.00 1.00 1088 616 1.00 1.00

Read Performance

Read performance is presented for the velocity, epsilon and delta files running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Velocity, Epsilon, and Delta File Read and Memory Initialization Performance
Number Nodes Overlap MBytes Read Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s
16 2040 16.7 1.1 23.2 1.4 36.8 1.1 44.3 1.2
8 951
14.8 1.0 22.1 1.6 33.0 1.0 43.2 1.3

Configuration Summary

Hardware Configuration:

16 x Sun Fire X2270 M2 servers, each with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)

Sun Storage 7210 system connected via QDR InfiniBand
2 x 18 GB SATA SSD (logzilla)
40 x 1 TB 7200 RM SATA disk

Software Configuration:

SUSE Linux Enterprise Server SLES 10 SP 2
Oracle Message Passing Toolkit 8.2.1 (for MPI)
Sun Studio 12 Update 1 C++, Fortran, OpenMP

Benchmark Description

This Reverse Time Migration (RTM) demonstration application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk. In this version, each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a four element inline 3-D array pad (spatial order of eight) shared with its neighbors to the left and right during the initialization phase. It represents a full RTM application including the data input, computation, communication, and final output image to be used by the next work flow step involving 3D volumetric seismic interpretation.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

I/O Characterization without Optimal Checkpointing

Velocity, Epsilon, and Delta Files - Grid Reading

The additional amount of overlapping reads to share velocity, epsilon, and delta edge data with neighbors can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (z_dimension) x (4 bytes) x (3 files)

For this particular benchmark study, the additional 3-D pad overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 1231 x 4 x 3 = 2.04 GB extra
    8 nodes: 7 x 8 x 1151 x 1231 x 4 x 3 = 0.95 GB extra

For the first of the two test cases, the total size of the three files used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 1231 x 4 bytes = 7.05 GB per file x 3 files = 21.13 GB

With the additional 3-D pad, the total amount of data read is:

    16 nodes: 2.04 GB + 21.13 GB = 23.2 GB
    8 nodes: 0.95 GB + 21.13 GB = 22.1 GB

For the second of the two test cases, the total size of the three files used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 1231 x 4 bytes = 14.09 GB per file x 3 files = 42.27 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: 2.04 GB + 42.27 GB = 44.3 GB
    8 nodes: 0.95 GB + 42.27 GB = 43.2 GB

Note that the amount of overlapping data read increases, not only by the number of nodes, but as the y dimension and/or the z dimension increases.

Trace Reading

The additional amount of overlapping reads to share trace edge data with neighbors for can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (4 bytes) x (number_of_time_slices)

For this particular benchmark study, the additional overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 4 x 800 = 442MB extra
    8 nodes: 7 x 8 x 1151 x 4 x 800 = 206MB extra

For the first case the size of the trace data file used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 4 bytes x 800 = 4.578 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 4.578 GB = 5.0 GB
    8 nodes: .206 GB + 4.578 GB = 4.8 GB

For the second case the size of the trace data file used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 4 bytes x 800 = 9.156 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 9.156 GB = 9.6 GB
    8 nodes: .206 GB + 9.156 GB = 9.4 GB

As the number of nodes is increased, the overlap causes more disk lock contention.

Writing Final Output Image

1243x1151x1231 - 7.1 GB per file:

    16 nodes: 78 x 1151 x 1231 x 4 = 442MB/node (7.1 GB total)
    8 nodes: 156 x 1151 x 1231 x 4 = 884MB/node (7.1 GB total)

2486x1151x1231 - 14.1 GB per file:

    16 nodes: 156 x 1151 x 1231 x 4 = 930 MB/node (14.1 GB total)
    8 nodes: 311 x 1151 x 1231 x 4 = 1808 MB/node (14.1 GB total)

Resource Allocation

It is best to allocate one node as the Oracle Grid Engine resource scheduler and MPI master host. This is especially true when running with 24 OpenMP threads in hyperthreading mode to avoid oversubscribing a node that is cooperating in delivering the solution.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 10/20/2010.

Thursday Sep 30, 2010

Consolidation of 30 x86 Servers onto One SPARC T3-2

One of Oracle's SPARC T3-2 servers was able to consolidate the database workloads off of thirty older x86 servers in a secure virtualized environment.

  • The thirty x86 servers required 6.7 times more power than the consolidated workload on the SPARC T3-2 server.

  • The x86 configuration used 10 times the rack space than the consolidated workload did on the SPARC T3-2 server.

  • In addition to power & space considerations, there are also administrative cost savings resulting from having to manage just one server, as opposed to thirty servers.

  • Gartner says, "They need to realize that removing a single x86 server from a data center will result in savings of more than $400 a year in energy costs alone".

  • The total transaction throughput for the SPARC T3 server (132,000) was almost the same as the aggregate throughput achieved by the thirty x86 servers (138,000), where each x86 running at 10% utilization.

  • The average transaction response time on the SPARC T3-2 server (24 ms) was just a little higher than the average transaction response time on the Intel servers (19.5 ms).

Performance Landscape

System Oracle
Instances
Average
System
Utilization
Transactions/
min/system
Average
Response
time (ms)
watts/
system
OS
Sun Fire X4250
2x 3.0GHz Xeon
1 10% 4,600 19.5 320 Linux
SPARC T3-2
1x 1.65GHz SPARC T3
30 80% 132,000 24.0 1400\* Solaris

\* power consumption includes storage and periperal devices

Notes:
total throughput for 30 Intel systems = 30 \* 4600 = 138,000
total watts for 30 Intel systems = 30 \* 320 = 9600

Results and Configuration Summary

x86 Server Configuration:

30 x Sun Fire X4250 servers, each with
2 X Intel 3.0 GHz E5450 processors
16 GB memory
6 x internal 146 GB 15K SAS disks
RedHat Linux 5.3
Oracle Database 11g Release 2

SPARC T3 Server Configuration:

1 x SPARC T3-2 server
2 x 1.65 GHz SPARC T3 processors
256 GB memory
2 X 10K 300 GB internal SAS disks
1 x Sun Storage F5100 Flash Array storage
1 x Sun Fires X4270 server as COMSTAR target
Oracle Solaris 10 9/10
Oracle Database 11g Release 2

Benchmark Description

This demonstration was designed to show the benefits of virtualization when upgrading from older X86 systems to one of Oracle's T-series servers. A 30:1 consolidation was shown moving from thirty X86 Linux servers to a single T-Series server running Oracle Solaris in a secure virtualized environment. After the consolidation, there was still 20% headroom in the SPARC T3-2 server for additional growth in the workload.

The 200 scale iGen OLTP workload was used to test the consolidation. The x86 system was loaded with iGen clients up to a level of 10% cpu utilization. This load level for x86 systems is typically found in many data centers.

Thirty Oracle Solaris zones (containers) were created on the SPARC T3-2 server, with each zone configured identically as the Oracle configuration on the x86 server. The throughput on each zone was ramped up to the same level as on the Intel base server.

The overall CPU utilization on the SPARC T3-2 server, together with the average iGen transaction response times were then measured along with the power consumption.

Key Points and Best Practices

  • Each Oracle Solaris container was assigned to a processor set consisting of eight virtual CPUs. This use of processor sets was critical to obtaining the reported performance number. Without processor set, the performance was reduced to about one-half the reported performance number.

  • Once the first container was completely configured (with Oracle 11g and iGen installed), the remaining containers were created by a simple cloning procedure, which took no more than a few minutes for each container.

  • Setting up a standalone x86 server with Linux, Oracle and iGen is a far more time consuming task than setting up additional containers once the first container has been created.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/20/2010.

Tuesday Sep 21, 2010

ProMAX Performance and Throughput on Sun Fire X2270 and Sun Storage 7410

Halliburton/Landmark's ProMAX 3D Prestack Kirchhoff Time Migration's single job scalability and multiple job throughput using various scheduling methods are evaluated on a cluster of Oracle's Sun Fire X2270 servers attached via QDR InfiniBand to Oracle's Sun Storage 7410 system.

Two resource scheduling methods, compact and distributed, are compared while increasing the system load with additional concurrent ProMAX jobs.

  • A single ProMAX job has near linear scaling of 5.5x on 6 nodes of a Sun Fire X2270 cluster.

  • A single ProMAX job has near linear scaling of 7.5x on a Sun Fire X2270 server when running from 1 to 8 threads.

  • ProMAX can take advantage of Oracle's Sun Storage 7410 system features compared to dedicated local disks. There was no significant difference in run time observed when running up to 8 concurrent 16 thread jobs.

  • The 8-thread ProMAX job throughput using the distributed scheduling method is equivalent or slightly faster than the compact scheme for 1 to 4 concurrent jobs.

  • The 16-thread ProMAX job throughput using the distributed scheduling method is up to 8% faster when compared to the compact scheme on an 8-node Sun Fire X2270 cluster.

The multiple job throughput characterization revealed in this benchmark study are key in pre-configuring Oracle Grid Engine resource scheduling for ProMAX on a Sun Fire X2270 cluster and provide valuable insight for server consolidation.

Performance Landscape

Single Job Scaling

Single job performance on a single node is near linear up the number of cores in the node, i.e. 2 Intel Xeon X5570s with 4 cores each. With hyperthreading (2 active threads per core) enabled, more ProMAX threads are used increasing the load on the CPU's memory architecture causing the reduced speedups.
ProMAX single job performance on the 6-node cluster shows near linear speedup node to node.
Single Job 6-Node Scalability
Hyperthreading Enabled - 16 Threads/Node Maximum
Number of Nodes Threads Per Node Speedup to 1 Thread Speedup to 1 Node
6 16 54.2 5.5
4 16 36.2 3.6
3 16 26.1 2.6
2 16 17.6 1.8
1 16 10.0 1.0
1 14 9.2
1 12 8.6
1 10 7.2\*
1 8 7.5
1 6 5.9
1 4 3.9
1 3 3.0
1 2 2.0
1 1 1.0

\* 2 threads contend with two master node daemons

Multiple Job Throughput Scaling, Compact Scheduling

With the Sun Storage 7410 system, performance of 8 concurrent jobs on the cluster using compact scheduling is equivalent to a single job.

Multiple Job Throughput Scalability
Hyperthreading Enabled - 16 Threads/Node Maximum
Number of Nodes Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Percent Cluster Used
1 1 16 1.00 1 13
2 1 16 1.00 2 25
4 1 16 1.00 4 50
8 1 16 1.00 8 100

Multiple 8-Thread Job Throughput Scaling, Compact vs. Distributed Scheduling

These results report the difference of different distributed method resource scheduling levels to 1, 2, and 4 concurrent job compact method baselines.

Multiple 8-Thread Job Scheduling
HyperThreading Enabled - Use 8 Threads/Node Maximum
Number of Jobs Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Total Threads per Node Used Percent of PVM Master 8 Threads Used
1 1 8 1.00 1 8 100
1 4 2 1.01 4 2 25
1 8 1 1.01 8 1 13

2 1 8 1.00 2 8 100
2 4 2 1.01 4 4 50
2 8 1 1.01 8 2 25

4 1 8 1.00 4 8 100
4 4 2 1.00 4 8 100
4 8 1 1.01 8 4 100

Multiple 16-Thread Job Throughput Scaling, Compact vs. Distributed Scheduling

The results are reported relative to the performance of 1, 2, 4, and 8 concurrent 2-node, 8-thread jobs.

Multiple 16-Thread Job Scheduling
HyperThreading Enabled - 16 Threads/Node Available
Number of Jobs Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Total Threads per Node Used Percent of PVM Master 16 Threads Used
1 1 16 0.66 1 16 100\*
1 2 8 1.00 2 8 50
1 4 4 1.03 4 4 25
1 8 2 1.06 8 2 13

2 1 16 0.70 2 16 100\*
2 2 8 1.00 4 8 50
2 4 4 1.07 8 4 25
2 8 2 1.08 8 4 25

4 1 16 0.74 4 16 100\*
4 4 4 0.74 4 16 100\*
4 2 8 1.00 8 8 50
4 4 4 1.05 8 8 50
4 8 2 1.04 8 8 50

8 1 16 1.00 8 16 100\*
8 4 4 1.00 8 16 100\*
8 8 2 1.00 8 16 100\*

\* master PVM host; running 20 to 21 total threads (over-subscribed)

Results and Configuration Summary

Hardware Configuration:

8 x Sun Fire X2270 servers, each with
2 x 2.93 GHz Intel Xeon X5570 processors
48 GB memory at 1333 MHz
1 x 500 GB SATA
Sun Storage 7410 system
4 x 2.3 GHz AMD Opteron 8356 processors
128 GB memory
2 Internal 233GB SAS drives = 466 GB
2 Internal 93 GB read optimized SSD = 186 GB
1 External Sun Storage J4400 array with 22 1TB SATA drives and 2 18GB write optimized SSD
11 TB mirrored data and mirrored write optimized SSD

Software Configuration:

SUSE Linux Enterprise Server 10 SP 2
Parallel Virtual Machine 3.3.11
Oracle Grid Engine
Intel 11.1 Compilers
OpenWorks Database requires Oracle 10g Enterprise Edition
Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

The ProMAX family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX is integrated with Halliburton's OpenWorks Geoscience Oracle Database to index prestack seismic data and populate the database with processed seismic.

This benchmark evaluates single job scalability and multiple job throughput of the ProMAX 3D Prestack Kirchhoff Time Migration while processing the Halliburton benchmark data set containing 70,808 traces with 8 msec sample interval and trace length of 4992 msec. Alternative thread scheduling methods are compared for optimizing single and multiple job throughput. The compact scheme schedules the threads of a single job in as few nodes as possible, whereas, the distributed scheme schedules the threads across a many nodes as possible. The effects of load on the Sun Storage 7410 system are measured. This information provides valuable insight into determining the Oracle Grid Engine resource management policies.

Hyperthreading is enabled for all of the tests. It should be noted that every node is running a PVM daemon and ProMAX license server daemon. On the master PVM daemon node, there are three additional ProMAX daemons running.

The first test measures single job scalability across a 6-node cluster with an additional node serving as the master PVM host. The speedup relative to a single node, single thread are reported.

The second test measures multiple job scalability running 1 to 8 concurrent 16-thread jobs using the Sun Storage 7410 system. The performance is reported relative to a single job.

The third test compares 8-thread multiple job throughput using different job scheduling methods on a cluster. The compact method involves putting all 8 threads for a job on the same node. The distributed method involves spreading the 8 threads of job across multiple nodes. The results report the difference of different distributed method resource scheduling levels to 1, 2, and 4 concurrent job compact method baselines.

The fourth test is similar to the second test except running 16-thread ProMAX jobs. The results are reported relative to the performance of 1, 2, 4, and 8 concurrent 2-node, 8-thread jobs.

The ProMAX processing parameters used for this benchmark:

Minimum output inline = 65
Maximum output inline = 85
Inline output sampling interval = 1
Minimum output xline = 1
Maximum output xline = 200 (fold)
Xline output sampling interval = 1
Antialias inline spacing = 15
Antialias xline spacing = 15
Stretch Mute Aperature Limit with Maximum Stretch = 15
Image Gather Type = Full Offset Image Traces
No Block Moveout
Number of Alias Bands = 10
3D Amplitude Phase Correction
No compression
Maximum Number of Cache Blocks = 500000

Key Points and Best Practices

  • The application was rebuilt with the Intel 11.1 Fortran and C++ compilers with these flags.

    -xSSE4.2 -O3 -ipo -no-prec-div -static -m64 -ftz -fast-transcendentals -fp-speculation=fast
  • There are additional execution threads associated with a ProMAX node. There are two threads that run on each node: the license server and PVM daemon. There are at least three additional daemon threads that run on the PVM master server: the ProMAX interface GUI, the ProMAX job execution - SuperExec, and the PVM console and control. It is best to allocate one node as the master PVM server to handle the additional 5+ threads. Otherwise, hyperthreading can be enabled and the master PVM host can support up to 8 ProMAX job threads.

  • When hyperthreading is enabled in on one of the non-master PVM hosts, there is a 7% penalty going from 8 to 10 threads. However, 12 threads are 11 percent faster than 8. This can be contributed to the two additional support threads when hyperthreading initiates.

  • Single job performance on a single node is near linear up the number of cores in the node, i.e. 2 Intel Xeon X5570s with 4 cores each. With hyperthreading (2 active threads per core) enabled, more ProMAX threads are used increasing the load on the CPU's memory architecture causing the reduced speedups.

    Users need to be aware of these performance differences and how it effects their production environment.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX. Results as of 9/20/2010.

Monday Sep 20, 2010

Schlumberger's ECLIPSE 300 Performance Throughput On Sun Fire X2270 Cluster with Sun Storage 7410

Oracle's Sun Storage 7410 system, attached via QDR InfiniBand to a cluster of eight of Oracle's Sun Fire X2270 servers, was used to evaluate multiple job throughput of Schlumberger's Linux-64 ECLIPSE 300 compositional reservoir simulator processing their standard 2 Million Cell benchmark model with 8 rank parallelism (MM8 job).

  • The Sun Storage 7410 system showed little difference in performance (2%) compared to running the MM8 job with dedicated local disk.

  • When running 8 concurrent jobs on 8 different nodes all to the Sun Storage 7140 system, the performance saw little degradation (5%) compared to a single MM8 job running on dedicated local disk.

Experiments were run changing how the cluster was utilized in scheduling jobs. Rather than running with the default compact mode, tests were run distributing the single job among the various nodes. Performance improvements were measured when changing from the default compact scheduling scheme (1 job to 1 node) to a distributed scheduling scheme (1 job to multiple nodes).

  • When running at 75% of the cluster capacity, distributed scheduling outperformed the compact scheduling by up to 34%. Even when running at 100% of the cluster capacity, the distributed scheduling is still slightly faster than compact scheduling.

  • When combining workloads, using the distributed scheduling allowed two MM8 jobs to finish 19% faster than the reference time and a concurrent PSTM workload to find 2% faster.

The Oracle Solaris Studio Performance Analyzer and Sun Storage 7410 system analytics were used to identify a 3D Prestack Kirchhoff Time Migration (PSTM) as a potential candidate for consolidating with ECLIPSE. Both scheduling schemes are compared while running various job mixes of these two applications using the Sun Storage 7410 system for I/O.

These experiments showed a potential opportunity for consolidating applications using Oracle Grid Engine resource scheduling and Oracle Virtual Machine templates.

Performance Landscape

Results are presented below on a variety of experiments run using the 2009.2 ECLIPSE 300 2 Million Cell Performance Benchmark (MM8). The compute nodes are a cluster of Sun Fire X2270 servers connected with QDR InfiniBand. First, some definitions used in the tables below:

Local HDD: Each job runs on a single node to its dedicated direct attached storage.
NFSoIB: One node hosts its local disk for NFS mounting to other nodes over InfiniBand.
IB 7410: Sun Storage 7410 system over QDR InfiniBand.
Compact Scheduling: All 8 MM8 MPI processes run on a single node.
Distributed Scheduling: Allocate the 8 MM8 MPI processes across all available nodes.

First Test

The first test compares the performance of a single MM8 test on a single node using local storage to running a number of jobs across the cluster and showing the effect of different storage solutions.

Compact Scheduling
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Local HDD Relative Throughput NFSoIB Relative Throughput IB 7410 Relative Throughput
13% 1 1.00 1.00\* 0.98
25% 2 0.98 0.97 0.98
50% 4 0.98 0.96 0.97
75% 6 0.98 0.95 0.95
100% 8 0.98 0.95 0.95

\* Performance measured on node hosting its local disk to other nodes in the cluster.

Second Test

This next test uses the Sun Storage 7410 system and compares the performance of running the MM8 job on 1 node using the compact scheduling to running multiple jobs with compact scheduling and to running multiple jobs with the distributed schedule. The tests are run on a 8 node cluster, so each distributed job has only 1 MPI process per node.

Comparing Compact and Distributed Scheduling
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Compact Scheduling
Relative Throughput
Distributed Scheduling\*
Relative Throughput
13% 1 1.00 1.34
25% 2 1.00 1.32
50% 4 0.99 1.25
75% 6 0.97 1.10
100% 8 0.97 0.98

\* Each distributed job has 1 MPI process per node.

Third Test

This next test uses the Sun Storage 7410 system and compares the performance of running the MM8 job on 1 node using the compact scheduling to running multiple jobs with compact scheduling and to running multiple jobs with the distributed schedule. This test only uses 4 nodes, so each distributed job has two MPI processes per node.

Comparing Compact and Distributed Scheduling on 4 Nodes
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Compact Scheduling
Relative Throughput
Distributed Scheduling\*
Relative Throughput
25% 1 1.00 1.39
50% 2 1.00 1.28
100% 4 1.00 1.00

\* Each distributed job it has two MPI processes per node.

Fourth Test

The last test involves running two different applications on the 4 node cluster. It compares the performance of running the cluster fully loaded and changing how the applications are run, either compact or distributed. The comparisons are made against the individual application running the compact strategy (as few nodes as possible). It shows that appropriately mixing jobs can give better job performance than running just one kind of application on a single cluster.

Multiple Job, Multiple Application Throughput Results
Comparing Scheduling Strategies
2009.2 ECLIPSE 300 MM8 2 Million Cell and 3D Kirchoff Time Migration (PSTM)

Number of PSTM Jobs Number of MM8 Jobs Compact Scheduling
(1 node x 8 processes
per job)
ECLIPSE
Distributed Scheduling
(4 nodes x 2 processes
per job)
ECLIPSE
Distributed Scheduling
(4 nodes x 4 processes
per job)
PSTM
Compact Scheduling
(2 nodes x 8 processes per job)
PSTM
Cluster Load
0 1 1.00 1.40

25%
0 2 1.00 1.27

50%
0 4 0.99 0.98

100%
1 2
1.19 1.02
100%
2 0

1.07 0.96 100%
1 0

1.08 1.00 50%

Results and Configuration Summary

Hardware Configuration:

8 x Sun Fire X2270 servers, each with
2 x 2.93 GHz Intel Xeon X5570 processors
24 GB memory (6 x 4 GB memory at 1333 MHz)
1 x 500 GB SATA
Sun Storage 7410 system, 24 TB total, QDR InfiniBand
4 x 2.3 GHz AMD Opteron 8356 processors
128 GB memory
2 Internal 233GB SAS drives (466 GB total)
2 Internal 93 GB read optimized SSD (186 GB total)
1 Sun Storage J4400 with 22 1 TB SATA drives and 2 18 GB write optimized SSD
20 TB RAID-Z2 (double parity) data and 2-way striped write optimized SSD or
11 TB mirrored data and mirrored write optimized SSD
QDR InfiniBand Switch

Software Configuration:

SUSE Linux Enterprise Server 10 SP 2
Scali MPI Connect 5.6.6
GNU C 4.1.2 compiler
2009.2 ECLIPSE 300
ECLIPSE license daemon flexlm v11.3.0.0
3D Kirchoff Time Migration

Benchmark Description

The benchmark is a home-grown study in resource usage options when running the Schlumberger ECLIPSE 300 Compositional reservoir simulator with 8 rank parallelism (MM8) to process Schlumberger's standard 2 Million Cell benchmark model. Schlumberger pre-built executables were used to process a 260x327x73 (2 Million Cell) sub-grid with 6,206,460 total grid cells and model 7 different compositional components within a reservoir. No source code modifications or executable rebuilds were conducted.

The ECLIPSE 300 MM8 job uses 8 MPI processes. It can run within a single node (compact) or across multiple nodes of a cluster (distributed). By using the MM8 job, it is possible to compare the performance between running each job on a separate node using local disk to using a shared network attached storage solution. The benchmark tests study the affect of increasing the number of MM8 jobs in a throughput model.

The first test compares the performance of running 1, 2, 4, 6 and 8 jobs on a cluster of 8 nodes using local disk, NFSoIB disk, and the Sun Storage 7410 system connected via InfiniBand. Results are compared against the time it takes to run 1 job with local disk. This test shows what performance impact there is when loading down a cluster.

The second test compares different methods of scheduling jobs on a cluster. The compact method involves putting all 8 MPI processes for a job on the same node. The distributed method involves using 1 MPI processes per node. The results compare the performance against 1 job on one node.

The third test is similar to the second test, but uses only 4 nodes in the cluster, so when running distributed, there are 2 MPI processes per node.

The fourth test compares the compact and distributed scheduling methods on 4 nodes while running a 2 MM8 jobs and one 16-way parallel 3D Prestack Kirchhoff Time Migration (PSTM).

Key Points and Best Practices

  • ECLIPSE is very sensitive to memory bandwidth and needs to be run on 1333 MHz or greater memory speeds. In order to maintain 1333 MHz memory, the maximum memory configuration for the processors used in this benchmark is 24 GB. Bios upgrades now allow 1333 MHz memory for up to 48 GB of memory. Additional nodes can be used to handle data sets that require more memory than available per node. Allocating at least 20% of memory per node for I/O caching helps application performance.

  • If allocating an 8-way parallel job (MM8) to a single node, it is best to use an ECLIPSE license for that particular node to avoid the any additional network overhead of sharing a global license with all the nodes in a cluster.

  • Understanding the ECLIPSE MM8 I/O access patterns is essential to optimizing a shared storage solution. The analytics available on the Oracle Unified Storage 7410 provide valuable I/O characterization information even without source code access. A single MM8 job run shows an initial read and write load related to reading the input grid, parsing Petrel ascii input parameter files and creating an initial solution grid and runtime specifications. This is followed by a very long running simulation that writes data, restart files, and generates reports to the 7410. Due to the nature of the small block I/O, the mirrored configuration for the 7410 outperformed the RAID-Z2 configuration.

    A single MM8 job reads, processes, and writes approximately 240 MB of grid and property data in the first 36 seconds of execution. The actual read and write of the grid data, that is intermixed with this first stage of processing, is done at a rate of 240 MB/sec to the 7410 for each of the two operations.

    Then, it calculates and reports the well connections at an average 260 KB writes/second with 32 operations/second = 32 x 8 KB writes/second. However, the actual size of each I/O operation varies between 2 to 100 KB and there are peaks every 20 seconds. The write cache is on average operating at 8 accesses/second at approximately 61 KB/second (8 x 8 KB writes/sec). As the number of concurrent jobs increases, the interconnect traffic and random I/O operations per second to the 7410 increases.

  • MM8 multiple job startup time is reduced on shared file systems, if each job uses separate input files.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/20/2010.

Sun Fire X4470 4 Node Cluster Delivers World Record SAP SD-Parallel Benchmark Result

Oracle delivered an SAP enhancement package 4 for SAP ERP 6.0 Sales and Distribution – Parallel (SD-Parallel) Benchmark world record result using four of Oracle's Sun Fire X4470 servers, Oracle Solaris 10 and Oracle 11g Real Application Clusters (RAC) software.

  • The Sun Fire X4470 servers delivered 8% more performance compared to the IBM Power 780 server running the SAP enhancement package 4 for SAP ERP 6.0 Sales and Distribution benchmark.

  • The Sun Fire X4470 servers result of 40,000 users delivered 2.2 times the performance of the HP ProLiant DL980 G7 result of 18,180 users.

  • The Sun Fire X4470 servers result of 40,000 users delivered 2.5 times the performance of the Fujitsu PRIMEQUEST 1800E result of 16,000 users.

This result shows that a complete software and hardware solution from Oracle using Oracle RAC, Oracle Solaris and Sun servers provides a superior performing solution.

Performance Landscape

Selected SAP Sales and Distribution benchmark results are presented in decreasing order in performance. All benchmarks were using SAP enhancement package 4 for SAP ERP 6.0 (Unicode) except the result marked with an asterix (\*) which was achieved with SAP ERP 6.0.

System OS
Database
Users SAPS Type Date
Four Sun Fire X4470
4xIntel Xeon X7560 @2.26GHz
256 GB
Solaris 10
Oracle 11g Real Application Clusters
40,000 221,014 Parallel 20-Sep-10
Five IBM System p 570 (\*)
8xPOWER6 @4.7GHz
128 GB
AIX 5L Version 5.3
Oracle 10g Real Application Clusters
37,040 187,450 Parallel "non-Unicode" 25-Mar-08
IBM Power 780
8xPOWER7 @3.8GHz
1 TB
AIX 6.1
DB2 9.7
37,000 202,180 2-Tier 7-Apr-10
Two Sun Fire X4470
4xIntel Xeon X7560 @2.26GHz
256 GB
Solaris 10
Oracle 11g Real Application Clusters
21,000 115,300 Parallel 28-Jun-10
HP DL980 G7
8xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
18,180 99,320 2-Tier 21-Jun-10
Fujitsu PRIMEQUEST 1800E
8xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
16,000 87,550 2-Tier 30-Mar-10
Four Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
13,718 75,762 Parallel 12-Oct-09
HP DL580 G7
4xIntel Xeon X7560 @2.26GHz
256 GB
Win Server 2008 R2 DE
SQL Server 2008
10,445 57,020 2-Tier 21-Jun-10
Two Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
7,220 39,420 Parallel 12-Oct-09
One Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
3,800 20,750 Parallel 12-Oct-09

Complete benchmark results and a description can be found at the SAP benchmark website http://www.sap.com/solutions/benchmark/sd.epx.

Results and Configuration Summary

Hardware Configuration:

4 x Sun Fire X4470 servers, each with
4 x Intel Xeon X7560 2.26 GHz (4 chips, 32 cores, 64 threads)
256 GB memory

Software Configuration:

Oracle 11g Real Application Clusters (RAC)
Oracle Solaris 10

Results Summary:

Number of SAP SD benchmark users:
40,000
Average dialog response time:
0.86 seconds
Throughput:

Dialog steps/hour:
13,261,000

SAPS:
221,020
SAP Certification:
2010039

Benchmark Description

SAP is one of the premier world-wide ERP application providers and maintains a suite of benchmark tests to demonstrate the performance of competitive systems running the various SAP products.

The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments. The SAP Standard Application Sales and Distribution - Parallel (SD-Parallel) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing and demonstrates the ability to run both the application and database software on a single system.

The SD-Parallel Benchmark consists of the same transactions and user interaction steps as the SD Benchmark. This means that the SD-Parallel Benchmark runs the same business processes as the SD Benchmark. The difference between the benchmarks is the technical data distribution.

The additional rule for parallel and distributed databases is one must equally distribute the benchmark users across all database nodes for the used benchmark clients (round-robin method). Following this rule, all database nodes work on data of all clients. This avoids unrealistic configurations such as having only one client per database node.

The SAP Benchmark Council agreed to give the parallel benchmark a different name so that the difference can be easily recognized by any interested parties - customers, prospects, and analysts. The naming convention is SD-Parallel for Sales & Distribution - Parallel.

In January 2009, a new version of the SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution (SD) Benchmark was released. This new release has higher cpu requirements and so yields from 25-50% fewer users compared to the previous (non-unicode) Standard Sales and Distribution (SD) Benchmark. Between 10-30% of this greater load is due to the extra overhead from the processing of the larger character strings due to Unicode encoding.

Unicode is a computing standard that allows for the representation and manipulation of text expressed in most of the world's writing systems. Before the Unicode requirement, this benchmark used ASCII characters meaning each was just 1 byte. The new version of the benchmark requires Unicode characters and the Application layer (where ~90% of the cycles in this benchmark are spent) uses a new encoding, UTF-16, which uses 2 bytes to encode most characters (including all ASCII characters) and 4 bytes for some others. This requires computers to do more computation and use more bandwidth and storage for most character strings. Refer to the above SAP Note for more details.

See Also

Disclosure Statement

SAP enhancement package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, results as of 9/19/2010. For more details, see http://www.sap.com/benchmark. SD-Parallel, Four Sun Fire X4470 (each 4 processors, 32 cores, 64 threads) 40,000 SAP SD Users, Cert# 2010039. SD-Parallel, Two Sun Fire X4470 (each 4 processors, 32 cores, 64 threads) 21,000 SAP SD Users, Cert# 2010029. SD 2-Tier, HP ProLiant DL980 G7 (8 processors, 64 cores, 128 threads) 18,180 SAP SD Users, Cert# 2010028. SD 2-Tier, Fujitsu PRIMEQUEST 1800E (8 processors, 64 cores, 128 threads) 16,000 SAP SD Users, Cert# 2010010. SD-Parallel, Four Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 13,718 SAP SD Users, Cert# 2009041. SD 2-Tier, HP ProLiant DL580 G7 (4 processors, 32 cores, 64 threads) 10,490 SAP SD Users, Cert# 2010032. SD 2-Tier, IBM System x3850 X5 (4 processors, 32 cores, 64 threads) 10,450 SAP SD Users, Cert# 2010012. SD 2-Tier, Fujitsu PRIMERGY RX600 S5 (4 processors, 32 cores, 64 threads) 9,560 SAP SD Users, Cert# 2010017. SD-Parallel, Two Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 7,220 SAP SD Users, Cert# 2009040. SD-Parallel, Sun Blade X6270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, Cert# 2009039. SD 2-Tier, Sun Fire X4270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, Cert# 2009033.

SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, results as of 9/19/2010. SD-Parallel, Five IBM System p 570 (each 8 processors, 16 cores, 32 threads) 37,040 SAP SD Users, Cert# 2008013.

Tuesday Jun 29, 2010

Sun Fire X2270 M2 Achieves Leading Single Node Results on ANSYS FLUENT Benchmark

Oracle's Sun Fire X2270 M2 server produced leading single node performance results running the ANSYS FLUENT benchmark cases as compared to the best single node results currently posted at the ANSYS FLUENT website. ANSYS FLUENT is a prominent MCAE application used for computational fluid dynamics (CFD).

  • The Sun Fire X2270 M2 server outperformed all single node systems in 5 of 6 test cases at the 12 core level, beating systems from Cray and SGI.
  • For the truck_14m test, the Sun Fire X2270 M2 server outperformed all single node systems at all posted core counts, beating systems from SGI, Cray and HP. When considering performance on a single node, the truck_14m model is most representative of customer CFD model sizes in the test suite.
  • The Sun Fire X2270 M2 server with 12 cores performed up to 1.3 times faster than the previous generation Sun Fire X2270 server with 8 cores.

Performance Landscape

Results are presented for six of the seven ANSYS FLUENT benchmark tests. The seventh test is not a practical test for a single system. Results are ratings, where bigger is better. A rating is the number of jobs that could be run in a single day (86,400 / run time). Competitive results are from the ANSYS FLUENT benchmark website as of 25 June 2010.

Single System Performance

ANSYS FLUENT Benchmark Tests
Results are Ratings, Bigger is Better
System Benchmark Test
eddy_417k turbo_500k aircraft_2m sedan_4m truck_14m truck_poly_14m
Sun Fire X2270 M2 1129.4 5391.6 1105.9 814.1 94.8 96.4
SGI Altix 8400EX 1338.0 5308.8 1073.3 796.3 - -
SGI Altix XE1300C 1299.2 5284.4 1071.3 801.3 90.2 -
Cray CX1 1060.8 5127.6 1069.6 808.6 86.1 87.5

Scaling of Benchmark Test truck_14m

ANSYS FLUENT truck_14m Model
Results are Ratings, Bigger is Better
System Cores Used
12 8 4 2 1
Sun Fire X2270 M2 94.8 73.6 41.4 21.0 10.4
SGI Altix XE1300C 90.2 60.9 41.1 20.7 9.0
Cray CX1 (X5570) - 71.7 33.2 18.9 8.1
HP BL460 G6 (X5570) - 70.3 38.6 19.6 9.2

Comparing System Generations, Sun Fire X2270 M2 to Sun Fire X2270

ANSYS FLUENT Benchmark Tests
Results are Ratings, Bigger is Better
System Benchmark Test
eddy_417k turbo_500k aircraft_2m sedan_4m truck_14m truck_poly_14m
Sun Fire X2270 M2 1129.4 5374.8 1103.8 814.1 94.8 96.4
Sun Fire X2270 981.5 4163.9 862.7 691.2 73.6 73.3

Ratio 1.15 1.29 1.28 1.18 1.29 1.32

Results and Configuration Summary

Hardware Configuration:

Sun Fire X2270 M2
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory
1 x 500 GB 7200 rpm SATA internal HDD

Sun Fire X2270
2 x 2.93 GHz Intel Xeon X5570 processors
48 GB memory
2 x 24 GB internal striped SSDs

Software Configuration:

64-bit SUSE Linux Enterprise Server 10 SP 3 (SP 2 for X2270)
ANSYS FLUENT V12.1.2
ANSYS FLUENT Benchmark Test Suite

Benchmark Description

The following description is from the ANSYS FLUENT website:

The FLUENT benchmarks suite comprises of a set of test cases covering a large range of mesh sizes, physical models and solvers representing typical industry usage. The cases range in size from a few 100 thousand cells to more than 100 million cells. Both the segregated and coupled implicit solvers are included, as well as hexahedral, mixed and polyhedral cell cases. This broad coverage is expected to demonstrate the breadth of FLUENT performance on a variety of hardware platforms and test cases.

The performance of a CFD code will depend on several factors, including size and topology of the mesh, physical models, numerics and parallelization, compilers and optimization, in addition to performance characteristics of the hardware where the simulation is performed. The principal objective of this benchmark suite is to provide comprehensive and fair comparative information of the performance of FLUENT on available hardware platforms.

About the ANSYS FLUENT 12 Benchmark Test Suite

    CFD models tend to be very large where grid refinement is required to capture with accuracy conditions in the boundary layer region adjacent to the body over which flow is occurring. Fine grids are required to also determine accurate turbulence conditions. As such these models can run for many hours or even days as well using a large number of processors.

See Also

Disclosure Statement

All information on the FLUENT website (http://www.fluent.com) is Copyrighted 1995-2010 by ANSYS Inc. Results as of June 25, 2010.

Sun Fire X2270 M2 Demonstrates Outstanding Single Node Performance on MSC.Nastran Benchmarks

Oracle's Sun Fire X2270 M2 server results showed outstanding performance running the MCAE application MSC.Nastran as shown by the MD Nastran MDR3 serial and parallel test cases.

Performance Landscape

Complete information about the serial results presented below can be found on the MSC Nastran website.


MD Nastran MDR3 Serial Test Results
Platform Benchmark Problem
Results are total elapsed run time in seconds
xl0imf1 xx0xst0 xl1fn40 vl0sst1
Sun Fire X2270 M2 999 704 2337 115
Sun Blade X6275 1107 798 2285 120
Intel Nehalem 1235 971 2453 123
Intel Nehalem w/ SSD 1484 767 2456 120
IBM:P6 570 ( I8 )
1510 4612 132
IBM:P6 570 ( I4 ) 1016 1618 5534 147

Complete information about the parallel results presented below can be found on the MSC Nastran website.


MD Nastran MDR3 Parallel Test Results
Platform Benchmark Problem
Results are total elapsed run time in seconds
xx0cmd2 md0mdf1
Serial DMP=2 DMP=4 DMP=8 Serial DMP=2 DMP=4
Sun Blade X6275 840 532 391 279 880 422 223
Sun Fire X2270 M2 847 558 371 297 889 462 232
Intel Nehalem w/ 4 SSD 887 639 405
902 479 235
Intel Nehalem 915 561 408
922 470 251
IBM:P6 570 ( I8 ) 920 574 392 322


IBM:P6 570 ( I4 ) 959 616 419 343 911 469 242

Results and Configuration Summary

Hardware Configuration:

Sun Fire X2270 M2
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory
4 x 24 GB SSDs (striped)

Software Configuration:

64-bit SUSE Linux Enterprise Server 10 SP 3
MSC Software MD 2008 R3
MD Nastran MDR3 benchmark test suite

Benchmark Description

The benchmark tests are representative of typical MSC.Nastran applications including both serial and parallel (DMP) runs involving linear statics, nonlinear statics, and natural frequency extraction as well as others. MD Nastran is an integrated simulation system with a broad set of multidiscipline analysis capabilities.

Key Points and Best Practices

  • The test cases for the MSC.Nastran module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). The required scratch file size ranges from less than 1 GB on up to about 140 GB. To obtain best performance, it is important to have a high performance storage system when running MD Nastran.

  • To improve performance, it is possible to make use of the MD Nastran feature which sets the maximum amount of memory the application will use. This allows a user to configure where temporary files are held, including in memory file systems like tmpfs.

See Also

Disclosure Statement

MSC.Software is a registered trademark of MSC. All information on the MSC.Software website is copyrighted. MD Nastran MDR3 results from http://www.mscsoftware.com and this report as of June 28, 2010.

Sun Fire X2270 M2 Sets World Record on SPEC OMP2001 Benchmark

Oracle's Sun Fire X2270 M2 server running the Oracle Solaris 10 10/09 with the Oracle Solaris Studio 12 Update 1 compiler, produced the top x86 SPECompM2001 result for all 2-socket servers.

  • The Sun Fire X2270 M2 server with two Intel Xeon X5670 processors running 24 OpenMP threads achieved a SPEC OMP2001 result of 55,178 SPECompM2001.

  • The Sun Fire X2270 M2 server beat the Cisco B200 M2 system even thought the Cisco system used the faster Intel Xeon X5680 (3.33GHz) chips.

Performance Landscape

SPEC OMP2001 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results. All results as of 06/28/10.

In the tables below
"Base" = SPECompMbase2001 and "Peak" = SPECompMpeak2001

SPEC OMPM2001 results

System Processors Base
Threads
Performance Results
Cores/
Chips
Type GHz Peak Base
Sun Fire X2270 M2 12/2 Xeon X5670 2.93 24 55178 49548
Cisco B200 M2 12/2 Xeon X5680 3.33 24 55072 52314
Intel SR1600UR 12/2 Xeon X5680 3.33 24 54249 51510
Intel SR1600UR 12/2 Xeon X5670 2.93 24 53313 50283

Results and Configuration Summary

Hardware Configuration:

Sun Fire X2270 M2
2 x 2.93 GHz Intel Xeon X5670
24 GB

Software Configuration:

Oracle Solaris 10 10/09
Oracle Solaris Studio 12 Update 1
SPEC OMP2001 suite v3.2

Benchmark Description

The SPEC OMPM2001 Benchmark Suite was released in June 2001 and tests HPC performance using OpenMP for parallelism.

  • 11 programs (3 in C and 8 in Fortran) parallelized using OpenMP API
Goals of the suite:
  • Targeted to mid-range (4-32 processor) parallel systems
  • Run rules, tools and reporting similar to SPEC CPU2006
  • Programs representative of HPC and Scientific Applications

The SPEC OMPL2001 Benchmark Suite was released in June 2001 and tests HPC performance using OpenMP for parallelism.

  • 9 programs (2 in C and 7 in Fortran) parallelized using OpenMP API
Goals of the suite:
  • Targeted to larger parallel systems
  • Run rules, tools and reporting similar to SPEC CPU2006
  • Programs representative of HPC and Scientific Applications

There are "base" variants of both the above metrics that require more conservative compilation, such as using the same flags for all benchmarks.

See Also

Disclosure Statement

SPEC, SPEComp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 28 June 2010 and this report. Sun Fire X2270 M2 (2 chips, 12 cores, 24 OpenMP threads) 55,178 SPECompM2001;

Monday Jun 28, 2010

Sun Fire X4470 Sets World Records on SPEC OMP2001 Benchmarks

Oracle's Sun Fire X4470 server, with four Intel Xeon X7560 processors capable of running OpenMP applications with 64 compute threads, delivered outstanding performance on the both medium and large suites of the industry-standard SPEC OMP2001 benchmark.

  • The Sun Fire X4470 server running the Oracle Solaris 10 10/09 operating system with Oracle Solaris Studio 12 Update 1 compiler software, produced the top x86 result on SPECompM2001.

  • The Sun Fire X4470 server running the Oracle Solaris 10 10/09 operating system with Oracle Solaris Studio 12 Update 1 compiler software, produced the top x86 result on SPECompL2001.

  • The Sun Fire X4470 server beats IBM Power 750 Express POWER7 3.55 GHz SPECompM2001 score by 14%, while the Sun Fire X4470 server uses half the number of OpenMP threads compared to the IBM Power 750.
  • The Sun Fire X4470 server with four Intel Xeon 7560 processors, running 64 OpenMP threads, achieved SPEC OMP2001 results of 118,264 SPECompM2001 and 642,479 SPECompL2001.

  • The Sun Fire X4470 server produced better SPECompL2001 results than Cisco (UCS C460 M1) and Intel (QSSC-S4R) even though they all used the same number of Intel Xeon X7560 processors.

  • The Sun Fire X4470 server produced better SPECompM2001 results than Cisco (UCS C460 M1), SGI (Altix UV 10) and Intel (QSSC-S4R) even though they all used the same number of Intel Xeon X7560 processors.

Performance Landscape

SPEC OMP2001 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results. All results as of 06/28/10.

In the tables below
"Base" = SPECompLbase2001 or SPECompMbase2001
"Peak" = SPECompLpeak2001 or SPECompMpeak2001

SPEC OMPL2001 results

System Processors Base
Threads
Performance Results
Cores/
Chips
Type GHz Peak Base
Sun Fire X4470 32/4 Xeon 7560 2.26 64 642479 615790
Cisco UCS C460 M1 32/4 Xeon 7560 2.26 64 628126 607818
Intel QSSC-S4R X7560 32/4 Xeon 7560 2.26 64 610386 591375
Sun/Fujitsu SPARC M8000 64/16 SPARC64 VII 2.52 64 581807 532576

SPEC OMPM2001 results

System Processors Base
Threads
Performance Results
Cores/
Chips
Type GHz Peak Base
Sun Fire X4470 32/4 Xeon 7560 2.26 64 118264 95650
Cisco UCS C460 M1 32/4 Xeon 7560 2.26 64 109077 100258
SGI Altix UV 10 32/4 Xeon 7560 2.26 64 107248 96797
Intel QSSC-S4R X7560 32/4 Xeon 7560 2.26 64 106369 98288
Sun/Fujitsu SPARC M8000 64/16 SPARC64 VII 2.52 64 104714 75418
IBM Power 750 Express 32/4 POWER7 3.55 128 104175 92957

Results and Configuration Summary

Hardware Configuration:

Sun Fire X4470
4 x 2.26 GHz Intel Xeon X7560
256 GB

Software Configuration:

Oracle Solaris 10 10/09
Oracle Solaris Studio Update 1
SPEC OMP2001 suite v3.2

Benchmark Description

The SPEC OMPM2001 Benchmark Suite was released in June 2001 and tests HPC performance using OpenMP for parallelism.

  • 11 programs (3 in C and 8 in Fortran) parallelized using OpenMP API
Goals of the suite:
  • Targeted to mid-range (4-32 processor) parallel systems
  • Run rules, tools and reporting similar to SPEC CPU2006
  • Programs representative of HPC and Scientific Applications

The SPEC OMPL2001 Benchmark Suite was released in June 2001 and tests HPC performance using OpenMP for parallelism.

  • 9 programs (2 in C and 7 in Fortran) parallelized using OpenMP API
Goals of the suite:
  • Targeted to larger parallel systems
  • Run rules, tools and reporting similar to SPEC CPU2006
  • Programs representative of HPC and Scientific Applications

There are "base" variants of both the above metrics that require more conservative compilation, such as using the same flags for all benchmarks.

See Also

Disclosure Statement

SPEC, SPEComp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 28 June 2010 and this report. Sun Fire X4470 (4 chips, 32 cores, 64 OpenMP threads) 642,479 SPECompL2001, 118264 SPECompM2001.

Sun Fire X4470 2-Node Configuration Sets World Record for SAP SD-Parallel Benchmark

Using two of Oracle's Sun Fire X4470 servers to run the SAP Enhancement Package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution – Parallel (SD-Parallel) standard application benchmark, Oracle delivered a world record result. This was run using Oracle Solaris 10 and Oracle 11g Real Application Clusters (RAC) software.

  • The Sun Fire X4470 servers result of 21,000 users delivered more than twice the performance of the IBM System x3850 X5 system result of 10,450 users.

  • The Sun Fire X4470 servers result of 21,000 users beat the HP ProLiant DL980 G7 system result of 18,180 users. Both solutions used 8 Intel Xeon X7560 processors.

  • The Sun Fire X4470 servers result of 21,000 users beat the Fujitsu PRIMEQUEST 1800E system result of 16,000 users. Both solutions used 8 Intel Xeon X7560 processors.

  • This result shows how a compete software and hardware solution from Oracle, using Oracle RAC, Oracle Solaris and along with Oracle's Sun servers, can provide a superior performing solution when compared to the competition.

Performance Landscape

SAP Enhancement Package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, select results presented in decreasing performance order. Both Parallel and 2-Tier solution results are listed in the table.

System OS
Database
Users SAPS Type Date
Two Sun Fire X4470
4xIntel Xeon X7560 @2.26GHz
256 GB
Solaris 10
Oracle 11g Real Application Clusters
21,000 115,300 Parallel 28-Jun-10
HP DL980 G7
8xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
18,180 99,320 2-Tier 21-Jun-10
Fujitsu PRIMEQUEST 1800E
8xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
16,000 87,550 2-Tier 30-Mar-10
Four Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
13,718 75,762 Parallel 12-Oct-09
IBM System x3850 X5
4xIntel Xeon X7560 @2.26GHz
256 GB
Win Server 2008 EE
DB2 9.7
10,450 57,120 2-Tier 30-Mar-10
HP DL580 G7
4xIntel Xeon X7560 @2.26GHz
256 GB
Win Server 2008 R2 DE
SQL Server 2008
10,445 57,020 2-Tier 21-Jun-10
Fujitsu PRIMERGY RX600 S5
4xIntel Xeon X7560 @2.26GHz
512 GB
Win Server 2008 R2 DE
SQL Server 2008
9,560 52,300 2-Tier 06-May-10
Two Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
7,220 39,420 Parallel 12-Oct-09
One Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
3,800 20,750 Parallel 12-Oct-09
Sun Fire X4270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g
3,800 21,000 2-Tier 21-Aug-09

Complete benchmark results may be found at the SAP benchmark website http://www.sap.com/solutions/benchmark/sd.epx.

Results and Configuration Summary

Hardware Configuration:

2 x Sun Fire X4470 servers, each with
4 x Intel Xeon X7560 2.26 GHz (4 chips, 32 cores, 64 threads)
256 GB memory

Software Configuration:

Oracle 11g Real Application Clusters (RAC)
Oracle Solaris 10

Results Summary:

Number of SAP SD benchmark users:
21,000
Average dialog response time:
0.93 seconds
Throughput:

Dialog steps/hour:
6,918,000

SAPS:
115,300
SAP Certification:
2010029

Benchmark Description

The SAP Standard Application Sales and Distribution - Parallel (SD-Parallel) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing, and demonstrates the ability to run both the application and database software on a single system. The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments.

The SD-Parallel Benchmark consists of the same transactions and user interaction steps as the SD Benchmark. This means that the SD-Parallel Benchmark runs the same business processes as the SD Benchmark. The difference between the benchmarks is the technical data distribution.

An additional rule for parallel and distributed databases is one must equally distribute the benchmark users across all database nodes for the used benchmark clients (round-robin-method). Following this rule, all database nodes work on data of all clients. This avoids unrealistic configurations such as having only one client per database node.

The SAP Benchmark Council agreed to give the parallel benchmark a different name so that the difference can be easily recognized by any interested parties - customers, prospects, and analysts. The naming convention is SD-Parallel for Sales & Distribution - Parallel.

SAP is one of the premier world-wide ERP application providers, and maintains a suite of benchmark tests to demonstrate the performance of competitive systems on the various SAP products.

See Also

Disclosure Statement

SAP Enhancement Package 4 for SAP ERP 6.0 (Unicode) Sales and Distribution Benchmark, results as of 6/22/2010. For more details, see http://www.sap.com/benchmark. SD-Parallel, Two Sun Fire X4470 (each 4 processors, 32 cores, 64 threads) 21,000 SAP SD Users, Cert# 2010029. SD 2-Tier, HP ProLiant DL980 G7 (8 processors, 64 cores, 128 threads) 18,180 SAP SD Users, Cert# 2010028. SD 2-Tier, Fujitsu PRIMEQUEST 1800E (8 processors, 64 cores, 128 threads) 16,00o SAP SD Users, Cert# 2010010. SD-Parallel, Four Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 13,718 SAP SD Users, Cert# 2009041. SD 2-Tier, IBM System x3850 X5 (4 processors, 32 cores, 64 threads) 10,450 SAP SD Users, Cert# 2010012. SD 2-Tier, Fujitsu PRIMERGY RX600 S5 (4 processors, 32 cores, 64 threads) 9,560 SAP SD Users, Cert# 2010017. SD-Parallel, Two Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 7,220 SAP SD Users, Cert# 2009040. SD-Parallel, Sun Blade X6270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, Cert# 2009039. SD 2-Tier, Sun Fire X4270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, Cert# 2009033.

Tuesday Apr 06, 2010

WRF Benchmark: X6275 Beats Power6

Significance of Results

Oracle's Sun Blade X6275 cluster is 28% faster than the IBM POWER6 cluster on Weather Research and Forecasting (WRF) continental United Status (CONUS) benchmark datasets. The Sun Blade X6275 cluster used a Quad Data Rate (QDR) InfiniBand connection along with Intel compilers and MPI.

  • On the 12 km CONUS data set, the Sun Blade X6275 cluster was 28% faster than the IBM POWER6 cluster at 512 cores.

  • The Sun Blade X6275 cluster with 768 cores (one full Sun Blade 6048 chassis) was 47% faster than 1024 cores of the IBM POWER6 cluster (multiple racks).

  • On the 2.5 km CONUS data set, the Sun Blade X6275 cluster was 21% faster than the IBM POWER6 cluster at 512 cores.

  • The Sun Blade X6275 cluster with 768 cores (on full Sun Blade 6048 chassis) outperforms the IBM Power6 cluster with 1024 cores by 28% on the 2.5 km CONUS dataset.

Performance Landscape

The performance in GFLOPS is shown below on multiple datasets.

Weather Research and Forecasting
CONUS 12 KM Dataset
Cores Performance in GFLOPS
Sun
X6275
Intel
Whitebox
IBM
POWER6
Cray
XT5
SGI TACC
Ranger
Blue
Gene/P
8 17.5 19.8 17.0
10.2

16 38.7 37.5 33.6 21.4 20.1 10.8
32 71.6 73.3 66.5 40.4 39.8 21.2 5.9
64 132.5 131.4 117.2 75.2 77.0 37.8
128 235.8 232.8 209.1 137.4 114.0 74.5 20.4
192 323.6





256 405.2 415.1 363.1 243.2 197.9 121.0 37.4
384 556.6





512 691.9 696.7 542.2 392.2 375.2 193.9 65.6
768 912.0






1024

618.5 634.1 605.9 271.7 108.5
1700



840.1


2048





175.6

All cores used on each node which participates in each run.

Sun X6275 - 2.93 GHz X5570, InfiniBand
Intel Whitebox - 2.8 GHz GHz X5560, InfiniBand
IBM POWER6 - IBM Power 575, 4.7 GHz POWER6, InfiniBand, 3 frames
Cray XT5 - 2.7 GHz AMD Opteron (Shanghai), Cray SeaStar 2.1
SGI - best of a variety of results
TACC Ranger - 2.3 GHz AMD Opteron (Barcelona), InfiniBand
Blue Gene/P - 850 MHz PowerPC 450, 3D-Torus (proprietary)

Weather Research and Forecasting
CONUS 2.5 KM Dataset
Cores Performance in GFLOPS
Sun
X6275
SGI
8200EX
Blue
Gene/L
IBM
POWER6
Cray
XT5
Intel
Whitebox
TACC
Ranger
16 35.2






32 69.6

64.3


64 140.2

130.9
147.8 24.5
128 278.9 89.9
242.5 152.1 290.6 87.7
192 400.5





256 514.8 179.6 8.3 431.3 306.3 535.0 145.3
384 735.1





512 973.5 339.9 16.5 804.4 566.2 1019.9 311.0
768 1367.7





1024
721.5 124.8 1067.3 1075.9 1911.4 413.4
2048
1389.5 241.2
1849.7 3251.1
2600




4320.6
3072
1918.7 350.5
2651.3

4096
2543.5 453.2
3288.7

6144
3057.3 642.3
4280.1

8192
3569.7 820.4
5140.4

18432

1238.0



Sun X6275 - 2.93 GHz X5570, InfiniBand
SGI 8200EX - 3.0 GHz E5472, InfiniBand
Blue Gene/L - 700 MHz PowerPC 440, 3D-Torus (proprietary)
IBM POWER6 - IBM Power 575, 4.7 GHz POWER6, InfiniBand, 3 frames
Cray XT5 - 2.4 GHz AMD Opteron (Shanghai), Cray SeaStar 2.1
Intel Whitebox - 2.8 GHz GHz X5560, InfiniBand
TACC Ranger - 2.3 GHz AMD Opteron (Barcelona), InfiniBand

Results and Configuration Summary

Hardware Configuration:

48 x Sun Blade X6275 server modules, 2 nodes per blade, each node with
2 Intel Xeon X5570 2.93 GHz processors, turbo enabled, ht disabled
24 GB memory
QDR InfiniBand

Software Configuration:

SUSE Linux Enterprise Server 10 SP2
Intel Compilers 11.1.059
Intel MPI 3.2.2
WRF 3.0.1.1
WRF 3.1.1
netCDF 4.0.1

Benchmark Description

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. WRF is designed to be a flexible, state-of-the-art atmospheric simulation system that is portable and efficient on available parallel computing platforms. It features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility.

There are two fixed-size benchmark cases.

Single domain, medium size 12KM Continental US (CONUS-12K)

  • 425x300x35 cell volume
  • 48hr, 12km resolution dataset from Oct 24, 2001
  • Benchmark is a 3hr simulation for hrs 25-27 starting from a provided restart file
  • Iterations output at every 72 sec of simulation time, with the computation cost of each time step ~30 GFLOP

Single domain, large size 2.5KM Continental US (CONUS-2.5K)

  • 1501x1201x35 cell volume
  • 6hr, 2.5km resolution dataset from June 4, 2005
  • Benchmark is the final 3hr simulation for hrs 3-6 starting from a provided restart file; the benchmark may also be performed (but seldom reported) for the full 6hrs starting from a cold start
  • Iterations output at every 15 sec of simulation time, with the computation cost of each time step ~412 GFLOP

See Also

Disclosure Statement

WRF, see http://www.mmm.ucar.edu/wrf/WG2/bench/, results as of 3/8/2010.

Monday Jan 25, 2010

Sun/Solaris Leadership in SAP SD Benchmarks and HP claims

COMMENTS ON  SIGNIFICANT SAP SD 2 Tier RESULTS and HP MISLEADING CLAIMS:

HP is making claims of "leadership" in the  Two-Tier SAP SD benchmark by carefully fencing the claims, based on the OS (Linux & Windows),  and conveniently omitting the actual leading results from Sun on Solaris.

HP's claims: ftp://ftp.hp.com/pub/c-products/servers/benchmarks/HP_ProLiant_785_585_385_SAP_perf_brief_121009.pdf

It is worthwhile to take a closer look at the results and the real leadership of Sun and Solaris in this SAP benchmark. All the SAP SD Two-Tier results discussed here can be seen at: http://www.sap.com/solutions/benchmark/sd2tier.epx  All results here use the latest version of SAP Enhancement Package 4  for SAP ERP 6.0 (Unicode).

Here is a summary of the HP claims and the counterpoints showing Sun and Solaris real leadership in performance and scalability.

HP claims the #1 position for 8-Processor, 4-Processor and 2-processor servers as follows;

  • 8-processor:  yes, but on Windows (8,280 SAP SD Benchmark users) and on Linux (8,022 SAP SD Benchmark users) with the HP Proliant DL785 G6 (8x 2.8GHz Opteron 8439 SE).
  • Formally correct statements however HP fails to mention Sun actual  #1 8-processor Overall record result by far, using Solaris at 10,000 SAP SD Benchmark users on a Sun Fire X4640 with eight 2.6GHz Opteron 8435, leading by more than 20% with lower clock speed!
  • 4-processor:  yes, but on Linux (4,370 SAP SD Benchmark users,  using the HP Proliant DL585 G6, 4x 2.8GHz AMD Opteron SE).
  • Again, formally correct statement however, HP fails to mention that Sun holds the 4-processor Overall record result using Solaris at 4,720 SAP SD Benchmark users, obtained on a Sun SPARC Enterprise T5440 with four 1.6GHz UltraSPARC T2 Plus.
  • 2-processor: Similarly HP claims the #1 and #2 rankings, but on Linux (3,171 SAP SD Benchmark users, HP Proliant DL380 G6, 2x 2.93GHz Xeon X5570) and (2,315 SAP SD Benchmark users, HP Proliant DL385 G6, 2x 2.6GHz Opteron 2435).
  • Again, HP omits the fact that Sun holds the  #1 2-Processor Overall  record result on Solaris at 3,800 SAP SD Benchmark users  obtained on a Sun Fire X4270 with 2x 2.93GHz Xeon X5570. A 20% and 64% Sun leading result!

The only conclusion is that that Sun Servers running Solaris have real Leadership in the SAP SD Two-Tier Benchmark. This fact is not only confirmed for the 2- to 8-processor servers but also at the high end where the Sun M9000 with Solaris leads with the World Record Overall for this benchmark, showing real record performance and top vertical scalability.

More details on the World Record Sun M9000 SAP SD 2-Tier results at BestPerf blog: http://blogs.sun.com/BestPerf/entry/sun_m9000_fastest_sap_2

SAP Benchmark Disclosure statement

Two-tier SAP Sales and Distribution (SD) standard SAP enhancement package 4 for SAP ERP 6.0 (Unicode) application benchmarks as of Jan 22, 2010: 

Cert# Benchmark Server Users SAPS Procs Cores Thrds CPU CPU MHz Mem (MB) Operating System RDBMS Release
2009046 Sun SPARC M9000 32000 175600 64 256 512 SPARC64 VII 2880 1179648 Solaris 10 Oracle 10g
2009049 Sun Fire X4640 10000 55070 8 48 48 AMD Opt 8435 2600 262144 Solaris 10 Oracle 10g
2009052 HP ProL DL785 G6 8022 43800 8 48 48 AMD Opt 8439SE 2800 131072 SuSE Linux ES10 MaxDB 7.8
2009035 HP ProL DL785 G6 8280 45350 8 48 48 AMD Opt 8439SE 2800 131072 Windows 2008-EE SQL Server 2008
2009026 Sun SPARC T5440 4720 25830 4 32 256 UltraSPARC T2Plus 1600 262144 Solaris 10 Oracle 10g
2009025 HP ProL DL585 G6 4665 25530 4 24 24 AMD Opt 8439SE 2800 65536 Windows 2008-EE SQL Server 2008
2009051 HP ProL DL585 G6 4370 23850 4 24 24 AMD Opt 8439SE 2800 65536 SuSE Linux ES10 MaxDB 7.8
2009033 Sun Fire X4270 3800 21000 2 8 16 Intel Xeon X5570 2930 49152 Solaris 10 Oracle 10g
2009004 HP ProL DL380 G6 3300 18030 2 8 16 Intel Xeon X5570 2930 49152 Windows 2008-EE SQL Server 2008
2009006 HP ProL DL380 G6 3171 17380 2 8 16 Intel Xeon X5570 2930 49152 SuSE Linux ES10 MaxDB 7.8
2009050 HP ProL DL385 G6 2315 12650 2 12 12 AMD Opt 2435 2600 49152 SuSE Linux ES10 MaxDB 7.8

SAP, R/3, reg TM of SAP AG in Germany and other countries. More info www.sap.com/benchmark

Friday Nov 20, 2009

Sun Blade 6048 and Sun Blade X6275 NAMD Molecular Dynamics Benchmark beats IBM BlueGene/L

Significance of Results

A Sun Blade 6048 chassis with 48 Sun Blade X6275 server modules ran benchmarks using the NAMD molecular dynamics applications software. NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD is driven by major trends in computing and structural biology and received a 2002 Gordon Bell Award.

  • The cluster of 32 Sun Blade X6275 server modules was 9.2x faster than the 512 processor configuration of the IBM BlueGene/L.

  • The cluster of 48 Sun Blade X6275 server modules exhibited excellent scalability for NAMD molecular dynamics simulation, up to 37.8x speedup for 48 blades relative to 1 blade.

  • For largest molecule considered, the cluster of 48 Sun Blade X6275 server modules achieved a throughput of 0.028 seconds per simulation step.
Molecular dynamics simulation is important to biological and materials science research. Molecular dynamics is used to determine the low energy conformations or shapes of a molecule. These conformations are presumed to be the biologically active conformations.

Performance Landscape

The NAMD Performance Benchmarks web page plots the performance of NAMD when the ApoA1 benchmark is executed on a variety of clusters. The performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation, multiplied by the number of "processors" on which NAMD executes in parallel. The following table compares the performance of the Sun Blade X6275 cluster to several of the clusters for which performance is reported on the web page. In this table, the performance is expressed in terms of the time in seconds required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Cluster Name and Interconnect Throughput for 128 Cores
(seconds per step)
Throughput for 256 Cores
(seconds per step)
Throughput for 512 Cores
(seconds per step)
Sun Blade X6275 InfiniBand 0.014 0.0073 0.0048
Cambridge Xeon/3.0 InfiniPath 0.016 0.0088 0.0056
NCSA Xeon/2.33 InfiniBand 0.019 0.010 0.008
AMD Opteron/2.2 InfiniPath 0.025 0.015 0.008
IBM HPCx PWR4/1.7 Federation 0.039 0.021 0.013
SDSC IBM BlueGene/L MPI 0.108 0.061 0.044

The following tables report results for NAMD molecular dynamics using a cluster of Sun Blade X6275 server modules. The performance of the cluster is expressed in terms of the time in seconds that is required to execute one step of the molecular dynamics simulation. A smaller number implies better performance.

Blades Cores STMV molecule (1) f1 ATPase molecule (2) ApoA1 molecule (3)
Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy Thruput
(secs/ step)
spdup effi'cy
48 768 0.0277 37.8 79% 0.0075 35.2 73% 0.0039 22.2 46%
36 576 0.0324 32.3 90% 0.0096 27.4 76% 0.0045 19.3 54%
32 512 0.0368 28.4 89% 0.0104 25.3 79% 0.0048 18.1 57%
24 384 0.0481 21.8 91% 0.0136 19.3 80% 0.0066 13.2 55%
16 256 0.0715 14.6 91% 0.0204 12.9 81% 0.0073 11.9 74%
12 192 0.0875 12.0 100% 0.0271 9.7 81% 0.0096 9.1 76%
8 128 0.1292 8.1 101% 0.0337 7.8 98% 0.0139 6.3 79%
4 64 0.2726 3.8 95% 0.0666 4.0 100% 0.0224 3.9 98%
1 16 1.0466 1.0 100% 0.2631 1.0 100% 0.0872 1.0 100%

spdup - speedup versus 1 blade result
effi'cy - speedup efficiency versus 1 blade result

(1) Satellite Tobacco Mosaic Virus (STMV) molecule, 1,066,628 atoms, 12 Angstrom cutoff, Langevin dynamics, 500 time steps
(2) f1 ATPase molecule, 327,506 atoms, 11 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps
(3) ApoA1 molecule, 92,224 atoms, 12 Angstrom cutoff, particle mesh Ewald dynamics, 500 time steps

Results and Configuration Summary

Hardware Configuration

    48 x Sun Blade X6275, each with
      2 x (2 x 2.93 GHz Intel QC Xeon X5570 (Nehalem) processors)
      2 x (24 GB memory)
      Hyper-Threading (HT) off, Turbo Mode on

Software Configuration

    SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
    OpenMPI 1.3.2
    gcc 4.1.2 (1/15/2007), gfortran 4.1.2 (1/15/2007)

Benchmark Description

Molecular dynamics simulation is widely used in biological and materials science research. NAMD is a public-domain molecular dynamics software application for which a variety of molecular input directories are available. Three of these directories define:
  • the Satellite Tobacco Mosaic Virus (STMV) that comprises 1,066,628 atoms
  • the f1 ATPase enzyme that comprises 327,506 atoms
  • the ApoA1 enzyme that comprises 92,224 atoms
Each input directory also specifies the type of molecular dynamics simulation to be performed, for example, Langevin dynamics with a 12 Angstrom cutoff for 500 time steps, or particle mesh Ewald dynamics with an 11 Angstrom cutoff for 500 time steps.

Key Points and Best Practices

Models with large numbers of atoms scale better than models with small numbers of atoms.

The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.33GHz. This feature was was enabled when generating the results reported here.

See Also

Disclosure Statement

NAMD, see http://www.ks.uiuc.edu/Research/namd/performance.html for more information, results as of 11/17/2009.

Monday Nov 02, 2009

Sun Ultra 27 Delivers Leading Single Frame Buffer SPECviewperf 10 Results

A Sun Ultra 27 workstation configured with an nVidia FX5800 graphics card delivered outstanding performance running the SPECviewperf® 10 benchmark.

  • When compared with other workstations running a single graphics card (i.e. not running two or more cards in SLI mode), the Sun Ultra 27 workstation places first in 6 of 8 subtests and second in the remaining two subtests.

  • The calculated geometric mean shows that Sun Ultra 27 workstation is 11% faster than competitor's workstations.

  • The optimum point for price/performance is the nVidia FX1800 graphics card.

Results have been published on the SPEC web site at http://www.spec.org/gwpg/gpc.data/vp10/summary.html.

Performance Landscape

Performance of the Sun Ultra 27 versus the competition. Bigger is better for each of the eight tests. The comparison is based upon the performance of the Sun Ultra 27 workstation. Performance is measured in frames per second.


3DSMAX CATIA ENSIGHT MAYA
Perf % Perf % Perf % Perf %
Sun Ultra 27 FX5800 59.34
68.81
58.07
246.09
HP xw4600 ATI FireGL V7700 49.71 19 48.05 43 57.11 2
268.62 -8
HP xw4600 FX4800 52.26 14 63.26 12 53.79 8
226.82 7
Fujtsu Celsius M470 FX3800 53.67 11 65.25 7 52.19 10 227.37 7

PROENGINEER SOLIDWORKS TEAMCENTER UGS
Perf % Perf % Perf % Perf %
Sun Ultra 27 FX5800 68.96
152.01
42.02
36.04
HP xw4600 ATI FireGL V7700 47.25 32 109.71 28 40.18 4 56.65 -57
HP xw4600 FX4800 61.15 11 131.31 14 28.42 32 33.43 7
Fujtsu Celsius M470 FX3800 64.39 7
139.2 8 29.02 31 33.27 8

Comparison of various frame buffers on the Sun Ultra 27 running SPECviewperf 10. Performance is reported for each test along with the difference in performance as compared to the FX5800 frame buffer. The runs in the table below were made with 3.2GHz W3570 processors.


3DSMAX CATIA ENSIGHT MAYA PROENGR SOLIDWRKS TEAMCNTR UGS
Perf % Perf % Perf % Perf % Perf % Perf % Perf % Perf %
FX5800 57.07
67.84
58.63
219.4
68.05
152.3
40.85
34.73
FX3800 57.17 0 66.57 2
54.91 7
206.4 6 66.48 2 146.3 4 38.48 6 33.12 5
FX1800 56.73 1
64.33 6
52.05 13 189.3 16 64.67 5 135.2 13 34.18 20
30.46 14
FX380 45.90 24 55.81 22 34.93 68 120.3 82 46.09 48 64.11 138 17.00 140 13.88 150

Results and Configuration Summary

Hardware Configuration:

    Sun Ultra 27 Workstation
    1 x 3.33 GHz Intel Xeon (tm) W3580
    2GB (1 x 2GB PC10600 1333MHz)
    1 x 500GB SATA
    nVidia Quadro FX380, FX1800, FX3800 & FX5800
    $7,529.00 (includes Microsoft Windows and monitor)

Software Configuration:

    OS: Microsoft Windows Vista Ultimate, 32-bit
    Benchmark: SPECviewperf 10

Benchmark Description

SPECviewperf measures 3D graphics rendering performance of systems running under OpenGL. SPECviewperf is a synthetic benchmark designed to be a predictor of application performance and a measure of graphics subsystem performance. It is a measure of graphics subsystem performance (primarily graphics bus, driver and graphics hardware) and its impact on the system without the full overhead of an application. SPECviewperf reports performance in frames per second.

Please go here for a more complete description of the tests.

Key Points and Best Practices

SPECviewperf measures the 3D rendering performance of systems running under OpenGL.

The SPECopcSM project group's SPECviewperf 10 is totally new performance evaluation software. In addition to features found in previous versions, it now provides the ability to compare performance of systems running in higher-quality graphics modes that use full-scene anti-aliasing, and measures how effectively graphics subsystems scale when running multithreaded graphics content. Since the SPECviewperf source and binaries have been upgraded to support changes, no comparisons should be made between past results and current results for viewsets running under SPECviewperf 10.

SPECviewperf 10 requires OpenGL 1.5 and a minimum of 1GB of system memory. It currently supports Windows 32/64.

See Also

Disclosure Statement

SPEC® and the benchmark name SPECviewperf® are registered trademarks of the Standard Performance Evaluation Corporation. Competitive benchmark results stated above reflect results published on www.spec.org as of Oct 18, 2009. For the latest SPECviewperf benchmark results, visit www.spec.org/gwpg.

Saturday Oct 24, 2009

Sun C48 & Lustre fast for Seismic Reverse Time Migration using Sun X6275

Significance of Results

A Sun Blade 6048 Modular System with 12 Sun Blade X6275 server modules were clustered together with QDR InfiniBand and using a Lustre File System with QDR InfiniBand to show performance improvements over an NFS file system for reading in Velocity, Epsilon, and Delta Slices and imaging 800 samples of various various grid sizes using the Reverse Time Migration.

  • The Initialization Time for populating the processing grids demonstrates significant advantages of Lustre over NFS:
    • 2486x1151x1231 : 20x improvement
    • 1243x1151x1231 : 20x improvement
    • 125x1151x1231 : 11x improvement
  • The Total Application Performance shows the Interconnect and I/O advantages of using QDR InfiniBand Lustre for the large grid sizes:
    • 2486x1151x1231 : 2x improvement - processed in less than 19 minutes
    • 1243x1151x1231 : 2x improvement - processed in less than 10 minutes

  • The Computational Kernel Scalability Efficiency for the 3 grid sizes:
    • 125x1151x1231 : 97% (1-8 nodes)
    • 1243x1151x1231 : 102% (8-24 nodes)
    • 2486x1151x1231 : 100% (12-24 nodes)

  • The Total Application Scalability Efficiency for the large grid sizes:
    • 1243x1151x1231 : 72% (8-24 nodes)
    • 2485x1151x1231 : 71% (12-24 nodes)

  • On the X5570 Intel processor with HyperThreading enabled and running 16 OpenMP threads per node gives approximately a 10% performance improvement over running 8 threads per node.

Performance Landscape

This first table presents the initialization time, comparing different number processors along with different problem sizes. The results are presented in seconds and shows the advantage the Lustre file system running over QDR InfiniBand provided when compared to a simple NFS file system.


Initialization Time Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
24 48 1.59 18.90 8.90 181.78 15.63 362.48
20 40 1.60 18.90 8.93 181.49 16.91 358.81
16 32 1.58 18.59 8.97 181.58 17.39 353.72
12 24 1.54 18.61 9.35 182.31 22.50 364.25
8 16 1.40 18.60 10.02 183.79

4 8 1.57 18.80



2 4 2.54 19.31



1 2 4.54 20.34



This next table presents the total application run time, comparing different number processors along with different problem sizes. It shows that for larger problems, using the Lustre file system running over QDR InfiniBand provided a big performance advantage when compared to a simple NFS file system.


Total Application Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
24 48 251.48 273.79 553.75 1125.02 1107.66 2310.25
20 40 232.00 253.63 658.54 971.65 1143.47 2062.80
16 32 227.91 209.66 826.37 1003.81 1309.32 2348.60
12 24 217.77 234.61 884.27 1027.23 1579.95 3877.88
8 16 223.38 203.14 1200.71 1362.42

4 8 341.14 272.68



2 4 605.62 625.25



1 2 892.40 841.94



The following table presents the run time and speedup of just the computational kernel for different processor counts for the three different problem sizes considered. The scaling results are based upon the smallest number of nodes run and that number is used as the baseline reference point.


Computational Kernel Performance & Scalability
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
X6275 Time
(sec)
Speedup:
1-node
X6275 Time
(sec)
Speedup:
1-node
X6275 Time
(sec)
Speedup:
1-node
24 48 35.38 13.7 210.82 24.5 427.40 24.0
20 40 35.02 13.8 255.27 20.2 517.03 19.8
16 32 41.76 11.6 317.96 16.2 646.22 15.8
12 24 49.53 9.8 422.17 12.2 853.37 12.0\*
8 16 62.34 7.8 645.27 8.0\*

4 8 124.66 3.9



2 4 238.80 2.0



1 2 484.89 1.0



The last table presents the speedup of the total application for different processor counts for the three different problem sizes presented. The scaling results are based upon the smallest number of nodes run and that number is used as the baseline reference point.


Total Application Scalability Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
1243 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
2486 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
24 48 3.6 17.3 17.1
20 40 3.8 14.6 16.6
16 32 4.0 11.6 14.5
12 24 4.1 10.9 12.0\*
8 16 4.0 8.0\*
4 8 2.6

2 4 1.5

1 2 1.0

Note: HyperThreading is enabled and running 16 threads per Node.

Results and Configuration Summary

Hardware Configuration:
    Sun Blade 6048 Modular Modular System with
      12 x Sun Blade x6275 Server Modules, each with
        4 x 2.93 GHz Intel Xeon QC X5570 processors
        12 x 4 GB memory at 1333 MHz
        2 x 24 GB Internal Flash
    QDR InfiniBand Lustre 1.8.0.1 File System
    GBit NFS file system

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    MPI: Scali MPI Connect 5.6.6-59413
    Compiler: Sun Studio 12 C++, Fortran, OpenMP

Benchmark Description

The Reverse Time Migration (RTM) is currently the most popular seismic processing algorithm because of its ability to produce quality images of complex substructures. It can accurately image steep dips that can not be imaged correctly with traditional Kirchhoff 3D or frequency domain algorithms. The Wave Equation Migration (WEM) can image steep dips but does not produce the image quality that can be achieved by the RTM. However, the increased computational complexity of the RTM over the WEM introduces new areas for performance optimization. The current trend in seismic processing is to perform iterative migrations on wide azimuth marine data surveys using the Reverse Time Migration.

This Reverse Time Migration code reads in processing parameters that define the grid dimensions, number of threads, number of processors, imaging condition, and various other parameters. The master node calculates the memory requirements to determine if there is sufficient memory to process the migration "in-core". The domain decomposition across all the nodes is determined by dividing the first grid dimension by the number of nodes. Each node then reads in it's section of the Velocity Slices, Delta Slices, and Epsilon Slices using MPI IO reads. The three source and receiver wavefield state vectors are created: previous, current, and next state. The processing steps through the input trace data reading both the receiver and source data for each of the 800 time steps. It uses forward propagation for the source wave field and backward propagation in time to cross correlate the receiver wavefield. The computational kernel consists of a 13 point stencil to process a subgrid within the memory of each node using OpenMP parallelism. Afterwards, conditioning and absorption are applied and boundary data is communicated to neighboring nodes as each time step is processed. The final image is written out using MPI IO.

Total memory requirements for each grid size:

    125x1151x1231: 7.5GB
    1243x1151x1231: 78GB
    2486x1151x1231: 156GB

For this phase of benchmarking, the focus was to optimize the data initialization. In the next phase of benchmarking, the trace data reading will be optimized so that each node reads in only it's section of interest. In this benchmark the trace data reading skews the Total Application Performance as the number of nodes increase. This will be optimized in the next phase of benchmarking, as well as, further node optimization with OpenMP. The IO description for this benchmark phase on each grid size:

    125x1151x1231:
      Initialization MPI Read: 3 x 709MB = 2.1GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 576KB = 920MB \* number of nodes
      Final Output Image MPI Write: 709MB / number of nodes
    1243x1151x1231: 78GB
      Initialization MPI Read: 3 x 7.1GB = 21.3GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 5.7MB = 9.2GB \* number of nodes
      Final Output Image MPI Write: 7.1GB / number of nodes
    2486x1151x1231: 156GB
      Initialization MPI Read: 3 x 14.2GB = 42.6GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 11.4MB = 18.4GB \* number of nodes
      Final Output Image MPI Write: 42.6GB / number of nodes

Key Points and Best Practices

  • Additional evaluations were performed to compare GBit NFS, Infiniband NFS, and Infiniband Lustre for the Reverse Time Migration Initialization. Infiniband NFS was 6x faster than GBit NFS and Infiniband Lustre was 3x faster than Infiniband NFS using the same disk configurations. On 12 nodes for grid size 2486x1151x1231 the initialization time was 22.50 seconds for IB Lustre, 61.03 seconds for IB NFS, and 364.25 seconds for GBit NFS.
  • The Reverse Time Migration computational performance scales nicely as a function of the grid size being processed. This is consistent with the IBM published results for this application.
  • The Total Application performance results are not typically reported in benchmark studies for this application. The IBM report specifically states that the execution times do not include I/O times and non-recurring allocation or initialization delays. Examining the total application performance reveals that the workload is no longer dominated by the the partial differential equation (PDE) solver, as IBM suggests, but is constrained by the I/O for grid initialization, reading in the traces, saving/restoring wave state data, and writing out the final image. Aggressive optimization of the PDE solver has little effect on the overall throughput of this application. It is clearly more important to optimize the I/O. The trend in seismic processing, as stated at the 2008 Society of Exploration Geophysicists (SEG) conference, is to run the reverse time migration iteratively on wide azimuth data. Thus, optimizing the I/O and application throughput is imperative to meet this trend. SSD and Flash technologies in conjunction with Sun's Lustre file system can reduce this I/O bottleneck and pave the path for the future in seismic processing.
  • Minimal tuning effort was applied to achieve the results presented. Sun's HPC software stack, which includes the Sun Studio compiler, was used to build the 70000 lines of C++ and Fortran source into the application executable. The only compiler option used was "-fast". No assembly level optimizations, like those performed by IBM to use SIMD registers (SSE registers), where performed in this benchmark. Similarly, no explicit cache blocking, loop unrolling, or memory bandwidth optimizations were conducted. The idea was to demonstrate the performance that a customer can expect from their existing applications without extensive, platform specific optimizations.

See Also

Disclosure Statement

Reverse Time Migration, Results as of 10/23/2009. For more info http://www.caam.rice.edu/tech_reports/2006/TR06-18.pdf

Sun F5100 and Seismic Reverse Time Migration with faster Optimal Checkpointing

A prominent Seismic Processing algorithm, Reverse Time Migration with Optimal Checkpointing, in SMP "THREADS" Mode, was testing using a Sun Fire X4270 server configured with four high performance 15K SAS hard disk drives (HDDs) and a Sun Storage F5100 Flash Array. This benchmark compares I/O devices for checkpointing wave state information while processing a production seismic migration.

  • Sun Storage F5100 Flash Array is 2.2x faster than high-performance 15K RPM disks.

  • Multithreading the checkpointing using the Sun Studio C++ Compiler OpenMP implementation gives a 12.8x performance improvement over the original single threaded version.

These results show the new trend in seismic processing to run iterative Reverse Time Migrations and migration playback is a reality. This is made possible through the use of Sun FlashFire technology to provide good checkpointing speeds without additional disk cache memory. The application can take advantage of all the memory within a node without regard to checkpoint cache buffers required for performance to HDDs. Similarly, larger problem sizes can be solved without increasing the memory footprint of each computational node.

Performance Landscape


Reverse Time Migration Optimal Checkpointing - SMP Threads Mode
Grid Size -800 x 1151 x 1231 with 800 Samples - 60GB of memory
Number
Checkpts
HDD F5100
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
F5100
Speedup
80 660.8 25.8 686.6 277.4 40.2 317.6 2.2x
400 1615.6 382.3 1997.9 989.5 269.7 1259.2 1.6x


Reverse Time Migration Optimal Checkpointing - SMP Threads Mode
Grid Size -125 x 1151 x 1231 with 800 Samples - 9GB of memory
Number
Checkpts
HDD F5100
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
F5100
Speedup
80 10.2 0.2 10.4 8.0 0.2 8.2 1.3x
400 52.3 0.4 52.7 45.2 0.3 45.5 1.2x
800 102.6 0.7 103.3 91.8 0.6 92.4 1.1x


Reverse Time Migration Optimal Checkpointing
Single Thread vs Multithreaded I/O Performance
Grid Size -125 x 1151 x 1231 with 800 Samples - 9GB of memory
Number
Checkpts
Single Thread F5100
Total Time (secs)
Multithreaded F5100
Total Time (secs)
Multithread
Speedup
80 105.3 8.2 12.8x
400 482.9 45.5 10.6x
800 963.5 92.4 10.4x

Note: Hyperthreading and Turbo Mode enabled while running 16 threads per node.

Results and Configuration Summary

Hardware Configuration:

    Sun Fire 4270 Server
      2 x 2.93 GHz Quad-core Intel Xeon X5570 processors
      72 GB memory
      4 x 73 GB 15K SAS drives
        File system striped across 4 15K RPM high-performance SAS HD RAID0
      Sun Storage F5100 Flash Array with local/internal r/w buff 4096
        20 x 24 GB flash modules

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Compiler: Sun Studio 12 C++, Fortran, OpenMP

Benchmark Description

The Reverse Time Migration (RTM) is currently the most popular seismic processing algorithm because of it's ability to produce quality images of complex substructures. It can accurately image steep dips that can not be imaged correctly with traditional Kirchhoff 3D or frequency domain algorithms. The Wave Equation Migration (WEM) can image steep dips but does not produce the image quality that can be achieved by the RTM. However, the increased computational complexity of the RTM over the WEM introduces new areas for performance optimization. The current trend in seismic processing is to perform iterative migrations on wide azimuth marine data surveys using the Reverse Time Migration.

The Reverse Time Migration with Optimal Checkpointing was introduced so large migrations could be performed within minimal memory configurations of x86 cluster nodes. The idea is to only have three wavestate vectors in memory for each of the source and receiver wavefields instead of holding the entire wavefields in memory for the duration of processing. With the Sun Flash F5100, this can be done with little performance penalty to the full migration time. Another advantage of checkpointing is to provide the ability to playback migrations and facilitate iterative migrations.

  • The stored snapshot data can be reprocessed with different filtering, image conditioning, or a variety of other parameters.
  • Fine grain snapshoting can help the processing of more complex subsurface data.
  • A Geoscientist can "playback" a migration from the saved snapshots to visually validate migration accuracy or pick areas of interest for additional processing.

The Reverse Time Migration with Optimal Checkpointing is an algorithm designed by Griewank (Griewank, 1992; Blanch et al., 1998; Griewank, 2000; Griewank and Walther, 2000; Akcelik et al., 2003).

  • The application takes snapshots of wavefield state data for some interval of the total number of samples.
  • This adjoint state method performs crosscorrelation of the source and receiver wavefields at the each level.
  • Forward recursion is used for the source wavefield and backward recursion for the receiver wavefield.
  • For relatively small seismic migrations, all of the forward processed state information can be saved and restored with minimal impact on the total processing time.
  • Effectively, the computational complexity increases while the memory requirements decrease by a logarithmic factor of the number of snapshots.
  • Griewank's algorithm helps define the most optimal tradeoff between computational performance and the number of memory buffers (memory requirements) to support this cross correlation.

For the purposes of this benchmark, this implementation of the Reverse Time Migration with Optimal Checkpointing does not fully implement the optimal memory buffer scheme proposed by Griewank. The intent is to compare various I/O alternatives for saving wave state data for each node in a compute cluster.

This benchmark measures the time to perform the wave state saves and restores while simultaneously processing the wave state data.

Key Points and Best Practices

  • Mulithreading the checkpointing using Sun Studio OpenMP and running 16 I/O threads with hyperthreading enabled gives a performance advantage over single threaded I/O to the Sun Storage F5100 flash array. The Sun Storage F5100 flash array can process concurrent I/O requests from multiple threads very efficiently.
  • Allocating the majority of a node's available memory to the Reverse Time Migration algorithm and leaving little memory for I/O caching favors the Sun Storage F5100 flash array over direct attached high performance disk drives. This performance advantage decreases as the number of snapshots increase. The reason for this is that increasing the number of snapshots decreases the memory requirement for the application.

See Also

Disclosure Statement

Reverse Time Migration with Optimal Checkpointing, Results as of 10/23/2009. For more info http://www.caam.rice.edu/tech_reports/2006/TR06-18.pdf

Tuesday Oct 13, 2009

CP2K Life Sciences, Ab-initio Dynamics - Sun Blade 6048 Chassis with Sun Blade X6275 - Scalability and Throughput with Quad Data Rate InfiniBand

Significance of Results

Clusters of Sun Blade X6275 and X6270 server modules were used to run benchmarks using the CP2K ab-initio dynamics applications software.

  • For the X6270 cluster with Dual Data Rate (DDR) InfiniBand the rate of increase of scalability slows dramatically at 16 nodes, whereas for the X6275 cluster with QDR InfiniBand the scalability continues to 72 nodes.
  • For 64 nodes, the speed of the Sun Blade X6275 cluster with QDR InfiniBand was 2.7X that of a Sun Blade X6270 cluster with DDR InfiniBand.

Ab-initio dynamics simulation is important to materials science research.  Dynamics simulation is used to determine the trajectories of atoms or molecules over time.

Performance Landscape

The CP2K Bulk Water Benchmarks web page plots the performance of CP2K ab-initio dynamics benchmarks that have from 32 to 512 water molecules for a cluster that comprises two 2.66GHz Xeon E5430 quad core CPUs per node and that uses Dual Data Rate InfiniBand.

The following table reports the execution time for the 512 water molecule benchmark when executed on the Sun Blade X6275 cluster having Quad Data Rate InfiniBand and on the Sun Blade X6270 cluster having Dual Data Rate InfiniBand. Each node of either Sun Blade cluster comprises two 2.93GHz Intel Xeon X5570 quad core CPUs. In the following table, the performance is expressed in terms of the "wall clock" time in seconds required to execute ten steps of the ab-initio dynamics simulation for 512 water molecules. A smaller number implies better performance.

Number
of Nodes
X6275 QDR InfiniBand
(seconds for 10 steps)
X6270 DDR InfiniBand
(seconds for 10 steps)
96
1184.36
72 564.16
64 598.41 1591.35
32 706.82 1436.49
24 950.02 1752.20
16 1227.73 2119.50
12 1440.16 1739.26
8 1876.95 2120.73
4 3408.39 3705.44

Results and Configuration Summary

Hardware Configuration:

    Sun Blade[tm] 6048 Modular System with 3 shelves, each shelf with
      12 x Sun Blade X6275, each blade with
        2 x (2 x 2.93 GHz Intel QC Xeon X5570 processors)
        2 x (24 GB memory)
        Hyper-Threading (HT) off, Turbo Mode on
    QDR InfiniBand
    96 x Sun Blade X6270, each blade with
      2 x 2.93 GHz Intel QC Xeon X5570 processors)
      1 x (24 GB memory)
      Hyper-Threading (HT) off, Turbo Mode off
    DDR InfiniBand
Software Configuration:
    SUSE Linux Enterprise Server 10 SP2 kernel version 2.6.16.60-0.31_lustre.1.8.0.1-smp
    OpenMPI 1.3.2
    Sun Studio 12 f90 compiler, ScaLAPACK, BLACS and Performance Libraries
    FFTW (Fastest Fourier Transform in the West) 3.2.1

Benchmark Description

CP2K is a parallel ab-initio dynamics code that is designed to perform atomistic and molecular simulations of solid state, liquid, molecular and biological systems. It provides a general framework for different methods such as e.g. density functional theory (DFT) using a mixed Gaussian and plane waves approach (GPW), and classical pair and many-body potentials.

Ab-initio dynamics simulation is widely used in materials science research. CP2K is a public-domain ab-initio dynamics software application.

Key Points and Best Practices

  • QDR InfiniBand scales better than DDR InfiniBand.
  • The Intel QC X5570 processors include a turbo boost feature coupled with a speed-step option in the CPU section of the Advanced BIOS settings. Under specific circumstances, this can provide cpu overclocking which increases the processor frequency from 2.93GHz to 3.2GHz. This feature was was enabled for the X6275 and disabled for the X6270 when generating the results reported here.

See Also

Disclosure Statement

CP2K, see http://cp2k.berlios.de/ for more information, results as of 10/13/2009.

SAP 2-tier SD-Parallel on Sun Blade X6270 1-node, 2-node and 4-node

Significance of Results

  • Four Sun Blade X6270 (2 processors, 8 cores, 16 threads), running SAP ERP application Release 6.0 Enhancement Pack 4 (Unicode) with Oracle Database on top of Solaris 10 OS delivered the highest eight-processor result on the two-tier SAP SD-Parallel Standard Application Benchmark, as of Oct 12th, 2009.

  • Four Sun Blade X6270 servers with Intel Xeon X5570 processors achieved 1.9x performance improvement from two Sun Blade X6270 with the same processors.

  • Two Sun Blade X6270 (2 processors, 8 cores, 16 threads), running SAP ERP application Release 6.0 Enhancement Pack 4 (Unicode) with Oracle Database on top of Solaris 10 OS delivered the highest four-processor result on the two-tier SAP SD-Parallel Standard Application Benchmark, as of Oct 12th, 2009.

  • Two Sun Blade X6270 servers with Intel Xeon X5570 processors achieved 1.9x performance imporvement over a single 2-processor Sun Blade X6270 system.

  • A one node Sun Blade X6270 server with Intel Xeon X5570 processors running Oracle RAC delivers the same result as a Sun Fire X4270 server with Intel Xeon X5570 processors running Oracle with no performance difference between Oracle 10g and Oracle 10g RAC.

  • This benchmark highlights the near-linear scaling of Oracle 10g Real Application Cluster runs on Sun Microsystems hardware in a SAP environment.

  • In January 2009, a new version, the Two-tier SAP ERP 6.0 Enhancement Pack 4 (Unicode) Standard Sales and Distribution (SD) Benchmark, was released. This new release has higher cpu requirements and so yields from 25-50% fewer users compared to the previous Two-tier SAP ERP 6.0 (non-unicode) Standard Sales and Distribution (SD) Benchmark. 10-30% of this is due to the extra overhead from the processing of the larger character strings due to Unicode encoding. See this SAP Note for more details.

  • Unicode is a computing standard that allows for the representation and manipulation of text expressed in most of the world's writing systems. Before the Unicode requirement, this benchmark used ASCII characters meaning each was just 1 byte. The new version of the benchmark requires Unicode characters and the Application layer (where ~90% of the cycles in this benchmark are spent) uses a new encoding, UTF-16, which uses 2 bytes to encode most characters (including all ASCII characters) and 4 bytes for some others. This requires computers to do more computation and use more bandwidth and storage for most character strings. Refer to the above SAP Note for more details.

Performance Landscape

SAP SD-Parallel 2-Tier Performance Table (in decreasing performance order).

SAP ERP 6.0 Enhancement Pack 4 (Unicode) Results
(New version of the benchmark as of January 2009)

System OS
Database
Users SAP
ERP/ECC
Release
SAPS Date
Four Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
13,718 2009
6.0 EP4
(Unicode)
75,762 12-Oct-09
Two Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
7,220 2009
6.0 EP4
(Unicode)
39,420 12-Oct-09
One Sun Blade X6270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g Real Application Clusters
3,800 2009
6.0 EP4
(Unicode)
20,750 12-Oct-09
Sun Fire X4270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g
3,800 2009
6.0 EP4
(Unicode)
21,000 21-Aug-09

Complete benchmark results may be found at the SAP benchmark website http://www.sap.com/benchmark.

Results and Configuration Summary

Four Sun Blade X6270 Servers, each with two Intel Xeon X5570 2.93 GHz(2 processors, 8 cores, 16 threads)

    Number of SAP SD benchmark users:
    13,718
    Average dialog response time:
    0.86 seconds
    Throughput:

    Dialog steps/hour:
    4,545,729

    SAPS:
    75,762
    SAP Certification:
    2009041

Two Sun Blade X6270 Servers, each with two Intel Xeon X5570 2.93 GHz(2 processors, 8 cores, 16 threads)

    Number of SAP SD benchmark users:
    7,220
    Average dialog response time:
    0.99 seconds
    Throughput:

    Dialog steps/hour:
    2,365,000

    SAPS:
    39,420
    SAP Certification:
    2009040

One Sun Blade X6270 Servers, with two Intel Xeon X5570 2.93 GHz(2 processors, 8 cores, 16 threads)

    Number of SAP SD benchmark users:
    3,800
    Average dialog response time:
    0.99 seconds
    Throughput:

    Dialog steps/hour:
    1,245,000

    SAPS:
    20,750
    SAP Certification:
    2009039

Software:

    Oracle 10g Real Application Clusters
    Solaris 10 OS

Benchmark Description

The SAP Standard Application Sales and Distribution - Parallel (SD-Parallel) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing, and demonstrates the ability to run both the application and database software on a single system. The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments.
SD Versus SD-Parallel
The SD-Parallel Benchmark consists of the same transactions and user interaction steps as the SD Benchmark. This means that the SD-Parallel Benchmark runs the same business processes as the SD Benchmark. The difference between the benchmarks is the technical data distribution. An Additional Rule for Parallel and Distributed Databases
The additional rule is: Equally distribute the benchmark users across all database nodes for the used benchmark clients (round-robin-method). Following this rule, all database nodes work on data of all clients. This avoids unrealistic configurations such as having only one client per database node.
The SAP Benchmark Council agreed to give the parallel benchmark a different name so that the difference can be easily recognized by any interested parties - customers, prospects, and analysts. The naming convention is SD-Parallel for Sales & Distribution - Parallel.
SAP is one of the premier world-wide ERP application providers, and maintains a suite of benchmark tests to demonstrate the performance of competitive systems on the various SAP products.

Disclosure Statement

SAP SD benchmark based on SAP enhancement package 4 for SAP ERP 6.0 (Unicode) application benchmark as of Oct 12th, 2009: Four Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 13,718 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, each 48 GB memory, running two-tier SAP Sales and Distribution Parallel (SD-Parallel) standard SAP SD benchmark with Oracle 10g Real Application Clusters and Solaris 10, Cert# 2009041. Two Sun Blade X6270 (each 2 processors, 8 cores, 16 threads) 7,220 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, each 48 GB memory, running two-tier SAP Sales and Distribution Parallel (SD-Parallel) standard SAP SD benchmark with Oracle 10g Real Application Clusters and Solaris 10, Cert# 2009040. Sun Blade X6270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, running two-tier SAP Sales and Distribution Parallel (SD-Parallel) standard SAP SD benchmark with Oracle 10g Real Application Clusters and Solaris 10, Cert# 2009039. Sun Fire X4270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, running two-tier SAP Sales and Distribution (SD) standard SAP SD benchmark with Oracle 10g and Solaris 10, Cert# 2009033.

SAP, R/3, reg TM of SAP AG in Germany and other countries. More info www.sap.com/benchmark

Monday Oct 12, 2009

MCAE ABAQUS faster on Sun F5100 and Sun X4270 - Single Node World Record

The Sun Storage F5100 Flash Array can substantially improve performance over internal hard disk drives as shown by the I/O intensive ABAQUS MCAE application Standard benchmark tests on a Sun Fire X4270 server.

The I/O intensive ABAQUS "Standard" benchmarks test cases were run on a single Sun Fire X4270 server. Data is presented for runs at both 8 and 16 thread counts.

The ABAQUS "Standard" module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal striped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "S4b" test case by 14%.

  • The Sun Fire X4270 server coupled with a Sun Storage F5100 Flash Array established the world record performance on a single node for the four test cases S2A, S4B, S4D and S6.

Performance Landscape

ABAQUS "Standard" Benchmark Test S4B: Advantage of Sun Storage F5100

Results are total elapsed run times in seconds

Threads 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
8 1504 1318 14%
16 1811 1649 10%

ABAQUS Standard Server Benchmark Subset: Single Node Record Performance

Results are total elapsed run times in seconds

Platform Cores S2a S4b S4d S6
X4270 w/F5100 8 302 1192 779 1237
HP BL460c G6 8 324 1309 843 1322
X4270 w/F5100 4 552 1970 1181 1706
HP BL460c G6 4 561 2062 1234 1812

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      Hyperthreading enabled
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: ABAQUS V6.9-1 Standard Module
    Benchmark: ABAQUS Standard Benchmark Test Suite

Benchmark Description

Abaqus/Standard Benchmark Problems

These problems provide an estimate of the performance that can be expected when running Abaqus/Standard or similar commercially available MCAE (FEA) codes like ANSYS and MSC/Nastran on different computers. The jobs are representative of those typically analyzed by Abaqus/Standard and other MCAE applications. These analyses include linear statics, nonlinear statics, and natural frequency extraction.

Please go here for a more complete description of the tests.

Key Points and Best Practices

  • The memory requirements for the test cases in the ABAQUS Standard benchmark test suite are rather substantial with some of the test cases requiring slightly over 20GB of memory. There are two memory limits one a minimum where out of core "memory" will be used when this limit is exceeded. This requires more time consuming cpu and another maximum memory limit that minimizes I/O operations. These memory limits are given in the ABAQUS output and can be established before making a full execution in a preliminary diagnostic mode run.
  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the ABAQUS job. This is done in the "abaqus_v6.env" file that either resides in the subdirectory from where the job was launched or in the abaqus "site" subdirectory under the home installation directory.
  • Sometimes when running multiple cores on a single node, it is preferable from a performance standpoint to run in "smp" shared memory mode This is specified using the "THREADS" option on the "mpi_mode" line in the abaqus_v6.env file as opposed to the "MPI" option on this line. The test case considered here illustrates this point.
  • The test cases for the ABAQUS standard module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system with high performance interconnects. On Linux OS's advantage can be taken of excess memory that can be used to cache and accelerate I/O.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Abaqus, Inc. or its subsidiaries in the United States and/or o ther countries: Abaqus, Abaqus/Standard, Abaqus/Explicit. All information on the ABAQUS website is Copyrighted 2004-2009 by Dassault Systemes. Results from http://www.simulia.com/support/v69/v69_performance.php as of October 12, 2009.

MCAE ANSYS faster on Sun F5100 and Sun X4270

Significance of Results

The Sun Storage F5100 Flash Array can greatly improve performance over internal hard disk drives as shown by the I/O intensive ANSYS MCAE application BMD benchmark tests on a Sun Fire X4270 server.

Select ANSYS 12 BMD benchmarks were run on a single Sun Fire X4270 server. These I/O intensive test cases were run to compare the performance of conventional high performance disk to Sun FlashFire technology.

The ANSYS 12.0 module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal stripped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "BMD-4" test case by 67% in the 8-core/8-thread server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "BMD-7" test case by 18% in the 8-core/16-thread server configuration.

Performance Landscape

ANSYS 12 "BMD" Test Suite on Single X4270 (24GB mem.) - SMP Mode

Results are total elapsed run times in seconds

Test Case SMP 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
bmd-4 8 523 314 67%
bmd-7 16 357 303 18%

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      Hyperthreading enabled
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: ANSYS Multiphysics 12.0
    Benchmark: ANSYS 12 "BMD" Benchmark Test Suite

Benchmark Description

ANSYS is a general purpose engineering analysis MCAE application that is based on the Finite Element Method. It performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior or deformations are concerned. Ansys provides a number of benchmark tests which exercise the capabilities of the software.

Please go here for a more complete description of the tests.

Key Points and Best Practices

Performance Considerations

The performance of Ansys (IO-intensive MCAE application) can be increased by reducing the IO demands of the application by increasing server memory or by using SSDs to increase the bandwidth and reduce the latency. The most I/O intensive case in the ANSYS distributed "BMD" test suite is BMD-4 particularly at the (maximum) 8 core level for a single node.


  • Ansys now takes full advantage of inexpensive RAID0 disk arrays and delivers sustained I/O rates.

  • Large memory can cache file accesses but often the size of ANSYS files grows much larger than the available physical memory so that system file caching is not able to hide the I/O cost.
  • For fast ANSYS runs the recommended configuration is a RAID 0 setup using 4 or more disks and a fast RAID controller. These fast I/O configurations are inexpensive to put together for systems and can achieve I/O rates in excess of 200 MB/sec.
  • SSD drives have much lower seek times, use less power, and tend to be about 2X faster than the fastest rotating disks for sustained throughput. The observed speed of a RAID 0 configuration of SSD drives for ANSYS simulations has been nearly as fast as I/O that is cached by large memory systems. SSD drives then may be the most affordable way to extend the capacity of a system to jobs that are too large to run in-core without incurring the performance penalty usually associated with I/O demands.

More About The ANSYS BMD "Distributed" Benchmarks

ANSYS is a general purpose engineering analysis MCAE application that is based on the Finite Element Method. It performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior or deformations are concerned.

In the most recent release of the ANSYS benchmarks there are now two test suites: The SMP "BM" suite designed to run on a single node with multi processors and the DMP "BMD" suite intended to run on multi node clusters but which can also run on a single node in SMP mode as in this study.

  • The test cases from both ANSYS test suites all have a substantial I/O component where 15% to 20% of the total run times are associated with I/O activity (primarily scratch files). Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system with high performance interconnects. When running with the SX64 build a ZFS system might be a good idea to employ.
  • The ANSYS test cases don't scale very well (BMD better than BM) ; at best on up 8 cores.
  • The memory requirements for the test cases in the ANSYS BMD are greater than for the standard benchmark test suite. The requirements for the standard suite are not great requiring less than 3GB.

See Also

MCAE, SSD, HPC, ANSYS, Linux, SuSE, Performance, X64, Intel

Disclosure Statement

The following are trademarks or registered trademarks of ANSYS, Inc., ANSYS Multiphysics TM. All information on the ANSYS website is Copyrighted by ANSYS, Inc. Results from http://www.ansys.com/services/ss-intel-bench120.htm as of October 12, 2009.

MCAE MCS/NASTRAN faster on Sun F5100 and Fire X4270

Significance of Results

The Sun Storage F5100 Flash Array can double performance over internal hard disk drives as shown by the I/O intensive MSC/Nastran MCAE application MDR3 benchmark tests on a Sun Fire X4270 server.

The MD Nastran MDR3 benchmarks were run on a single Sun Fire X4270 server. The I/O intensive test cases were run at different core levels from one up to the maximum of 8 available cores in SMP mode.

The MSC/Nastran MD 2008 R3 module is an MCAE application based on the finite element method (FEA) of analysis. This computer based numerical method inherently involves a substantial I/O component. The purpose was to evaluate the performance of the Sun Storage F5100 Flash Array relative to high performance 15K RPM internal stripped HDDs.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xx0cmd2" test case by 107% in the 8-core server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xl0tdf1"test case by 85% in the 8-core server configuration.

The MD Nastran MDR3 test suite was designed to include some very I/O intensive test cases albeit some are not very scalable. These cases are the called "xx0wmd0" and "xx0xst0". Both were run and results are presented using a single core server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xx0xst0"test case by 33% in the single-core server configuration.

  • The Sun Storage F5100 Flash Array outperformed the high performance 15K RPM SAS drives on the "xx0wmd0"test case by 20% in the single-core server configuration.

Performance Landscape

MD Nastran MDR3 Benchmark Tests

Results in seconds

Test Case DMP 4x15K RPM
72 GB SAS HDD
striped HW RAID0
Sun F5100
r/w buff 4096
striped
Sun F5100
Performance
Advantage
xx0cmd2 8 959 463 107%
xl0tdf1 8 1104 596 85%
xx0xst0 1 1307 980 33%
xx0wmd0 1 20250 16806 20%

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4270
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      24 GB memory
      4 x 72 GB 15K RPM striped (HW RAID0) SAS disks
    Sun Storage F5100 Flash Array
      20 x 24 GB flash modules
      Intel controller

Software Configuration:

    O/S: 64-bit SUSE Linux Enterprise Server 10 SP 2
    Application: MSC/NASTRAN MD 2008 R3
    Benchmark: MDR3 Benchmark Test Suite
    HP MPI: 02.03.00.00 [7585] Linux x86-64

Benchmark Description

The benchmark tests are representative of typical MSC/Nastran applications including both SMP and DMP runs involving linear statics, nonlinear statics, and natural frequency extraction.

The MD (Multi Discipline) Nastran 2008 application performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior and/or deformations are concerned. The new release includes the MARC module for general purpose nonlinear analyses and the Dytran module that employs an explicit solver to analyze crash and high velocity impact conditions.

Please go here for a more complete description of the tests.

Key Points and Best Practices

  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the Nastran job. This is done on the command line with the mem= option. On Linux based systems where the platform has a large amount of memory and where the model does not have large scratch I/O requirements the memory can be allocated to a tmpfs scratch space file system. On Solaris X64 systems advantage can be taken of ZFS for higher I/O performance.

  • The MD Nastran MDR3 test cases don't scale very well, a few not at all and the rest on up to 8 cores at best.

  • The test cases for the MSC/Nastran module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). The required scratch file size ranges from less than 1 GB on up to about 140 GB. Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system, further enhanced as indicated here by implementing the Lustre based I/O system. High performance interconnects such as InfiniBand for inter node cluster message passing as well as I/O transfer from the storage system can also enhance performance substantially.

See Also

Disclosure Statement

MSC.Software is a registered trademark of MSC. All information on the MSC.Software website is copyrighted. MD Nastran MDR3 results from http://www.mscsoftware.com and this report as of October 12, 2009.

Friday Oct 09, 2009

X6275 Cluster Demonstrates Performance and Scalability on WRF 2.5km CONUS Dataset

Significance of Results

Results are presented for the Weather Research and Forecasting (WRF) code running on twelve Sun Blade X6275 server modules, housed in the Sun Blade 6048 chassis, using the 2.5 km CONUS benchmark dataset.

  • The Sun Blade X6275 cluster was able to achieve 373 GFLOP/s on the CONUS 2.5-KM Dataset.
  • The results demonstrate an 91% speedup efficiency, or 11x speedup, from 1 to 12 blades.
  • The current results results were run with turbo on.

Performance Landscape

Performance is expressed in terms "simulation speedup" which is the ratio of the simulated time step per iteration to the average wall clock time required to compute it. A larger number implies better performance.

The current results were run with turbo mode on.

WRF 3.0.1.1: Weather Research and Forecasting CONUS 2.5-KM Dataset
#
Blade
#
Node
#
Proc
#
Core
Performance
(Simulation Speedup)
Computation Rate
GFLOP/sec
Speedup/Efficiency
(vs. 1 blade)
Turbo On
Relative Perf
Turbo On Turbo Off Turbo On Turbo Off Turbo On Turbo Off
12 24 48 192 13.58 12.93 373.0 355.1 11.0 / 91% 10.4 / 87% +6%
 8  16  32  128  9.27
254.6
 7.5 / 93% 

 6 12 24  96  7.03  6.60 193.1 181.3  5.7 / 94%  5.3 / 89% +7%
 4  8  16  64  4.74
130.2
 3.8 / 96% 

 2  4  8  32  2.44
67.0
 2.0 / 98% 

 1  2  4  16  1.24  1.24 34.1 34.1 1.0 / 100% 1.0 / 100% +0%

Results and Configuration Summary

Hardware Configuration:

    Sun Blade 6048 Modular System
      12 x Sun Blade X6275 Server Modules, each with
        4 x 2.93 GHz Intel QC X5570 processors
        24 GB (6 x 4GB)
        QDR InfiniBand
        HT disabled in BIOS
        Turbo mode enabled in BIOS

Software Configuration:

    OS: SUSE Linux Enterprise Server 10 SP 2
    Compiler: PGI 7.2-5
    MPI Library: Scali MPI v5.6.4
    Benchmark: WRF 3.0.1.1
    Support Library: netCDF 3.6.3

Benchmark Description

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. WRF is designed to be a flexible, state-of-the-art atmospheric simulation system that is portable and efficient on available parallel computing platforms. It features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility.

Dataset used:

    Single domain, large size 2.5KM Continental US (CONUS-2.5K)

    • 1501x1201x35 cell volume
    • 6hr, 2.5km resolution dataset from June 4, 2005
    • Benchmark is the final 3hr simulation for hrs 3-6 starting from a provided restart file; the benchmark may also be performed (but seldom reported) for the full 6hrs starting from a cold start
    • Iterations output at every 15 sec of simulation time, with the computation cost of each time step ~412 GFLOP

Key Points and Best Practices

  • Processes were bound to processors in round-robin fashion.
  • Model simulation time is 15 seconds per iteration as defined in input job deck. An achieved speedup of 2.67X means that each model iteration of 15s of simulation time was executed in 5.6s of real wallclock time (on average).
  • Computational requirements are 412 GFLOP per simulation time step as (measured empirically and) documented on the UCAR web site for this data model.
  • Model was run as single MPI job.
  • Benchmark was built and run as a pure-MPI variant. With larger process counts building and running WRF as a hybrid OpenMP/MPI variant may be more efficient.
  • Input and output (netCDF format) datasets can be very large for some WRF data models and run times will generally benefit by using a scalable filesystem. Performance with very large datasets (>5GB) can benefit by enabling WRF quilting of I/O across designated processors/servers. The master thread (or rank-0) performs most of the I/O (unless quilting specifies otherwise), with all processes potentially generating some I/O.

See Also

Disclosure Statement

WRF, CONUS-2.5K, see http://www.mmm.ucar.edu/wrf/WG2/bench/, results as of 9/21/2009.
About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« May 2016
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today