Wednesday Dec 08, 2010

Sun Blade X6275 M2 Cluster with Sun Storage 7410 Performance Running Seismic Processing Reverse Time Migration

This Oil & Gas benchmark highlights both the computational performance improvements of the Sun Blade X6275 M2 server module over the previous genernation server and the linear scalability achievable for the total application throughput using a Sun Storage 7410 system to deliver almost 2 GB/sec I/O effective write performance.

Oracle's Sun Storage 7410 system attached via 10 Gigabit Ethernet to a cluster of Oracle's Sun Blade X6275 M2 server modules was used to demonstrate the performance of a 3D VTI Reverse Time Migration application, a heavily used geophysical imaging and modeling application for Oil & Gas Exploration. The total application throughput scaling and computational kernel performance improvements are presented for imaging two production sized grids using 800 input samples.

  • The Sun Blade X6275 M2 server module showed up to a 40% performance improvement over the previous generation server module with super-linear scalability to 16 nodes for the 9-Point Stencil used in this Reverse Time Migration computational kernel.

  • The balanced combination of Oracle's Sun Storage 7410 system over 10 GbE to the Sun Blade X6275 M2 server module cluster showed linear scalability for the total application throughput, including the I/O and MPI communication, to produce a final 3-D seismic depth imaged cube for interpretation.

  • The final image write time from the Sun Blade X6275 M2 server module nodes to Oracle's Sun Storage 7410 system achieved 10GbE line speed of 1.25 GBytes/second or better write performance. The effects of I/O buffer caching on the Sun Blade X6275 M2 server module nodes and 34 GByte write optimized cache on the Sun Storage 7410 system gave up to 1.8 GBytes/second effective write performance.

Performance Landscape

Server Generational Performance Improvements

Performance improvements for the Reverse Time Migration computational kernel using a Sun Blade X6275 M2 cluster are compared to the previous generation Sun Blade X6275 cluster. Hyper-threading was enabled for both configurations allowing 24 OpenMP threads for the Sun Blade X6275 M2 server module nodes and 16 for the Sun Blade X6275 server module nodes.

Sun Blade X6275 M2 Performance Improvements
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
X6275 Kernel Time (sec) X6275 M2 Kernel Time (sec) X6275 M2 Speedup X6275 Kernel Time (sec) X6275 M2 Kernel Time (sec) X6275 M2 Speedup
16 306 242 1.3 728 576 1.3
14 355 271 1.3 814 679 1.2
12 435 346 1.3 945 797 1.2
10 541 390 1.4 1156 890 1.3
8 726 555 1.3 1511 1193 1.3

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Blade X6275 M2 server cluster with a Sun Storage 7410 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server node.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 501 242 2.1\* 2.3\* 1060 576 2.0 2.1\*
14 583 271 1.8 2.0 1219 679 1.7 1.8
12 681 346 1.6 1.6 1420 797 1.5 1.5
10 807 390 1.3 1.4 1688 890 1.2 1.3
8 1058 555 1.0 1.0 2085 1193 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache for larger node counts

Image File Effective Write Performance

The performance for writing the final 3D image from a Sun Blade X6275 M2 server cluster over 10 Gigabit Ethernet to a Sun Storage 7410 system are presented. Each server allocated one core per node for MPI I/O thus allowing 22 OpenMP compute threads per node with hyperthreading enabled. Captured performance analytics from the Sun Storage 7410 system indicate effective use of its 34 Gigabyte write optimized cache.

Image File Effective Write Performance
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Write Time (sec) Write Performance (GB/sec) Write Time (sec) Write Performance (GB/sec)
16 4.8 1.5 10.2 1.4
14 5.0 1.4 10.2 1.4
12 4.0 1.8 11.3 1.3
10 4.3 1.6 9.1 1.6
8 4.6 1.5 9.7 1.5

Note: Performance results better than 1.3GB/sec related to I/O buffer caching on server nodes.

Configuration Summary

Hardware Configuration:

8 x 2 node Sun Blade X6275 M2 server nodes, each node with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)
1 x QDR InfiniBand Host Channel Adapter

Sun Datacenter InfiniBand Switch IB-36
Sun Network 10 GbE Switch 72p

Sun Storage 7410 system connected via 10 Gigabit Ethernet
4 x 17 GB STEC ZeusIOPs SSD mirrored - 34 GB
40 x 750 GB 7500 RPM Seagate SATA disks mirrored - 14.4 TB
No L2ARC Readzilla Cache

Software Configuration:

Oracle Enterprise Linux Server release 5.5
Oracle Message Passing Toolkit 8.2.1c (for MPI)
Oracle Solaris Studio 12.2 C++, Fortran, OpenMP

Benchmark Description

This Vertical Transverse Isotropy (VTI) Anisotropic Reverse Time Depth Migration (RTM) application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk for the next work flow step involving 3-D seismic volume interpretation. In doing so, it reports the compute, interprocessor communication, and I/O performance of the individual functions that comprise the total solution. Unlike most references for the Reverse Time Migration, that focus solely on the performance of the 3D stencil compute kernel, this demonstration code additionally reports the total throughput involved in processing large data sets with a full 3D Anisotropic RTM application. It provides valuable insight into configuration and sizing for specific seismic processing requirements. The performance effects of new processors, interconnects, I/O subsystems, and software technologies can be evaluated while solving a real Exploration business problem.

This benchmark study uses the "in-core" implementation of this demonstration code where each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a 4 element array pad (based on spatial order 8) shared with it's neighbors to the left and right during the initialization phase. It maintains previous, current, and next wavefield state information for each of the source, receiver, and anisotropic wavefields in memory. The second two grid dimensions used in this benchmark are specifically chosen to be prime numbers to exaggerate the effects of data alignment. Algorithm adaptions for processing higher orders in space and alternative "out-of-core" solutions using SSDs for wave state checkpointing are implemented in this demonstration application to better understand the effects of problem size scaling. Care is taken to handle absorption boundary conditioning and a variety of imaging conditions, appropriately.

RTM Application Structure:

Read Processing Parameter File, Determine Domain Decomposition, and Initialize Data Structures, and Allocate Memory.

Read Velocity, Epsilon, and Delta Data Based on Domain Decomposition and create source, receiver, & anisotropic previous, current, and next wave states.

First Loop over Time Steps

Compute 3D Stencil for Source Wavefield (a,s) - 8th order in space, 2nd order in time
Propagate over Time to Create s(t,z,y,x) & a(t,z,y,x)
Inject Estimated Source Wavelet
Apply Absorption Boundary Conditioning (a)
Update Wavefield States and Pointers
Write Snapshot of Wavefield (out-of-core) or Push Wavefield onto Stack (in-core)
Communicate Boundary Information

Second Loop over Time Steps
Compute 3D Stencil for Receiver Wavefield (a,r) - 8th order in space, 2nd order in time
Propagate over Time to Create r(t,z,y,x) & a(t,z,y,x)
Read Receiver Trace and Inject Receiver Wavelet
Apply Absorption Boundary Conditioning (a)
Update Wavefield States and Pointers
Communicate Boundary Information
Read in Source Wavefield Snapshot (out-of-core) or Pop Off of Stack (in-core)
Cross-correlate Source and Receiver Wavefields
Update image using image conditioning parameters

Write 3D Depth Image i(z,x,y) = Sum over time steps s(t,z,x,y) \* r(t,z,x,y) or other imaging conditions.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

Image File MPI Write Performance Tuning

Changing the Image File Write from MPI non-blocking to MPI blocking and setting Oracle Message Passing Toolkit MPI environment variables revealed an 18x improvement in write performance to the Sun Storage 7410 system going from:

    86.8 to 4.8 seconds for the 1243 x 1151 x 1231 grid size
    183.1 to 10.2 seconds for the 2486 x 1151 x 1231 grid size

The Swat Sun Storage 7410 analytics data capture indicated an initial write performance of about 100 MB/sec with the MPI non-blocking implementation. After modifying to MPI blocking writes, Swat showed between 1.3 and 1.8 GB/sec with up to 13000 write ops/sec to write the final output image. The Swat results are consistent with the actual measured performance and provide valuable insight into the Reverse Time Migration application I/O performance.

The reason for this vast improvement has to do with whether the MPI file mode is sequential or not (MPI_MODE_SEQUENTIAL, O_SYNC, O_DSYNC). The MPI non-blocking routines, MPI_File_iwrite_at and MPI_wait, typically used for overlapping I/O and computation, do not support sequential file access mode. Therefore, the application could not take full performance advantages of the Sun Storage 7410 system write optimized cache. In contrast, the MPI blocking routine, MPI_File_write_at, defaults to MPI sequential mode and the performance advantages of the write optimized cache are realized. Since writing the final image is at the end of RTM execution, there is no need to overlap the I/O with computation.

Additional MPI parameters used:

    setenv SUNW_MP_PROCBIND true
    setenv MPI_SPIN 1
    setenv MPI_PROC_BIND 1

Adjusting the Level of Multithreading for Performance

The level of multithreading (8, 10, 12, 22, or 24) for various components of the RTM should be adjustable based on the type of computation taking place. Best to use OpenMP num_threads clause to adjust the level of multi-threading for each particular work task. Use numactl to specify how the threads are allocated to cores in accordance to the OpenMP parallelism level.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 12/07/2010.

Monday Mar 29, 2010

Sun Blade X6275/QDR IB/ Reverse Time Migration

Significance of Results

Oracle's Sun Blade X6275 cluster with a Lustre file system was used to demonstrate the performance potential of the system when running reverse time migration applications complete with I/O processing.

  • Reduced the Total Application run time for the Reverse Time Migration when processing 800 input traces for two production sized surveys from a QDR Infiniband Lustre file system on 24 X6275 nodes, by implementing algorithm I/O optimizations and taking advantage of MPI I/O features in HPC ClusterTools:

    • 1243x1151x1231 - Wall clock time reduced from 11.5 to 6.3 minutes (1.8x improvement)
    • 2486x1151x1231 - Wall clock time reduced from 21.5 to 13.5 minutes (1.6x improvement)
  • Reduced the I/O Intensive Trace-Input time for the Reverse Time Migration when reading 800 input traces for two production sized surveys from a QDR Infiniband Lustre file system on 24 X6275 nodes running HPC ClusterTools, by modifying the algorithm to minimize the per node data requirement and avoiding unneeded synchronization:

    • 2486x1151x1231 : Time reduced from 121.5 to 3.2 seconds (38.0x improvement)
    • 1243x1151x1231 : Time reduced from 71.5 to 2.2 seconds (32.5x improvement)
  • Reduced the I/O Intensive Grid Initialization time for the Reverse Time Migration Grid when reading the Velocity, Epsilon, and Delta slices for two production sized surveys from a QDR Infiniband Lustre file system on 24 X6275 nodes running HPC ClusterTools, by modifying the algorithm to minimize the per node grid data requirement:

    • 2486x1151x1231 : Time reduced from 15.6 to 4.9 seconds (3.2x improvement)
    • 1243x1151x1231 : Time reduced from 8.9 to 1.2 seconds (7.4x improvement)

Performance Landscape

In the tables below, the hyperthreading feature is enabled and the systems are fully utilized.

This first table presents the total application performance in minutes. The overall performance improved significantly because of the improved I/O performance and other benefits.


Total Application Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes 1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Original
Time (mins)
MPI I/O
Time (mins)
Improvement Original
Time (mins)
MPI I/O
Time (mins)
Improvement
24 11.5 6.3 1.8x 21.5 13.5 1.6x
20 12.0 8.0 1.5x 21.9 15.4 1.4x
16 13.8 9.7 1.4x 26.2 18.0 1.5x
12 21.7 13.2 1.6x 29.5 23.1 1.3x

This next table presents the initialization I/O time. The results are presented in seconds and shows the advantage of the improved MPI I/O strategy.


Initialization Time Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes 1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Original
Time (sec)
MPI I/O
Time (sec)
Improvement Original
Time (sec)
MPI I/O
Time (sec)
Improvement
24 8.9 1.2 7.4x 15.6 4.9 3.2x
20 9.3 1.5 6.2x 16.9 3.9 4.3x
16 9.7 2.5 3.9x 17.4 11.3 1.5x
12 9.8 3.3 3.0x 22.5 14.9 1.5x

This last table presents the trace I/O time. The results are presented in seconds and shows the significant advantage of the improved MPI I/O strategy.


Trace I/O Time Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes 1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Original
Time (sec)
MPI I/O
Time (sec)
Improvement Original
Time (sec)
MPI I/O
Time (sec)
Improvement
24 71.5 2.2 32.5x 121.5 3.2 38.0x
20 67.7 2.4 28.2x 118.3 3.9 30.3x
16 64.2 2.7 23.7x 110.7 4.6 24.1x
12 69.9 4.2 16.6x 296.3 14.6 20.3x

Results and Configuration Summary

Hardware Configuration:

Oracle's Sun Blade 6048 Modular Modular System with
12 x Oracle's Sun Blade x6275 Server Modules, each with
4 x 2.93 GHz Intel Xeon QC X5570 processors
12 x 4 GB memory at 1333 MHz
2 x 24 GB Internal Flash
QDR InfiniBand Lustre 1.8.0.1 File System

Software Configuration:

OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
MPI: Oracle Message Passing Toolkit 8.2.1 for I/O optimization to Lustre file system
MPI: Scali MPI Connect 5.6.6-59413 for original Lustre file system runs
Compiler: Oracle Solaris Studio 12 C++, Fortran, OpenMP

Benchmark Description

The primary objective of this Reverse Time Migration Benchmark is to present MPI I/O tuning techniques, exploit the power of the Sun's HPC ClusterTools MPI I/O implementation, and demonstrate the world-class performance of Sun's Lustre File System to Exploration Geophysicists throughout the world. A Sun Blade 6048 Modular System with 12 Sun Blade X6275 server modules were clustered together with a QDR Infiniband Lustre File System to show performance improvements in the Reverse Time Migration Throughput by using the Sun HPC ClusterTools MPI-IO features to implement specific algorithm I/O optimizations.

This Reverse Time Migration Benchmark measures the total time it takes to image 800 samples of various production size grids and write the final image to disk. In this new I/O optimized version, each node reads in only the data to be processed by that node plus a 4 element inline pad shared with it's neighbors to the left and right. This latest version, essentially loads the boundary condition data during the initialization phase. The previous version handled boundary conditions by having each node read in all the trace, velocity, and conditioning data. Or, alternatively, the master node would read in all the data and distribute it in it's entirety to every node in the cluster. With the previous version, each node had full memory copies of all input data sets even when it only processed a subset of that data. The new version only holds the inline dimensions and pads to be processed by a particular node in memory.

Key Points and Best Practices

  • The original implementation of the trace I/O involved the master node reading in nx \* ny floats and communicating this trace data to all the other nodes in a synchronous manner. Each node only used a subset of the trace data for each of the 800 time steps. The optimized I/O version has each node asynchronously read in only the (nx/num_procs + 8) \* ny floats that it will be processing. The additional 8 inline values for the optimized I/O version are the 4 element pads of a node's left and right neighbors to handle initial boundary conditions. The MPI_Barrier needed for the original implementation, for synchronization, and the additional I/O for each node to load all the data values, truly impacts performance. For the I/O optimized version, each node reads only the data values it needs and does not require the same MPI_Barrier synchronization as the original version of the Reverse Time Migration Benchmark. By performing such I/O optimizations, a significant improvement is seen in the Trace I/O.

  • For the best MPI performance, allocate the X6275 nodes in blade by blade order and run with HyperThreading enabled. The "Binary Conditioning" part of the Reverse Time Migration specifically likes hyperthreading.

  • To get the best I/O performance, use a maximum of 70% of each nodes available memory for the Reverse Time Migration application. Execution time may vary I/O results can occur if the nodes have different memory size configurations.

See Also

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today