Sun Blade X6275/QDR IB/ Reverse Time Migration

Significance of Results

Oracle's Sun Blade X6275 cluster with a Lustre file system was used to demonstrate the performance potential of the system when running reverse time migration applications complete with I/O processing.

  • Reduced the Total Application run time for the Reverse Time Migration when processing 800 input traces for two production sized surveys from a QDR Infiniband Lustre file system on 24 X6275 nodes, by implementing algorithm I/O optimizations and taking advantage of MPI I/O features in HPC ClusterTools:

    • 1243x1151x1231 - Wall clock time reduced from 11.5 to 6.3 minutes (1.8x improvement)
    • 2486x1151x1231 - Wall clock time reduced from 21.5 to 13.5 minutes (1.6x improvement)
  • Reduced the I/O Intensive Trace-Input time for the Reverse Time Migration when reading 800 input traces for two production sized surveys from a QDR Infiniband Lustre file system on 24 X6275 nodes running HPC ClusterTools, by modifying the algorithm to minimize the per node data requirement and avoiding unneeded synchronization:

    • 2486x1151x1231 : Time reduced from 121.5 to 3.2 seconds (38.0x improvement)
    • 1243x1151x1231 : Time reduced from 71.5 to 2.2 seconds (32.5x improvement)
  • Reduced the I/O Intensive Grid Initialization time for the Reverse Time Migration Grid when reading the Velocity, Epsilon, and Delta slices for two production sized surveys from a QDR Infiniband Lustre file system on 24 X6275 nodes running HPC ClusterTools, by modifying the algorithm to minimize the per node grid data requirement:

    • 2486x1151x1231 : Time reduced from 15.6 to 4.9 seconds (3.2x improvement)
    • 1243x1151x1231 : Time reduced from 8.9 to 1.2 seconds (7.4x improvement)

Performance Landscape

In the tables below, the hyperthreading feature is enabled and the systems are fully utilized.

This first table presents the total application performance in minutes. The overall performance improved significantly because of the improved I/O performance and other benefits.


Total Application Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes 1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Original
Time (mins)
MPI I/O
Time (mins)
Improvement Original
Time (mins)
MPI I/O
Time (mins)
Improvement
24 11.5 6.3 1.8x 21.5 13.5 1.6x
20 12.0 8.0 1.5x 21.9 15.4 1.4x
16 13.8 9.7 1.4x 26.2 18.0 1.5x
12 21.7 13.2 1.6x 29.5 23.1 1.3x

This next table presents the initialization I/O time. The results are presented in seconds and shows the advantage of the improved MPI I/O strategy.


Initialization Time Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes 1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Original
Time (sec)
MPI I/O
Time (sec)
Improvement Original
Time (sec)
MPI I/O
Time (sec)
Improvement
24 8.9 1.2 7.4x 15.6 4.9 3.2x
20 9.3 1.5 6.2x 16.9 3.9 4.3x
16 9.7 2.5 3.9x 17.4 11.3 1.5x
12 9.8 3.3 3.0x 22.5 14.9 1.5x

This last table presents the trace I/O time. The results are presented in seconds and shows the significant advantage of the improved MPI I/O strategy.


Trace I/O Time Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes 1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Original
Time (sec)
MPI I/O
Time (sec)
Improvement Original
Time (sec)
MPI I/O
Time (sec)
Improvement
24 71.5 2.2 32.5x 121.5 3.2 38.0x
20 67.7 2.4 28.2x 118.3 3.9 30.3x
16 64.2 2.7 23.7x 110.7 4.6 24.1x
12 69.9 4.2 16.6x 296.3 14.6 20.3x

Results and Configuration Summary

Hardware Configuration:

Oracle's Sun Blade 6048 Modular Modular System with
12 x Oracle's Sun Blade x6275 Server Modules, each with
4 x 2.93 GHz Intel Xeon QC X5570 processors
12 x 4 GB memory at 1333 MHz
2 x 24 GB Internal Flash
QDR InfiniBand Lustre 1.8.0.1 File System

Software Configuration:

OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
MPI: Oracle Message Passing Toolkit 8.2.1 for I/O optimization to Lustre file system
MPI: Scali MPI Connect 5.6.6-59413 for original Lustre file system runs
Compiler: Oracle Solaris Studio 12 C++, Fortran, OpenMP

Benchmark Description

The primary objective of this Reverse Time Migration Benchmark is to present MPI I/O tuning techniques, exploit the power of the Sun's HPC ClusterTools MPI I/O implementation, and demonstrate the world-class performance of Sun's Lustre File System to Exploration Geophysicists throughout the world. A Sun Blade 6048 Modular System with 12 Sun Blade X6275 server modules were clustered together with a QDR Infiniband Lustre File System to show performance improvements in the Reverse Time Migration Throughput by using the Sun HPC ClusterTools MPI-IO features to implement specific algorithm I/O optimizations.

This Reverse Time Migration Benchmark measures the total time it takes to image 800 samples of various production size grids and write the final image to disk. In this new I/O optimized version, each node reads in only the data to be processed by that node plus a 4 element inline pad shared with it's neighbors to the left and right. This latest version, essentially loads the boundary condition data during the initialization phase. The previous version handled boundary conditions by having each node read in all the trace, velocity, and conditioning data. Or, alternatively, the master node would read in all the data and distribute it in it's entirety to every node in the cluster. With the previous version, each node had full memory copies of all input data sets even when it only processed a subset of that data. The new version only holds the inline dimensions and pads to be processed by a particular node in memory.

Key Points and Best Practices

  • The original implementation of the trace I/O involved the master node reading in nx \* ny floats and communicating this trace data to all the other nodes in a synchronous manner. Each node only used a subset of the trace data for each of the 800 time steps. The optimized I/O version has each node asynchronously read in only the (nx/num_procs + 8) \* ny floats that it will be processing. The additional 8 inline values for the optimized I/O version are the 4 element pads of a node's left and right neighbors to handle initial boundary conditions. The MPI_Barrier needed for the original implementation, for synchronization, and the additional I/O for each node to load all the data values, truly impacts performance. For the I/O optimized version, each node reads only the data values it needs and does not require the same MPI_Barrier synchronization as the original version of the Reverse Time Migration Benchmark. By performing such I/O optimizations, a significant improvement is seen in the Trace I/O.

  • For the best MPI performance, allocate the X6275 nodes in blade by blade order and run with HyperThreading enabled. The "Binary Conditioning" part of the Reverse Time Migration specifically likes hyperthreading.

  • To get the best I/O performance, use a maximum of 70% of each nodes available memory for the Reverse Time Migration application. Execution time may vary I/O results can occur if the nodes have different memory size configurations.

See Also

Comments:

Post a Comment:
Comments are closed for this entry.
About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today