Saturday Oct 24, 2009

Sun C48 & Lustre fast for Seismic Reverse Time Migration using Sun X6275

Significance of Results

A Sun Blade 6048 Modular System with 12 Sun Blade X6275 server modules were clustered together with QDR InfiniBand and using a Lustre File System with QDR InfiniBand to show performance improvements over an NFS file system for reading in Velocity, Epsilon, and Delta Slices and imaging 800 samples of various various grid sizes using the Reverse Time Migration.

  • The Initialization Time for populating the processing grids demonstrates significant advantages of Lustre over NFS:
    • 2486x1151x1231 : 20x improvement
    • 1243x1151x1231 : 20x improvement
    • 125x1151x1231 : 11x improvement
  • The Total Application Performance shows the Interconnect and I/O advantages of using QDR InfiniBand Lustre for the large grid sizes:
    • 2486x1151x1231 : 2x improvement - processed in less than 19 minutes
    • 1243x1151x1231 : 2x improvement - processed in less than 10 minutes

  • The Computational Kernel Scalability Efficiency for the 3 grid sizes:
    • 125x1151x1231 : 97% (1-8 nodes)
    • 1243x1151x1231 : 102% (8-24 nodes)
    • 2486x1151x1231 : 100% (12-24 nodes)

  • The Total Application Scalability Efficiency for the large grid sizes:
    • 1243x1151x1231 : 72% (8-24 nodes)
    • 2485x1151x1231 : 71% (12-24 nodes)

  • On the X5570 Intel processor with HyperThreading enabled and running 16 OpenMP threads per node gives approximately a 10% performance improvement over running 8 threads per node.

Performance Landscape

This first table presents the initialization time, comparing different number processors along with different problem sizes. The results are presented in seconds and shows the advantage the Lustre file system running over QDR InfiniBand provided when compared to a simple NFS file system.


Initialization Time Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
24 48 1.59 18.90 8.90 181.78 15.63 362.48
20 40 1.60 18.90 8.93 181.49 16.91 358.81
16 32 1.58 18.59 8.97 181.58 17.39 353.72
12 24 1.54 18.61 9.35 182.31 22.50 364.25
8 16 1.40 18.60 10.02 183.79

4 8 1.57 18.80



2 4 2.54 19.31



1 2 4.54 20.34



This next table presents the total application run time, comparing different number processors along with different problem sizes. It shows that for larger problems, using the Lustre file system running over QDR InfiniBand provided a big performance advantage when compared to a simple NFS file system.


Total Application Performance Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
Lustre Time
(sec)
NFS Time
(sec)
24 48 251.48 273.79 553.75 1125.02 1107.66 2310.25
20 40 232.00 253.63 658.54 971.65 1143.47 2062.80
16 32 227.91 209.66 826.37 1003.81 1309.32 2348.60
12 24 217.77 234.61 884.27 1027.23 1579.95 3877.88
8 16 223.38 203.14 1200.71 1362.42

4 8 341.14 272.68



2 4 605.62 625.25



1 2 892.40 841.94



The following table presents the run time and speedup of just the computational kernel for different processor counts for the three different problem sizes considered. The scaling results are based upon the smallest number of nodes run and that number is used as the baseline reference point.


Computational Kernel Performance & Scalability
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
1243 x 1151 x 1231
800 Samples
2486 x 1151 x 1231
800 Samples
X6275 Time
(sec)
Speedup:
1-node
X6275 Time
(sec)
Speedup:
1-node
X6275 Time
(sec)
Speedup:
1-node
24 48 35.38 13.7 210.82 24.5 427.40 24.0
20 40 35.02 13.8 255.27 20.2 517.03 19.8
16 32 41.76 11.6 317.96 16.2 646.22 15.8
12 24 49.53 9.8 422.17 12.2 853.37 12.0\*
8 16 62.34 7.8 645.27 8.0\*

4 8 124.66 3.9



2 4 238.80 2.0



1 2 484.89 1.0



The last table presents the speedup of the total application for different processor counts for the three different problem sizes presented. The scaling results are based upon the smallest number of nodes run and that number is used as the baseline reference point.


Total Application Scalability Comparison
Reverse Time Migration - SMP Threads and MPI Mode
Nodes Procs 125 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
1243 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
2486 x 1151 x 1231
800 Samples
Lustre Speedup:
1-node
24 48 3.6 17.3 17.1
20 40 3.8 14.6 16.6
16 32 4.0 11.6 14.5
12 24 4.1 10.9 12.0\*
8 16 4.0 8.0\*
4 8 2.6

2 4 1.5

1 2 1.0

Note: HyperThreading is enabled and running 16 threads per Node.

Results and Configuration Summary

Hardware Configuration:
    Sun Blade 6048 Modular Modular System with
      12 x Sun Blade x6275 Server Modules, each with
        4 x 2.93 GHz Intel Xeon QC X5570 processors
        12 x 4 GB memory at 1333 MHz
        2 x 24 GB Internal Flash
    QDR InfiniBand Lustre 1.8.0.1 File System
    GBit NFS file system

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    MPI: Scali MPI Connect 5.6.6-59413
    Compiler: Sun Studio 12 C++, Fortran, OpenMP

Benchmark Description

The Reverse Time Migration (RTM) is currently the most popular seismic processing algorithm because of its ability to produce quality images of complex substructures. It can accurately image steep dips that can not be imaged correctly with traditional Kirchhoff 3D or frequency domain algorithms. The Wave Equation Migration (WEM) can image steep dips but does not produce the image quality that can be achieved by the RTM. However, the increased computational complexity of the RTM over the WEM introduces new areas for performance optimization. The current trend in seismic processing is to perform iterative migrations on wide azimuth marine data surveys using the Reverse Time Migration.

This Reverse Time Migration code reads in processing parameters that define the grid dimensions, number of threads, number of processors, imaging condition, and various other parameters. The master node calculates the memory requirements to determine if there is sufficient memory to process the migration "in-core". The domain decomposition across all the nodes is determined by dividing the first grid dimension by the number of nodes. Each node then reads in it's section of the Velocity Slices, Delta Slices, and Epsilon Slices using MPI IO reads. The three source and receiver wavefield state vectors are created: previous, current, and next state. The processing steps through the input trace data reading both the receiver and source data for each of the 800 time steps. It uses forward propagation for the source wave field and backward propagation in time to cross correlate the receiver wavefield. The computational kernel consists of a 13 point stencil to process a subgrid within the memory of each node using OpenMP parallelism. Afterwards, conditioning and absorption are applied and boundary data is communicated to neighboring nodes as each time step is processed. The final image is written out using MPI IO.

Total memory requirements for each grid size:

    125x1151x1231: 7.5GB
    1243x1151x1231: 78GB
    2486x1151x1231: 156GB

For this phase of benchmarking, the focus was to optimize the data initialization. In the next phase of benchmarking, the trace data reading will be optimized so that each node reads in only it's section of interest. In this benchmark the trace data reading skews the Total Application Performance as the number of nodes increase. This will be optimized in the next phase of benchmarking, as well as, further node optimization with OpenMP. The IO description for this benchmark phase on each grid size:

    125x1151x1231:
      Initialization MPI Read: 3 x 709MB = 2.1GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 576KB = 920MB \* number of nodes
      Final Output Image MPI Write: 709MB / number of nodes
    1243x1151x1231: 78GB
      Initialization MPI Read: 3 x 7.1GB = 21.3GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 5.7MB = 9.2GB \* number of nodes
      Final Output Image MPI Write: 7.1GB / number of nodes
    2486x1151x1231: 156GB
      Initialization MPI Read: 3 x 14.2GB = 42.6GB / number of nodes
      Trace Data Read per Node: 2 x 800 x 11.4MB = 18.4GB \* number of nodes
      Final Output Image MPI Write: 42.6GB / number of nodes

Key Points and Best Practices

  • Additional evaluations were performed to compare GBit NFS, Infiniband NFS, and Infiniband Lustre for the Reverse Time Migration Initialization. Infiniband NFS was 6x faster than GBit NFS and Infiniband Lustre was 3x faster than Infiniband NFS using the same disk configurations. On 12 nodes for grid size 2486x1151x1231 the initialization time was 22.50 seconds for IB Lustre, 61.03 seconds for IB NFS, and 364.25 seconds for GBit NFS.
  • The Reverse Time Migration computational performance scales nicely as a function of the grid size being processed. This is consistent with the IBM published results for this application.
  • The Total Application performance results are not typically reported in benchmark studies for this application. The IBM report specifically states that the execution times do not include I/O times and non-recurring allocation or initialization delays. Examining the total application performance reveals that the workload is no longer dominated by the the partial differential equation (PDE) solver, as IBM suggests, but is constrained by the I/O for grid initialization, reading in the traces, saving/restoring wave state data, and writing out the final image. Aggressive optimization of the PDE solver has little effect on the overall throughput of this application. It is clearly more important to optimize the I/O. The trend in seismic processing, as stated at the 2008 Society of Exploration Geophysicists (SEG) conference, is to run the reverse time migration iteratively on wide azimuth data. Thus, optimizing the I/O and application throughput is imperative to meet this trend. SSD and Flash technologies in conjunction with Sun's Lustre file system can reduce this I/O bottleneck and pave the path for the future in seismic processing.
  • Minimal tuning effort was applied to achieve the results presented. Sun's HPC software stack, which includes the Sun Studio compiler, was used to build the 70000 lines of C++ and Fortran source into the application executable. The only compiler option used was "-fast". No assembly level optimizations, like those performed by IBM to use SIMD registers (SSE registers), where performed in this benchmark. Similarly, no explicit cache blocking, loop unrolling, or memory bandwidth optimizations were conducted. The idea was to demonstrate the performance that a customer can expect from their existing applications without extensive, platform specific optimizations.

See Also

Disclosure Statement

Reverse Time Migration, Results as of 10/23/2009. For more info http://www.caam.rice.edu/tech_reports/2006/TR06-18.pdf

Sun F5100 and Seismic Reverse Time Migration with faster Optimal Checkpointing

A prominent Seismic Processing algorithm, Reverse Time Migration with Optimal Checkpointing, in SMP "THREADS" Mode, was testing using a Sun Fire X4270 server configured with four high performance 15K SAS hard disk drives (HDDs) and a Sun Storage F5100 Flash Array. This benchmark compares I/O devices for checkpointing wave state information while processing a production seismic migration.

  • Sun Storage F5100 Flash Array is 2.2x faster than high-performance 15K RPM disks.

  • Multithreading the checkpointing using the Sun Studio C++ Compiler OpenMP implementation gives a 12.8x performance improvement over the original single threaded version.

These results show the new trend in seismic processing to run iterative Reverse Time Migrations and migration playback is a reality. This is made possible through the use of Sun FlashFire technology to provide good checkpointing speeds without additional disk cache memory. The application can take advantage of all the memory within a node without regard to checkpoint cache buffers required for performance to HDDs. Similarly, larger problem sizes can be solved without increasing the memory footprint of each computational node.

Performance Landscape


Reverse Time Migration Optimal Checkpointing - SMP Threads Mode
Grid Size -800 x 1151 x 1231 with 800 Samples - 60GB of memory
Number
Checkpts
HDD F5100
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
F5100
Speedup
80 660.8 25.8 686.6 277.4 40.2 317.6 2.2x
400 1615.6 382.3 1997.9 989.5 269.7 1259.2 1.6x


Reverse Time Migration Optimal Checkpointing - SMP Threads Mode
Grid Size -125 x 1151 x 1231 with 800 Samples - 9GB of memory
Number
Checkpts
HDD F5100
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
Put Time
(secs)
Get Time
(secs)
Total Time
(secs)
F5100
Speedup
80 10.2 0.2 10.4 8.0 0.2 8.2 1.3x
400 52.3 0.4 52.7 45.2 0.3 45.5 1.2x
800 102.6 0.7 103.3 91.8 0.6 92.4 1.1x


Reverse Time Migration Optimal Checkpointing
Single Thread vs Multithreaded I/O Performance
Grid Size -125 x 1151 x 1231 with 800 Samples - 9GB of memory
Number
Checkpts
Single Thread F5100
Total Time (secs)
Multithreaded F5100
Total Time (secs)
Multithread
Speedup
80 105.3 8.2 12.8x
400 482.9 45.5 10.6x
800 963.5 92.4 10.4x

Note: Hyperthreading and Turbo Mode enabled while running 16 threads per node.

Results and Configuration Summary

Hardware Configuration:

    Sun Fire 4270 Server
      2 x 2.93 GHz Quad-core Intel Xeon X5570 processors
      72 GB memory
      4 x 73 GB 15K SAS drives
        File system striped across 4 15K RPM high-performance SAS HD RAID0
      Sun Storage F5100 Flash Array with local/internal r/w buff 4096
        20 x 24 GB flash modules

Software Configuration:

    OS: 64-bit SUSE Linux Enterprise Server SLES 10 SP 2
    Compiler: Sun Studio 12 C++, Fortran, OpenMP

Benchmark Description

The Reverse Time Migration (RTM) is currently the most popular seismic processing algorithm because of it's ability to produce quality images of complex substructures. It can accurately image steep dips that can not be imaged correctly with traditional Kirchhoff 3D or frequency domain algorithms. The Wave Equation Migration (WEM) can image steep dips but does not produce the image quality that can be achieved by the RTM. However, the increased computational complexity of the RTM over the WEM introduces new areas for performance optimization. The current trend in seismic processing is to perform iterative migrations on wide azimuth marine data surveys using the Reverse Time Migration.

The Reverse Time Migration with Optimal Checkpointing was introduced so large migrations could be performed within minimal memory configurations of x86 cluster nodes. The idea is to only have three wavestate vectors in memory for each of the source and receiver wavefields instead of holding the entire wavefields in memory for the duration of processing. With the Sun Flash F5100, this can be done with little performance penalty to the full migration time. Another advantage of checkpointing is to provide the ability to playback migrations and facilitate iterative migrations.

  • The stored snapshot data can be reprocessed with different filtering, image conditioning, or a variety of other parameters.
  • Fine grain snapshoting can help the processing of more complex subsurface data.
  • A Geoscientist can "playback" a migration from the saved snapshots to visually validate migration accuracy or pick areas of interest for additional processing.

The Reverse Time Migration with Optimal Checkpointing is an algorithm designed by Griewank (Griewank, 1992; Blanch et al., 1998; Griewank, 2000; Griewank and Walther, 2000; Akcelik et al., 2003).

  • The application takes snapshots of wavefield state data for some interval of the total number of samples.
  • This adjoint state method performs crosscorrelation of the source and receiver wavefields at the each level.
  • Forward recursion is used for the source wavefield and backward recursion for the receiver wavefield.
  • For relatively small seismic migrations, all of the forward processed state information can be saved and restored with minimal impact on the total processing time.
  • Effectively, the computational complexity increases while the memory requirements decrease by a logarithmic factor of the number of snapshots.
  • Griewank's algorithm helps define the most optimal tradeoff between computational performance and the number of memory buffers (memory requirements) to support this cross correlation.

For the purposes of this benchmark, this implementation of the Reverse Time Migration with Optimal Checkpointing does not fully implement the optimal memory buffer scheme proposed by Griewank. The intent is to compare various I/O alternatives for saving wave state data for each node in a compute cluster.

This benchmark measures the time to perform the wave state saves and restores while simultaneously processing the wave state data.

Key Points and Best Practices

  • Mulithreading the checkpointing using Sun Studio OpenMP and running 16 I/O threads with hyperthreading enabled gives a performance advantage over single threaded I/O to the Sun Storage F5100 flash array. The Sun Storage F5100 flash array can process concurrent I/O requests from multiple threads very efficiently.
  • Allocating the majority of a node's available memory to the Reverse Time Migration algorithm and leaving little memory for I/O caching favors the Sun Storage F5100 flash array over direct attached high performance disk drives. This performance advantage decreases as the number of snapshots increase. The reason for this is that increasing the number of snapshots decreases the memory requirement for the application.

See Also

Disclosure Statement

Reverse Time Migration with Optimal Checkpointing, Results as of 10/23/2009. For more info http://www.caam.rice.edu/tech_reports/2006/TR06-18.pdf
About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today