Monday Sep 19, 2011

Halliburton ProMAX® Seismic Processing on Sun Blade X6270 M2 with Sun ZFS Storage 7320

Halliburton/Landmark's ProMAX® 3D Pre-Stack Kirchhoff Time Migration's (PSTM) single workflow scalability and multiple workflow throughput using various scheduling methods are evaluated on a cluster of Oracle's Sun Blade X6270 M2 server modules attached to Oracle's Sun ZFS Storage 7320 appliance.

Two resource scheduling methods, compact and distributed, are compared while increasing the system load with additional concurrent ProMAX® workflows.

  • Multiple concurrent 24-process ProMAX® PSTM workflow throughput is constant; 10 workflows on 10 nodes finish as fast as 1 workflow on one compute node. Additionally, processing twice the data volume yields similar traces/second throughput performance.

  • A single ProMAX® PSTM workflow has good scaling from 1 to 10 nodes of a Sun Blade X6270 M2 cluster scaling 4.5X. ProMAX® scales to 4.7X on 10 nodes with one input data set and 6.3X with two consecutive input data sets (i.e. twice the data).

  • A single ProMAX® PSTM workflow has near linear scaling of 11x on a Sun Blade X6270 M2 server module when running from 1 to 12 processes.

  • The 12-thread ProMAX® workflow throughput using the distributed scheduling method is equivalent or slightly faster than the compact scheme for 1 to 6 concurrent workflows.

Performance Landscape

Multiple 24-Process Workflow Throughput Scaling

This test measures the system throughput scalability as concurrent 24-process workflows are added, one workflow per node. The per workflow throughput and the system scalability are reported.

Aggregate system throughput scales linearly. Ten concurrent workflows finish in the same time as does one workflow on a single compute node.

Halliburton ProMAX® Pre-Stack Time Migration - Multiple Workflow Scaling


Single Workflow Scaling

This test measures single workflow scalability across a 10-node cluster. Utilizing a single data set, performance exhibits near linear scaling of 11x at 12 processes, and per-node scaling of 4x at 6 nodes; performance flattens quickly reaching a peak of 60x at 240 processors and per-node scaling of 4.7x with 10 nodes.

Running with two consecutive input data sets in the workflow, scaling is considerably improved with peak scaling ~35% higher than obtained using a single data set. Doubling the data set size minimizes time spent in workflow initialization, data input and output.

Halliburton ProMAX® Pre-Stack Time Migration - Single Workflow Scaling

This next test measures single workflow scalability across a 10-node cluster (as above) but limiting scheduling to a maximum of 12-process per node; effectively restricting a maximum of one process per physical core. The speedup relative to a single process, and single node are reported.

Utilizing a single data set, performance exhibits near linear scaling of 37x at 48 processes, and per-node scaling of 4.3x at 6 nodes. Performance of 55x at 120 processors and per-node scaling of 5x with 10 nodes is reached and scalability is trending higher more strongly compared to the the case of two processes running per physical core above. For equivalent total process counts, multi-node runs using only a single process per physical core appear to run between 28-64% more efficiently (96 and 24 processes respectively). With a full compliment of 10 nodes (120 processes) the peak performance is only 9.5% lower than with 2 processes per vcpu (240 processes).

Running with two consecutive input data sets in the workflow, scaling is considerably improved with peak scaling ~35% higher than obtained using a single data set.

Halliburton ProMAX® Pre-Stack Time Migration - Single Workflow Scaling

Multiple 12-Process Workflow Throughput Scaling, Compact vs. Distributed Scheduling

The fourth test compares compact and distributed scheduling of 1, 2, 4, and 6 concurrent 12-processor workflows.

All things being equal, the system bi-section bandwidth should improve with distributed scheduling of a fixed-size workflow; as more nodes are used for a workflow, more memory and system cache is employed and any node memory bandwidth bottlenecks can be offset by distributing communication across the network (provided the network and inter-node communication stack do not become a bottleneck). When physical cores are not over-subscribed, compact and distributed scheduling performance is within 3% suggesting that there may be little memory contention for this workflow on the benchmarked system configuration.

With compact scheduling of two concurrent 12-processor workflows, the physical cores become over-subscribed and performance degrades 36% per workflow. With four concurrent workflows, physical cores are oversubscribed 4x and performance is seen to degrade 66% per workflow. With six concurrent workflows over-subscribed compact scheduling performance degrades 77% per workflow. As multiple 12-processor workflows become more and more distributed, the performance approaches the non over-subscribed case.

Halliburton ProMAX® Pre-Stack Time Migration - Multiple Workflow Scaling

141616 traces x 624 samples


Test Notes

All tests were performed with one input data set (70808 traces x 624 samples) and two consecutive input data sets (2 * (70808 traces x 624 samples)) in the workflow. All results reported are the average of at least 3 runs and performance is based on reported total wall-clock time by the application.

All tests were run with NFS attached Sun ZFS Storage 7320 appliance and then with NFS attached legacy Sun Fire X4500 server. The StorageTek Workload Analysis Tool (SWAT) was invoked to measure the I/O characteristics of the NFS attached storage used on separate runs of all workflows.

Configuration Summary

Hardware Configuration:

10 x Sun Blade X6270 M2 server modules, each with
2 x 3.33 GHz Intel Xeon X5680 processors
48 GB DDR3-1333 memory
4 x 146 GB, Internal 10000 RPM SAS-2 HDD
10 GbE
Hyper-Threading enabled

Sun ZFS Storage 7320 Appliance
1 x Storage Controller
2 x 2.4 GHz Intel Xeon 5620 processors
48 GB memory (12 x 4 GB DDR3-1333)
2 TB Read Cache (4 x 512 GB Read Flash Accelerator)
10 GbE
1 x Disk Shelf
20.0 TB RAID-Z (20 x 1 TB SAS-2, 7200 RPM HDD)
4 x Write Flash Accelerators

Sun Fire X4500
2 x 2.8 GHz AMD 290 processors
16 GB DDR1-400 memory
34.5 TB RAID-Z (46 x 750 GB SATA-II, 7200 RPM HDD)
10 GbE

Software Configuration:

Oracle Linux 5.5
Parallel Virtual Machine 3.3.11 (bundled with ProMAX)
Intel 11.1.038 Compilers
Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

The ProMAX® family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX® is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX® is integrated with Halliburton's OpenWorks® Geoscience Oracle Database to index prestack seismic data and populate the database with processed seismic.

This benchmark evaluates single workflow scalability and multiple workflow throughput of the ProMAX® 3D Prestack Kirchhoff Time Migration (PSTM) while processing the Halliburton benchmark data set containing 70,808 traces with 8 msec sample interval and trace length of 4992 msec. Benchmarks were performed with both one and two consecutive input data sets.

Each workflow consisted of:

  • reading the previously constructed MPEG encoded processing parameter file
  • reading the compressed seismic data traces from disk
  • performing the PSTM imaging
  • writing the result to disk

Workflows using two input data sets were constructed by simply adding a second identical seismic data read task immediately after the first in the processing parameter file. This effectively doubled the data volume read, processed, and written.

This version of ProMAX® currently only uses Parallel Virtual Machine (PVM) as the parallel processing paradigm. The PVM software only used TCP networking and has no internal facility for assigning memory affinity and processor binding. Every compute node is running a PVM daemon.

The ProMAX® processing parameters used for this benchmark:

Minimum output inline = 65
Maximum output inline = 85
Inline output sampling interval = 1
Minimum output xline = 1
Maximum output xline = 200 (fold)
Xline output sampling interval = 1
Antialias inline spacing = 15
Antialias xline spacing = 15
Stretch Mute Aperature Limit with Maximum Stretch = 15
Image Gather Type = Full Offset Image Traces
No Block Moveout
Number of Alias Bands = 10
3D Amplitude Phase Correction
No compression
Maximum Number of Cache Blocks = 500000

Primary PSTM business metrics are typically time-to-solution and accuracy of the subsurface imaging solution.

Key Points and Best Practices

  • Multiple job system throughput scales perfectly; ten concurrent workflows on 10 nodes each completes in the same time and has the same throughput as a single workflow running on one node.
  • Best single workflow scaling is 6.6x using 10 nodes.

    When tasked with processing several similar workflows, while individual time-to-solution will be longer, the most efficient way to run is to fully distribute them one workflow per node (or even across two nodes) and run these concurrently, rather than to use all nodes for each workflow and running consecutively. For example, while the best-case configuration used here will run 6.6 times faster using all ten nodes compared to a single node, ten such 10-node jobs running consecutively will overall take over 50% longer to complete than ten jobs one per node running concurrently.

  • Throughput was seen to scale better with larger workflows. While throughput with both large and small workflows are similar with only one node, the larger dataset exhibits 11% and 35% more throughput with four and 10 nodes respectively.

  • 200 processes appears to be a scalability asymptote with these workflows on the systems used.
  • Hyperthreading marginally helps throughput. For the largest model run on 10 nodes, 240 processes delivers 11% more performance than with 120 processes.

  • The workflows do not exhibit significant I/O bandwidth demands. Even with 10 concurrent 24-process jobs, the measured aggregate system I/O did not exceed 100 MB/s.

  • 10 GbE was the only network used and, though shared for all interprocess communication and network attached storage, it appears to have sufficient bandwidth for all test cases run.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX®, GeoProbe®, OpenWorks®. Results as of 9/1/2011.

Wednesday Dec 08, 2010

Sun Blade X6275 M2 Cluster with Sun Storage 7410 Performance Running Seismic Processing Reverse Time Migration

This Oil & Gas benchmark highlights both the computational performance improvements of the Sun Blade X6275 M2 server module over the previous genernation server and the linear scalability achievable for the total application throughput using a Sun Storage 7410 system to deliver almost 2 GB/sec I/O effective write performance.

Oracle's Sun Storage 7410 system attached via 10 Gigabit Ethernet to a cluster of Oracle's Sun Blade X6275 M2 server modules was used to demonstrate the performance of a 3D VTI Reverse Time Migration application, a heavily used geophysical imaging and modeling application for Oil & Gas Exploration. The total application throughput scaling and computational kernel performance improvements are presented for imaging two production sized grids using 800 input samples.

  • The Sun Blade X6275 M2 server module showed up to a 40% performance improvement over the previous generation server module with super-linear scalability to 16 nodes for the 9-Point Stencil used in this Reverse Time Migration computational kernel.

  • The balanced combination of Oracle's Sun Storage 7410 system over 10 GbE to the Sun Blade X6275 M2 server module cluster showed linear scalability for the total application throughput, including the I/O and MPI communication, to produce a final 3-D seismic depth imaged cube for interpretation.

  • The final image write time from the Sun Blade X6275 M2 server module nodes to Oracle's Sun Storage 7410 system achieved 10GbE line speed of 1.25 GBytes/second or better write performance. The effects of I/O buffer caching on the Sun Blade X6275 M2 server module nodes and 34 GByte write optimized cache on the Sun Storage 7410 system gave up to 1.8 GBytes/second effective write performance.

Performance Landscape

Server Generational Performance Improvements

Performance improvements for the Reverse Time Migration computational kernel using a Sun Blade X6275 M2 cluster are compared to the previous generation Sun Blade X6275 cluster. Hyper-threading was enabled for both configurations allowing 24 OpenMP threads for the Sun Blade X6275 M2 server module nodes and 16 for the Sun Blade X6275 server module nodes.

Sun Blade X6275 M2 Performance Improvements
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
X6275 Kernel Time (sec) X6275 M2 Kernel Time (sec) X6275 M2 Speedup X6275 Kernel Time (sec) X6275 M2 Kernel Time (sec) X6275 M2 Speedup
16 306 242 1.3 728 576 1.3
14 355 271 1.3 814 679 1.2
12 435 346 1.3 945 797 1.2
10 541 390 1.4 1156 890 1.3
8 726 555 1.3 1511 1193 1.3

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Blade X6275 M2 server cluster with a Sun Storage 7410 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server node.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 501 242 2.1\* 2.3\* 1060 576 2.0 2.1\*
14 583 271 1.8 2.0 1219 679 1.7 1.8
12 681 346 1.6 1.6 1420 797 1.5 1.5
10 807 390 1.3 1.4 1688 890 1.2 1.3
8 1058 555 1.0 1.0 2085 1193 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache for larger node counts

Image File Effective Write Performance

The performance for writing the final 3D image from a Sun Blade X6275 M2 server cluster over 10 Gigabit Ethernet to a Sun Storage 7410 system are presented. Each server allocated one core per node for MPI I/O thus allowing 22 OpenMP compute threads per node with hyperthreading enabled. Captured performance analytics from the Sun Storage 7410 system indicate effective use of its 34 Gigabyte write optimized cache.

Image File Effective Write Performance
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Write Time (sec) Write Performance (GB/sec) Write Time (sec) Write Performance (GB/sec)
16 4.8 1.5 10.2 1.4
14 5.0 1.4 10.2 1.4
12 4.0 1.8 11.3 1.3
10 4.3 1.6 9.1 1.6
8 4.6 1.5 9.7 1.5

Note: Performance results better than 1.3GB/sec related to I/O buffer caching on server nodes.

Configuration Summary

Hardware Configuration:

8 x 2 node Sun Blade X6275 M2 server nodes, each node with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)
1 x QDR InfiniBand Host Channel Adapter

Sun Datacenter InfiniBand Switch IB-36
Sun Network 10 GbE Switch 72p

Sun Storage 7410 system connected via 10 Gigabit Ethernet
4 x 17 GB STEC ZeusIOPs SSD mirrored - 34 GB
40 x 750 GB 7500 RPM Seagate SATA disks mirrored - 14.4 TB
No L2ARC Readzilla Cache

Software Configuration:

Oracle Enterprise Linux Server release 5.5
Oracle Message Passing Toolkit 8.2.1c (for MPI)
Oracle Solaris Studio 12.2 C++, Fortran, OpenMP

Benchmark Description

This Vertical Transverse Isotropy (VTI) Anisotropic Reverse Time Depth Migration (RTM) application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk for the next work flow step involving 3-D seismic volume interpretation. In doing so, it reports the compute, interprocessor communication, and I/O performance of the individual functions that comprise the total solution. Unlike most references for the Reverse Time Migration, that focus solely on the performance of the 3D stencil compute kernel, this demonstration code additionally reports the total throughput involved in processing large data sets with a full 3D Anisotropic RTM application. It provides valuable insight into configuration and sizing for specific seismic processing requirements. The performance effects of new processors, interconnects, I/O subsystems, and software technologies can be evaluated while solving a real Exploration business problem.

This benchmark study uses the "in-core" implementation of this demonstration code where each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a 4 element array pad (based on spatial order 8) shared with it's neighbors to the left and right during the initialization phase. It maintains previous, current, and next wavefield state information for each of the source, receiver, and anisotropic wavefields in memory. The second two grid dimensions used in this benchmark are specifically chosen to be prime numbers to exaggerate the effects of data alignment. Algorithm adaptions for processing higher orders in space and alternative "out-of-core" solutions using SSDs for wave state checkpointing are implemented in this demonstration application to better understand the effects of problem size scaling. Care is taken to handle absorption boundary conditioning and a variety of imaging conditions, appropriately.

RTM Application Structure:

Read Processing Parameter File, Determine Domain Decomposition, and Initialize Data Structures, and Allocate Memory.

Read Velocity, Epsilon, and Delta Data Based on Domain Decomposition and create source, receiver, & anisotropic previous, current, and next wave states.

First Loop over Time Steps

Compute 3D Stencil for Source Wavefield (a,s) - 8th order in space, 2nd order in time
Propagate over Time to Create s(t,z,y,x) & a(t,z,y,x)
Inject Estimated Source Wavelet
Apply Absorption Boundary Conditioning (a)
Update Wavefield States and Pointers
Write Snapshot of Wavefield (out-of-core) or Push Wavefield onto Stack (in-core)
Communicate Boundary Information

Second Loop over Time Steps
Compute 3D Stencil for Receiver Wavefield (a,r) - 8th order in space, 2nd order in time
Propagate over Time to Create r(t,z,y,x) & a(t,z,y,x)
Read Receiver Trace and Inject Receiver Wavelet
Apply Absorption Boundary Conditioning (a)
Update Wavefield States and Pointers
Communicate Boundary Information
Read in Source Wavefield Snapshot (out-of-core) or Pop Off of Stack (in-core)
Cross-correlate Source and Receiver Wavefields
Update image using image conditioning parameters

Write 3D Depth Image i(z,x,y) = Sum over time steps s(t,z,x,y) \* r(t,z,x,y) or other imaging conditions.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

Image File MPI Write Performance Tuning

Changing the Image File Write from MPI non-blocking to MPI blocking and setting Oracle Message Passing Toolkit MPI environment variables revealed an 18x improvement in write performance to the Sun Storage 7410 system going from:

    86.8 to 4.8 seconds for the 1243 x 1151 x 1231 grid size
    183.1 to 10.2 seconds for the 2486 x 1151 x 1231 grid size

The Swat Sun Storage 7410 analytics data capture indicated an initial write performance of about 100 MB/sec with the MPI non-blocking implementation. After modifying to MPI blocking writes, Swat showed between 1.3 and 1.8 GB/sec with up to 13000 write ops/sec to write the final output image. The Swat results are consistent with the actual measured performance and provide valuable insight into the Reverse Time Migration application I/O performance.

The reason for this vast improvement has to do with whether the MPI file mode is sequential or not (MPI_MODE_SEQUENTIAL, O_SYNC, O_DSYNC). The MPI non-blocking routines, MPI_File_iwrite_at and MPI_wait, typically used for overlapping I/O and computation, do not support sequential file access mode. Therefore, the application could not take full performance advantages of the Sun Storage 7410 system write optimized cache. In contrast, the MPI blocking routine, MPI_File_write_at, defaults to MPI sequential mode and the performance advantages of the write optimized cache are realized. Since writing the final image is at the end of RTM execution, there is no need to overlap the I/O with computation.

Additional MPI parameters used:

    setenv SUNW_MP_PROCBIND true
    setenv MPI_SPIN 1
    setenv MPI_PROC_BIND 1

Adjusting the Level of Multithreading for Performance

The level of multithreading (8, 10, 12, 22, or 24) for various components of the RTM should be adjustable based on the type of computation taking place. Best to use OpenMP num_threads clause to adjust the level of multi-threading for each particular work task. Use numactl to specify how the threads are allocated to cores in accordance to the OpenMP parallelism level.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 12/07/2010.

Tuesday Oct 26, 2010

3D VTI Reverse Time Migration Scalability On Sun Fire X2270-M2 Cluster with Sun Storage 7210

This Oil & Gas benchmark shows the Sun Storage 7210 system delivers almost 2 GB/sec bandwidth and realizes near-linear scaling performance on a cluster of 16 Sun Fire X2270 M2 servers.

Oracle's Sun Storage 7210 system attached via QDR InfiniBand to a cluster of sixteen of Oracle's Sun Fire X2270 M2 servers was used to demonstrate the performance of a Reverse Time Migration application, an important application in the Oil & Gas industry. The total application throughput and computational kernel scaling are presented for two production sized grids of 800 samples.

  • Both the Reverse Time Migration I/O and combined computation shows near-linear scaling from 8 to 16 nodes on the Sun Storage 7210 system connected via QDR InfiniBand to a Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 2.0x improvement
      2486 x 1151 x 1231: 1.7x improvement
  • The computational kernel of the Reverse Time Migration has linear to super-linear scaling from 8 to 16 nodes in Oracle's Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231 : 2.2x improvement
      2486 x 1151 x 1231 : 2.0x improvement
  • Intel Hyper-Threading provides additional performance benefits to both the Reverse Time Migration I/O and computation when going from 12 to 24 OpenMP threads on the Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 8% - computational kernel; 2% - total application throughput
      2486 x 1151 x 1231: 12% - computational kernel; 6% - total application throughput
  • The Sun Storage 7210 system delivers the Velocity, Epsilon, and Delta data to the Reverse Time Migration at a steady rate even when timing includes memory initialization and data object creation:

      1243 x 1151 x 1231: 1.4 to 1.6 GBytes/sec
      2486 x 1151 x 1231: 1.2 to 1.3 GBytes/sec

    One can see that when doubling the size of the problem, the additional complexity of overlapping I/O and multiple node file contention only produces a small reduction in read performance.

Performance Landscape

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 504 259 2.0 2.2\* 1024 551 1.7 2.0
14 565 279 1.8 2.0 1191 677 1.5 1.6
12 662 343 1.6 1.6 1426 817 1.2 1.4
10 784 394 1.3 1.4 1501 856 1.2 1.3
8 1024 560 1.0 1.0 1745 1108 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache

Application Scaling – Hyper-Threading Study

The affects of hyperthreading are presented when running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server.

Hyper-Threading Comparison – 12 versus 24 OpenMP Threads
Number Nodes Thread per Node Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup
16 24 504 259 1.02 1.08 1024 551 1.06 1.12
16 12 515 279 1.00 1.00 1088 616 1.00 1.00

Read Performance

Read performance is presented for the velocity, epsilon and delta files running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Velocity, Epsilon, and Delta File Read and Memory Initialization Performance
Number Nodes Overlap MBytes Read Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s
16 2040 16.7 1.1 23.2 1.4 36.8 1.1 44.3 1.2
8 951
14.8 1.0 22.1 1.6 33.0 1.0 43.2 1.3

Configuration Summary

Hardware Configuration:

16 x Sun Fire X2270 M2 servers, each with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)

Sun Storage 7210 system connected via QDR InfiniBand
2 x 18 GB SATA SSD (logzilla)
40 x 1 TB 7200 RM SATA disk

Software Configuration:

SUSE Linux Enterprise Server SLES 10 SP 2
Oracle Message Passing Toolkit 8.2.1 (for MPI)
Sun Studio 12 Update 1 C++, Fortran, OpenMP

Benchmark Description

This Reverse Time Migration (RTM) demonstration application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk. In this version, each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a four element inline 3-D array pad (spatial order of eight) shared with its neighbors to the left and right during the initialization phase. It represents a full RTM application including the data input, computation, communication, and final output image to be used by the next work flow step involving 3D volumetric seismic interpretation.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

I/O Characterization without Optimal Checkpointing

Velocity, Epsilon, and Delta Files - Grid Reading

The additional amount of overlapping reads to share velocity, epsilon, and delta edge data with neighbors can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (z_dimension) x (4 bytes) x (3 files)

For this particular benchmark study, the additional 3-D pad overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 1231 x 4 x 3 = 2.04 GB extra
    8 nodes: 7 x 8 x 1151 x 1231 x 4 x 3 = 0.95 GB extra

For the first of the two test cases, the total size of the three files used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 1231 x 4 bytes = 7.05 GB per file x 3 files = 21.13 GB

With the additional 3-D pad, the total amount of data read is:

    16 nodes: 2.04 GB + 21.13 GB = 23.2 GB
    8 nodes: 0.95 GB + 21.13 GB = 22.1 GB

For the second of the two test cases, the total size of the three files used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 1231 x 4 bytes = 14.09 GB per file x 3 files = 42.27 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: 2.04 GB + 42.27 GB = 44.3 GB
    8 nodes: 0.95 GB + 42.27 GB = 43.2 GB

Note that the amount of overlapping data read increases, not only by the number of nodes, but as the y dimension and/or the z dimension increases.

Trace Reading

The additional amount of overlapping reads to share trace edge data with neighbors for can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (4 bytes) x (number_of_time_slices)

For this particular benchmark study, the additional overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 4 x 800 = 442MB extra
    8 nodes: 7 x 8 x 1151 x 4 x 800 = 206MB extra

For the first case the size of the trace data file used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 4 bytes x 800 = 4.578 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 4.578 GB = 5.0 GB
    8 nodes: .206 GB + 4.578 GB = 4.8 GB

For the second case the size of the trace data file used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 4 bytes x 800 = 9.156 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 9.156 GB = 9.6 GB
    8 nodes: .206 GB + 9.156 GB = 9.4 GB

As the number of nodes is increased, the overlap causes more disk lock contention.

Writing Final Output Image

1243x1151x1231 - 7.1 GB per file:

    16 nodes: 78 x 1151 x 1231 x 4 = 442MB/node (7.1 GB total)
    8 nodes: 156 x 1151 x 1231 x 4 = 884MB/node (7.1 GB total)

2486x1151x1231 - 14.1 GB per file:

    16 nodes: 156 x 1151 x 1231 x 4 = 930 MB/node (14.1 GB total)
    8 nodes: 311 x 1151 x 1231 x 4 = 1808 MB/node (14.1 GB total)

Resource Allocation

It is best to allocate one node as the Oracle Grid Engine resource scheduler and MPI master host. This is especially true when running with 24 OpenMP threads in hyperthreading mode to avoid oversubscribing a node that is cooperating in delivering the solution.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 10/20/2010.

Tuesday Sep 21, 2010

ProMAX Performance and Throughput on Sun Fire X2270 and Sun Storage 7410

Halliburton/Landmark's ProMAX 3D Prestack Kirchhoff Time Migration's single job scalability and multiple job throughput using various scheduling methods are evaluated on a cluster of Oracle's Sun Fire X2270 servers attached via QDR InfiniBand to Oracle's Sun Storage 7410 system.

Two resource scheduling methods, compact and distributed, are compared while increasing the system load with additional concurrent ProMAX jobs.

  • A single ProMAX job has near linear scaling of 5.5x on 6 nodes of a Sun Fire X2270 cluster.

  • A single ProMAX job has near linear scaling of 7.5x on a Sun Fire X2270 server when running from 1 to 8 threads.

  • ProMAX can take advantage of Oracle's Sun Storage 7410 system features compared to dedicated local disks. There was no significant difference in run time observed when running up to 8 concurrent 16 thread jobs.

  • The 8-thread ProMAX job throughput using the distributed scheduling method is equivalent or slightly faster than the compact scheme for 1 to 4 concurrent jobs.

  • The 16-thread ProMAX job throughput using the distributed scheduling method is up to 8% faster when compared to the compact scheme on an 8-node Sun Fire X2270 cluster.

The multiple job throughput characterization revealed in this benchmark study are key in pre-configuring Oracle Grid Engine resource scheduling for ProMAX on a Sun Fire X2270 cluster and provide valuable insight for server consolidation.

Performance Landscape

Single Job Scaling

Single job performance on a single node is near linear up the number of cores in the node, i.e. 2 Intel Xeon X5570s with 4 cores each. With hyperthreading (2 active threads per core) enabled, more ProMAX threads are used increasing the load on the CPU's memory architecture causing the reduced speedups.
ProMAX single job performance on the 6-node cluster shows near linear speedup node to node.
Single Job 6-Node Scalability
Hyperthreading Enabled - 16 Threads/Node Maximum
Number of Nodes Threads Per Node Speedup to 1 Thread Speedup to 1 Node
6 16 54.2 5.5
4 16 36.2 3.6
3 16 26.1 2.6
2 16 17.6 1.8
1 16 10.0 1.0
1 14 9.2
1 12 8.6
1 10 7.2\*
1 8 7.5
1 6 5.9
1 4 3.9
1 3 3.0
1 2 2.0
1 1 1.0

\* 2 threads contend with two master node daemons

Multiple Job Throughput Scaling, Compact Scheduling

With the Sun Storage 7410 system, performance of 8 concurrent jobs on the cluster using compact scheduling is equivalent to a single job.

Multiple Job Throughput Scalability
Hyperthreading Enabled - 16 Threads/Node Maximum
Number of Nodes Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Percent Cluster Used
1 1 16 1.00 1 13
2 1 16 1.00 2 25
4 1 16 1.00 4 50
8 1 16 1.00 8 100

Multiple 8-Thread Job Throughput Scaling, Compact vs. Distributed Scheduling

These results report the difference of different distributed method resource scheduling levels to 1, 2, and 4 concurrent job compact method baselines.

Multiple 8-Thread Job Scheduling
HyperThreading Enabled - Use 8 Threads/Node Maximum
Number of Jobs Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Total Threads per Node Used Percent of PVM Master 8 Threads Used
1 1 8 1.00 1 8 100
1 4 2 1.01 4 2 25
1 8 1 1.01 8 1 13

2 1 8 1.00 2 8 100
2 4 2 1.01 4 4 50
2 8 1 1.01 8 2 25

4 1 8 1.00 4 8 100
4 4 2 1.00 4 8 100
4 8 1 1.01 8 4 100

Multiple 16-Thread Job Throughput Scaling, Compact vs. Distributed Scheduling

The results are reported relative to the performance of 1, 2, 4, and 8 concurrent 2-node, 8-thread jobs.

Multiple 16-Thread Job Scheduling
HyperThreading Enabled - 16 Threads/Node Available
Number of Jobs Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Total Threads per Node Used Percent of PVM Master 16 Threads Used
1 1 16 0.66 1 16 100\*
1 2 8 1.00 2 8 50
1 4 4 1.03 4 4 25
1 8 2 1.06 8 2 13

2 1 16 0.70 2 16 100\*
2 2 8 1.00 4 8 50
2 4 4 1.07 8 4 25
2 8 2 1.08 8 4 25

4 1 16 0.74 4 16 100\*
4 4 4 0.74 4 16 100\*
4 2 8 1.00 8 8 50
4 4 4 1.05 8 8 50
4 8 2 1.04 8 8 50

8 1 16 1.00 8 16 100\*
8 4 4 1.00 8 16 100\*
8 8 2 1.00 8 16 100\*

\* master PVM host; running 20 to 21 total threads (over-subscribed)

Results and Configuration Summary

Hardware Configuration:

8 x Sun Fire X2270 servers, each with
2 x 2.93 GHz Intel Xeon X5570 processors
48 GB memory at 1333 MHz
1 x 500 GB SATA
Sun Storage 7410 system
4 x 2.3 GHz AMD Opteron 8356 processors
128 GB memory
2 Internal 233GB SAS drives = 466 GB
2 Internal 93 GB read optimized SSD = 186 GB
1 External Sun Storage J4400 array with 22 1TB SATA drives and 2 18GB write optimized SSD
11 TB mirrored data and mirrored write optimized SSD

Software Configuration:

SUSE Linux Enterprise Server 10 SP 2
Parallel Virtual Machine 3.3.11
Oracle Grid Engine
Intel 11.1 Compilers
OpenWorks Database requires Oracle 10g Enterprise Edition
Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

The ProMAX family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX is integrated with Halliburton's OpenWorks Geoscience Oracle Database to index prestack seismic data and populate the database with processed seismic.

This benchmark evaluates single job scalability and multiple job throughput of the ProMAX 3D Prestack Kirchhoff Time Migration while processing the Halliburton benchmark data set containing 70,808 traces with 8 msec sample interval and trace length of 4992 msec. Alternative thread scheduling methods are compared for optimizing single and multiple job throughput. The compact scheme schedules the threads of a single job in as few nodes as possible, whereas, the distributed scheme schedules the threads across a many nodes as possible. The effects of load on the Sun Storage 7410 system are measured. This information provides valuable insight into determining the Oracle Grid Engine resource management policies.

Hyperthreading is enabled for all of the tests. It should be noted that every node is running a PVM daemon and ProMAX license server daemon. On the master PVM daemon node, there are three additional ProMAX daemons running.

The first test measures single job scalability across a 6-node cluster with an additional node serving as the master PVM host. The speedup relative to a single node, single thread are reported.

The second test measures multiple job scalability running 1 to 8 concurrent 16-thread jobs using the Sun Storage 7410 system. The performance is reported relative to a single job.

The third test compares 8-thread multiple job throughput using different job scheduling methods on a cluster. The compact method involves putting all 8 threads for a job on the same node. The distributed method involves spreading the 8 threads of job across multiple nodes. The results report the difference of different distributed method resource scheduling levels to 1, 2, and 4 concurrent job compact method baselines.

The fourth test is similar to the second test except running 16-thread ProMAX jobs. The results are reported relative to the performance of 1, 2, 4, and 8 concurrent 2-node, 8-thread jobs.

The ProMAX processing parameters used for this benchmark:

Minimum output inline = 65
Maximum output inline = 85
Inline output sampling interval = 1
Minimum output xline = 1
Maximum output xline = 200 (fold)
Xline output sampling interval = 1
Antialias inline spacing = 15
Antialias xline spacing = 15
Stretch Mute Aperature Limit with Maximum Stretch = 15
Image Gather Type = Full Offset Image Traces
No Block Moveout
Number of Alias Bands = 10
3D Amplitude Phase Correction
No compression
Maximum Number of Cache Blocks = 500000

Key Points and Best Practices

  • The application was rebuilt with the Intel 11.1 Fortran and C++ compilers with these flags.

    -xSSE4.2 -O3 -ipo -no-prec-div -static -m64 -ftz -fast-transcendentals -fp-speculation=fast
  • There are additional execution threads associated with a ProMAX node. There are two threads that run on each node: the license server and PVM daemon. There are at least three additional daemon threads that run on the PVM master server: the ProMAX interface GUI, the ProMAX job execution - SuperExec, and the PVM console and control. It is best to allocate one node as the master PVM server to handle the additional 5+ threads. Otherwise, hyperthreading can be enabled and the master PVM host can support up to 8 ProMAX job threads.

  • When hyperthreading is enabled in on one of the non-master PVM hosts, there is a 7% penalty going from 8 to 10 threads. However, 12 threads are 11 percent faster than 8. This can be contributed to the two additional support threads when hyperthreading initiates.

  • Single job performance on a single node is near linear up the number of cores in the node, i.e. 2 Intel Xeon X5570s with 4 cores each. With hyperthreading (2 active threads per core) enabled, more ProMAX threads are used increasing the load on the CPU's memory architecture causing the reduced speedups.

    Users need to be aware of these performance differences and how it effects their production environment.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX. Results as of 9/20/2010.

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today