Monday Sep 19, 2011

Halliburton ProMAX® Seismic Processing on Sun Blade X6270 M2 with Sun ZFS Storage 7320

Halliburton/Landmark's ProMAX® 3D Pre-Stack Kirchhoff Time Migration's (PSTM) single workflow scalability and multiple workflow throughput using various scheduling methods are evaluated on a cluster of Oracle's Sun Blade X6270 M2 server modules attached to Oracle's Sun ZFS Storage 7320 appliance.

Two resource scheduling methods, compact and distributed, are compared while increasing the system load with additional concurrent ProMAX® workflows.

  • Multiple concurrent 24-process ProMAX® PSTM workflow throughput is constant; 10 workflows on 10 nodes finish as fast as 1 workflow on one compute node. Additionally, processing twice the data volume yields similar traces/second throughput performance.

  • A single ProMAX® PSTM workflow has good scaling from 1 to 10 nodes of a Sun Blade X6270 M2 cluster scaling 4.5X. ProMAX® scales to 4.7X on 10 nodes with one input data set and 6.3X with two consecutive input data sets (i.e. twice the data).

  • A single ProMAX® PSTM workflow has near linear scaling of 11x on a Sun Blade X6270 M2 server module when running from 1 to 12 processes.

  • The 12-thread ProMAX® workflow throughput using the distributed scheduling method is equivalent or slightly faster than the compact scheme for 1 to 6 concurrent workflows.

Performance Landscape

Multiple 24-Process Workflow Throughput Scaling

This test measures the system throughput scalability as concurrent 24-process workflows are added, one workflow per node. The per workflow throughput and the system scalability are reported.

Aggregate system throughput scales linearly. Ten concurrent workflows finish in the same time as does one workflow on a single compute node.

Halliburton ProMAX® Pre-Stack Time Migration - Multiple Workflow Scaling


Single Workflow Scaling

This test measures single workflow scalability across a 10-node cluster. Utilizing a single data set, performance exhibits near linear scaling of 11x at 12 processes, and per-node scaling of 4x at 6 nodes; performance flattens quickly reaching a peak of 60x at 240 processors and per-node scaling of 4.7x with 10 nodes.

Running with two consecutive input data sets in the workflow, scaling is considerably improved with peak scaling ~35% higher than obtained using a single data set. Doubling the data set size minimizes time spent in workflow initialization, data input and output.

Halliburton ProMAX® Pre-Stack Time Migration - Single Workflow Scaling

This next test measures single workflow scalability across a 10-node cluster (as above) but limiting scheduling to a maximum of 12-process per node; effectively restricting a maximum of one process per physical core. The speedup relative to a single process, and single node are reported.

Utilizing a single data set, performance exhibits near linear scaling of 37x at 48 processes, and per-node scaling of 4.3x at 6 nodes. Performance of 55x at 120 processors and per-node scaling of 5x with 10 nodes is reached and scalability is trending higher more strongly compared to the the case of two processes running per physical core above. For equivalent total process counts, multi-node runs using only a single process per physical core appear to run between 28-64% more efficiently (96 and 24 processes respectively). With a full compliment of 10 nodes (120 processes) the peak performance is only 9.5% lower than with 2 processes per vcpu (240 processes).

Running with two consecutive input data sets in the workflow, scaling is considerably improved with peak scaling ~35% higher than obtained using a single data set.

Halliburton ProMAX® Pre-Stack Time Migration - Single Workflow Scaling

Multiple 12-Process Workflow Throughput Scaling, Compact vs. Distributed Scheduling

The fourth test compares compact and distributed scheduling of 1, 2, 4, and 6 concurrent 12-processor workflows.

All things being equal, the system bi-section bandwidth should improve with distributed scheduling of a fixed-size workflow; as more nodes are used for a workflow, more memory and system cache is employed and any node memory bandwidth bottlenecks can be offset by distributing communication across the network (provided the network and inter-node communication stack do not become a bottleneck). When physical cores are not over-subscribed, compact and distributed scheduling performance is within 3% suggesting that there may be little memory contention for this workflow on the benchmarked system configuration.

With compact scheduling of two concurrent 12-processor workflows, the physical cores become over-subscribed and performance degrades 36% per workflow. With four concurrent workflows, physical cores are oversubscribed 4x and performance is seen to degrade 66% per workflow. With six concurrent workflows over-subscribed compact scheduling performance degrades 77% per workflow. As multiple 12-processor workflows become more and more distributed, the performance approaches the non over-subscribed case.

Halliburton ProMAX® Pre-Stack Time Migration - Multiple Workflow Scaling

141616 traces x 624 samples


Test Notes

All tests were performed with one input data set (70808 traces x 624 samples) and two consecutive input data sets (2 * (70808 traces x 624 samples)) in the workflow. All results reported are the average of at least 3 runs and performance is based on reported total wall-clock time by the application.

All tests were run with NFS attached Sun ZFS Storage 7320 appliance and then with NFS attached legacy Sun Fire X4500 server. The StorageTek Workload Analysis Tool (SWAT) was invoked to measure the I/O characteristics of the NFS attached storage used on separate runs of all workflows.

Configuration Summary

Hardware Configuration:

10 x Sun Blade X6270 M2 server modules, each with
2 x 3.33 GHz Intel Xeon X5680 processors
48 GB DDR3-1333 memory
4 x 146 GB, Internal 10000 RPM SAS-2 HDD
10 GbE
Hyper-Threading enabled

Sun ZFS Storage 7320 Appliance
1 x Storage Controller
2 x 2.4 GHz Intel Xeon 5620 processors
48 GB memory (12 x 4 GB DDR3-1333)
2 TB Read Cache (4 x 512 GB Read Flash Accelerator)
10 GbE
1 x Disk Shelf
20.0 TB RAID-Z (20 x 1 TB SAS-2, 7200 RPM HDD)
4 x Write Flash Accelerators

Sun Fire X4500
2 x 2.8 GHz AMD 290 processors
16 GB DDR1-400 memory
34.5 TB RAID-Z (46 x 750 GB SATA-II, 7200 RPM HDD)
10 GbE

Software Configuration:

Oracle Linux 5.5
Parallel Virtual Machine 3.3.11 (bundled with ProMAX)
Intel 11.1.038 Compilers
Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

The ProMAX® family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX® is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX® is integrated with Halliburton's OpenWorks® Geoscience Oracle Database to index prestack seismic data and populate the database with processed seismic.

This benchmark evaluates single workflow scalability and multiple workflow throughput of the ProMAX® 3D Prestack Kirchhoff Time Migration (PSTM) while processing the Halliburton benchmark data set containing 70,808 traces with 8 msec sample interval and trace length of 4992 msec. Benchmarks were performed with both one and two consecutive input data sets.

Each workflow consisted of:

  • reading the previously constructed MPEG encoded processing parameter file
  • reading the compressed seismic data traces from disk
  • performing the PSTM imaging
  • writing the result to disk

Workflows using two input data sets were constructed by simply adding a second identical seismic data read task immediately after the first in the processing parameter file. This effectively doubled the data volume read, processed, and written.

This version of ProMAX® currently only uses Parallel Virtual Machine (PVM) as the parallel processing paradigm. The PVM software only used TCP networking and has no internal facility for assigning memory affinity and processor binding. Every compute node is running a PVM daemon.

The ProMAX® processing parameters used for this benchmark:

Minimum output inline = 65
Maximum output inline = 85
Inline output sampling interval = 1
Minimum output xline = 1
Maximum output xline = 200 (fold)
Xline output sampling interval = 1
Antialias inline spacing = 15
Antialias xline spacing = 15
Stretch Mute Aperature Limit with Maximum Stretch = 15
Image Gather Type = Full Offset Image Traces
No Block Moveout
Number of Alias Bands = 10
3D Amplitude Phase Correction
No compression
Maximum Number of Cache Blocks = 500000

Primary PSTM business metrics are typically time-to-solution and accuracy of the subsurface imaging solution.

Key Points and Best Practices

  • Multiple job system throughput scales perfectly; ten concurrent workflows on 10 nodes each completes in the same time and has the same throughput as a single workflow running on one node.
  • Best single workflow scaling is 6.6x using 10 nodes.

    When tasked with processing several similar workflows, while individual time-to-solution will be longer, the most efficient way to run is to fully distribute them one workflow per node (or even across two nodes) and run these concurrently, rather than to use all nodes for each workflow and running consecutively. For example, while the best-case configuration used here will run 6.6 times faster using all ten nodes compared to a single node, ten such 10-node jobs running consecutively will overall take over 50% longer to complete than ten jobs one per node running concurrently.

  • Throughput was seen to scale better with larger workflows. While throughput with both large and small workflows are similar with only one node, the larger dataset exhibits 11% and 35% more throughput with four and 10 nodes respectively.

  • 200 processes appears to be a scalability asymptote with these workflows on the systems used.
  • Hyperthreading marginally helps throughput. For the largest model run on 10 nodes, 240 processes delivers 11% more performance than with 120 processes.

  • The workflows do not exhibit significant I/O bandwidth demands. Even with 10 concurrent 24-process jobs, the measured aggregate system I/O did not exceed 100 MB/s.

  • 10 GbE was the only network used and, though shared for all interprocess communication and network attached storage, it appears to have sufficient bandwidth for all test cases run.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX®, GeoProbe®, OpenWorks®. Results as of 9/1/2011.

Tuesday Jul 14, 2009

Vdbench: Sun StorageTek Vdbench, a storage I/O workload generator.

Vdbench is written in Java (and a little C) and runs on Solaris Sparc and X86, Windows, AIX, Linux, zLinux, HP/UX, and OS/X.

I wrote the SPC1 and SPC2 workload generator using the Vdbench base code for the Storage Performance Council: http://www.storageperformance.org

Vdbench is a disk and tape I/O workload generator, allowing detailed control over numerous workload parameters like:

Options:

· For raw disk (and tape) and large disk files:

o Read vs. write

o Random vs. sequential or skip-sequential

o I/O rate

o Data transfer size

o Cache hit rates

o I/O queue depth control

o Unlimited amount of concurrent devices and workloads

o Compression (tape)

· For file systems:

o Number of directory and files

o File sizes

o Read vs. write

o Data transfer size

o Directory create/delete, file create/delete,

o Unlimited amount of concurrent file systems and workloads

Single host or Multi-host:

All work is centrally controlled, running either on a single host or on multiple hosts concurrently.

Reporting:

Centralized reporting, reporting and reporting using the simple idea that you can't understand performance of a workload unless you can see the detail. If you just look at run totals you'll miss the fact that for some reason the storage configuration was idle for several seconds or even minutes!

  • Second by second detail of by Vdbench accumulated performance statistics for total workload and for each individual logical device used by Vdbench.
  • For Solaris Sparc and X86: second by second detail of Kstat statistics for total workload and for each physical lun or NFS mounted device used.
  • All Vdbench reports are HTML files. Just point your browser to the summary.html file in your Vdbench output directory and all the reports link together.
  • Swat (an other of my tools) allows you to display performance charts of the data created by Vdbench: Just start SPM, then 'File' 'Import Vdbench data'.
  • Vdbench will (optionally) automatically call Swat to create JPG files of your performance charts.
  • Vdbench has a GUI that will allow you to compare the results of two different Vdbench workload executions. It shows the differences between the two runs in different grades of green, yellow and red. Green is good, red is bad.

Data Validation:

Data Validation is a highly sophisticated methodology to assure data integrity by always writing unique data contents to each block and then doing a compare after the next read or before the next write. The history tables containing information about what is written where is maintained in memory and optionally in journal files. Journaling allows data to be written to disk in one execution of Vdbench with Data Validation and then continued in a future Vdbench execution to make sure that after a system shutdown all data is still there. Great for testing mirrors: write some data using journaling, break the mirror, and have Vdbench validate the contents of the mirror.

I/O Replay

A disk I/O workload traced using Swat (an other of my tools) can be replayed using Vdbench on any test system to any type of storage. This allows you to trace a production I/O workload, bring the trace data to your lab, and then replay your I/O workload on whatever storage you want. Want to see how the storage performs when the I/O rate doubles? Vdbench Replay will show you. With this you can test your production workload without the hassle of having to get your data base software and licenses, your application software, or even your production data on your test system.

For more detailed information about Vdbench go to http://vdbench.org where you can download the documentation or the latest GA version of Vdbench.

You can find continuing updates about Swat and Vdbench on my blog: http://blogs.sun.com/henk/

Henk Vandenbergh

PS: If you're wondering where the name Vdbench came from :  Henk Vandenbergh benchmarking.

Storage performance and workload analysis using Swat.

Swat (Sun StorageTek Workload Analysis Tool) is a host-based, storage-centric Java application that thoroughly captures, summarizes, and analyzes storage workloads for both Solaris and Windows environments.

This tool was written to help Sun’s engineering, sales and service organizations and Sun’s customers understand storage I/O workloads.


 Sample screenshot:


Swat can be used for among many other reasons:

  • Problem analysis
  • Configuration sizing (just buying x GB of storage just won't do anymore)
  • Trend analysis: is my workload growing, and can I identify/resolve problems before they happen?

Swat is storage agnostic, so it does not matter what type or brand of storage you are trying to report on. Swat reports the host's view of the storage performance and workload, using the same Kstat (Solaris) data that iostat uses.

Swat consists of several different major functions:

· Swat Performance Monitor (SPM)

· Swat Trace Facility (STF)

· Swat Trace Monitor (STM)

· Swat Real Time Monitor

· Swat Local Real Time Monitor

· Swat Reporter

Swat Performance Monitor (SPM):

Works on Solaris and Windows. An attempt has been made in the current Swat 3.02 to also collect data on AIX and Linux. Swat 3.02 also reports Network Adapter statistics on Solaris, Windows, and Linux. A Swat Data Collector (agent) runs on some or all of your servers/hosts, collecting I/O performance statistics every 5, 10, or 15 minutes and writes the data to a disk file, one new file every day, automatically switched at midnight.

The data then can be analyzed using the Swat Reporter.

Swat Trace Facility (STF):

For Solaris and Windows. STF collects detailed I/O trace information. This data then goes through a data Extraction and Analysis phase that generates hundreds or thousands of second-by-second statistics counters. That data then can be analyzed using the Swat Reporter. You create this trace for between 30 and 60 minutes for instance at a time when you know you will have a performance problem.

A disk I/O workload traced using Swat can be replayed on any test system to any type of storage using Vdbench (an other of my tools, available at http://vdbench.org). This allows you to trace a production I/O workload, bring the trace data to your lab, and then replay that I/O workload on whatever storage you want. Want to see how the storage performs when the I/O rate doubles or triples? Vdbench Replay will show you. With this you can test your production workload without the hassle of having to get your data base software and licenses, your application software and licenses, or even your production data.

Note: STF is currently limited to the collection of about 20,000 IOPS. Some development effort is required to handle the current increase in IOPS made possible by Solid State Devices (SSDs).

Note: STF, while collecting the trace data is the only Swat function that requires root access. This functionality is all handled by one single KSH script which can be run independently. (Script uses TNF and ADB).

Swat Trace Monitor (STM):

With STF you need to know when the performance problem will occur so that you can schedule the trace data to be collected. Not every performance problem however is predictable. STM will run an in-memory trace and then monitors the overall storage performance. Once a certain threshold is reach, for instance response time greater than 100 milliseconds, the in-memory trace buffer is dumped to disk and the trace then continues collecting trace data for an amount of seconds before terminating.

Swat Real Time Monitor:

When a Data Collector is active on your current or any network-connected host, Swat Real Time Monitor will open a Java socket connection with that host, allowing you to actively monitor the current storage performance either from your local or any of your remote hosts.

Swat Local Real Time Monitor:

Local Real Time Monitor is the quickest way to start using Swat. Just enter './swat -l' and Swat will start a private Data Collector for your local system and then will show you exactly what is happening to your current storage workload. No more fiddling trying to get some useful data out of a pile of iostat output.

Swat Reporter:

The Swat Reporter ties everything together. All data collected by the above Swat functions can be displayed using this powerful GUI reporting and charting function. You can generate hundreds of different performance charts or tabulated reports giving you intimate understanding of your storage workload and performance. Swat will even create JPG files for you that then can be included in documents and/or presentations. There is even a batch utility (Swat Batch Reporter) that will automate the JPG generation for you. If you want, Swat will even create a script for this batch utility for you.

Some of the many available charts:

  • Response time per controller or device
  • I/O rate per controller or device
  • Read percentage
  • Data transfer size
  • Queue depth
  • Random vs. sequential (STF only)
  • CPU usage
  • Device skew
  • Etc. etc.

Swat has been written in Java. This means, that once your data has been collected on its originating system, the data can be displayed and analyzed using the Swat Reporter on ANY Java enabled system, including any type of laptop.

For more detailed information go to (long URL)where you can download the latest release, Swat 3.02.

You can find continuing updates about Swat and Vdbench on my blog: http://blogs.sun.com/henk/

Henk Vandenbergh

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today