Monday Sep 20, 2010

Schlumberger's ECLIPSE 300 Performance Throughput On Sun Fire X2270 Cluster with Sun Storage 7410

Oracle's Sun Storage 7410 system, attached via QDR InfiniBand to a cluster of eight of Oracle's Sun Fire X2270 servers, was used to evaluate multiple job throughput of Schlumberger's Linux-64 ECLIPSE 300 compositional reservoir simulator processing their standard 2 Million Cell benchmark model with 8 rank parallelism (MM8 job).

  • The Sun Storage 7410 system showed little difference in performance (2%) compared to running the MM8 job with dedicated local disk.

  • When running 8 concurrent jobs on 8 different nodes all to the Sun Storage 7140 system, the performance saw little degradation (5%) compared to a single MM8 job running on dedicated local disk.

Experiments were run changing how the cluster was utilized in scheduling jobs. Rather than running with the default compact mode, tests were run distributing the single job among the various nodes. Performance improvements were measured when changing from the default compact scheduling scheme (1 job to 1 node) to a distributed scheduling scheme (1 job to multiple nodes).

  • When running at 75% of the cluster capacity, distributed scheduling outperformed the compact scheduling by up to 34%. Even when running at 100% of the cluster capacity, the distributed scheduling is still slightly faster than compact scheduling.

  • When combining workloads, using the distributed scheduling allowed two MM8 jobs to finish 19% faster than the reference time and a concurrent PSTM workload to find 2% faster.

The Oracle Solaris Studio Performance Analyzer and Sun Storage 7410 system analytics were used to identify a 3D Prestack Kirchhoff Time Migration (PSTM) as a potential candidate for consolidating with ECLIPSE. Both scheduling schemes are compared while running various job mixes of these two applications using the Sun Storage 7410 system for I/O.

These experiments showed a potential opportunity for consolidating applications using Oracle Grid Engine resource scheduling and Oracle Virtual Machine templates.

Performance Landscape

Results are presented below on a variety of experiments run using the 2009.2 ECLIPSE 300 2 Million Cell Performance Benchmark (MM8). The compute nodes are a cluster of Sun Fire X2270 servers connected with QDR InfiniBand. First, some definitions used in the tables below:

Local HDD: Each job runs on a single node to its dedicated direct attached storage.
NFSoIB: One node hosts its local disk for NFS mounting to other nodes over InfiniBand.
IB 7410: Sun Storage 7410 system over QDR InfiniBand.
Compact Scheduling: All 8 MM8 MPI processes run on a single node.
Distributed Scheduling: Allocate the 8 MM8 MPI processes across all available nodes.

First Test

The first test compares the performance of a single MM8 test on a single node using local storage to running a number of jobs across the cluster and showing the effect of different storage solutions.

Compact Scheduling
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Local HDD Relative Throughput NFSoIB Relative Throughput IB 7410 Relative Throughput
13% 1 1.00 1.00\* 0.98
25% 2 0.98 0.97 0.98
50% 4 0.98 0.96 0.97
75% 6 0.98 0.95 0.95
100% 8 0.98 0.95 0.95

\* Performance measured on node hosting its local disk to other nodes in the cluster.

Second Test

This next test uses the Sun Storage 7410 system and compares the performance of running the MM8 job on 1 node using the compact scheduling to running multiple jobs with compact scheduling and to running multiple jobs with the distributed schedule. The tests are run on a 8 node cluster, so each distributed job has only 1 MPI process per node.

Comparing Compact and Distributed Scheduling
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Compact Scheduling
Relative Throughput
Distributed Scheduling\*
Relative Throughput
13% 1 1.00 1.34
25% 2 1.00 1.32
50% 4 0.99 1.25
75% 6 0.97 1.10
100% 8 0.97 0.98

\* Each distributed job has 1 MPI process per node.

Third Test

This next test uses the Sun Storage 7410 system and compares the performance of running the MM8 job on 1 node using the compact scheduling to running multiple jobs with compact scheduling and to running multiple jobs with the distributed schedule. This test only uses 4 nodes, so each distributed job has two MPI processes per node.

Comparing Compact and Distributed Scheduling on 4 Nodes
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Compact Scheduling
Relative Throughput
Distributed Scheduling\*
Relative Throughput
25% 1 1.00 1.39
50% 2 1.00 1.28
100% 4 1.00 1.00

\* Each distributed job it has two MPI processes per node.

Fourth Test

The last test involves running two different applications on the 4 node cluster. It compares the performance of running the cluster fully loaded and changing how the applications are run, either compact or distributed. The comparisons are made against the individual application running the compact strategy (as few nodes as possible). It shows that appropriately mixing jobs can give better job performance than running just one kind of application on a single cluster.

Multiple Job, Multiple Application Throughput Results
Comparing Scheduling Strategies
2009.2 ECLIPSE 300 MM8 2 Million Cell and 3D Kirchoff Time Migration (PSTM)

Number of PSTM Jobs Number of MM8 Jobs Compact Scheduling
(1 node x 8 processes
per job)
ECLIPSE
Distributed Scheduling
(4 nodes x 2 processes
per job)
ECLIPSE
Distributed Scheduling
(4 nodes x 4 processes
per job)
PSTM
Compact Scheduling
(2 nodes x 8 processes per job)
PSTM
Cluster Load
0 1 1.00 1.40

25%
0 2 1.00 1.27

50%
0 4 0.99 0.98

100%
1 2
1.19 1.02
100%
2 0

1.07 0.96 100%
1 0

1.08 1.00 50%

Results and Configuration Summary

Hardware Configuration:

8 x Sun Fire X2270 servers, each with
2 x 2.93 GHz Intel Xeon X5570 processors
24 GB memory (6 x 4 GB memory at 1333 MHz)
1 x 500 GB SATA
Sun Storage 7410 system, 24 TB total, QDR InfiniBand
4 x 2.3 GHz AMD Opteron 8356 processors
128 GB memory
2 Internal 233GB SAS drives (466 GB total)
2 Internal 93 GB read optimized SSD (186 GB total)
1 Sun Storage J4400 with 22 1 TB SATA drives and 2 18 GB write optimized SSD
20 TB RAID-Z2 (double parity) data and 2-way striped write optimized SSD or
11 TB mirrored data and mirrored write optimized SSD
QDR InfiniBand Switch

Software Configuration:

SUSE Linux Enterprise Server 10 SP 2
Scali MPI Connect 5.6.6
GNU C 4.1.2 compiler
2009.2 ECLIPSE 300
ECLIPSE license daemon flexlm v11.3.0.0
3D Kirchoff Time Migration

Benchmark Description

The benchmark is a home-grown study in resource usage options when running the Schlumberger ECLIPSE 300 Compositional reservoir simulator with 8 rank parallelism (MM8) to process Schlumberger's standard 2 Million Cell benchmark model. Schlumberger pre-built executables were used to process a 260x327x73 (2 Million Cell) sub-grid with 6,206,460 total grid cells and model 7 different compositional components within a reservoir. No source code modifications or executable rebuilds were conducted.

The ECLIPSE 300 MM8 job uses 8 MPI processes. It can run within a single node (compact) or across multiple nodes of a cluster (distributed). By using the MM8 job, it is possible to compare the performance between running each job on a separate node using local disk to using a shared network attached storage solution. The benchmark tests study the affect of increasing the number of MM8 jobs in a throughput model.

The first test compares the performance of running 1, 2, 4, 6 and 8 jobs on a cluster of 8 nodes using local disk, NFSoIB disk, and the Sun Storage 7410 system connected via InfiniBand. Results are compared against the time it takes to run 1 job with local disk. This test shows what performance impact there is when loading down a cluster.

The second test compares different methods of scheduling jobs on a cluster. The compact method involves putting all 8 MPI processes for a job on the same node. The distributed method involves using 1 MPI processes per node. The results compare the performance against 1 job on one node.

The third test is similar to the second test, but uses only 4 nodes in the cluster, so when running distributed, there are 2 MPI processes per node.

The fourth test compares the compact and distributed scheduling methods on 4 nodes while running a 2 MM8 jobs and one 16-way parallel 3D Prestack Kirchhoff Time Migration (PSTM).

Key Points and Best Practices

  • ECLIPSE is very sensitive to memory bandwidth and needs to be run on 1333 MHz or greater memory speeds. In order to maintain 1333 MHz memory, the maximum memory configuration for the processors used in this benchmark is 24 GB. Bios upgrades now allow 1333 MHz memory for up to 48 GB of memory. Additional nodes can be used to handle data sets that require more memory than available per node. Allocating at least 20% of memory per node for I/O caching helps application performance.

  • If allocating an 8-way parallel job (MM8) to a single node, it is best to use an ECLIPSE license for that particular node to avoid the any additional network overhead of sharing a global license with all the nodes in a cluster.

  • Understanding the ECLIPSE MM8 I/O access patterns is essential to optimizing a shared storage solution. The analytics available on the Oracle Unified Storage 7410 provide valuable I/O characterization information even without source code access. A single MM8 job run shows an initial read and write load related to reading the input grid, parsing Petrel ascii input parameter files and creating an initial solution grid and runtime specifications. This is followed by a very long running simulation that writes data, restart files, and generates reports to the 7410. Due to the nature of the small block I/O, the mirrored configuration for the 7410 outperformed the RAID-Z2 configuration.

    A single MM8 job reads, processes, and writes approximately 240 MB of grid and property data in the first 36 seconds of execution. The actual read and write of the grid data, that is intermixed with this first stage of processing, is done at a rate of 240 MB/sec to the 7410 for each of the two operations.

    Then, it calculates and reports the well connections at an average 260 KB writes/second with 32 operations/second = 32 x 8 KB writes/second. However, the actual size of each I/O operation varies between 2 to 100 KB and there are peaks every 20 seconds. The write cache is on average operating at 8 accesses/second at approximately 61 KB/second (8 x 8 KB writes/sec). As the number of concurrent jobs increases, the interconnect traffic and random I/O operations per second to the 7410 increases.

  • MM8 multiple job startup time is reduced on shared file systems, if each job uses separate input files.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/20/2010.

Friday Jul 10, 2009

World Record TPC-H@300GB Price-Performance for Windows on Sun Fire X4600 M2

Significance of Results

Sun and Microsoft combined to deliver World Record price performance for Windows based results on the TPC-H benchmark at the 300GB scale factor. Using Microsoft's SQL Server 2008 Enterprise database along with Microsoft Windows Server 2008 operating system on the Sun Fire X4600 M2 server, the result of 2.80 $/QphH@300GB (USD) was delivered.

  • The Sun Fire X4600 M2 provides World Record price-performance of 2.80 $/QphH@300GB (USD) among Windows based TPC-H results at the 300GB scale factor. This result is 14% better price performance than the HP DL785 result.
  • The Sun Fire X4600 M2 trails HP's World Record single system performance (HP: 57,684 QphH@300GB, Sun: 55,185 QphH@300GB) by less than 5%.
  • The Sun/SQL Server solution used fewer disks for the database (168) than the other top performance leaders @300GB.
  • IBM required 79% more disks (300 total) than Sun to get a result of 46,034 QphH@300GB which is 20% below Sun's QphH.
  • HP required 21% more disks (204 total) than Sun to achieve a result of 3.24 $/QphH@300GB (USD) which is 16% worse than Sun's price performance.

This is Sun's first published TPC-H SQL Server benchmark.

Performance Landscape

ch/co/th = chips, cores, threads
$/QphH = TPC-H Price/Performance metric (smaller is better)

System ch/co/th Processor Database QphH $/QphH Price Disks Available
Sun Fire X4600 M2 8/32/32 2.7 Opteron 8384 SQL Server 2008 55,158 2.80 $154,284 168 07/06/09
HP DL785 8/32/32 2.7 Opteron 8384 SQL Server 2008 57,684 3.24 $186,700 204 11/17/08
IBM x3950 M2 8/32/32 2.93 Intel X7350 SQL Server 2005 46,034 5.40 $248,635 300 03/07/08

Complete benchmark results may be found at the TPC benchmark website http://www.tpc.org.

Results and Configuration Summary

Server:

    Sun Fire X4600 M2 with:
      8 x AMD Opteron 8384, 2.7 GHz QC processors
      256 GB memory
      3 x 73GB (15K RPM) internal SAS disks

Storage:

    14 x Sun Storage J4200 each consisting of 12 x 146GB 15,000 RPM SAS disks

Software:

    Operating System: Microsoft Windows Server 2008 Enterprise x64 Edition SP1
    Database Manager: SQL Server 2008 Enterprise x64 Edition SP1

Audited Results:

    Database Size: 300GB (Scale Factor)
    TPC-H Composite: 55,157.5 QphH@300GB
    Price/performance: $2.80 / QphH@300GB (USD)
    Available: July 6, 2009
    Total 3 Year Cost: $154,284.19 (USD)
    TPC-H Power: 67,095.6
    TPC-H Throughput: 45,343.5
    Database Load Time: 17 hours 29 minutes
    Storage Ratio: 76.82

Benchmark Description

The TPC-H benchmark is a performance benchmark established by the Transaction Processing Council (TPC) to demonstrate Data Warehousing/Decision Support Systems (DSS). TPC-H measurements are produced for customers to evaluate the performance of various DSS systems. These queries and updates are executed against a standard database under controlled conditions. Performance projections and comparisons between different TPC-H Database sizes (100GB, 300GB, 1000GB, 3000GB and 10000GB) are not allowed by the TPC.

TPC-H is a data warehousing-oriented, non-industry-specific benchmark that consists of a large number of complex queries typical of decision support applications. It also includes some insert and delete activity that is intended to simulate loading and purging data from a warehouse. TPC-H measures the combined performance of a particular database manager on a specific computer system.

The main performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@SF, where SF is the number of GB of raw data, referred to as the scale factor). QphH@SF is intended to summarize the ability of the system to process queries in both single and multi user modes. The benchmark requires reporting of price/performance, which is the ratio of QphH to total HW/SW cost plus 3 years maintenance. A secondary metric is the storage efficiency, which is the ratio of total configured disk space in GB to the scale factor.

Key Points and Best Practices

SQL Server 2008 is able to take advantage of the lower latency local memory access provides on the Sun Fire 4600 M2 server. This was achieved by setting the NUMA initialization parameter to enable all NUMA optimizations.

Enabling the Windows large-page feature provided a significant performance improvement. Because SQL Server 2008 manages its own memory buffer, the use of large-pages resulted in significant performance increase. Note that to use large-pages, an application must be part of the large-page group of the OS (Windows).

The 64-bit Windows OS and 64-bit SQL Server software were able to utilize the 256GB of memory available on the Sun Fire 4600 M2 server.

See Also

Disclosure Statement

TPC-H@300GB: Sun Fire X4600 M2 55,158 QphH@300GB, $2.80/QphH@300GB, availability 7/6/09; HP DL785, 57,684 QphH@300GB, $3.24/QphH@300GB, availability 11/17/08; IBM x3950 M2, 46,034 QphH@300GB, $5.40/QphH@300GB, availability 03/07/08; TPC-H, QphH, $/QphH tm of Transaction Processing Performance Council (TPC). More info www.tpc.org.

Friday Jun 05, 2009

Interpreting Sun's SPECpower_ssj2008 Publications

Sun recently entered the SPECpower fray with the publication of three results on the SPECpower_ssj2008 benchmark.  Strangely, the three publications documented results on the same hardware platform (Sun Netra X4250) running identical software stacks, but the results were markedly different.  What exactly were we trying to get at?

 Benchmark Configurations

Sun produces robust industrial-grade servers with a range of redundancy features we believe benefit our customers.   These features increase reliability, at the cost of additional power consumption. For example, redundant power supplies and redundant fans allow servers to tolerate faults, and hot-swap capabilities further minimize downtime.

The benchmark run and reporting rules require the incorporation within the tested configuration of all components implied by the model name.  Within these limitations, the first publication was intended to be the best result (that is, the lowest power consumption per unit of performance) achievable on the Sun Netra X4250 platform, by minimizing the configured hardware to the greatest extent possible.

Common Components

All tested configurations had the following components in common:

  • System:  Sun Netra X4250
  • Processor: 2 x Intel L5408 QC @ 2.13GHz
  • 2 x 658 watt redundant AC power supplies
  • redundant fans
  • standard I/O expansion mezzanine
  • standard Telco dry contact alarm

And the same software stack:

  • OS: Windows Server 2003 R2 Enterprise X64 Edition SP2
  • Drivers: platform-specific drivers from Sun Netra X4250 Tools and Drivers DVD Version 2.1N
  • JVM: Java HotSpot 32-Bit Server VM on Windows, version 1.6.0_14

Tiny Configuration

In addition to the common hardware components, the tiny configuration was limited to:

  • 8 GB of Memory (4 x 2048 MB as PC2-5300F 2Rx8)
  • 1 x Sun 146 GB 10K RPM SAS internal drive

This is called the tiny configuration because it seems unlikely that most customers would configure an 8-core server with only one disk and only 1 GB available per core. Nevertheless, from a benchmark point of view, this configuration gave the best result.

Typical Configuration

The other two results were both produced on a configuration we considered much more typical of configurations that are actually ordered by customers.  In addition to the common hardware, these typical configuration included:

  • 32 GB of Memory (8 x 4096 MB as PC2-5300F)
  • 4 x Sun 146 GB 10K RPM SAS internal drives
  • 1 x Sun x8 PCIe Quad Gigabit Ethernet option card (X4447A-Z)

Nothing special was done with the additional components.  The added memory increased the performance component of the benchmark. The other components were installed and configured but allowed to sit idle, so consumed less power than they would have under load.

One Other Thing: Tuning for Performance

So one thing we're getting at is the difference in power consumption between a small configuration optimized for a power-performance benchmark and a typical configuration optimized for customer workloads.  Hardware (power consumption) is only half of the benchmark--the other half being the performance achieved by the System Under Test (SUT).

Tuning Choices 

In all three publications the identical tunings were applied at the software level: identical java command-line arguments and JVM-to-processor affinity.  We also applied, in the case of the better results, the common (but usually non-default) BIOS-level optimization of disabling hardware prefetcher and adjacent cache line prefetch.  These optimizations are commonly applied to produce optimized SPECpower_ssj2008 results but it is unlikely that many production applications would benefit from these settings.  To demonstrate the effect of this tuning, the final result was generated with standard BIOS settings.

 And just so we couldn't be accused of sand-bagging the results, the number of JVMs was increased in the typical configurations to take advantage of the additional memory populated over and above the tiny configuration.  Additional performance was achieved but sadly it doesn't compensate for the higher power consumption of all that memory.

So in summary we tuned:

  • Tiny Configuration: non-default BIOS settings
  • Typical Configuration 1: non-default BIOS settings; additional JVMs to utilize added memory
  • Typical Configuration 2: default BIOS settings; additional JVMs to utilize added memory

At the OS level, all tunings  were identical.

Results 

The results are summarized in this table:

System
(Click system for SPEC full disclosure)

Processors

Performance

Model

GHz

Metric
overall
ssj_ops/watt

Peak
Performance
ssj_ops

Peak
Power
watts

Idle
Power
watts

Sun Netra X4250
(8GB non-default BIOS)

L5408

2.13

600

244832

226

174

Sun Netra X425
(32GB non-default BIOS)

L5408

2.13

478

251555

294

226

Sun Netra X4250
(32GB default BIOS)

L5408

2.13

437

229828

296

225

Conclusions

  • The measurement and reporting methods of the benchmark encourage small memory configurations.  Comparing the first and second result, adding additional memory yielded minimal performance improvement (from 244832 to 251555) but a large increase in power consumption, 68 watts at peak.

  • In our opinion, unrealistically small configurations yield the best results on this benchmark.  On the more typical system, the benchmark overall metric decreased from 600 overall ssj_ops per watt to 478 overall ssj_ops per watt, despite our best effort to utilize the additional configured memory.

  • On typical configurations, reverting to default BIOS settings resulted in a significant decrease in performance (from 25155 to 229828) with no corresponding decrease in power consumption (essentially identical for both results).

Configurations typical of customer systems (with adequate memory, internal disks, and option cards) consume more power than configurations which are commonly benchmarked, while providing no corresponding improvement in SPECpower_ssj2008 benchmark performance. The result is a lower overall power-performance metric on typical configurations and a lack of published benchmark results on robust systems with the capacities and redundancies that enterprise customers desire.

Fair Use Disclosure

SPEC, SPECpower, and SPECpower_ssj are trademarks of the Standard Performance Evaluation Corporation.  All results from the SPEC website (www.spec.com)  as of June 5, 2009.  For a complete set of accepted results refer to that site.

Wednesday Jun 03, 2009

Wide Variety of Topics to be discussed on BestPerf

A sample of the various Sun and partner technologies to be discussed:
OpenSolaris, Solaris, Linux, Windows, vmware, gcc, Java, Glassfish, MySQL, Java, Sun-Studio, ZFS, dtrace, perflib, Oracle, DB2, Sybase, OpenStorage, CMT, SPARC64, X64, X86, Intel, AMD

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today