Monday Oct 26, 2015

Graph PageRank: SPARC M7-8 Beats x86 E5 v3 Per Chip

Graph algorithms are used in many big data and analytics workloads. The report presents performance using the PageRank algorithm. Oracle's SPARC M7 processor based systems provide better performance than an x86 E5 v3 based system.

  • Oracle's SPARC M7-8 server was able to deliver 3.2 times faster per chip performance than a two-chip x86 E5 v3 server running a PageRank algorithm implemented using Parallel Graph AnalytiX (PGX) from Oracle Labs on a medium sized graph.

Performance Landscape

The graph used for these results has 41,652,230 nodes and 1,468,365,182 edges using 22 GB of memory. All of the following results were run as part of this benchmark effort. Performance is a measure of processing rate, bigger is better.

PageRank Algorithm
Server Performance SPARC Advantage
SPARC M7-8
8 x SPARC M7 (4.13 GHz, 8x 32core)
281.1 3.2x faster per chip
x86 E5 v3 server
2 x Intel E5-2699 v3 (2.3 GHz, 2x 18core)
22.2 1.0

The number of cores are per processor.

Configuration Summary

Systems Under Test:

SPARC M7-8 server with
4 x SPARC M7 processors (4.13 GHz)
4 TB memory
Oracle Solaris 11.3
Oracle Solaris Studio 12.4

Oracle Server X5-2 with
2 x Intel Xeon Processor E5-2699 v3 (2.3 GHz)
384 GB memory
Oracle Linux
gcc 4.7.4

Benchmark Description

Graphs are a core part of many analytics workloads. They are very data intensive and stress computers. Each algorithm typically traverses the entire graph multiple times, while doing certain arithmetic operations during the traversal, it can perform (double/single precision) floating point operations.

The mathematics of PageRank are entirely general and apply to any graph or network in any domain. Thus, PageRank is now regularly used in bibliometrics, social and information network analysis, and for link prediction and recommendation. The PageRank algorithm counts the number and quality of links to a page to determine a rough estimate of the importance of the website.

Key Points and Best Practices

  • This algorithm is implemented using PGX (Parallel Graph AnalytiX) from Oracle Labs, a fast, parallel, in-memory graph analytic framework.
  • The graph used for these results is based on real world data from Twitter and has 41,652,230 nodes and 1,468,365,182 edges using 22 GB of memory.

See Also

Disclosure Statement

Copyright 2015, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of October 25, 2015.

Graph Breadth First Search Algorithm: SPARC T7-4 Beats 4-Chip x86 E7 v2

Graph algorithms are used in many big data and analytics workloads. Oracle's SPARC T7 processor based systems provide better performance than x86 systems with the Intel Xeon E7 v2 family of processors.

  • The SPARC T7-4 server was able to deliver 3.1x better performance than a four chip x86 server running a breadth-first search (BFS) on a large graph.

Performance Landscape

The problem is identified by "Scale" and the approximate amount of memory used. Results are listed as edge traversals in billions (ETB) per second (bigger is better). The SPARC M7 processor results were run as part of this benchmark effort. The x86 results were taken from a previous benchmark effort.

Breadth-First Search (BFS)
Scale Dataset
(GB)
ETB/sec Speedup
T7-4/x86
SPARC T7-4 x86 (4xE7 v2)
30 580 1.68 0.54 3.1x
29 282 1.76 0.62 2.8x
28 140 1.70 0.99 1.7x
27 70 1.56 1.07 1.5x
26 35 1.67 1.19 1.4x

Configuration Summary

Systems Under Test:

SPARC T7-4 server with
4 x SPARC M7 processors (4.13 GHz)
1 TB memory
Oracle Solaris 11.3
Oracle Solaris Studio 12.4

Sun Server X4-4 system with
4 x Intel Xeon E7-8895 v2 processors (2.8 GHz)
1 TB memory
Oracle Solaris 11.2
Oracle Solaris Studio 12.4

Benchmark Description

Graphs are a core part of many analytics workloads. They are very data intensive and stress computers. This benchmark does a breadth-first search on a randomly generated graph. It reports the number of graph edges traversed (in billions) per second (ETB/sec). To generate the graph, the data generator from the graph500 benchmark was used.

A description of what breadth-first search is taken from Introduction to Algorithms, page 594:

Given a graph G = (V, E) and a distinguished source vertex s, breadth-first search systematically explores the edges of G to "discover" every vertex that is reachable from s. It computes the distance (smallest number of edges) from s to each reachable vertex. It also produces a "breadth-first tree" with root s that contains all reachable vertices. For any vertex reachable from s, the simple path in the breadth-first tree from s to corresponds to a "shortest path" from s to in G, that is, a path containing the smallest number of edges. The algorithm works on both directed and undirected graphs.

Cormen, Thomas H., Leiserson, Charles E., Rivest, Ronald L., Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN 0-262-03384-4.

See Also

Disclosure Statement

Copyright 2015, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of October 25, 2015.

SPARC T7-1 Delivers 1-Chip World Records for SPEC CPU2006 Rate Benchmarks

This page has been updated on November 19, 2015. The SPARC T7-1 server results have been published at www.spec.org.

Oracle's SPARC T7-1 server delivered world record SPEC CPU2006 rate benchmark results for systems with one chip. This was accomplished with Oracle Solaris 11.3 and Oracle Solaris Studio 12.4 software.

  • The SPARC T7-1 server achieved world record scores of 1200 SPECint_rate2006, 1120 SPECint_rate_base2006, 832 SPECfp_rate2006, and 801 SPECfp_rate_base2006.

  • The SPARC T7-1 server beat the 1 chip Fujitsu CELSIUS C740 with an Intel Xeon Processor E5-2699 v3 by 1.7x on the SPECint_rate2006 benchmark. The SPARC T7-1 server beat the 1 chip NEC Express5800/R120f-1M with an Intel Xeon Processor E5-2699 v3 by 1.8x on the SPECfp_rate2006 benchmark.

  • The SPARC T7-1 server beat the 1 chip IBM Power S812LC server with a POWER8 processor by 1.9 times on the SPECint_rate2006 benchmark and by 1.8 times on the SPECfp_rate2006 benchmark.

  • The SPARC T7-1 server beat the 1 chip Fujitsu SPARC M10-4S with a SPARC64 X+ processor by 2.2x on the SPECint_rate2006 benchmark and by 1.6x on the SPECfp_rate2006 benchmark.

  • The SPARC T7-1 server improved upon the previous generation SPARC platform which used the SPARC T5 processor by 2.5 on the SPECint_rate2006 benchmark and by 2.3 on the SPECfp_rate2006 benchmark.

The SPEC CPU2006 benchmarks are derived from the compute-intensive portions of real applications, stressing chip, memory hierarchy, and compilers. The benchmarks are not intended to stress other computer components such as networking, the operating system, or the I/O system. Note that there are many other SPEC benchmarks, including benchmarks that specifically focus on Java computing, enterprise computing, and network file systems.

Performance Landscape

Complete benchmark results are at the SPEC website. The tables below provide the new Oracle results, as well as select results from other vendors.

Presented are single chip SPEC CPU2006 rate results. Only the best results published at www.spec.org per chip type are presented (best Intel, IBM, Fujitsu, Oracle chips).

SPEC CPU2006 Rate Results – One Chip
System Chip Peak Base
  SPECint_rate2006
SPARC T7-1 SPARC M7 (4.13 GHz, 32 cores) 1200 1120
Fujitsu CELSIUS C740 Intel E5-2699 v3 (2.3 GHz, 18 cores) 715 693
IBM Power S812LC POWER8 (2.92 GHz, 10 cores) 642 482
Fujitsu SPARC M10-4S SPARC64 X+ (3.7 GHz, 16 cores) 546 479
SPARC T5-1B SPARC T5 (3.6 GHz, 16 cores) 489 441
IBM Power 710 Express POWER7 (3.55 GHz, 8 cores) 289 255
  SPECfp_rate2006
SPARC T7-1 SPARC M7 (4.13 GHz, 32 cores) 832 801
NEC Express5800/R120f-1M Intel E5-2699 v3 (2.3 GHz, 18 cores) 474 460
IBM Power S812LC POWER8 (2.92 GHz, 10 cores) 468 394
Fujitsu SPARC M10-4S SPARC64 X+ (3.7 GHz, 16 cores) 462 418
SPARC T5-1B SPARC T5 (3.6 GHz, 16 cores) 369 350
IBM Power 710 Express POWER7 (3.55 GHz, 8 cores) 248 229

The following table highlights the performance of the single-chip SPARC M7 processor based server to the best published two-chip POWER8 processor based server.

SPEC CPU2006 Rate Results
Comparing One SPARC M7 Chip to Two POWER8 Chips
System Chip Peak Base
  SPECint_rate2006
SPARC T7-1 1 x SPARC M7 (4.13 GHz, 32core) 1200 1120
IBM Power S822LC 2 x POWER8 (2.92 GHz, 2x 10core) 1100 853
  SPECfp_rate2006
SPARC T7-1 1 x SPARC M7 (4.13 GHz, 32 cores) 832 801
IBM Power S822LC 2 x POWER8 (2.92 GHz, 2x 10core) 888 745

Configuration Summary

System Under Test:

SPARC T7-1
1 x SPARC M7 processor (4.13 GHz)
512 GB memory (16 x 32 GB dimms)
800 GB on 4 x 400 GB SAS SSD (mirrored)
Oracle Solaris 11.3
Oracle Solaris Studio 12.4 with 4/15 Patch Set

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark. It measures:

  • Speed — single copy performance of chip, memory, compiler
  • Rate — multiple copy (throughput)

The benchmark is also divided into integer intensive applications and floating point intensive applications:

  • integer: 12 benchmarks derived from applications such as artificial intelligence chess playing, artificial intelligence go playing, quantum computer simulation, perl, gcc, XML processing, and pathfinding
  • floating point: 17 benchmarks derived from applications, including chemistry, physics, genetics, and weather.

It is also divided depending upon the amount of optimization allowed:

  • base: optimization is consistent per compiled language, all benchmarks must be compiled with the same flags per language.
  • peak: specific compiler optimization is allowed per application.

The overall metrics for the benchmark which are commonly used are:

  • SPECint_rate2006, SPECint_rate_base2006: integer, rate
  • SPECfp_rate2006, SPECfp_rate_base2006: floating point, rate
  • SPECint2006, SPECint_base2006: integer, speed
  • SPECfp2006, SPECfp_base2006: floating point, speed

Key Points and Best Practices

  • Jobs were bound using pbind.

See Also

Disclosure Statement

SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Results as of November 19, 2015 from www.spec.org.
SPARC T7-1: 1200 SPECint_rate2006, 1120 SPECint_rate_base2006, 832 SPECfp_rate2006, 801 SPECfp_rate_base2006; SPARC T5-1B: 489 SPECint_rate2006, 440 SPECint_rate_base2006, 369 SPECfp_rate2006, 350 SPECfp_rate_base2006; Fujitsu SPARC M10-4S: 546 SPECint_rate2006, 479 SPECint_rate_base2006, 462 SPECfp_rate2006, 418 SPECfp_rate_base2006. IBM Power 710 Express: 289 SPECint_rate2006, 255 SPECint_rate_base2006, 248 SPECfp_rate2006, 229 SPECfp_rate_base2006; Fujitsu CELSIUS C740: 715 SPECint_rate2006, 693 SPECint_rate_base2006; NEC Express5800/R120f-1M: 474 SPECfp_rate2006, 460 SPECfp_rate_base2006; IBM Power S822LC: 1100 SPECint_rate2006, 853 SPECint_rate_base2006, 888 SPECfp_rate2006, 745 SPECfp_rate_base2006; IBM Power S812LC: 642 SPECint_rate2006, 482 SPECint_rate_base2006, 468 SPECfp_rate2006, 394 SPECfp_rate_base2006.

SPARC T7-4 Delivers 4-Chip World Record for SPEC OMP2012

Oracle's SPARC T7-4 server delivered world record performance on the SPEC OMP2012 benchmark for systems with four chips. This was accomplished with Oracle Solaris 11.3 and Oracle Solaris Studio 12.4 software.

  • The SPARC T7-4 server delivered world record for systems with four chips of 27.9 SPECompG_peak2012 and 26.4 SPECompG_base2012.

  • The SPARC T7-4 server beat the four chip HP ProLiant DL580 Gen9 with Intel Xeon Processor E7-8890 v3 by 29% on the SPECompG_peak2012 benchmark.

  • This SPEC OMP2012 benchmark result demonstrates that the SPARC M7 processor performs well on floating-point intensive technical computing and modeling workloads.

Performance Landscape

Complete benchmark results are at the SPEC website, SPEC OMP2012 Results. The table below provides the new Oracle result as well as the previous best four chip results.

SPEC OMP2012 Results
Four Chip Results
System Processor Peak Base
SPARC T7-4 SPARC M7, 4.13 GHz 27.9 26.4
HP ProLiant DL580 Gen9 Intel Xeon E7-8890 v3, 2.5 GHz 21.5 20.4
Cisco UCS C460 M4 Intel Xeon E7-8890 v3, 2.5 GHz -- 20.8

Configuration Summary

Systems Under Test:

SPARC T7-4
4 x 4.13 GHz SPARC M7 processors
2 TB memory (64 x 32 GB dimms)
4 x 600 GB SAS 10,000 RPM HDD (mirrored)
Oracle Solaris 11.3 (11.3.0.30.0)
Oracle Solaris Studio 12.4 with 4/15 Patch Set

Benchmark Description

The following was taken from the SPEC website.

SPEC OMP2012 focuses on compute intensive performance, which means these benchmarks emphasize the performance of:

  • the computer processor (CPU),
  • the memory architecture,
  • the parallel support libraries, and
  • the compilers.

It is important to remember the contribution of the latter three components. SPEC OMP performance intentionally depends on more than just the processor.

SPEC OMP2012 contains a suite that focuses on parallel computing performance using the OpenMP parallelism standard.

The suite can be used to measure along the following vector:

  • Compilation method: Consistent compiler options across all programs of a given language (the base metrics) and, optionally, compiler options tuned to each program (the peak metrics).

SPEC OMP2012 is not intended to stress other computer components such as networking, the operating system, graphics, or the I/O system. Note that there are many other SPEC benchmarks, including benchmarks that specifically focus on graphics, distributed Java computing, webservers, and network file systems.

Key Points and Best Practices

  • Jobs were bound using OMP_PLACES.

See Also

Disclosure Statement

SPEC and the benchmark name SPEC OMP are registered trademarks of the Standard Performance Evaluation Corporation. Results as of November 11, 2015 from www.spec.org. SPARC T7-4 (4 chips, 128 cores, 1024 threads): 27.9 SPECompG_peak2012, 26.4 SPECompG_base2012; HP ProLiant DL580 Gen9 (4 chips, 72 cores, 144 threads): 21.5 SPECompG_peak2012, 20.4 SPECompG_base2012; Cisco UCS C460 M7 (4 chips, 72 cores, 144 threads): 20.8 SPECompG_base2012.

Tuesday Mar 26, 2013

SPARC T5 Systems Deliver SPEC CPU2006 Rate Benchmark Multiple World Records

Oracle's SPARC T5 processor based systems delivered world record performance on the SPEC CPU2006 rate benchmarks. This was accomplished with Oracle Solaris 11.1 and Oracle Solaris Studio 12.3 software.

SPARC T5-8

  • The SPARC T5-8 server delivered world record SPEC CPU2006 rate benchmark results for systems with eight processors.

  • The SPARC T5-8 server achieved scores of 3750 SPECint_rate2006, 3490 SPECint_rate_base2006, 3020 SPECfp_rate2006, and 2770 SPECfp_rate_base2006.

  • The SPARC T5-8 server beat the 8 processor IBM Power 760 with POWER7+ processors by 1.7x on the SPECint_rate2006 benchmark and 2.2x on the SPECfp_rate2006 benchmark.

  • The SPARC T5-8 server beat the 8 processor IBM Power 780 with POWER7 processors by 35% on the SPECint_rate2006 benchmark and 14% on the SPECfp_rate2006 benchmark.

  • The SPARC T5-8 server beat the 8 processor HP DL980 G7 with Intel Xeon E7-4870 processors by 1.7x on the SPECint_rate2006 benchmark and 2.1x on the SPECfp_rate2006 benchmark.

SPARC T5-1B

  • The SPARC T5-1B server module delivered world record SPEC CPU2006 rate benchmark results for systems with one processor.

  • The SPARC T5-1B server module achieved scores of 467 SPECint_rate2006, 436 SPECint_rate_base2006, 369 SPECfp_rate2006, and 350 SPECfp_rate_base2006.

  • The SPARC T5-1B server module beat the 1 processor IBM Power 710 Express with a POWER7 processor by 62% on the SPECint_rate2006 benchmark and 49% on the SPECfp_rate2006 benchmark.

  • The SPARC T5-1B server module beat the 1 processor NEC Express5800/R120d-1M with an Intel Xeon E5-2690 processor by 31% on the SPECint_rate2006 benchmark. The SPARC T5-1B server module beat the 1 processor Huawei RH2288 V2 with an Intel Xeon E5-2690 processor by 44% on the SPECfp_rate2006 benchmark.

  • The SPARC T5-1B server module beat the 1 processor Supermicro A+ 1012G-MTF with an AMD Operton 6386 SE processor by 51% on the SPECint_rate2006 benchmark and 65% on the SPECfp_rate2006 benchmark.

Performance Landscape

Complete benchmark results are at the SPEC website, SPEC CPU2006 Results. The tables below provide the new Oracle results, as well as, select results from other vendors.

SPEC CPU2006 Rate Results – Eight Processors
System Processor ch/co/th * Peak Base
SPECint_rate2006
SPARC T5-8 SPARC T5, 3.6 GHz 8/128/1024 3750 3490
IBM Power 780 POWER7, 3.92 GHz 8/64/256 2770 2420
HP DL980 G7 Xeon E7-4870, 2.4 GHz 8/80/160 2180 2070
IBM Power 760 POWER7+, 3.42 GHz 8/48/192 2170 1480
Dell PowerEdge C6145 Opteron 6180 SE, 2.5 GHz 8/96/96 1670 1440
SPECfp_rate2006
SPARC T5-8 SPARC T5, 3.6 GHz 8/128/1024 3020 2770
IBM Power 780 POWER7, 3.92 GHz 8/64/256 2640 2410
HP DL980 G7 Xeon E7-4870, 2.4 GHz 8/80/160 1430 1380
IBM Power 760 POWER7+, 3.42 GHz 8/48/192 1400 1130
Dell PowerEdge C6145 Opteron 6180 SE, 2.5 GHz 8/96/96 1310 1200

* ch/co/th — chips / cores / threads enabled

SPEC CPU2006 Rate Results – One Processor
System Processor ch/co/th * Peak Base
SPECint_rate2006
SPARC T5-1B SPARC T5, 3.6 GHz 1/16/128 467 436
NEC Express5800/R120d-1M Xeon E5-2690, 2.9 GHz 1/8/16 357 343
Supermicro A+ 1012G-MTF Opteron 6386 SE, 2.8 GHz 1/16/16 309 269
IBM Power 710 Express POWER7, 3.556 GHz 1/8/32 289 255
SPECfp_rate2006
SPARC T5-1B SPARC T5, 3.6 GHz 1/16/128 369 350
Huawei RH2288 V2 Xeon E5-2690, 2.9 GHz 1/8/16 257 250
IBM Power 710 Express POWER7, 3.556 GHz 1/8/32 248 229
Supermicro A+ 1012G-MTF Opteron 6386 SE, 2.8 GHz 1/16/16 223 199

* ch/co/th — chips / cores / threads enabled

Configuration Summary

Systems Under Test:

SPARC T5-8
8 x 3.6 GHz SPARC T5 processors
4 TB memory (128 x 32 GB dimms)
2 TB on 8 x 600 GB 10K RPM SAS disks, arranged as 4 x 2-way mirrors
Oracle Solaris 11.1 (SRU 4.6)
Oracle Solaris Studio 12.3 1/13 PSE

SPARC T5-1B
1 x 3.6 GHz SPARC T5 processor
256 GB memory (16 x 16 GB dimms)
157 GB on 2 x 300 GB 10K RPM SAS disks (mirrored)
Oracle Solaris 11.1 (SRU 3.4)
Oracle Solaris Studio 12.3 1/13 PSE

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark. It measures:

  • Speed — single copy performance of chip, memory, compiler
  • Rate — multiple copy (throughput)

The benchmark is also divided into integer intensive applications and floating point intensive applications:

  • integer: 12 benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding
  • floating point: 17 benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

It is also divided depending upon the amount of optimization allowed:

  • base: optimization is consistent per compiled language, all benchmarks must be compiled with the same flags per language.
  • peak: specific compiler optimization is allowed per application.

The overall metrics for the benchmark which are commonly used are:

  • SPECint_rate2006, SPECint_rate_base2006: integer, rate
  • SPECfp_rate2006, SPECfp_rate_base2006: floating point, rate
  • SPECint2006, SPECint_base2006: integer, speed
  • SPECfp2006, SPECfp_base2006: floating point, speed

See Also

Disclosure Statement

SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Results as of March 26, 2013 from www.spec.org and this report. SPARC T5-8: 3750 SPECint_rate2006, 3490 SPECint_rate_base2006, 3020 SPECfp_rate2006, 2770 SPECfp_rate_base2006; SPARC T5-1B: 467 SPECint_rate2006, 436 SPECint_rate_base2006, 369 SPECfp_rate2006, 350 SPECfp_rate_base2006.

Tuesday Apr 10, 2012

SPEC CPU2006 Results on Oracle's Sun x86 Servers

Oracle's new Sun x86 servers delivered world records on the benchmarks SPECfp2006 and SPECint_rate2006 for two processor servers. This was accomplished with Oracle Solaris 11 and Oracle Solaris Studio 12.3 software.

  • The Sun Fire X4170 M3 (now known as Sun Server X3-2) server achieved a world record result in for SPECfp2006 benchmark with a score of 96.8.

  • The Sun Blade X6270 M3 server module (now known as Sun Blade X3-2B) produced best integer throughput performance for all 2-socket servers with a SPECint_rate2006 score of 705.

  • The Sun x86 servers with Intel Xeon E5-2690 2.9 GHz processors produced a cross-generational performance improvement up to 1.8x over the previous generation, Sun x86 M2 servers.

Performance Landscape

Complete benchmark results are at the SPEC website, SPEC CPU2006 Results. The tables below provide the new Oracle results, as well as, select results from other vendors.

SPECint2006
System Processor c/c/t * Peak Base O/S Compiler
Fujitsu PRIMERGY BX924 S3 Intel E5-2690, 2.9 GHz 2/16/16 60.8 56.0 RHEL 6.2 Intel 12.1.2.273
Sun Fire X4170 M3 Intel E5-2690, 2.9 GHz 2/16/32 58.5 54.3 Oracle Linux 6.1 Intel 12.1.0.225
Sun Fire X4270 M2 Intel X5690, 3.47 GHz 2/12/12 46.2 43.9 Oracle Linux 5.5 Intel 12.0.1.116

SPECfp2006
System Processor c/c/t * Peak Base O/S Compiler
Sun Fire X4170 M3 Intel E5-2690, 2.9 GHz 2/16/32 96.8 86.4 Oracle Solaris 11 Studio 12.3
Sun Blade X6270 M3 Intel E5-2690, 2.9 GHz 2/16/32 96.0 85.2 Oracle Solaris 11 Studio 12.3
Sun Fire X4270 M3 Intel E5-2690, 2.9 GHz 2/16/32 95.9 85.1 Oracle Solaris 11 Studio 12.3
Fujitsu CELSIUS R920 Intel E5-2687, 2.9 GHz 2/16/16 93.8 87.6 RHEL 6.1 Intel 12.1.2.273
Sun Fire X4270 M2 Intel X5690, 3.47 GHz 2/12/24 64.2 59.2 Oracle Solaris 10 Studio 12.2

Only 2-chip server systems listed below, excludes workstations.

SPECint_rate2006
System Processor Base
Copies
c/c/t * Peak Base O/S Compiler
Sun Blade X6270 M3 Intel E5-2690, 2.9 GHz 32 2/16/32 705 632 Oracle Solaris 11 Studio 12.3
Sun Fire X4270 M3 Intel E5-2690, 2.9 GHz 32 2/16/32 705 630 Oracle Solaris 11 Studio 12.3
Sun Fire X4170 M3 Intel E5-2690, 2.9 GHz 32 2/16/32 702 628 Oracle Solaris 11 Studio 12.3
Cisco UCS C220 M3 Intel E5-2690, 2.9 GHz 32 2/16/32 697 671 RHEL 6.2 Intel 12.1.0.225
Sun Blade X6270 M2 Intel X5690, 3.47 GHz 24 2/12/24 410 386 Oracle Linux 5.5 Intel 12.0.1.116

SPECfp_rate2006
System Processor Base
Copies
c/c/t * Peak Base O/S Compiler
Cisco UCS C240 M3 Intel E5-2690, 2.9 GHz 32 2/16/32 510 496 RHEL 6.2 Intel 12.1.2.273
Sun Fire X4270 M3 Intel E5-2690, 2.9 GHz 64 2/16/32 497 461 Oracle Solaris 11 Studio 12.3
Sun Blade X6270 M3 Intel E5-2690, 2.9 GHz 32 2/16/32 497 460 Oracle Solaris 11 Studio 12.3
Sun Fire X4170 M3 Intel E5-2690, 2.9 GHz 64 2/16/32 495 464 Oracle Solaris 11 Studio 12.3
Sun Fire X4270 M2 Intel E5690, 3.47 GHz 24 2/12/24 273 265 Oracle Linux 5.5 Intel 12.0.1.116

* c/c/t — chips / cores / threads enabled

Configuration Summary and Results

Hardware Configuration:

Sun Fire X4170 M3 server
2 x 2.90 GHz Intel Xeon E5-2690 processors
128 GB memory (16 x 8 GB 2Rx4 PC3-12800R-11, ECC)

Sun Fire X4270 M3 server
2 x 2.90 GHz Intel Xeon E5-2690 processors
128 GB memory (16 x 8 GB 2Rx4 PC3-12800R-11, ECC)

Sun Blade X6270 M3 server module
2 x 2.90 GHz Intel Xeon E5-2690 processors
128 GB memory (16 x 8 GB 2Rx4 PC3-12800R-11, ECC)

Software Configuration:

Oracle Solaris 11 11/11 (SRU2)
Oracle Solaris Studio 12.3 (patch update 1 nightly build 120313) Oracle Linux Server Release 6.1
Intel C++ Studio XE 12.1.0.225
SPEC CPU2006 V1.2

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark. It measures:

  • Speed — single copy performance of chip, memory, compiler
  • Rate — multiple copy (throughput)

The benchmark is also divided into integer intensive applications and floating point intensive applications:

  • integer: 12 benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding
  • floating point: 17 benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

It is also divided depending upon the amount of optimization allowed:

  • base: optimization is consistent per compiled language, all benchmarks must be compiled with the same flags per language.
  • peak: specific compiler optimization is allowed per application.

The overall metrics for the benchmark which are commonly used are:

  • SPECint_rate2006, SPECint_rate_base2006: integer, rate
  • SPECfp_rate2006, SPECfp_rate_base2006: floating point, rate
  • SPECint2006, SPECint_base2006: integer, speed
  • SPECfp2006, SPECfp_base2006: floating point, speed

See here for additional information.

See Also

Disclosure Statement

SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Results as of 10 April 2012 from www.spec.org and this report.

Tuesday Oct 26, 2010

3D VTI Reverse Time Migration Scalability On Sun Fire X2270-M2 Cluster with Sun Storage 7210

This Oil & Gas benchmark shows the Sun Storage 7210 system delivers almost 2 GB/sec bandwidth and realizes near-linear scaling performance on a cluster of 16 Sun Fire X2270 M2 servers.

Oracle's Sun Storage 7210 system attached via QDR InfiniBand to a cluster of sixteen of Oracle's Sun Fire X2270 M2 servers was used to demonstrate the performance of a Reverse Time Migration application, an important application in the Oil & Gas industry. The total application throughput and computational kernel scaling are presented for two production sized grids of 800 samples.

  • Both the Reverse Time Migration I/O and combined computation shows near-linear scaling from 8 to 16 nodes on the Sun Storage 7210 system connected via QDR InfiniBand to a Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 2.0x improvement
      2486 x 1151 x 1231: 1.7x improvement
  • The computational kernel of the Reverse Time Migration has linear to super-linear scaling from 8 to 16 nodes in Oracle's Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231 : 2.2x improvement
      2486 x 1151 x 1231 : 2.0x improvement
  • Intel Hyper-Threading provides additional performance benefits to both the Reverse Time Migration I/O and computation when going from 12 to 24 OpenMP threads on the Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 8% - computational kernel; 2% - total application throughput
      2486 x 1151 x 1231: 12% - computational kernel; 6% - total application throughput
  • The Sun Storage 7210 system delivers the Velocity, Epsilon, and Delta data to the Reverse Time Migration at a steady rate even when timing includes memory initialization and data object creation:

      1243 x 1151 x 1231: 1.4 to 1.6 GBytes/sec
      2486 x 1151 x 1231: 1.2 to 1.3 GBytes/sec

    One can see that when doubling the size of the problem, the additional complexity of overlapping I/O and multiple node file contention only produces a small reduction in read performance.

Performance Landscape

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 504 259 2.0 2.2\* 1024 551 1.7 2.0
14 565 279 1.8 2.0 1191 677 1.5 1.6
12 662 343 1.6 1.6 1426 817 1.2 1.4
10 784 394 1.3 1.4 1501 856 1.2 1.3
8 1024 560 1.0 1.0 1745 1108 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache

Application Scaling – Hyper-Threading Study

The affects of hyperthreading are presented when running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server.

Hyper-Threading Comparison – 12 versus 24 OpenMP Threads
Number Nodes Thread per Node Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup
16 24 504 259 1.02 1.08 1024 551 1.06 1.12
16 12 515 279 1.00 1.00 1088 616 1.00 1.00

Read Performance

Read performance is presented for the velocity, epsilon and delta files running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Velocity, Epsilon, and Delta File Read and Memory Initialization Performance
Number Nodes Overlap MBytes Read Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s
16 2040 16.7 1.1 23.2 1.4 36.8 1.1 44.3 1.2
8 951
14.8 1.0 22.1 1.6 33.0 1.0 43.2 1.3

Configuration Summary

Hardware Configuration:

16 x Sun Fire X2270 M2 servers, each with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)

Sun Storage 7210 system connected via QDR InfiniBand
2 x 18 GB SATA SSD (logzilla)
40 x 1 TB 7200 RM SATA disk

Software Configuration:

SUSE Linux Enterprise Server SLES 10 SP 2
Oracle Message Passing Toolkit 8.2.1 (for MPI)
Sun Studio 12 Update 1 C++, Fortran, OpenMP

Benchmark Description

This Reverse Time Migration (RTM) demonstration application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk. In this version, each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a four element inline 3-D array pad (spatial order of eight) shared with its neighbors to the left and right during the initialization phase. It represents a full RTM application including the data input, computation, communication, and final output image to be used by the next work flow step involving 3D volumetric seismic interpretation.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

I/O Characterization without Optimal Checkpointing

Velocity, Epsilon, and Delta Files - Grid Reading

The additional amount of overlapping reads to share velocity, epsilon, and delta edge data with neighbors can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (z_dimension) x (4 bytes) x (3 files)

For this particular benchmark study, the additional 3-D pad overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 1231 x 4 x 3 = 2.04 GB extra
    8 nodes: 7 x 8 x 1151 x 1231 x 4 x 3 = 0.95 GB extra

For the first of the two test cases, the total size of the three files used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 1231 x 4 bytes = 7.05 GB per file x 3 files = 21.13 GB

With the additional 3-D pad, the total amount of data read is:

    16 nodes: 2.04 GB + 21.13 GB = 23.2 GB
    8 nodes: 0.95 GB + 21.13 GB = 22.1 GB

For the second of the two test cases, the total size of the three files used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 1231 x 4 bytes = 14.09 GB per file x 3 files = 42.27 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: 2.04 GB + 42.27 GB = 44.3 GB
    8 nodes: 0.95 GB + 42.27 GB = 43.2 GB

Note that the amount of overlapping data read increases, not only by the number of nodes, but as the y dimension and/or the z dimension increases.

Trace Reading

The additional amount of overlapping reads to share trace edge data with neighbors for can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (4 bytes) x (number_of_time_slices)

For this particular benchmark study, the additional overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 4 x 800 = 442MB extra
    8 nodes: 7 x 8 x 1151 x 4 x 800 = 206MB extra

For the first case the size of the trace data file used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 4 bytes x 800 = 4.578 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 4.578 GB = 5.0 GB
    8 nodes: .206 GB + 4.578 GB = 4.8 GB

For the second case the size of the trace data file used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 4 bytes x 800 = 9.156 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 9.156 GB = 9.6 GB
    8 nodes: .206 GB + 9.156 GB = 9.4 GB

As the number of nodes is increased, the overlap causes more disk lock contention.

Writing Final Output Image

1243x1151x1231 - 7.1 GB per file:

    16 nodes: 78 x 1151 x 1231 x 4 = 442MB/node (7.1 GB total)
    8 nodes: 156 x 1151 x 1231 x 4 = 884MB/node (7.1 GB total)

2486x1151x1231 - 14.1 GB per file:

    16 nodes: 156 x 1151 x 1231 x 4 = 930 MB/node (14.1 GB total)
    8 nodes: 311 x 1151 x 1231 x 4 = 1808 MB/node (14.1 GB total)

Resource Allocation

It is best to allocate one node as the Oracle Grid Engine resource scheduler and MPI master host. This is especially true when running with 24 OpenMP threads in hyperthreading mode to avoid oversubscribing a node that is cooperating in delivering the solution.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 10/20/2010.

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« May 2016
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today