X

Everything you want and need to know about Oracle SPARC systems performance

Memory and Bisection Bandwidth: SPARC T7 and M7 Servers Faster Than x86 and POWER8

Brian Whitney
Principal Software Engineer

The STREAM benchmark measures delivered memory bandwidth on a variety of memory intensive tasks.  Delivered memory bandwidth is key to a server delivering high performance on a wide variety of workloads.  The STREAM benchmark is typically run where each chip in the system gets its memory requests satisfied from local memory.  This report presents performance of Oracle's SPARC M7 processor based servers and compares their performance to x86 and IBM POWER8 servers.

Bisection bandwidth on a server is a measure of the cross-chip data bandwidth between the processors of a system where no memory access is local to the processor.  Systems with large cross-chip penalties show dramatically lower bisection bandwidth.  Real-world ad hoc workloads tend to perform better on systems with better bisection bandwidth because their memory usage characteristics tend to be chaotic.

IBM says the sustained or delivered bandwidth of the IBM POWER8 12-core chip is 230 GB/s. This number is a peak bandwidth calculation: 230.4 GB/sec = 9.6 GHz * 3 (r+w) * 8 byte. A similar calculation is used by IBM for the POWER8 dual-chip-module (two 6-core chips) to show a sustained or delivered bandwidth of 192 GB/sec (192.0 GB/sec = 8.0 GHz * 3 (r+w) * 8 byte).  Peaks are the theoretical limits used for marketing hype, but true measured delivered bandwidth is the only useful comparison to help one understand delivered performance of real applications.

The STREAM benchmark is easy to run and anyone can measure memory bandwidth on a target system (see Key Points and Best Practices section).

  • The SPARC M7-8 server delivers over 1 TB/sec on the STREAM benchmark.  This is over 2.4 times the triad bandwidth of an eight-chip x86 E7 v3 server.

  • The SPARC T7-4 delivered 2.2 times the STREAM triad bandwidth of a four-chip x86 E7 v3 server and 1.7 times the triad bandwidth of a four-chip IBM Power System S824 server.

  • The SPARC T7-2 delivered 2.5 times the STREAM triad bandwidth of a two-chip x86 E5 v3 server.

  • The SPARC M7-8 server delivered over 8.5 times the triad bisection bandwidth of an eight-chip x86 E7 v3 server.

  • The SPARC T7-4 server delivered over 2.7 times the triad bisection bandwidth of a four-chip x86 E7 v3 server and 2.3 times the triad bisection bandwidth of a four-chip IBM Power System S824 server.

  • The SPARC T7-2 server delivered over 2.7 times the triad bisection bandwidth of a two-chip x86 E5 v3 server.

 

Performance Landscape

The following SPARC, x86, and IBM S824 STREAM results were run as part of this benchmark effort. The IBM S822L result is from the referenced web location.  The following SPARC results were all run using 32 GB dimms.

Maximum STREAM Benchmark Performance
System Chips Bandwidth (MB/sec - 10^6)
Copy Scale Add Triad
SPARC M7-8 8 995,402 995,727 1,092,742 1,086,305
x86 E7 v3 8 346,771 354,679 445,550 442,184
 
SPARC T7-4 4 512,080 510,387 556,184 555,374
IBM S824 4 251,533 253,216 322,399 319,561
IBM S822L 4 252,743 247,314 295,556 305,955
x86 E7 v3 4 230,027 232,092 248,761 251,161
 
SPARC T7-2 2 259,198 259,380 285,835 285,905
x86 E5 v4 2 120,939 121,417 129,775 130,242
x86 E5 v3 (COD) 2 103,927 105,262 117,688 117,680
x86 E5 v3 2 105,622 105,808 113,116 112,521
 
SPARC T7-1 1 131,323 131,308 144,956 144,706
 

All of the following bisection bandwidth results were run as part of this benchmark effort.

Bisection Bandwidth Benchmark Performance
(Nonlocal STREAM)
System Chips Bandwidth (MB/sec - 10^6)
Copy Scale Add Triad
SPARC M7-8 8 383,479 381,219 375,371 375,851
SPARC T5-8 8 172,195 172,354 250,620 250,858
x86 E7 v3 8 42,636 42,839 43,753 43,744
 
SPARC T7-4 4 142,549 142,548 142,645 142,729
SPARC T5-4 4 75,926 75,947 76,975 77,061
IBM S824 4 53,940 54,107 60,746 60,939
x86 E7 v3 4 41,636 47,740 51,206 51,333
 
SPARC T7-2 2 127,372 127,097 129,833 129,592
SPARC T5-2 2 91,530 91,597 91,761 91,984
x86 E5 v4 2 50,153 50,366 50,266 50,265
x86 E5 v3 2 45,211 45,331 47,414 47,251
 

The following SPARC results were all run using 16 GB dimms.

SPARC T7 Servers – 16 GB DIMMS
Maximum STREAM Benchmark Performance
System Chips Bandwidth (MB/sec - 10^6)
Copy Scale Add Triad
SPARC T7-4 4 520,779 521,113 602,137 600,330
SPARC T7-2 2 262,586 262,760 302,758 302,085
SPARC T7-1 1 132,154 132,132 168,677 168,654
 

Configuration Summary

SPARC Configurations:

SPARC M7-8
8 x SPARC M7 processors (4.13 GHz)
4 TB memory (128 x 32 GB dimms)
 
SPARC T7-4
4 x SPARC M7 processors (4.13 GHz)
2 TB memory (64 x 32 GB dimms)
1 TB memory (64 x 16 GB dimms)
 
SPARC T7-2
2 x SPARC M7 processors (4.13 GHz)
1 TB memory (32 x 32 GB dimms)
512 GB memory (32 x 16 GB dimms)
 
SPARC T7-1
1 x SPARC M7 processor (4.13 GHz)
512 GB memory (16 x 32 GB dimms)
256 GB memory (16 x 16 GB dimms)

Oracle Solaris 11.3
Oracle Solaris Studio 12.4
 

x86 Configurations:

Oracle Server X5-8
8 x Intel Xeon Processor E7-8995 v3
2 TB memory (128 x 16 GB dimms)
 
Oracle Server X5-4
4 x Intel Xeon Processor E7-8995 v3
1 TB memory (64 x 16 GB dimms)
 
Oracle Server X6-2
2 x Intel Xeon Processor E5-2699 v4
256 GB memory (16 x 16 GB dimms)
 
Oracle Server X5-2
2 x Intel Xeon Processor E5-2699 v3
256 GB memory (16 x 16 GB dimms)

Oracle Linux 7.1
Intel Parallel Studio XE Composer Version 2016 compilers
 

Benchmark Description

STREAM

The STREAM benchmark measures sustainable memory bandwidth (in MB/s) for simple vector compute kernels. All memory accesses are sequential, so a picture of how fast regular data may be moved through the system is portrayed. Properly run, the benchmark displays the characteristics of the memory system of the machine and not the advantages of running from the system's memory caches.

STREAM counts the bytes read plus the bytes written to memory. For the simple Copy kernel, this is exactly twice the number obtained from the bcopy convention. STREAM does this because three of the four kernels (Scale, Add and Triad) do arithmetic, so it makes sense to count both the data read into the CPU and the data written back from the CPU.  The Copy kernel does no arithmetic, but, for consistency, counts bytes the same way as the other three.

The sequential nature of the memory references is the benchmark's biggest weakness. The benchmark does not expose limitations in a system's interconnect to move data from anywhere in the system to anywhere.

Bisection Bandwidth – Easy Modification of STREAM Benchmark

To test for bisection bandwidth, processes are bound to processors in sequential order. The memory is allocated in reverse order, so that the memory is placed non-local to the process. The benchmark is then run.  If the system is capable of page migration, this feature must be turned off.


Key Points and Best Practices

The stream benchmark code was compiled for the SPARC M7
processor based systems with the following
flags (using cc):

-fast -m64 -W2,-Avector:aggressive-xautopar -xreduction -xpagesize=4m

The benchmark code was compiled for the x86
based systems with the following
flags (Intel icc compiler):

-O3 -m64 -xCORE-AVX2 -ipo -openmp -mcmodel=medium -fno-alias -nolib-inline

On Oracle Solaris, binding is accomplished with either setting the environment variable SUNW_MP_PROCBIND or the OpenMP variables OMP_PROC_BIND and OMP_PLACES.

export OMP_NUM_THREADS=512
export SUNW_MP_PROCBIND=0-511

On Oracle Linux systems using Intel compiler, binding is accomplished by setting the environment variable KMP_AFFINITY.

export OMP_NUM_THREADS=72
export KMP_AFFINITY='verbose,granularity=fine,proclist=[0-71],explicit'

The source code change in the file stream.c to do the reverse allocation

<     for (j=STREAM_ARRAY_SIZE-1; j>=0; j--) { 
            a[j] = 1.0; 
            b[j] = 2.0; 
            c[j] = 0.0; 
        }
---
>     for (j=0; j<STREAM_ARRAY_SIZE; j++) {
            a[j] = 1.0; 
            b[j] = 2.0; 
            c[j] = 0.0; 
        }
 

See Also

 

Disclosure Statement

Copyright 2015, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 10/25/2015.

For E5-2699 v4 results, copyright 2016, Oracle and/or its affiliates. All rights reserved.  Oracle and Java are registered trademarks of Oracle and/or its affiliates.  Other names may be trademarks of their respective owners. Results as of 6/29/2016.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha