X

Everything you want and need to know about Oracle SPARC systems performance

Memory and Bisection Bandwidth: SPARC S7 Performance

Brian Whitney
Principal Software Engineer

The STREAM benchmark measures delivered memory bandwidth on a variety of memory intensive tasks.  Delivered memory bandwidth is key to a server delivering high performance on a wide variety of workloads.  The STREAM benchmark is typically run where each chip in the system gets its memory requests satisfied from local memory.  This report presents performance of Oracle's SPARC S7 processor based servers and compares their performance to 2-chip x86 servers.

Bisection bandwidth on a server is a measure of the cross-chip data bandwidth between the processors of a system where no memory access is local to the processor.  Systems with large cross-chip penalties show dramatically lower bisection bandwidth.  Real-world ad hoc workloads tend to perform better on systems with better bisection bandwidth because their memory usage characteristics tend to be chaotic.

The STREAM benchmark is easy to run and anyone can measure memory bandwidth on a target system (see Key Points and Best Practices section).

  • The SPARC S7-2L server delivers nearly 100 GB/sec on the STREAM benchmark.  

Performance Landscape

The following SPARC and x86 STREAM results were run as part of this benchmark effort.  The SPARC S7 processor based servers deliver nearly the same STREAM benchmark performance as the x86 E5-2699 v4 and v3 based servers but with significantly fewer cores (more performance available per core).


Maximum STREAM Benchmark Performance
System (2-Chips) Total
Cores
Bandwidth (MB/sec - 10^6)
Copy Scale Add Triad
SPARC S7-2L (16 x 32 GB) 16 98,581 93,274 96,431 96,315
SPARC S7-2 (16 x 64 GB) 16 90,285 90,163 87,178 87,051
x86 E5-2699 v4 44 120,939 121,417 129,775 130,242
x86 E5-2699 v3 (COD) 36 103,927 105,262 117,688 117,680
x86 E5-2699 v3 36 105,622 105,808 113,116 112,521

All of the following bisection bandwidth results were run as part of this benchmark effort.  The SPARC S7 processor based servers deliver more STREAM benchmark performance compared to the x86 E5-2699 v4 and v3 based servers and with significantly fewer cores  (more performance available per core).

Bisection Bandwidth Benchmark Performance (Nonlocal STREAM)
System (2-Chips) Total
Cores
Bandwidth (MB/sec - 10^6)
Copy Scale Add Triad
SPARC S7-2L 16 57,443 57,020 56,815 56,562
x86 E5-2699 v4 44 50,153 50,366 50,266 50,265
x86 E5-2699 v3 36 45,211 45,331 47,414 47,251

Configuration Summary

SPARC Configurations:

SPARC S7-2L
2 x SPARC S7 processors (4.267 GHz)
512 GB memory (16 x 32 GB dimms)

SPARC S7-2
2 x SPARC S7 processors (4.267 GHz)
1 TB memory (16 x 64 GB dimms)

Oracle Solaris 11.3
Oracle Developer Studio 12.5

x86 Configurations:

Oracle Server X6-2
2 x Intel Xeon Processor E5-2699 v4
256 GB memory (16 x 16 GB dimms)

Oracle Server X5-2
2 x Intel Xeon Processor E5-2699 v3
256 GB memory (16 x 16 GB dimms)

Oracle Linux 7.1
Intel Parallel Studio XE Composer Version 2016 compilers

Benchmark Description

STREAM

The STREAM benchmark measures sustainable memory bandwidth (in MB/s) for simple vector compute kernels. All memory accesses are sequential, so a picture of how fast regular data may be moved through the system is portrayed. Properly run, the benchmark displays the characteristics of the memory system of the machine and not the advantages of running from the system's memory caches.

STREAM counts the bytes read plus the bytes written to memory. For the simple Copy kernel, this is exactly twice the number obtained from the bcopy convention. STREAM does this because three of the four kernels (Scale, Add and Triad) do arithmetic, so it makes sense to count both the data read into the CPU and the data written back from the CPU.  The Copy kernel does no arithmetic, but, for consistency, counts bytes the same way as the other three.

The sequential nature of the memory references is the benchmark's biggest weakness. The benchmark does not expose limitations in a system's interconnect to move data from anywhere in the system to anywhere.

Bisection Bandwidth – Easy Modification of STREAM Benchmark

To test for bisection bandwidth, processes are bound to processors in sequential order. The memory is allocated in reverse order, so that the memory is placed non-local to the process. The benchmark is then run.  If the system is capable of page migration, this feature must be turned off.

Key Points and Best Practices

The stream benchmark code was compiled for the SPARC S7 processor based systems with the following flags using Oracle Developer Studio 12.5 C:

-fast -m64-W2,-Avector:aggressive -xautopar -xreduction -xpagesize=4m  

The benchmark code was compiled for the x86 based systems with the following flags (Intel icc compiler):

-O3 -m64 -xCORE-AVX2 -ipo -openmp -mcmodel=medium -fno-alias -nolib-inline 

On Oracle Solaris, binding is accomplished with either setting the environment variable SUNW_MP_PROCBIND or the OpenMP variables OMP_PROC_BIND and OMP_PLACES.

export OMP_NUM_THREADS=128
export SUNW_MP_PROCBIND=0-127

On Oracle Linux systems using Intel compiler, binding is accomplished by setting the environment variable KMP_AFFINITY.

export OMP_NUM_THREADS=72
export KMP_AFFINITY='verbose,granularity=fine,proclist=[0-71],explicit'

The source code change in the file stream.c to do the reverse allocation

<     for (j=STREAM_ARRAY_SIZE-1; j>=0; j--) { 
            a[j] = 1.0; 
            b[j] = 2.0; 
            c[j] = 0.0; 
        }
---
>     for (j=0; j<STREAM_ARRAY_SIZE; j++) {
            a[j] = 1.0; 
            b[j] = 2.0; 
            c[j] = 0.0; 
        }

See Also


Disclosure Statement

Copyright 2016, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 6/29/2016.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha