Thursday Sep 13, 2007

Stream and the performance impact of compiler optimization


A couple of weeks back there was a discussion on the perf-discuss alias over on around memory bandwidth/throughput benchmarks, and as is customary in such a discussion stream got a mention. During the discussion a couple of suggestions were made regarding various compiler options to use with stream. Seeing as we gather some of this data on an ongoing basis as part of Suns Performance Lifestyle, I decided at the time to throw in a couple of extra experiments so that we can demonstrate the kind of impact that various compiler options can have.

Now in order to keep the test in the realms of reproducability for most people (and of course within the bounds of not annoying other people by taking up valuable test cycles on some of our larger machines, I would have liked to give one of our Sun Fire X4600's a run with all of these experiments, but they are incredibly highly utilised by the various engineering groups we work with) the numbers which I based this post on were generated on a Sun Fire X2100 M2 Server, which is a single cpu, dual core amd64 box. The ARRAY_SIZE for our problems was set using the l2 cache size script mentioned previously.

Secondly, as mentioned before, my group does not generate benchmark numbers for publication, in general we run stream with -xopenmp -fast as a more out of the box type compilation, however here we are taking a baseline with no compiler options passed in, which consists of twenty iterations of stream, and calling that our 100% point, and finally expressing our results as a percentage of our baseline. Now with all of that aside, lets move onto the experiments.

Experiment Details

The rig, as mentioned above is a Sun Fire X2100, 1 x 2400Mhz M2 chip, 1Gb of ram. The OS used is the latest version of Solaris Developer Express (Nevada 70b for those following from OpenSolaris). The compiler is Sun Studio 12, 2007/05.

The compiler options used here do not represent an exhaustive comparison of the various compiler options, rather a more general indications of the kind of optimization that Studio 12 can do, and the impact that your compiler can have on the performance of your application. It should be noted though that Stream is a benchmark which is highly suited to be being optimized. For a more accurate set of results we use OpenMP. We have two tables of results below, one with OMP_NUM_THREADS set to the core count (two in this case) and one with OMP_NUM_THREADS set to the physical processor count (one in this case).

The Results

The data here is pretty self explanatory, the darker the green the better the number. Of the experiments we did here, the most optimal options were -fast -xopenmp -xvector=simd -xprefetch -xprefetch_level=3 and running with our environment set to include OMP_NUM_THREADS=2.

1 OpenMP Thread
metric no options -fast -fast -xopenmp -fast -xopenmp
-fast -xopenmp
-xvector=simd -xprefetch
-fast -xvector=simd
-xprefetch -xprefetch_level=3
add 100.00% 118.50% 117.04% 130.08% 188.93% 187.18%
copy 100.00% 126.39% 124.94% 217.72% 214.05% 214.68%
scale 100.00% 123.97% 122.48% 214.20% 210.21% 212.20%
triad 100.00% 118.34% 116.67% 129.84% 186.03% 188.86%

2 OpenMP Thread's
metric no options -fast -fast -xopenmp -fast -xopenmp
-fast -xopenmp
-xvector=simd -xprefetch
-fast -xvector=simd
-xprefetch -xprefetch_level=3
add 100.00% 118.50% 177.71% 245.01% 301.24% 187.33%
copy 100.00% 126.44% 181.02% 323.97% 324.35% 214.55%
scale 100.00% 124.10% 175.78% 319.18% 319.12% 212.20%
triad 100.00% 118.41% 177.07% 245.63% 298.91% 188.94%

Compiler Options

The compiler options used here represent a small subset of what you can do with Sun Studio 12, and are covered in a lot more detail in the extensive documentation. Most of the options used above are self explanatory, but the two that maybe of interest are -xvector and -xprefetch.

-xvector=simd Instructs the compiler to use SIMD, Single Instruction Multiple Data. Basically this allows us to deal with several chunks of data in one operation rather than multiple ones. Stream is heavily vector orienated, so this gives us a sizeable performance gain.
-xprefetch -xprefetch_level=3 This option enables prefetching, at the highest level the compiler supports. Prefetching is a mechanism by which data is speculatively is fetched from memory into the cpu cache. Certain processor architectures (ie Sparc, and in this case amd64) will do an amount of prefetching, but you can instruct the compiler to insert even more prefetch instructions. In the case of stream we are processing large arrays which lends itself very well to this kind of optimization, but its one that should be used with some caution.

Further Reading

The compiler folks are continuously publishing new articles which contain various tips and suggestions on how to get the most out of your compiler which are well worth reading, and its also worth signing up to the Sun Developer Netowrk to get the free downloads of Studio 12.
Technorati Tag(s) : ,



« August 2016