Stream and the performance impact of compiler optimization


A couple of weeks back there was a discussion on the perf-discuss alias over on around memory bandwidth/throughput benchmarks, and as is customary in such a discussion stream got a mention. During the discussion a couple of suggestions were made regarding various compiler options to use with stream. Seeing as we gather some of this data on an ongoing basis as part of Suns Performance Lifestyle, I decided at the time to throw in a couple of extra experiments so that we can demonstrate the kind of impact that various compiler options can have.

Now in order to keep the test in the realms of reproducability for most people (and of course within the bounds of not annoying other people by taking up valuable test cycles on some of our larger machines, I would have liked to give one of our Sun Fire X4600's a run with all of these experiments, but they are incredibly highly utilised by the various engineering groups we work with) the numbers which I based this post on were generated on a Sun Fire X2100 M2 Server, which is a single cpu, dual core amd64 box. The ARRAY_SIZE for our problems was set using the l2 cache size script mentioned previously.

Secondly, as mentioned before, my group does not generate benchmark numbers for publication, in general we run stream with -xopenmp -fast as a more out of the box type compilation, however here we are taking a baseline with no compiler options passed in, which consists of twenty iterations of stream, and calling that our 100% point, and finally expressing our results as a percentage of our baseline. Now with all of that aside, lets move onto the experiments.

Experiment Details

The rig, as mentioned above is a Sun Fire X2100, 1 x 2400Mhz M2 chip, 1Gb of ram. The OS used is the latest version of Solaris Developer Express (Nevada 70b for those following from OpenSolaris). The compiler is Sun Studio 12, 2007/05.

The compiler options used here do not represent an exhaustive comparison of the various compiler options, rather a more general indications of the kind of optimization that Studio 12 can do, and the impact that your compiler can have on the performance of your application. It should be noted though that Stream is a benchmark which is highly suited to be being optimized. For a more accurate set of results we use OpenMP. We have two tables of results below, one with OMP_NUM_THREADS set to the core count (two in this case) and one with OMP_NUM_THREADS set to the physical processor count (one in this case).

The Results

The data here is pretty self explanatory, the darker the green the better the number. Of the experiments we did here, the most optimal options were -fast -xopenmp -xvector=simd -xprefetch -xprefetch_level=3 and running with our environment set to include OMP_NUM_THREADS=2.

1 OpenMP Thread
metric no options -fast -fast -xopenmp -fast -xopenmp
-fast -xopenmp
-xvector=simd -xprefetch
-fast -xvector=simd
-xprefetch -xprefetch_level=3
add 100.00% 118.50% 117.04% 130.08% 188.93% 187.18%
copy 100.00% 126.39% 124.94% 217.72% 214.05% 214.68%
scale 100.00% 123.97% 122.48% 214.20% 210.21% 212.20%
triad 100.00% 118.34% 116.67% 129.84% 186.03% 188.86%

2 OpenMP Thread's
metric no options -fast -fast -xopenmp -fast -xopenmp
-fast -xopenmp
-xvector=simd -xprefetch
-fast -xvector=simd
-xprefetch -xprefetch_level=3
add 100.00% 118.50% 177.71% 245.01% 301.24% 187.33%
copy 100.00% 126.44% 181.02% 323.97% 324.35% 214.55%
scale 100.00% 124.10% 175.78% 319.18% 319.12% 212.20%
triad 100.00% 118.41% 177.07% 245.63% 298.91% 188.94%

Compiler Options

The compiler options used here represent a small subset of what you can do with Sun Studio 12, and are covered in a lot more detail in the extensive documentation. Most of the options used above are self explanatory, but the two that maybe of interest are -xvector and -xprefetch.

-xvector=simd Instructs the compiler to use SIMD, Single Instruction Multiple Data. Basically this allows us to deal with several chunks of data in one operation rather than multiple ones. Stream is heavily vector orienated, so this gives us a sizeable performance gain.
-xprefetch -xprefetch_level=3 This option enables prefetching, at the highest level the compiler supports. Prefetching is a mechanism by which data is speculatively is fetched from memory into the cpu cache. Certain processor architectures (ie Sparc, and in this case amd64) will do an amount of prefetching, but you can instruct the compiler to insert even more prefetch instructions. In the case of stream we are processing large arrays which lends itself very well to this kind of optimization, but its one that should be used with some caution.

Further Reading

The compiler folks are continuously publishing new articles which contain various tips and suggestions on how to get the most out of your compiler which are well worth reading, and its also worth signing up to the Sun Developer Netowrk to get the free downloads of Studio 12.
Technorati Tag(s) : ,

Next step (though it is probably very hard) is a compiler that does not need people trying every possible combination of options, and can automatically use the best optimizations...

Posted by Marc on September 13, 2007 at 11:07 AM IST #

In fairness I think -fast does solve most cases for people, the two extra optimizations noted here do give a massive boost on this workload, but probably wouldn't for a lot of applications.

On saying that though, yeah, I'd love to see a compiler that could work out the best optimization itself, but I think some human interaction is going to be needed for a bit longer.

Posted by fintanr on September 13, 2007 at 11:34 AM IST #

Just curious: Why didn't you try -xipo=2 (interprocedural optimizer) ?

Posted by Roland Mainz on September 13, 2007 at 01:35 PM IST #

Actually using -fast isn't realistic, as it is a macro which will expand to switches which optimize for that particular CPU ISA and timing. What that means is that the binary generated cannot be run across a range of different processors.

One could cook something up with isaexec, but even then -fast wouldn't be appropriate.

Posted by UX-admin on September 14, 2007 at 01:00 AM IST #

Hi Folks,

Roland -xipo=2 was just one of the experiments I didn't add in, it would make an interesting datapoint though, if I can get time on the rig again I'll gather the data.

UX - yep, I understand the -fast issue, but in the context of benchmarking on a particular box I think its quite appropriate.

Posted by fintanr on September 14, 2007 at 03:55 AM IST #

Post a Comment:
Comments are closed for this entry.



« August 2016