Thursday May 28, 2009

Offchip bandwidth enhancement using compression

My slides on using light-weight compression to enhance available offchip bandwidth on future processors can now be found here. As way of an introduction, we found that light-weight compression schemes can improve the effective offchip bandwidth by over 3X on a wide variety of important workloads.

Monday Mar 16, 2009

Offchip bandwidth and compression presentation

I'm presenting on leveraging compression to increase the effective offchip bandwidth of multicore processors at this weeks Multicore Expo in Santa Clara. Details here.

Monday Nov 17, 2008

SPECcpu and autopar

Browsing a few internet forums, it interesting to note that there is quite a debate on the use of autopar in SPEC results, with some calling for its use to be disallowed. While it is true that when autopar is permitted, the resulting performance is no longer single-thread performance, it nevertheless it an interesting measure of how well a particular processor/compiler combination perform (Even before the common use of autopar, SPEC was never just a measure of performance, but was heavily influenced by the sophistication of the compiler). Further, in today's multi-core processors, where caches & off-chip bandwidth are shared, even single-thread SPEC runs don't give all the of information one needs to fully understand the impact on the processor. Rather, SPEC ratio is merely an indication of of well the processor/system can run a single-instance of an application (i.e. peak single-application performance) and SPECrate is a measure of the peak throughput.

Finally, it can be argued that many HPC codes are amenable to autopar, so its use in SPECfp is relevant. Its use in SPECint is more problematic as most integer codes are more difficult to noticeably accelerate using autopar -- its just unfortunate that SPEC06int includes libquantum.....

Friday Nov 14, 2008


With my recent talk of T2s, CMTs and autopar I thought it might be interesting to provide a link to Sun's recently released OpenSPARC Internals book. This provides detailed information on many of these topics and can be downloaded for free here. I contributed the first chapter, so its probably safe to skip that one :-)

Thursday Nov 13, 2008

UltraSPARC T2 SPECcpu autopar

Following from previous discussions on the benefits of autopar for SPECcpu, the next logical question is "what kind of benefit does autopar provide on CMT processors like the UltraSPARC T2". On UltraSPARC T2, we have a multitude of hardware strands available and, as previously discussed, the bare-metal inter-thread communication latencies are extremely low. I talked with some of the compiler gurus at Sun and sure enough this analysis had been undertaken for SPECfp. The results are as follows:

Pretty cool! 7 of the suites X benchmarks show some benefit, with 1 showing over 30X speedup, a further 3 showing a benefit of over 10X, and the remainder showing 2-4X improvements. Some of the benchmarks show peak performance when using less than 64-threads. This is not unexpected, as this is an out-of-the-box run and given that T2 has shared pipelines and, like most multi-core processors, shared caches and offchip bandwidth, some tweaking is required to maximize performance.

Tuesday Nov 11, 2008

SPECfp and autopar

Continuing the SPECcpu theme, an interesting paper from Intel describing the performance upticks from autopar, SSE and other optimizations can be found here. On a dual-core they show decent gains on 6 benchmarks and slight gains on a further two.

Similarly, using the Sun Studio compiler, 8 benchmarks benefit from autopar, delivering a 16% improvement in the geometric mean on a dual-core processor (as illustrated here).

Wednesday Oct 08, 2008

SPECcpu : an increasingly multi-threaded benchmark

It is interesting to look at recent SPEC2006 results and compare them with the results from just a year or so ago. In addition to the normal improvements one would expect as compilers become increasingly familiar with this fairly new benchmark suite, autoparallelism is being employed to boost scores on this traditionally single-thread performance benchmark.

For SPECint, the autopar gains seem to be limited to one benchmark – libquantum, while on the FP side, there are several. Looking at libquantum:

1) August 07,    3.00 GHz Intel Dual core, 4MB L2$  (4MB between 2-cores), no autopar : libquantum  31.6
2) August 07,    3.00 GHz Intel Dual core, 4MB L2$  (4MB between 2-cores), autopar :    libquantum  78.9
3) September 08, 3.20 GHz Intel Quad core, 12MB L2$ (6MB between 2-cores), autopar :    libquantum 283
4) September 08, 2.66 Ghz Intel Six core,  9MB L2$ (3MB between 2-cores), autopar :     libquantum 316

For libquantum there are obviously other compiler optimizations being undertaken in these latest benchmark submissions (as evidenced by various interim results e.g. here), but the comparison of (3) and (4) alone illustrate the power of threading – 50% reduction in per core L2$, and a 20% reduction in frequency, but the libquantum score improves by 10%. Indeed, if you factor out the frequency difference, the 4 to 6 score scaling is very good. It is also interesting to note that the new libquantum scores are around 10X higher than the scores for the other benchmarks. In fact, if you assume that (3) would be say 3X lower without autoparallelism support in the compiler, then the processors whole SPECint ratio score (which is a geometric mean of all of the benchmark scores) falls by over 10%. If the libquantum score is reduced to that seen in (1), then the overall score falls by almost 20%!

It is interesting to note that comparing (1) and (3), the overall score has improved from 21 to 29.3 – an almost 40% improvement. However, roughly half of that gain comes from threading libquantum! Of the remainder, some comes from frequency, some comes from compiler optimizations that significantly improve hmmer performance, and the remainder from a number of other small increases.

It is interesting that the majority of the gains on this traditionally single-threaded benchmark that we have observed in the last several processor generations come from multithreading.....

Similarly, for SPECfp, there are a number of benchmarks which benefit significantly from autoparallelism. This is not surprising, as FP workloads are typically more amenable to this style of optimization. Comparing the recent results from a quad-core Opteron both with and without autopar, it looks like there are 4 FP benchmarks that benefit significantly:

bwaves : 3X improvement

cactusADM : 7.6X improvement

gemsFDTD : 2.2X improvement

wrf : 1.48X improvement

Cumulatively, threading these 3 benchmarks delivers about a 30% improvement in the SPECfp score!

Thursday Aug 21, 2008

Accelerate Multithreaded Applications with CMT Processors -- crypto & microparallelism

A new CMT whitepaper discussing how to accelerate multithreaded applications on CMT processors can be found here. The whitepaper touches on high-performance cryptography on CMT processors and microparallelism techniques.

Friday Jun 13, 2008

Link compression

I've been spending some time over recent years looking into the potential for leveraging compression on various off-chip links. In July's issue of IEEE Transactions on Computing, there is an article covering some of this work - focussing specifically on links to memory. The paper can be found here.

Basically, the article illustrates that on a wide variety of workloads (SPECcpu, media and commercial), a number of simple compression techniques will yield significant reductions in the amount of data that needs to be transmitted over the link (for both reads and write-backs). Accordingly, leveraging these techniques can deliver significant benefits - cost reduction, power reduction and potentially even performance improvements.

Friday May 09, 2008

CommunityOne slides available -- Multicore Processors & Microparallelism

Thanks to everyone that attended my CommunityOne presentation earlier this week. The slides can be downloaded from here (search for Microparallelism) [Username: contentbuilder Password: doc789]-- there are significantly more Microparallelism examples than contained in the recent MultiCore Expo presentation.

Tuesday Apr 29, 2008

Multicore slides

My slides from Multicore are now available on the OpenSPARC website here.

Thursday Apr 10, 2008

CommunityOne and Microparallelism

I have a more detailed presentation on Microparallelism at the upcoming CommunityOne conference (May 5th). Details here.

Thursday Mar 13, 2008

Link compression

I have an article titled "Memory-Link Compression Schemes: A Value Locality Perspective", that will appear in the June issue of IEEE Transactions on Computers. The article can be found here.

Tuesday Mar 11, 2008

Multicore Expo 2008

I'll be presenting at the upcoming Multicore Expo on "Multicore Processors and Microparallelism". The agenda for the conference can be found here.

Wednesday Dec 05, 2007

T2 slideset

Located here

Dr. Spracklen is a senior staff engineer in the Architecture Technology Group (Sun Microelectronics), that is focused on architecting and modeling next-generation SPARC processors. His current focus is hardware accelerators.


Top Tags
« June 2016