Friday Oct 14, 2005

Comparing Sun Performance Library to ACML, Goto, and ATLAS on Opteron

As promised, here are some comparisons of Perflib to the more popular BLAS implementation available for the Opteron hardware. All these timings were run on the exact same hardware configuration. The Performance Library numbers are produced on a single cpu of a 4-way opteron running at 2.2 Ghz with 8 GB of Ram. The GOTO, ATLAS, and ACML runs were performed on the same hardware but running SuSE 9.3 pro.

These are about as useful as the SPARC graphs I posted yesterday. Even such simple timers leave much room for variation. Bind to a processor or not? Run timers with cold or warm dcache and icache? Use the 2.5 gnu64 Acml library or the 2.7 pgi64 library? A clever marketing critter could easily take this data and proclaim virtually any of the libraries tested was the best BLAS implementation.

I contend that benchmarks or timers such as these, while marginally useful for producing glossy sales material, are next to useless for the application developer who is searching for performance delivered to a particular application. The much referenced Linpack benchmark is only marginally superior to these contrived performance measurement tools in my opinion.

The reader might notice that Perflib isn't the top performer in all graphs I posted. Even though I'm the technical lead for the Perflib product, I don't care. I don't care because it is my belief that timers in this class have very little value when trying to measure the worth of a scientific library. Tomorrow, I'll post some data showing what I believe to be a better method for judging the worth of a scientific library.

Thursday Oct 13, 2005

Performance Comparisons of Popular BLAS implementations

Historically, it's been hard to compare the performance of Perflib with competing libraries on SPARC hardware. For one, there were few, credible competitors. Sure, you could download the netlib source code, compile with optimization, and use the resulting library but you would be quite disappointed with the result. The routines which do the heavy lifting for the entire library (xGEMM, xAXPY, xDOT, xGEMV) would not perform anywhere near the machine peak and this poor performance would effect the performance of virtually the entire library.

A couple clever fellows developed the Automatically Tuned Linear Algebra Software (ATLAS) system which addresses many of the problems. You can get all the gory details here but the big idea is that a series of scripts, automatic code generators, and some hand-written kernels could produce a very usable BLAS/LAPACK library.

Another clever fellow at the University of Texas by the name of Goto has produced some respectable BLAS implementations which used to be freely available. On certain platforms, this small number of routines turns in some really respectable performance. His homepage claims that "The GotoBLAS codes are currently the fastest implementations of the Basic Linear Algebra Subroutines (BLAS). I'll present some data and let the reader draw their own conclusions.

These performance numbers were gathered from a 1.5 Ghz UltraSPARC IV+ machine.

In the next couple days, I'll post additional performance numbers for SPARC and Opteron hardware. There are more competitors for the commodity hardware (i.e. Opteron) and I'll present some comparisons between Goto, Atlas, Perflib, Acml.

Of course, the previous plots are the standard 100-1000 by steps of 100 which are nice because they're easy to produce but they're mostly worthless since I've yet to see a benchmark or application which calls these routines in this manner. Future entries will present some of the application specific comparisons which I alluded to in previous posts.

Monday Sep 19, 2005

On the benefits of a custom library

For the past five or six years, I've been advocating the repackaging of Perflib into what I call 'Custom Libraries' (pause for musical fanfare here). The idea first surfaced when I received a request from someone in the benchmarking group for a PerflibLite. It started me thinking about the benefits of such a thing with respect to customers and ISVs.

A custom library would contain only those routines necessary for the particular user or user application. If the application in question is expected to run on multiple chips, the library would contain the objects for each of the routines and each of the chips for optimal performance.

A custom library would greatly reduce the mass storage footprint for the Performance Library. I know that disk space is cheap but it's no reason to keep around useless bits. There would also be a (perhaps negligible) benefit during the dynamic linking phase.

We already know that most applications use only a tiny fraction of any given scientific library during computation. That goes not only for Perflib but ATLAS, GOTO, ACML, MKL, NAG, and any number of other offerings. These libraries contain routines for a wide range of computation but any given application accesses only a handful. The libraries also contain routines for various data types (single precision, double precision, complex, and double complex). It's a rare application which makes use of single precision floating point and double precision complex routines.

One of the major benefits from a library provider's view is information about which routines are being used. This allows the developer/tuner some idea as to where to spend scarce tuning resources. Another benefit is the information concerning who is using a particular routine or class of routines. Given this information, users could be sent an e-mail informing them when a routine has had bugs fixed or performance enhanced.

If you think this idea has merit, drop me an e-mail or comment here.

Tuesday Sep 06, 2005

The Fortran Compiler (F95) and some clever optimizations

I thought I would write about some of the really good work being done in the Fortran compiler with respect to the MATMUL F90/F95 intrinsic. The compiler, or more specifically the middle end, is doing some very clever optimizations when it comes to handling calls to the MATMUL intrinsic.

In the most simple case, a user will use the MATMUL intrinsic to multiply two matrices together.

REAL\*8, DIMENSION(10,10) :: A, B, C

For those programmers familiar with the BLAS routines, this call can be re-written as follows:

call dgemm('n','n',10,10,10,1.0d0,A,10,B,10,0.0d0,C,10)

The BLAS3 interface for the xGEMM routines is very flexible and can handle such things as submatrices, alpha and beta with different values, etc. From a high level perspective, the F90/F95 language allows the same sort of flexibility.

C(1:5,1:5) = MATMUL(A(1:5,1:5),B(1:5,1:5))

While this offers the programmer an easy way to handle such things as submatrices it may cause some performance issues depending on the matrix multiplication code that supports the intrinsic. If the low-level routine has the same flexibility as the BLAS3 xGEMM routine, the compiler could easily call the underlying routine as follows:

call dgemm('n','n',5,5,5,1.0d0,A,10,B,10,0.0d0,C,10)

If, on the other hand, the underlying routine is restrictive in terms of how it is called, the compiler might be forced to perform gather/scatter pairs on each of the matrices before and after the call to the computational routine. Obviously, this will have an adverse impact on program performance.

Not only is the compiler handling sub-matrices, it's also handling various other forms of the general matrix multiply.

C = MATMUL(transpose(A),B) becomes call dgemm('t','n',10,10,10,1.0d0,A,10,B,10,0.0d0,C,10)

C = MATMUL(transpose(A)\*alpha,B) becomes call dgemm('t','n',10,10,10,alpha,A,10,B,10,0.0d0,C,10)

C = C + MATMUL(A,B) becomes call dgemm('n','n',10,10,10,1.0d0,A,10,B,10,1.0d0,C,10)

C = C \* beta + MATMUL(A\*alpha,transpose(B)) becomes call dgemm('n','t',10,10,10,alpha,A,10,B,10,beta,C,10)

The F90 compiler has been doing these transformations since Sun Studio 9 and additional optimizations are in the works where MATMUL calls which are really matrix-vector (i.e. xGEMV) calls will be converted directly to the correct matrix-vector call.

Tuesday Aug 23, 2005

Gathering unique signatures from ISV applications

In yesterday's entry, I hinted at a way of gathering ISV application data to guide performance tuning. The method by which this is done isn't new but how that data is used is unique as far as I know.

The PLG (Performance Library Group) has created an interpose library for every user-callable routine in the library. Greg Nakhimovsky wrote an introduction to writing interposers on the developers forum which is a good place to start if you're unfamiliar with interposers.

What this library does is intercept all the external calls to Perflib, records some information about the call (we call this the signature of the call) and then pass the control into the actual Perflib routine.

What is created after a run is a trace file that contains a list of unique signatures and a frequency count for each signature. A simple example might look like:

  • daxpy:146:2:1:1:512
  • daxpy:56:2:1:1:1536
  • dasum:152:1:2
  • dasum:60:1:18

What this tells me is that daxpy was called with N=146, Alpha=(not 1 and not 0), Incx=1, Incy=1 and it was called with those exact parameters 512 times. Also, dasum was called with N=152, Incx=1 and it was called with those parameters 2 times.

The reason the exact Alpha value isn't captured for daxpy is because it has no impact on the performance of the routine. This allows us to collapse all the calls to daxpy that have common integer parameters into a single unique signature. This isn't the case where alpha is one or zero since the code special cases those values of alpha.

This information allows us to look deeply into the way an application is actually using the library. Most tuning tools (ala prof, analyzer, etc) will tell you how much time you're spending in a particular routine but that's about it. In a recent example, we were benchmarking the Nastran application against several competitors on the Opteron boxes. In most cases we were doing fine but for one data set, we were getting beaten quite soundly. We used the interpose library to find that dgemv was being called with M=1.5 million and N=7. It just so happens that our code was blocked for 8 elements and the entire time was being spent in the cleanup code! After handling this case better, we were able to more than double the performance of the routine for that particular class of cases (tall skinny A matrices).

Some readers (I kid myself into thinking anyone has read this far) might wonder why on earth I'm writing about such wonkish details. The reason is simple, in upcoming entries I'm going to present some performance comparisons to competing libraries and I'm going to compare Perflib to our competitors on applications where one or both of the libraries may never have been used to actually run the application. I'm able to do this (to a certain extent) because once I have the signature profile of an application, I can project the impact a particular library will have on the performance of that application. Can I do it perfectly? No, but I can make some supportable generalizations about how Perflib would compare to a competing product on a particular application. As always, thanks for your attention.

Recent versions of the Sun Performance Library

In response to a recent query regarding my initial blog (thanks Kristofer), I'd like give some details about the current Performance Library efforts. Most of the new development has gone into creating versions for 32-bit X86 (vanilla P3 or P4 chips without SSE2 support), 32-bit SSE2 (minimum requirements P4 w/SSE2 support and Solaris 9 update 6 or later OS), and finally 64-bit Opteron.

Kristofer asks, "Why does the x86 version of the library stink so bad compared to the SPARC offerings?". If the only library you have tested is the non-SSE2, 32-bit version which was released with the Sun Studio 9 compiler and tools, I'm sure you're disappointed with the results. It was our first effort after returning to the X86 architecture and we had no OpenMP support in the compiler. The schedule was tight and it was about all we could do to field a working, serial version of the library. However, we also shipped an SSE2 enabled version of the library with the Sun Studio 9 compiler and tools which you might give a try. You do need an P4 or better SSE2 enabled chip and Solaris 9 update 6 or later OS. You'll find that our 32-bit performance is quite respectable. You can access that library using the -xarch=sse2 compiler option.

% f90 -fast -xarch=sse2 -o prog prog.f -xlic_lib=sunperf
% prog
3000 4831.38784

That's 4831 MFlops on a 2.8 Ghz machine doing a double precision matrix multiply. Not world record sorts of numbers but certainly respectable (86+% of peak). The Studio 9 Performance Library doesn't have multi-threaded support due to missing support in the compiler.

Now, let's move forward a release to Studio 10. We rolled out 64-bit support and all libraries are parallelized. Let's take our same 3000x3000 double precision matrix multiply and run on a 2-way 2.2Ghz Opteron machine running Solaris 10.

% f90 -fast -xarch=amd64 -o prog prog.f -xlic_lib=sunperf
% prog
3000 3788.6042649

86% of peak ... not great but not too bad considering it was our first 64-bit attempt and we released the compilers and tools concurrently with the Solaris OS for 64-bit.

Let's see what the parallel scaling looks like:

% setenv PARALLEL 2
% prog
3000 7237.128577

Not too bad considering this was on a two processor machine so we're fighting the last processor problem a bit. Let's see how she goes on a 4-way box ...

% setenv PARALLEL 4
% prog
3000 13987.367615136576

Wow, 14 GFlop DGEMM on a commodity box. If Neil Lincoln and the guys at ETA could see me now! Everyone knows that benchmarking and publishing of performance numbers is voodoo at best. I can always find a test case where my library beats Vendor X's library and vice versa. The direction we've been moving lately is application based benchmarking. As time permits, I'll post some discussion of a unique way we've started collecting application data and how we're using it to better our product.

Thanks for reading this far.

Tuesday Aug 16, 2005

Sun Performance Library (aka Perflib aka SunPerf)

Since it seems that self-promotion is the way of the world, I'm going to start blogging on topics near and dear to my heart. The first of which is the product my group produces, the Sun Performance Library. Also known as Perflib and SunPerf.

This product has been part of the Sun compilers and tools for more than 11 years but it's relatively unknown to all but a small population of High Performance Computing (HPC) folks, mostly SEs and customers who currently use the product.

I could go on and on with details and descriptions but it's all be done much more professionally by the technical documentation folks and can be found here:

Sun Performance Library Reference
Sun Performance Library User's Guide

The short story is that the library provides the Basic Linear Algebra Subroutines(BLAS), Fast Fourier Transform (FFT) routines, Linear Algebra packages (LAPACK), and some odds and ends to help HPC customers get the most out of Sun hardware.

Since this is my first entry, I'll keep it short so as to not lose my one remaining reader (Hi Mom!). More later.




« August 2016