Friday Jan 29, 2010

..TO BE CONTINUED

We'll continue where we left off in a few weeks, but under the Oracle logo.

Stay tuned.


Wednesday Oct 21, 2009

Updated Performance Analyzer Manual

The Sun Studio Performance Analyzer reference manual, updated for Sun Studio 12 update 1, is now available on docs.sun.com:

Developing high performance applications requires a combination of compiler features, libraries of optimized functions, and tools for performance analysis. The Performance Analyzer manual describes the tools that are available to help you assess the performance of your code, identify potential performance problems, and locate the part of the code where the problems occur.

http://docs.sun.com/app/docs/doc/821-0304

Tuesday Oct 13, 2009

HPC Profiling for Fun and Profit

Just released:

HPC Profiling with the Sun Studio Performance Tools
Marty Itzkowitz and Yukon Maruyama (Sun Microsystems) describe how to use the Sun Studio Performance Tools to understand the performance issues in single-threaded, multi-threaded,  OpenMP, and MPI applications, and the techniques used to profile them. This paper was presented at the Third Parallel Tools Workshop held in Dresden Germany in September.

The link to the article is:

http://developers.sun.com/sunstudio/documentation/techart/hpc_profiling.pdf

Tuesday Sep 01, 2009

Going Parallel

Need to know more about parallel and multithreaded programming, but were afraid to ask? 

Here's a really good set of seven tutorials presented by Ruud van der Pas called:

"An Introduction to Parallel Programming"

Monday Jun 22, 2009

Sun Studio 12 Update 1 Supports OpenMP 3.0

Just released today, the latest Sun Studio 12 Update 1 compilers and tools support the OpenMP 3.0 tasking features natively.

You can download the entire suite of software tools from the Sun Studio portal and the OpenSolaris repository.

Also, there's an updated OpenMP 3.0 API User's Guide.

And release notes, readmes, and other documentation.

Thursday Jun 18, 2009

OpenMP 3.0

OpenMP is a shared memory multithreading API that utilizes source code directives to turn a serial program into a parallel program. In most cases it doesn't require any reprogramming, just the insertion of directives around loops.

The Sun Studio compilers recognize OpenMP directives and generate the appropriate parallel code automatically. 

The OpenMP API specification is under constant development by a committee of international volunteers. The official website is openmp.org.

The annual International Workshop on OpenMP (IWOMP) was held in Dresden, Germany last week and a number of interesting presentations were given. Most of the slides are now available on the IWOMP 2009 website.

Here are some of particular interest:

An Overview of OpenMP 3.0 Ruud van der Pas, Sun Microsystems
Tasking in OpenMP 3.0 Alejandro Duran, Barcelona Supercomputing Center
Sun Studio OpenMP Compilers and Tools Ruud van der Pas, Sun Microsystems
OpenMP And Performance Ruud van der Pas, Sun Microsystems

Monday Feb 16, 2009

Feedback

Another useful optimization option available with Sun Studio compilers is profile feedback.

This option can be especially helpful with codes that contain a lot of branching. The compiler is unable to determine from the source code alone which branches in an IF or CASE statement are the most likely to be taken. Using the profile feedback feature, you can run an instrumented version of the code using typical data to collect statistics on code coverage and branching, and then recompile the code using this collected data.

Darryl Gove has a great description of profile feedback in his book Solaris Application Programming.

With profile feedback, the compiler is better able to do certain optimizations that it cannot do by just analyzing the source code:

  • Layout the compiled code so that branches are rarely taken. The most frequent branches "fall-through" to the next memory location, avoiding a fetch and branch to a distant location.
  • Inline routines called many times. This avoids costly function calls.
  • Move infrequently executed code out of the "hot" parts of the code. This improves utilization of the instruction cache.
  • Lots more optimizations based on how variables are and are not utilized, based on the mostly likely paths the program will take

Of course, all these optimizations will depend on the typicality of the test data collected in the profile. Some cases it might be useful to identify a set of "typical data", collect data for each set, and compile multiple versions using each profile. Of course, this all depends on the application.

To use profile feedback, the compilation is in three phases:

  1. Compile with -xprofile=collect to produce an instrumented executable.
  2. Run the instrumented executable with a typical data set to create a performance profile.
  3. Recompile with -xprofile=use and -xO5 to produce the optimized executable

 % cc -xO3 -xprofile=collect:/tmp/profile myapp.c
 % a.out
 % cc -xO5 -xprofile=use:/tmp/profile -o myapp myapp.c


Read about profile feedback in the compiler man pages: C++, C, Fortran

Wednesday Jan 28, 2009

See Compiler Run, Run Compiler, Run...

If you've ever wondered what the compiler is doing when it optimizes your code, you can use the command-line tool, er_src, which is part of the Sun Studio Performance Analyzer, to view the "compiler commentary".

Just compile with some optimization level and -g and then pass the object code to er_src.

>f95 -O3 -g -c fall.f95 ; er_src fall.o
Source file: fall.f95
Object file: fall.o
Load Object: fall.o

     1.         parameter (n=100)
        <Function: MAIN>
     2.         real psi(n,n)
     3.         a = 1E6
     4.         tpi = 2\*3.14159265
     5.         di = tpi/float(n)
     6.         dj = di

    Source loop below has tag L1
    Source loop below has tag L2
    L1 could not be pipelined because it contains calls
     7.     forall (j=1:n, i=1:n) psi(i,j)= a\*sin((float(i)-.5) \* di) \* sin((float(j)-.5)\*dj)
     8.         print\*, psi(50,50)
     9.         end

This is a little test example using a Fortran 95 FORALL loop, compiled at optimization level O3.

Lets try it again, but this time with -fast for full optimization:

>f95 -fast -g -c fall.f95 ; er_src fall.o
Source file: fall.f95
Object file: fall.o
Load Object: fall.o

     1.         parameter (n=100)
        <Function: MAIN>
     2.         real psi(n,n)
     3.         a = 1E6
     4.         tpi = 2\*3.14159265
     5.         di = tpi/float(n)
     6.         dj = di

    Source loop below has tag L1
    Source loop below has tag L2
    L1 fissioned into 2 loops, generating: L3, L4
    L1 transformed to use calls to vector intrinsics: __vsinf_
    L4 scheduled with steady-state cycle count = 2
    L4 unrolled 3 times
    L4 has 1 loads, 1 stores, 0 prefetches, 0 FPadds, 1 FPmuls, and 0 FPdivs per iteration
    L4 has 0 int-loads, 0 int-stores, 4 alu-ops, 0 muls, 0 int-divs and 0 shifts per iteration
    L3 scheduled with steady-state cycle count = 4
    L3 unrolled 2 times
    L3 has 0 loads, 1 stores, 0 prefetches, 3 FPadds, 1 FPmuls, and 0 FPdivs per iteration
    L3 has 0 int-loads, 0 int-stores, 3 alu-ops, 0 muls, 0 int-divs and 0 shifts per iteration
     7.     forall (j=1:n, i=1:n) psi(i,j)= a\*sin((float(i)-.5) \* di) \* sin((float(j)-.5)\*dj)
     8.         print\*, psi(50,50)
     9.         end

A lot more going on here. Note that transforms the FORALL into two loops and then unrolls them. It also uses a vector version of the sin() function to process a bunch of arguments in a single call.

While the compiler commentary can get somewhat bit cryptic, you can get a feel for the kinds of optimizations the compiler is performing on your code.

It's also useful when using the auto parallelization options. We'll have more to say about that. But it's worth using er_src to get an idea about what the compiler can and cannot do. And don't forget to also compile with -g.

Friday Jan 23, 2009

GNU to Sun

While we're talking about Sun Studio compiler options, there's a new article out on the translation of familiar Gnu compiler options (gcc/g++/gfortran) to the Sun Studio cc, CC, and f95 compilers.


Translating gcc/g++/gfortran Options to Sun Studio Compiler Options

Wednesday Jan 21, 2009

Not So Simple - The -fsimple Option

As mentioned earlier, the -fast compiler option is a good way to start because it is a combination of options that result in good execution performance.

But one of the options included in -fast  is  -fsimple=2. What does this do?

Unless directed to, the compiler does not attempt to simplify floating-point computations (the default is -fsimple=0). -fsimple=2 enables the optimizer to make aggressive simplifications with the understanding that this might cause some programs to produce slightly different results due to rounding effects. 

Here's what the man page says:

-fsimple=1 

Allow conservative simplifications. The resulting code does not strictly conform to IEEE 754, but numeric results of most programs are unchanged.

With -fsimple=1, the optimizer can assume the following:

  • IEEE 754 default rounding/trapping modes do not change after process initialization.

  • Computations producing no visible result other than potential floating point exceptions may be deleted.

  • Computations with NaNs (“Not a Number”) as operands need not propagate NaNs to their results; for example, x\*0 may be replaced by 0.

  • Computations do not depend on sign of zero.

With -fsimple=1, the optimizer is not allowed to optimize completely without regard to roundoff or exceptions. In particular, a floating–point computation cannot be replaced by one that produces different results with rounding modes held constant at run time.

–fsimple=2

In addition to —fsimple=1, permit aggressive floating point optimizations. This can cause some programs to produce different numeric results due to changes in the way expressions are evaluated. In particular, the standard rule requiring compilers to honor explicit parentheses around subexpressions to control expression evaluation order may be broken with -fsimple=2. This could result in numerical rounding differences with programs that depend on this rule.

For example, with -fsimple=2, the compiler may evaluate C-(A-B) as (C-A)+B, breaking the standard’s rule about explicit parentheses, if the resulting code is better optimized. The compiler might also replace repeated computations of x/y with x\*z, where z=1/y is computed once and saved in a temporary, to eliminate the costly divide operations.

Programs that depend on particular properties of floating-point arithmetic should not be compiled with -fsimple=2.

Even with -fsimple=2, the optimizer still is not permitted to introduce a floating point exception in a program that otherwise produces none.

So if you use -fast, some programs that are numerically unstable might get different results than when not compiled with -fast. If this happens to your program, you can experiment by overriding the -fsimple=2 component of -fast by compiling with    -fast -fsimple=0

Thursday Jan 15, 2009

On Target

To completely specify the target architecture for which the compiler should generate optimized code, there are three option flags you can use:

-xarch=keyword    choose the target instruction set by keyword
Examples: generic, native, sparc, sparcvis, sparcvis2, sparcfmaf, sparcima, 386, pentium_pro, sse, sse2, amd64, pentium_proa, ssea, sse2a, amd64a, sse3, ssse3

-xchip=keyword     choose the target processor for optimization
Examples: generic, native, sparc64vi, sparc64vii, ultra, ultra2, ultra2e, ultra2i, ultra3, ultra3cu, ultra3i, ultra4, ultra4plus, ultraT1, ultraT2, core2, opteron, pentium, pentium_pro, pentium3, pentium4, nehalem

-xcache=spec         choose the target processor's cache specifications
Examples: generic, native, level1spec:level2spec:level3spec

In all cases, generic compiles for the good performance on most platforms, and native compiles for the same platform the compiler is running on.  Combine with -m32 and -m64 and you have a complete set of options for 32-bit and 64-bit target processors.

But knowing the right combination of options for your processor may be too much to deal with, so the compiler also provides a macro to set all three options to some standard values. This is the -xtarget=keyword option:

-xtarget=keyword    choose the target processor to compile for
Examples: generic, native, ultra, ultra2, ultra2i, ultra 3, ultra3cu, ultra3ci, ultra4, ultra4plus, ultraT1, ultraT2, sparc64vi, sparc64vii, pentium, pentium_pro, pentium3, pentium4, woodcrest, penryn, nehalem, opteron, and others.

Each keyword expands into a unique -xarch/-xchip/-xcache setting, like:

-xtarget=ultra4 is equivalent to
-xarch=sparcvis  -xcache=64/32/4:8192/128/2 -xchip=ultra4

-xtarget=woodcrest is equivalent to
-xarch=ssse3 -xcache=32/64/8:4096/64/16 -xchip=core2

Keep in mind that -fast (discussed in an earlier post) sets a number of reasonable optimization options, including -xtarget=native. For example, on my AMD64 Turion laptop:

>f95 -xtarget=native -dryrun
###     command line files and options (expanded):
### -xarch=sse3a -xcache=64/64/2:1024/64/16 -xchip=opteron -dryrun

So if I compile on my laptop, but want to deploy the binary application on an Intel Woodcrest system, I would have to override the native target if I still want to use -fast:

>f95 -fast -xtarget=woodcrest -dryrun
###     command line files and options (expanded):
### -xO5 -dalign -fsimple=2 -fns=yes -ftrap=common -xlibmil -xlibmopt -nofstore -xregs=frameptr -xarch=ssse3 -xcache=32/64/8:4096/64/16 -xchip=core2 -dryrun -xdepend=yes

What you can't do is cross-compile between SPARC and x86/x64  --  you can't compile on a SPARC system to generate code for an Intel or AMD platform, and v.v.

But knowing your target can be important.

(The details are in the compiler man pages)

Tuesday Jan 13, 2009

What Am I Compiling For?

It's worth thinking about the target processor you intend your code to run on. If performance is not an issue, then you can go with whatever default the compiler offers. But overall performance will improve if you can be more specific about the target hardware.

Both SPARC and x86 processors have 32-bit and 64-bit modes. Which is best for your code? And are you letting the compiler generate code that utilizes the full instruction set of the target processor?

32-bit mode is fine for most applications, and it will run even if the target system is running in 64-bit mode. But the opposite is not true .. to run an application compiled for 64-bit it must be run on a system with a 64-bit kernel, it will get errors on a 32-bit system.

How do you find out if the (Solaris) system you're running on is in 32-bit or 64-bit mode? Use the isainfo -k command:

 >isainfo -v
64-bit sparcv9 applications
        vis2 vis
32-bit sparc applications
        vis2 vis v8plus div32 mul32

This SPARC system is running in 64-bit mode. The command also tells me that this processor has the VIS2 instruction set.

On another system, isainfo reports this:

 >isainfo -v
64-bit amd64 applications
    sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx cmov amd_sysc cx8 tsc fpu
32-bit i386 applications
    sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx cmov amd_sysc cx8 tsc fpu

On UltraSPARC systems, the only advantage to running a code in 64-bit mode is the ability to access very large address spaces. Otherwise there is very little performance gain, and some codes might even run slower. On x86/x64 systems, there is the added advantage of being able to utilize additional machine instructions and additional registers. For both, compiling for 64-bit may increase the binary size of the program (long data and pointers become 8 instead of 4 bytes). But if you're intending your code to run on x86/x64 systems, compiling for 64-bit is probably a good idea. It might even run faster.

So how do you do it?

The compiler options -m64 and -m32 specify compiling for 64-bit or 32-bit execution. And it's important to note that 64-bit and 32-bit objects and libraries cannot be intermixed in a single executable. Also, on Solaris systems -m32 is the default, but on 64-bit x64 Linux systems -m64 -xarch=sse2 is the default.

>f95 -m32 -o ran ran.f
>file ran
ran:    ELF 32-bit LSB executable 80386 Version 1 [FPU], dynamically linked, not stripped
>f95 -m64 -o ran64 ran.f
>file ran64
ran64:  ELF 64-bit LSB executable AMD64 Version 1 [SSE FXSR FPU], dynamically linked, not stripped

It's also most helpful to tell the compiler what processor you're intend to run the application on. The default is to produce a generic binary that will run well on most current processors. But that leaves out a lot of opportunities for optimization. As newer and newer processors are made available, new machine instructions or other hardware features are added to the basic architecture to improve performance. The compiler needs to be told whether or not to utilize these new features. However this can produce backward incompatibilities, rendering the binary code unable to run on older systems. To handle this, application developers will make various binary versions available for current and legacy platforms.

For example, if you compile with the -fast option, the compiler will generate the best code it can for the processor it is compiling on. -fast   includes -xtarget=native. You can override this choice by adding a different -xtarget after the -fast option on the command line (the command line is processed from left to right).  For example, to compile for an UltraSPARC T2 system when that is not the native system you are compiling on, use -fast -xtarget=ultraT2.

New processors appear on the scene often. And with each new release of the Sun Studio compilers, the list of -xtarget options expands to handle them.  These new processor values are usually announced in the Sun Studio compiler READMEs. Tipping the compiler about the target processor helps performance.

More about -xtarget and what it means next time.

(For details, check the compiler man pages)

Monday Jan 12, 2009

Debugging

Debugging your code is a necessary evil. Things never seem to work the way you expect them. Programs crash or get the wrong results, leaving you wondering why. So the next step is to invoke a debugger on the code.

Most compilers, like the Sun Studio compilers, have a -g option or equivalent, which adds debugging information, like symbol tables, to the object code. The Sun Studio debugging tool, dbx, reads the object code, symbol tables, and a core dump if available, and tries to locate the spot in the program where it died. Now you can look at what was happening when the code crashed and try to determine the cause.

But debugging code is an art. There's no good book on the subject of debugging code that I know of. Programmers learn by accumulating experience.

Debuggers have typically been command-line tools, like dbx. But it helps to use a GUI debugger that can reference the source code.

A new standalone lightweight GUI debugger, dbxtool, is part of the Sun Studio Express 11/08 release and is fully integrated into the Sun Studio NetBeans-based IDE.

There's a new screencast you can watch to learn about dbxtool and the features of the dbx debugger, and it's run by Dave Ford from the Sun Studio dbx engineering team. Click on the image to start the screencast.  

UPDATE: dbxtool is now part of the current Sun Studio 12 Update 1 release.

Saturday Jan 10, 2009

What Am I Optimizing?

Let's think about this a little bit more.

If I add an optimization option, like -xO3 or -fast, to my compile command-line, what does that actually mean?

Well, it means that everything in that compilation unit (source files) will be compiled with a certain set of optimization stragegies. The compiler will try to produce the best code it can at that level. But ambiguities in the source code might inhibit some optimizations because the compiler has to make sure that the machine code it generates will always do the right thing .. that is, do what the programmer expects it to do.

Note that all the routines, functions, modules, procedures, classes, compiled in that compilation unit will be compiled with the same options. In some cases the extra time spent by the compiler might be wasted on some routines because they are rarely called and do not really participate in the compute-intensive parts of the program.

For short programs, this hardly matters .. compile time is short, and you might only compile infrequently.

But this can become an issue with "industrial-strength" codes consisting of thousands of lines, hundreds of program units (routines, functions, etc..). Compile time might become a major concern, so we probably would want to compile only those routines that factor into the overall performance of the complete program.

That means you really need to know where your program is spending most of it's CPU time, and focus your performance optimization efforts primarily on those program units. This goes for any kind of performance optimization .. you do need to know and understand the flow of the program -- its footprint.

The Sun Studio Performance Analyzer is the tool to do that. While it does provide extensive features for gathering every piece of information about your program's execution, it also has a simple command-line interface that you can use immediately to find out where the program is spending most of its time.

Compile your code with the -g option (to produce a symbol table) and run the executable under the collect command.

>f95 -g -fixed -o shal shalow.f90

>collect shal

Creating experiment database test.1.er ...

1NUMBER OF POINTS IN THE X DIRECTION     256

 NUMBER OF POINTS IN THE Y DIRECTION     256

....

Running under the collect command generates runtime execution data in test.1.er/ that can be used by the er_print command of the Performance Analyzer:

>er_print -functions test.1.er
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name  
User CPU  User CPU         
  sec.      sec.      
18.113    18.113     <Total>
 6.805     6.805     calc1_
 6.384     6.384     calc2_
 4.893     4.893     calc3_
 0.020     0.020     inital_
 0.010     0.010     calc3z_
 0.        0.        cosf
 0.        0.        cputim_
 0.        0.        etime_
 0.        0.        getrusage
 0.       18.113     main
 0.       18.113     MAIN
 0.        0.        __rusagesys
 0.        0.        sinf
 0.       18.113     _start


The er_print -functions command gives us a quick way of seeing timings for all routines (this was a Fortran 95 program), including library routines. Right away I know that calc1, calc2, and calc3 do all the work, as expected. But we also see that calc3 is not as significant as calc1. ("Inclusive Time" includes time spent in the routines called by that routine, while "Exclusive Time" only counts time spent in the routine, exclusive of any calls to other routines.)

Well, this is a start. Note that no optimization was specified here. Lets see what happens with -fast.

>f95 -o shalfast -fast -fixed -g shalow.f90
>collect shalfast
Creating experiment database test.3.er ...
....
>er_print -functions test.3.er
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
7.695     7.695      <Total>
7.675     7.695      MAIN
0.020     0.020      __rusagesys
0.        0.020      etime_
0.        0.020      getrusage



Yikes! What happened?

Clearly, with -fast the compiler compressed the program as much as it could, replacing the calls to the calc routines by compiling them inline into one hunk of code. Note also the 2x improvement in performance.

Of course, this was a little toy test program. Things would look a lot more complicated with a large "industrial" program.

But you get the idea.

More information on er_print and collect.

Friday Jan 09, 2009

Optimization Levels

Sun Studio compilers provide five levels of optimization, -xO1 thru -xO5, and each increasing level adds more optimization strategies for the compiler, with -xO5 being the highest level.

And, the higher the optimization level the higher the compilation time, depending on the complexity of the source code, which is understandable because the compiler has to do more.

The default when an optimization level is not specified on the command line, is to do no optimization at all. This is good when you just want to get the code to compile, checking for syntax errors in the source and the right runtime behavior, with minimal compile time.

So, if you are concerned about runtime performance you need to specify an optimization level at compile time. A good starting point is to use the -fast macro, as described in an earlier post, which includes -xO5, the highest optimization level. Or, compile with an explicit level, like -xO3, which provides a reasonable amount of optimization without increasing compilation time significantly.

But keep in mind that the effectiveness of the compiler's optimization strategies depend on the source code being compiled. This is especially true in C and C++ where the use of pointers can frustrate the compiler's attempt at generating optimimal code due to the side effects such optimizations can cause. (But, of course, there are other options, like -xalias_level,  you can use to help the compiler make assumptions about the use of pointers in the source code.)

Another concern is whether or not you might need to use the debugger, dbx, during or after execution of the program.  For the debugger to provide useful information, it needs to see the symbol tables and linker data that are usually thrown away after compilation. The -g debug option preserves these tables in the executable file so the debugger can read them and associate the binary dump file with the symbolic program.

But the optimized code that the compiler generates may mix things up so that it's hard to tell where the code for one source statement starts and another ends. So that's why the compiler man pages talk a lot about the interaction between optimization levels and debugging. With optimization levels greater than 3,  the compiler provides best-effort symbolic information for the debugger.

Bottom line, you almost always get better performance by specifying an optimization level (or -fast which includes -xO5) on the compile command.

(Find out more...)

Thursday Jan 08, 2009

Optimization Shortcut with -fast

So if I've got a code and I've already compiled it without any real options, so I know it will compile, where do I start with trying to get the best performance?

Well, the Sun Studio compilers have many options for performance optimization. You can try them all one by one and see what works. 

Or, you can start off by compiling with -fast.

-fast is a macro -- it's a set of options that are all invoked simultaneously. Some of the options that it uses can be problematic for some codes. Also, compiling with -fast may increase compile time. But the resulting executable should run faster than compiling with default options for most codes.

Also, the set of options that make up -fast are different for each compiler and on whether you're compiling on a SPARC or x86/x64 processor.

One way to see what the component options of -fast are is by using the compiler's -dryrun or -# options

For example, on a SPARC Solaris system:

edgard:/home/rchrd<42>f95 -dryrun -fast | grep ###
###     command line files and options (expanded):
### -dryrun -xO5 -xarch=sparcvis2 -xcache=64/32/4:1024/64/4 -xchip=ultra3i -xpad=local -xvector=lib -dalign -fsimple=2 -fns=yes -ftrap=common -xlibmil -xlibmopt -fround=nearest

edgard:/home/rchrd<43>CC -dryrun -fast | grep ###
###     command line files and options (expanded):
### -dryrun -xO5 -xarch=sparcvis2 -xcache=64/32/4:1024/64/4 -xchip=ultra3i -xmemalign=8s -fsimple=2 -fns=yes -ftrap=%none -xlibmil -xlibmopt -xbuiltin=%all -D__MATHERR_ERRNO_DONTCARE


On my AMD64 OpenSolaris laptop we see:

FerrariOS:/export/home/rchrd<25>CC -dryrun -fast | grep ###
###     command line files and options (expanded):
### -dryrun -xO5 -xarch=sse3a -xcache=64/64/2:1024/64/16 -xchip=opteron -xdepend=yes -fsimple=2 -fns=yes -ftrap=%none -xlibmil -xlibmopt -xbuiltin=%all -D__MATHERR_ERRNO_DONTCARE -nofstore -xregs=frameptr -Qoption CC -iropt -Qoption CC -xcallee64

FerrariOS:/export/home/rchrd<22>cc -fast -# no.c |& grep ###
###     command line files and options (expanded):
### -D__MATHERR_ERRNO_DONTCARE -fns -nofstore -fsimple=2 -fsingle -xalias_level=basic -xarch=sse3a -xbuiltin=%all -xcache=64/64/2:1024/64/16 -xchip=opteron -xdepend -xlibmil -xlibmopt -xO5 -xregs=frameptr no.c



The particular options are chosen to get the best performance on the host platform ... so this assumes that you're going to run the executable binary on the same processor that compiled it.

I have one computationally intensive Fortran 95 program that runs on an UltraSPARC IIIi system in 54.4 seconds using just default compiler options. Just adding -fast to the compile command line gives me an executable that runs in only 12.2 seconds .. almost one-fifth the time.  The same program on my AMD64 laptop runs one-third as fast with -fast than without it.

But you do have to be careful. Check the manuals, which caution:

Because -fast invokes -dalign, -fns, -fsimple=2, programs compiled with -fast can result in nonstandard floating-point arithmetic, nonstandard alignment of data, and nonstandard ordering of expression evaluation. These selections might not be appropriate for most programs.

Looks like we we may have some more explaining to do.

 
  

When Run Was A Compiler

Back in the day (I mean around 1965), the Fortran compiler for the CDC 6600 (the supercomputer of the moment, pictured at the left) was called "run".

Odd choice perhaps. Seymour Cray, the 6600 designer, and Garner McCrossen, the programmer of the run compiler, figured that all you needed to put on a command line (actually a punched card in the control deck) was

run

and the system would invoke the compiler to read the Fortran source cards in the deck and run the program.

There were no compiler options of any significance.

The compiler was written in assembly language for the 6600 and was a remarkable piece of code.

Click on the photo and it will take  you to a Google search for more images of the CDC 6600. (Back in the day, I was a systems programmer at NYU on the serial 4 machine in 1965, maintaining the run compiler and library).

Wednesday Jan 07, 2009

Options Not Optional

Compiler options can be mysterious. They can have kind of a "don't go there" mystique about them. But actually they're there to help. 

There are some things the compiler can't do without help from the programmer. So that's when the compiler designers say "let's leave it to the programmer and create an option". Options also accumulate with time, so that's why there are so many of them. Some are "legacy" options, needed for certain situations that rarely come up these days. But the rest are really quite useful, and can greatly improve the kind of code the compiler generates from your source.

There are compiler command-line options for various things, like code optimization levels and run-time performance, parallelization, numeric and floating-point issues, data alignment, debugging, performance profiling, target processor and instruction sets, source code conventions, output mode, linker and library choices, warning and error message filtering, and more. Choosing the right set of options to compile with can make a great difference on how your code performs on a variety of platforms.

Darryl Gove has a great article on selecting the right compiler options.

Over the next couple of weeks I'll be taking a look at individual compiler options, dissecting them one at a time.

 

In the meantime, you can find a detailed list of Sun Studio 12 compiler options organized by function and source language here.

Testing

Testing, 1-2-3-4

Is this mike on?

About


Deep thoughts on compiling C, C++, and Fortran codes with Oracle Solaris Studio compilers, especially optimization and parallelization, from the Solaris Studio documentation lead, Richard Friedman. Email him at
Richard dot Friedman at Oracle dot com

When Run Was A Compiler

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today