Wednesday May 21, 2008

Presenting at OpenSolaris Users group tomorrow

Tomorrow I'll be presenting at the Silicon Valley OpenSolaris Users group. Alan DuBoff has asked me to try and avoid the monolithic presentation, so I'll be aiming to have a couple of short presentations. The idea is to push the balance towards communications rather than presentations.

Monday May 19, 2008

Compiler forensics

If you need to find out which version of the compiler is installed use:

$ cc -V
cc: Sun C 5.9 SunOS_sparc 2007/05/03

I've not been able to find a table which maps the version numbers back to the product names. For the record this is Sun Studio 12.

A more interesting question is what compiler generated an executable.

$ mcs -p test.o
acomp: Sun C 5.8 2005/10/13
as: Sun Compiler Common 11 2005/10/13

The test file was generated by Sun Studio 11.

Finally what flags were used to generate an executable. Use dwarfdump and grep for "command" for binaries generated with Sun Studio 12, or for C code compiled with Sun Studio 11.

$ dwarfdump test.o|grep command
                DW_AT_SUN_command_line       /opt/SUNWspro/prod/bin/cc -xtarget=generic64 -c  test.c      <   13>   DW_AT_SUN_command_line      DW_FORM_string

For older compilers use dumpstabs and grep for "CMD":

$ getr.o|grep CMD
   2:  .stabs "/export/home; /opt/SUNWspro/bin/../prod/bin/cc -c -O  getr.c",N_CMDLINE,0x0,0x0,0x0

Of course it's very unlikely that shipped binaries will contain information about compiler flags.

Thursday May 15, 2008

Crossfile inlining and inline templates

Found an interesting 'feature' of using crossfile (-xipo) optimisation together with inline templates. Suppose you have a 'library' routine which is defined in one file and uses an inline template. This library routine is used all over the code. Here's an example of such a routine:

int T(int);
int W(int i)
 return T(i);

The routine W relies on an inline template (T) to do the work. The inline template contains some code like:

.inline T,0
  add  %o0,%o0,%o0

The main routine resides in another file, and uses the routine W:

int W(int);
void main()

To use inline templates you compile the file that contains the call to the inline template together with the inline template that it calls - like this:

$ cc -c -xO4 m.c
$ cc -c -xO4 w.c
$ cc -xO4 m.o w.o

However, when crossfile optimisation (-xipo) is used, the routine W is inlined into main, and now main has a dependence on the inline template. But when m.o is recompiled after W has been inlined into main, the compiler cannot see the inline template for T because it was not present on the initial compile line for m.c. The result of this is an error like:

$ cc -c -xO4 -xipo m.c
$ cc -c -xO4 -xipo w.c
$ cc -xO4 -xipo m.o w.o
Undefined                     first referenced
 symbol                             in file
T                                   m.o
ld: fatal: Symbol referencing errors. No output written to a.out

As you might guess from the above description, the workaround is not intuitive. You need to add the inline template to the initial compile of the file m.c:

$ cc -c -xO4 -xipo m.c
$ cc -c -xO4 -xipo w.c
$ cc -xO4 -xipo m.o w.o

It is not sufficient to add the inline template to the final compile line.

Looking beyond the simple test case shown above, the problem really is that when crossfile optimisation is used, the developer is no longer aware of the places in the code where inlining has happened (which is as it should be). So the developer can't know which initial compile lines to add the inline template to.

Hence, the conclusion is that whenever you are compiling code that relies on inline templates with crossfile optimisation, it is necessary to include the inline template on the compile line of every file.

Sun Studio technical articles

Another list of Sun Studio technical articles.

Redistributable libraries

Steve Clamage and I just put together a short article on using the redistributable libraries that are shipped as part of the compiler. The particular one we focus on is stlport4 since this library is commonly substituted for the default libCstd.

There are two points to take away from the article. First of all, that the required libraries should be copied into a new directory structure for distribution with your application - this makes it easy to patch them, and ensures that the correct version is picked up. The second point is to use the $ORIGIN token when linking the application to specify the path, relative to the location of the executable, where the library will be found at runtime.

Runtime linking is one of my bugbears. I really get fed up with software that requires libraries to be located in particular places in order for it to run, or worse software that requires LD_LIBRARY_PATH to be set for the application to locate the libraries (see Rod Evan's blog entry).

Monday May 12, 2008

Slides for CommunityOne

All the slides for last week's CommunityOne conference are available for download. I was presenting in the CMT stream, you can find my slides here. Note that to download the slides, you'll need to use the username and password shown on the page.

My talk was on parallelisation. What's supported by the compiler, the steps to do it, and the tools that support that. I ended with an overview of microparallelisation.

Friday May 02, 2008

Official reschedule notice for CommunityOne

Session ID: S297077
Session Title: Techniques for Utilizing CMT
Track: Chip Multithreading (CMT): OpenSPARCâ„¢
Room: Esplanade 302
Date: 2008-05-05
Start Time: 13:30 

The official timetable has also been updated

OpenSolaris Summit

I'm stepping in to present at the OpenSolaris Summit on Sunday. The presentation is titled "Optimizing for OpenSolaris", and I believe someone else has already prepared the slideset - we'll see. Anyway, I'm looking forward to going, the attendees list contains many familiar names.

Embedded Systems Conference Presentation

I got the opportunity to present at the embedded systems conference in San Jose a couple of weeks back. My presentation covered parallelising a serial application, a quick tour of what to do, together with an overview of the tools that Sun Studio provides to help out. The presentation is now available on the OpenSPARC website.

Thursday Apr 24, 2008

Second life slides and script

Just completed the Second Life presentation. It appeared well attended, and I got a bundle of great questions at the end. If you were there, thank you! I've uploaded a screen shot that I managed to get before the presentation started. Unfortunately, I didn't get a picture of the stage setup with the life-size books, a very nice touch.

[Read More]

Thursday Mar 20, 2008

The much maligned -fast

The compiler flag -fast gets an unfair rap. Even the compiler reports:

cc: Warning: -xarch=native has been explicitly specified, or 
implicitly specified by a macro option, -xarch=native on this 
architecture implies -xarch=sparcvis2 which generates code that 
does not run on pre UltraSPARC III processors

which is hardly fair given the the UltraSPARC III line came out about 8 years ago! So I want to quickly discuss what's good about the option, and what reasons there are to be cautious.

The first thing to talk about is the warning message. -xtarget=native is a good option to use when the target platform is also the deployment platform. For me, this is the common case, but for people producing applications that are more generally deployed, it's not the common case. The best thing to do to avoid the warning and produce binaries that work with the widest range of hardware is to add the flag -xtarget=generic after -fast (compiler flags are parsed from left to right, so the rightmost flag is the one that gets obeyed). The generic target represents a mix of all the important processors, the mix produces code that should work well on all of them.

The next option which is in -fast for C that might cause some concern is -xalias_level=basic. This tells the compiler to assume that pointers of different basic types (e.g. integers, floats etc.) don't alias. Most people code to this, and the C standard actually has higher demands on the level of aliasing the compiler can assume. So code that conforms to the C standard will work correctly with this option. Of course, it's still worth being aware that the compiler is making the assumption.

The final area is floating point simplification. That's the flags -fsimple=2 which allows the compiler to reorder floating point expressions, -fns which allows the processor to flush subnormal numbers to zero, and some other flags that use faster floating point libraries or inline templates. I've previously written about my rather odd views on floating point maths. Basically it comes down to If these options make a difference to the performance of your code, then you should investigate why they make a difference..

Since -fast contains a number of flags which impact performance, it's probably a good plan to identify exactly those flags that do make a difference, and use only those. A tool like ats can really help here.

Performance tuning recipe

Dan Berger posted a comment about the compiler flags we'd used for Ruby. Basically, we've not done compiler flag tuning yet, so I'll write a quick outline of the first steps in tuning an application.

  • First of all profile the application with whatever flags it usually builds with. This is partly to get some idea of where the hot code is, but it's also useful to have some idea of what you're starting with. The other benefit is that it's tricky to debug build issues if you've already changed the defaults.
  • It should be pretty easy at this point to identify the build flags. Probably they will flash past on the screen, or in the worse case, they can be extracted (from non-stripped executables) using dumpstabs or dwarfdump. It can be necessary to check that the flags you want to use are actually the ones being passed into the compiler.
  • Of course, I'd certainly use spot to get all the profiles. One of the features spot has that is really useful is to archive the results, so that after multiple runs of the application with different builds it's still possible to look at the old code, and identify the compiler flags used.
  • I'd probably try -fast, which is a macro flag, meaning it enables a number of optimisations that typically improve application performance. I'll probably post later about this flag, since there's quite a bit to say about it. Performance under the flag should give an idea of what to expect with aggressive optimisations. If the flag does improve the applications performance, then you probably want to identify the exact options that provide the benefit and use those explicitly.
  • In the profile, I'd be looking for routines which seem to be taking too long, and I'd be trying to figure out what was causing the time to be spent. For SPARC, the execution count data from bit that's shown in the spot report is vital in being able to distinguish from code that runs slow, or code that runs fast but is executed many times. I'd probably try various other options based on what I saw. Memory stalls might make me try prefetch options, TLB misses would make me tweak the -xpagesize options.
  • I'd also want to look at using crossfile optimisation, and profile feedback, both of which tend to benefit all codes, but particularly codes which have a lot of branches. The compiler can make pretty good guesses on loops ;)

These directions are more a list of possible experiments than necessary an item-by-item checklist, but they form a good basis. And they are not an exhaustive list...

Friday Feb 01, 2008

The meaning of -xmemalign

I made some comments on a thread on the forums about memory alignment on SPARC and the -xmemalign flag. I've talked about memory alignment before, but this time the discussion was more about how the flag works. In brief:

  • The flag has two parts -xmemalign=[1|2|4|8][i|s]
  • The number specifies the alignment that the compiler should assume when compiling an object file. So if the compiler is not certain that the current variable is correctly aligned (say it's accessed through a pointer) then the compiler will assume the alignment given by the flag. Take a single precision floating point value that takes four bytes. Under -xmemalign=1[i|s] the compiler will assume that it is unaligned, so will issue four single byte loads to load the value. If the alignenment is specified as -xmemalign=2[i|s] the compiler will assume two byte alignment, so will issue two loads to get the four byte value.
  • The suffix [i|s] tells the compiler how to behave if there is a misaligned access. For 32-bit codes the default is i which fixes the misaligned access and continues. For 64-bit codes the default is s which causes the app to die with a SIGBUS error. This is the part of the flag that has to be specified at link time because it causes different code to be linked into the binary depending on the desired behaviour. The C documentation captures this correctly, but the C++ and Fortran docs will be updated.

Monday Jan 28, 2008

Sample chapter from Solaris Application Programming available

There's a sample chapter from my book up on

It's chapter 4 which is the chapter which discusses the tools that come with Solaris and Sun Studio. The chapter exists because I find that there are some tools that I use every day, and some tools that I might touch once a month, and some that I use even more rarely. The problems I hit are:

  • What was the name of the tool which ....?
  • What are the command line options to ...?
  • Is there a tool to ....?

Obviously I hit the third problem very infrequently, but I'm sometimes surprised when I discover a tool which I'd previously never heard of which just happens to do exactly what I need. Anyway I hope you find the chapter useful. It's one of my two solutions to this problem.

The other solution is spot which attempts to collect all the data that you routinely need for performance analysis of an application. So it calls the other tools - so you don't need to know the commandlines, or the names of the tools. One of the things that should be noticeable with spot is that it has few commandline options. I was hoping that we'd end up with none, but some are inevitable; but those are really house-keeping options (where to put the report, what to call it). There's only -X which generates an extended report, given the time it can take to get the data, it seemed appropriate to do the high value stuff quickly with an option for the tool to take a longer time when the user specified that it was ok.

Thursday Jan 24, 2008

-xalias_level in C and C++

The compiler flag -xalias_level takes various options, and specifies the amount of aliasing that the compiler should assume. For example -xalias_level=any (the default at optimisation levels below -fast) means that the compiler should assume that any pointers may alias (ie point to the same location in memory).

In Sun Studio the available levels have different names for C and C++. adopted the C levels for both C and C++. The following table shows the mapping between the C and C++ names:


Monday Dec 17, 2007

Open source application tuning

My group has started a page on the Sun wiki detailing the steps necessary to compile and build a number of open source applications. The page also contains links to useful destinations in the compiler documentation. Feel free to suggest ideas for applications that we should cover there - I can't guaranty that we'll manage to look at them, but I'd love to know what's important to you!

Monday Nov 26, 2007

Other Sun Studio videos

There were two other videos posted to the hpc portal. Marty Itzkowitz talking about the Performance Analyzer and Vijay Tatkar talking about portable applications.

Multi-threading webcast

A long while back I was asked to contribute a video that talked about parallelising applications. The final format is a webcast (audio and slides) rather than the expected video. This choice ended up being made to provide the clearest visuals of the slides, plus the smallest download.

I did get the opportunity to do the entire presentation on video - which was an interesting experience. I found it surprisingly hard to present to just a camera - I think the contrast with presenting to an audience is that you can look around the room and get feedback as to the appropriate level of energy to project. A video camera gives you no such feedback, and worse, there's no other place to look. Still I was quite pleased with the final video. The change to a webcast was made after this, so the audio from the video was carried over, and you still get to see about 3 seconds of the original film, but the rest has gone. I also ended up reworking quite a few of the slides - adding animation to clarify some of the topics.

The topics covered at a break-neck pace are, parallelising using Pthreads and OpenMP. Autoparallelisation by the compiler. Profiling parallel applications. Finally, detecting data races using the thread analyzer.

Tuesday Nov 06, 2007

Sun Studio support matrix

This table shows the available support for the various generations of Sun compilers. The name has evolved from Forte Developer, through Sun ONE Studio, to the current Sun Studio.

Friday Oct 26, 2007

What would you like to see in Sun Studio?

The Sun Studio team are inviting anyone to give them feedback about what features should be in the next version. Obviously, it is not going to be possible to implement every suggestion, but if you have any requests then follow the following procedure before the cut-off date of 31st October (yup, that's not too far in the future). Alternatively, paste them as comments into my blog, and I'll make sure that they get put on the list.

The procedure for filing a request for enhancement (rfe) against Sun Studio is as follows:

  • Go to and check the box at the bottom of the page which says that you understand that the tool is not a support mechanism, then click on "Start a new Report"
  • Select the following:
    • Type = Request for Enhancement
    • Product/Category = "C/C++/Fortran Compilers and Tools - Misc"
    • Subcategory = "C/C++/Fortran Compilers and Tools - Misc"
    • Release = "Other"
    • Operating System = choose appropriate OS
  • Click on "Continue"

On the next screen, you only need to fill out the required fields indicated by asterisks. Some notes for two of the fields:

  • The "Synopsis" field should be start with the identifier "VOC" (short for "Voice of Customer"). e.g.: "VOC: Please add this feature".
  • Complete the "Justification" field with details on why the feature is important - the more credible the reason the better.
  • Finally, click on "Submit"

Thanks for your help.

BTW, the same form can be used to report bugs in the product too - but as the check box indicates it is not a way of getting support.

Wednesday Oct 10, 2007

Debugging OpenMP code

The compiler flag -xopenmp enables the recognition of OpenMP directives. As a side -effect it also raises the optimisation level to -xO3. If you're trying to debug the code, then you'll not want the optimisation level raised then you can use the option -xopenmp=noopt which enables the recognition of OpenMP directives but does not increase the optimisation level.

It's also worth compiling with the flags -xvpara and -xloopinfo which report parallelisation information.

Tuesday Oct 09, 2007

CMT Developer Tools on the UltraSPARC T2 systems

The CMT Developer Tools are included on the new UltraSPARC T2 based systems, together with Sun Studio 12, and GCC for SPARC Systems.

The CMT Developer Tools are installed in:

and are (unfortunately) not on the default path.

Compiling for the UltraSPARC T2

Today, Sun launched systems based on the UltraSPARC T2. A question that is bound to come up is what compiler flags should be used for the processor?

Sun Studio 12 has the flag -xtarget=ultraT2 to specifically target the UltraSPARC T2. But before jumping off and using this flag, let's take the flag apart and see what it actually means. There are three components that are set by the -xtarget flag :

  • -xcache flag. This flag tells the compiler to target a particular cache configuration. The flag will have an impact on floating point code where the loops can be tiled to fit into cache. Obviously not all codes are amenable to this optimisation, so the -xcache setting is usually unimportant.
  • -xchip flag. This sets the instruction latencies and instruction selection preferences. The UltraSPARC T2 (in common with the UltraSPARC T1) has a simple pipeline so there is nothing much to gain from accurately modelling the instruction latencies. There are also no real situations where it will do better with one instruction sequence in preference to another (unless one is longer than the other). So for the UltraSPARC T2 this flag has little impact on the generated code.
  • -xarch flag. The -xarch flag controls the target architecture. This is traditionally used principally to control whether 32-bit or 64-bit binaries are generated. However, Sun Studio 12 introduced the flags -m32 and -m64 to separate the address-size of the binary from the instruction set selection. There are no UltraSPARC T2 specific instructions which the compiler currently generates, so the default of the SPARC V9 ISA is fine.
  • To summarise, there is an UltraSPARC T2 specific compiler flag, but for most situations the best target to use would be -xtarget=generic which should give good performance over a wide range of processors.

Friday Oct 05, 2007

dbx commands and disassembly

Here's a list of the command supported by dbx. The listi command shows the disassembled code for the next instructions:

% dbx a.out
(dbx) stop in main
dbx: warning: 'main' has no debugger info -- will trigger on first instruction
(2) stop in main
(dbx) run
Running: a.out
(process id 16158)
stopped in main at 0x00010cd0
0x00010cd0: main       :        save     %sp, -120, %sp
(dbx) listi
dbx: warning: no source lines at current PC; use "dis" instead
0x00010cd4: main+0x0004:        sethi    %hi(0x21000), %l0
0x00010cd8: main+0x0008:        clr      [%l0 + 0x00000164]

Friday Sep 28, 2007

Solaris Application Programming Table of Contents

A couple of folks requested that I post the table of contents for my book. This is the draft TOC, not the finished product. I assume that there will be a good correspondence, but the final version should definitely look neater.

Tuesday Sep 25, 2007

Solaris Application Programming book

I'm thrilled to see that my book is being listed for pre-order on Amazon in the US. It seems to take about a month for it to travel the Atlantic to Amazon UK.

Thursday Sep 13, 2007

Dwarf debug format

Just been looking at the dwarf debug format. In Sun Studio 12, all the compilers switched to using this format over the older stabs format. Chris Quenelle has put a useful guide to debug information up on the Sun Wiki. The dwarf format overview document is probably the easiest to read, the version 3 specification document is more detailed. The tool to extract the dwarf information from an object file is dwarfdump. For example the -l flag will extract line number info.

Wednesday Sep 12, 2007

Sun Studio 12 Performance Analyzer docs available

The documentation for the Sun Studio 12 version of the Performance Analyzer has gone live. The Sun Studio 12 docs collection is also available.

Article on discover (tool for detecting memory access errors)

My (brief) article on the Sun Memory Error Discovery Tool (aka discover) has gone live on the developer portal. The tool is designed to find memory access errors. It instruments the code and checks each memory access to see if it is valid. It can detect things like access to uninitialised memory, or accesses past the end of arrays. These can be really hard to find by inspecting the source, and even from core dumps - the symptoms of the problem often occur long after the memory access error has occurred. The one constraint on the tool is that it is currently only for single threaded apps.

Wednesday Aug 15, 2007

Comparing analyzer experiments

When performance tuning an application it is really helpful to be able to compare the performance of the current version of the code with an older version. This was one of the motivators for the performance reporting tool spot. spot takes great care to capture information about the system that the code was run on, the flags used to build the application, together with the obvious things like the profile of the application. The tool spot_diff, which is included in the latest release of spot, pulls out the data from multiple experiments and produces a very detailed comparison between them - indicating if, for example, one version had more TLB misses than another version.

However, there are situations where it's necessary to compare two analyzer experiments, and er_xcmp is a tool which does just that.

er_xcmp extracts the time spent in each routine for the input data that is passed to it, and presents this as a list of functions together with the time spent in each function from each data set. er_xcmp handles an arbitrary number of input files, so it's just as happy comparing three profiles as it is two. It's also able to handle data from bit, so comparisons of instruction count as well as user time are possible.

The input formats can be Analyzer experiments, fsummary output from er_print, or output directories from er_html - all three formats get boiled down to the same thing and handled in the same way by the script.

Here's some example output:

% er_xcmp
    2.8     8.8 <Total>
    1.8     2.6 foo
    1.0     6.2 main
    N/A     N/A _start

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« July 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming