Wednesday Feb 26, 2014

Article on RAW hazards

Feels like it's been a long while since I wrote up an article for OTN, so I'm pleased that I've finally got around to fixing that.

I've written about RAW hazards in the past. But I recently went through a patch of discovering them in a number of places, so I've written up a full article on them.

What is "nice" about RAW hazards is that once you recognise them for what they are (and that's the tricky bit), they are typically easy to avoid. So if you see 10 seconds of time attributable to RAW hazards in the profile, then you can often get the entire 10 seconds back by tweaking the code.

Tuesday Jan 10, 2012

What's inlined by -xlibmil

The compiler flag -xlibmil provides inline templates for some critical maths functions, but it comes with the optimisation that it does not set errno for these functions. The functions it inlines can vary from release to release, so it's useful to be able to see which functions are inlined, and determine whether you care that they don't set errno. You can see the list of functions using the command:

grep inline /compilerpath/prod/lib/
        .inline sqrtf,1
        .inline sqrt,2
        .inline ceil,2
        .inline ceilf,1
        .inline floor,2
        .inline floorf,1
        .inline rint,2
        .inline rintf,1

From a cursory glance at the list I got when I did this just now, I can only see sqrt as a function that sets errno. So if you use sqrt and you care about whether it set errno, then don't use -xlibmil.

Monday Apr 25, 2011

Using pragma opt

The Studio compiler has the ability to control the optimisation level that is applied to particular functions in an application. This can be useful if the functions are designed to work at a specific optimisation level, or if the application fails at a particular optimisation level, and you need to figure out where the problem lies.

The optimisation levels are controlled through pragma opt. The following steps need to be followed to use the pragma:

  • The directive needs to be inserted into the source file. The format of the directive is #pragma opt /level/ (/function/). This needs to be inserted into the code before the start of the function definition, but after the function header.
  • The code needs to be compiled with the flag -xmaxopt=level. This sets the maximum optimisation level for all functions in the file - including those tagged with #pragma opt.

We can see this in action using the following code snippet. This contains two identical functions, both return the square of a global variable. However, we are using #pragma opt to control the optimisation level of the function f().

int f();
int g();

#pragma opt 2 (f)

int d;

int f()
  return d\*d;

int g()
  return d\*d;

The code is compiled with the flag -xmaxopt=5, this specifies the maximum optimisation level that can be applied to any functions in the file.

$ cc -O -xmaxopt=5 -S opt.c

If we compare the disassembly for the functions f() and g(), we can see that g() is more optimal as it does not reload the global data.

/\* 000000          0 \*/         sethi   %hi(d),%o5

!   10                !  return d\*d;

/\* 0x0004         10 \*/         ldsw    [%o5+%lo(d)],%o4 ! volatile    // First load of d
/\* 0x0008            \*/         ldsw    [%o5+%lo(d)],%o3 ! volatile    // Second load of d
/\* 0x000c            \*/         retl    ! Result =  %o0
/\* 0x0010            \*/         mulx    %o4,%o3,%o0

/\* 000000         14 \*/         sethi   %hi(d),%o5
/\* 0x0004            \*/         ld      [%o5+%lo(d)],%o4               // Single load of d

!   15                !  return d\*d;

/\* 0x0008         15 \*/         sra     %o4,0,%o3
/\* 0x000c            \*/         retl    ! Result =  %o0
/\* 0x0010            \*/         mulx    %o3,%o3,%o0

Tuesday May 11, 2010

New Book: Multicore application programming

I'm very pleased to be able to talk about my next book Multicore Application Programming. I've been working on this for some time, and it's a great relief to be able to finally point to a webpage indicating that it really exists!

The release date is sometime around September/October. Amazon has it as the 11th October, which is probably about right. It takes a chunk of time for the text to go through editing, typesetting, and printing, before it's finally out in the shops. The current status is that it's a set of documents with a fair number of virtual sticky tags attached indicating points which need to be refined.

One thing that should immediately jump out from the subtitle is that the book (currently) covers Windows, Linux, and Solaris. In writing the book I felt it was critical to try and bridge the gaps between operating systems, and avoid writing it about only one.

Obviously the difference between Solaris and Linux is pretty minimal. The differences with Windows are much greater, but, when writing to the Windows native threading API, the actual differences are more syntactic than functional.

By this I mean that the name of the function changes, the parameters change a bit, but the meaning of the function call does not change. For example, you might call pthread_create(), on Windows you might call _beginthreadex(); the name of the function changes, there are a few different parameters, but both calls create a new thread.

I'll write a follow up post containing more details about the contents of the book.

Monday Sep 28, 2009

Updated compiler flags article

Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.

Thursday Jun 11, 2009

Code complete: burn this chapter

That's a somewhat inflammatory title for this post, but continuing from my previous post on the book Code Complete, I think that the chapters (25 & 26) on performance really do not contain good advice or good examples.

To make this more concrete, consider the example on pg 593 where Steve McConnell compares the performance of these two code fragments:

for i = 1 to 10
  a[ i ] = i
end for

a[ 1 ] = 1
a[ 2 ] = 2
a[ 3 ] = 3
a[ 4 ] = 4
a[ 5 ] = 5
a[ 6 ] = 6
a[ 7 ] = 7
a[ 8 ] = 8
a[ 9 ] = 9
a[ 10 ] = 10

Steve finds that Visual Basic and Java run the unrolled version of the loop faster.

There's a couple of examples that talk about incorrect access ordering for arrays. Here's some C code that illustrates the problem:

Slow codeFast code
for (column=0; column < max_column; column++) 
  for (row=0; row < max_row; row++) 
for (row=0; row < max_row; row++) 
  for (column=0; column < max_column; column++)

On page 599 it is suggested that the slow code is inefficient because it might cause paging to disk, on page 623 it is suggested that the higher trip count loop should be inside to amortise the initialisation overhead for each execution of the inner loop. Neither of these explanations is right. As I'm sure most of you recognise the code is slow because of cache misses incurred when accessing non-adjacent memory locations. There is a cost to initialisation of the inner loop, but nothing significant, and yes, you could get paging to disk - but only if you are running out of memory (and if you're running out of memory, you're hosed anyway!). You're more likely to get TLB misses (and perhaps that is what Mr McConnell intended to say.

I consider the above issues to be quite serious, but, unfortunately, I'm not terribly happy with the rest of the material. Hence my recommendation to ignore (or burn ;) these chapters. I'll go through my other reservations now.

Lack of details. The timing information is presented with no additional information (pg 623) "C++ Straight Time = 4.75 Code-Tuned Time = 3.19 Time Savings = 33%". What was the compiler? What compiler flags were given? What was the test harness?

The book presents it as somehow that "C++" runs this code slowly, but in reality it's more likely to be a test of the effectiveness of the compiler, and the ability of the user to use the compiler. I'd be surprised if any compiler with minimal optimisation enabled did not do the loop interchange operation necessary to get good performance. Which leads to my next observation:

Don't compilers do this? I think the book falls into one of the common "optimisation book" traps, where lots of ink is spent describing and naming the various optimisations. This gives the false impression that it is necessary for the expert programmer to be able to identify these optimisations and apply them to their program. Most compilers will apply all these optimisations - afterall that is what compilers are supposed to do - take the grudgery out of producing optimal code. It's great for page count to enumerate all the possible ways that code might be restructured for performance, but for most situations the restructuring will lead to code that has the same performance.

Profiling. It's not there! To me the most critical thing that a developer can do to optimise their program is to profile it. Understanding where the time is being spent is the necessary first step towards improving the performance of the application. This omission is alarming. The chapter already encourages users to do manual optimisations where there might be no gains (at the cost of time spent doing restructuring that could be better spent writing new code, and the risk that the resulting code is less maintainable), but without profiling the application, the users are basically encouraged to do this over the entire source code, not just the lines that actually matter.

Assembly language. Yes, I love assembly language, there's nothing I enjoy better than working with it (no comment), but I wouldn't encourage people to drop into it for performance reasons, unless they had utterly exhausted every other option. The book includes an example using Delphi where the assembly language version ran faster than the high-level version. My guess is that the compilers had some trouble with aliasing, and hence had more loads than were necessary - a check of the assembly code that the compilers generated would indicate that, and then it's pretty straight forward to write assembly-language-like high level code that the compiler can produce optimal code for. [Note, that I view reading and analysing the code at the assembly language level to be very useful, but I wouldn't recommend leaping into writing assembly language without a good reason.]

So what would I recommend:

  • Profile. Always profile. This will indicate where the time is being spent, and what sort of gains you should expect from optimising parts of the application.
  • Know the tools. Make sure that you know what compiler flags are available, and that you are requesting the right kind of things from the compiler. All too often there are stories about how A is faster than B, which are due to people not knowing how to use the tools.
  • Identify those parts of the code where the time is spent, and examine them in detail to determine if it's a short coming of the compiler, the compiler flags, or an ambiguity in the source code, that causes time to be spent there. Many performance problems can be solved with by adding a new flag, or perhaps a minor tweak to the source code.
  • Only when you have exhausted all other options, and you know that you can get a significant performance gain should you start wildly hacking at the source code, or recoding parts in assembly language.

The other thing to recommend is a read of Bart Smaalder's Performance Anti-patterns.

Wednesday May 20, 2009

Libraries (5) - Runtime costs - TLBs

The next consideration when using libraries is that each library will get mapped in on a new virtual page of memory; as shown in this pmap output:

% pmap 60500
60500:  a.out
00010000       8K r-x--  /libraries/a.out
00020000       8K rwx--  /libraries/a.out
FEEC0000      24K rwx--    [ anon ]
FEED0000       8K r-x--  /libraries/
FEEE0000       8K rwx--  /libraries/
FEEF0000       8K r-x--  /libraries/
FEF00000       8K rwx--  /libraries/
FEF10000       8K r-x--  /libraries/
FEF20000       8K rwx--  /libraries/
FEF30000       8K r-x--  /libraries/
FEF40000       8K rwx--  /libraries/
FEF50000       8K rwx--    [ anon ]
FEF60000       8K r-x--  /libraries/
FEF70000       8K rwx--  /libraries/
FEF80000       8K r-x--  /libraries/
FEF90000       8K rwx--  /libraries/
FEFA0000       8K r-x--  /libraries/
FEFB0000       8K rwx--  /libraries/
FEFC0000       8K r-x--  /libraries/

There are finite number of TLB entries on a chip. If each library takes an entry, and the code jumps around between libraries, then a single application can utilise quite a few TLB entries. Take a CMT system where there are multiple applications (or copies of the same application) running, and there becomes a lot of pressure on the TLB.

One of the enhancements in Solaris to support CMT processors is Shared Context. When multiple applications map the same library at the same address, then they can share a single context to map that library. This can lead to a significant reduction in the TLB pressure. Shared context only works for libraries that are loaded into the same memory locations in different contexts, so it can be defeated if the libraries are loaded in different orders or any other mechanisms that scramble the locations in memory.

If each library is mapped into a different TLB entry, then every call into a new library is a new ITLB entry, together with a jump through the PLT, together with the normal register spill/fill overhead. This can become quite a significant chunk of overhead.

To round this off, lets look at some figures from an artificial code run on an UltraSPARC T1 system that was hanging around here.

Application that jumps between 26 different routines a->b->c...->z. All the routines are included in the same executable. 3s
Application that jumps between 26 different routines a->...z. The routines are provided as a library, and calls are therefore routed through the PLT. 6s
Application that jumps between 26 different routines a->...z. The routines are provided as a library, but all are declared static except for the initial routine that is called by main. Therefore the calls within the library avoid the PLT. 3s
Application that jumps between 26 different routines a->...z. Each routine is defined in its own library, so calls to the routine have to go through the PLT, and also require a new ITLB entry to be used. 60s

Since the routines in this test code don't actually do anything, the overhead of calling through the PLT is clearly shown as a doubling of runtime. However, this is insignificant when compared with the costs of calling to separate libraries, which is about 10x slower than this.

Moving the experiment to look at the impact on CMT systems:

One copy of this executable per core of an UltraSPARC T1 processor 1 minute
Two copies of this executable per core 5 minutes
Four copies of this executable per core (fully loaded system) 8 minutes

Running multiple copies of the application has a significant impact on performance. The performance counters show very few instructions being executed, and much time being lost to ITLB misses. Now this performance is from a system without the shared context changes - so I would expect much better scaling on a system with these improvements (if I find one I'll rerun the experiment).

The conclusion is that care needs to be taken when deciding to split application code into libraries.

Libraries (4) - Runtime costs - Procedure Lookup Table (PLT)

Most applications spend the majority of their time running - rather than starting up. So it's useful to look at the costs of using libraries at runtime.

The most apparent cost of using libraries is that calls to routines now go indirectly to the target routine through the procedure look up table (PLT). Unless the developer explicitly limits the scope of a function, it is exported from the library as a global function, which means that even calls within the library will go through the PLT. Consider the following code snippet:

void func2()

void func1()

If this is compiled into an executable the assembly code will look like:

        11104:  82 10 00 0f  mov        %o7, %g1
        11108:  7f ff ff f8  call       func2   ! 0x110e8
        1110c:  9e 10 00 01  mov        %g1, %o7

However, if this is compiled as part of a library then the code looks like:

         664:  82 10 00 0f  mov         %o7, %g1
         668:  40 00 40 b9  call        .plt+0x3c       ! 0x1094c
         66c:  9e 10 00 01  mov         %g1, %o7

This is a doubling of the cost of the call.

In C it's possible to limit the scope of the function using the static keyword. Declaring func1 as static will cause the compiler to generate a direct call to that routine. The downside is that the routine will only be visible within the source file that defines it. It is also possible to use other methods to limit the visibility of symbols.

Libraries (3) - Application startup costs

As can be seen from the previous graphs, even a simple application (like ssh) can pull in a fair number of libraries. Whenever a library is pulled in, the linker has to request memory, load the image from disk, and then link in all the routines. This effort takes time - it's basically a large chunk of the start up time of an application. If you profile the start up of an application, you'll probably not see much because much of this time is basically the OS/disk activity of mapping the libraries into memory.

Of course applications also have start up costs associated with initialising data structures etc. However, the biggest risk is that applications will pull in libraries that they don't need, or perhaps do need, but don't need yet. The best work-around for this is to lazy load the libraries. Of course it's fairly easy to write code that either breaks under lazy loading or breaks lazy loading. It's not hard to work around these issues with care, and doing so can have a substantial impact on start up time.

Thursday May 07, 2009

The perils of strlen

Just been looking at an interesting bit of code. Here's a suitably benign version of it:

#include <string.h>
#include <stdio.h>

void main()
  char string[50];
  int i;
  int j=0;
  for (i=0; i<strlen(string); i++)
   if (string[i]=='1') {j=i;}

Compiling this bit of code leads to a loop that looks like:

/\* 0x002c         12 \*/         cmp     %i5,49
/\* 0x0030         10 \*/         add     %i3,1,%i3
/\* 0x0034         12 \*/         move    %icc,%i4,%i2
/\* 0x0038         10 \*/         call    strlen  ! params =  %o0 ! Result =  %o0
/\* 0x003c            \*/         or      %g0,%i1,%o0
/\* 0x0040            \*/         add     %i4,1,%i4
/\* 0x0044            \*/         cmp     %i4,%o0
/\* 0x0048            \*/         bcs,a,pt        %icc,.L900000109
/\* 0x004c         12 \*/         ldsb    [%i3],%i5

The problem being that for each character tested there's also a call to strlen! The reason for this is that the compiler cannot be sure what the call to strlen actually returns. The return value might depend on some external variable that could change as the loop progresses.

There's a lot of functions defined in the libraries that the compiler could optimise, if it was certain that it recognised them. The compiler flag that enables recognition of the "builtin" functions is -xbuiltin (which is included in -fast. This enables the compiler to do things like recognise calls to memcpy or memset and in some instances produce more optimal code. However, it doesn't recognise the call the strlen.

In terms of solving the problem, there are two approaches. The most portable approach is to hold the length of the string in a temporary variable:

  int length=strlen(string);
  for (i=0; i<length; i++)

Another, less portable approach, is to use #pragma no_side_effect. This pragma means the return value of the function depends only on the parameters passed into the function. So the result of calling strlen only depends on the value of the constant string that is passed in. The modified code looks like:

#include <string.h>
#include <stdio.h>
#pragma no_side_effect(strlen)

void main()
  char string[50];
  int i;
  int j=0;
  for (i=0; i<strlen(string); i++)
   if (string[i]=='1') {j=i;}

And more importantly, the resulting disassembly looks like:

/\* 0x0028          0 \*/         sub     %i1,49,%o7
/\* 0x002c         11 \*/         add     %i3,1,%i3
/\* 0x0030          0 \*/         sra     %o7,0,%o5
/\* 0x0034         13 \*/         movrz   %o5,%i4,%i2
/\* 0x0038         11 \*/         add     %i4,1,%i4
/\* 0x003c            \*/         cmp     %i4,%i5
/\* 0x0040            \*/         bcs,a,pt        %icc,.L900000109
/\* 0x0044         13 \*/         ldsb    [%i3],%i1

Friday Mar 20, 2009

University of Washington Presentation

I was presenting at the University of Washington, Seattle, on Wednesday on Solaris and Sun Studio. The talk covers the tools that are available in Solaris and Sun Studio. This is my "Grand Unified" presentation, it covers tools, compilers, optimisation, parallelisation, and debug.

Wednesday Nov 05, 2008

OpenSPARC presentations

As part of the OpenSPARC book, we were asked to provide slideware and to present that slideware. The details of what's available are listed on the OpenSPARC site, and are available for free from the site on

I contributed two sections. I produced the slides and did the voice over for the material on developing for CMT, the accompanying slides are also available. I also did a voice over for someone else's slides on Operating Systems for CMT (again slides available).

The recording sessions were ok, but a bit strange since it was just myself and the sound engineer working in a meeting room in Santa Clara. I get a lot of energy from live presentations, particularly the interactions with people, and I found the setup rather too quiet for my liking.

The Sun Studio presentation was relatively easy. It runs for nearly an hour, and there's a couple of places where I felt that additional slides would have helped the flow. The Operating Systems presentation was much harder as it was trying to weave a story around someone else's slide deck.

Friday May 23, 2008

Books on code tweaking

Following up on yesterday's entry on modulo arithmetic, I figured I'd note a couple of books that I've read in the past, and found interesting.

A while back a colleague pointed me to "Hacker's delight". There's recommendations from Josh Bloch and Guy Steele on the back cover. I found it an interesting read, but I remember being disappointed that there were not more general purpose tweaks. My copy of the book has disappeared somewhere or other, so I can't flick back through it and come up with anything better to say!

The book that most inspired me when I read it was Michael Abrash's "Zen of code optimization". It's probably one of the handful of reasons that I ended up doing what I do for a job. Together with his "Graphics Programming Black book" It also caused me to spend so much time iterating on the graphics code in a game I was writing in my spare time, that I never got to write the game.... ;)

Anyway, the really cool thing about Abrash's books, is that a few years they were released for free download in their entirety. Although the 386 processor that they focus on is no longer state-of-the-art, they're still interesting. They seem to still be available from various sources, a list is on wikipedia. Here's the download site on Chapter 14 on Boyer-Moore string searching is worth a read.

Monday Jan 07, 2008

putc in a multithreaded context

Just answering a question from a colleague. The application was running significantly slower when compiled as a multithreaded app compared to the original serial app. The profile showed mutex_unlock as being hot, but going up the callstack the routine that called mutex_unlock was putc.

This is the OpenSolaris source for putc, which shows a call to FLOCKFILE, which is defined in this file for MT programs. So for MT programs, a lock needs to be acquired before the character can be output.

Fortunately it is possible to avoid the locking using putc_unlocked. This call should not be used as a drop-in replacement for putc, but used after the appropriate mutex has been acquired. The details are in the Solaris Multi-threaded programming guide.

A test program that demonstrates this problem is:

#include <stdio.h>
#include <pthread.h>
#include <sys/time.h>

static double s_time;

void starttime()

void endtime(long its)
  double e_time=1.0\*gethrtime();
  printf("Time per iteration %5.2f ns\\n", (e_time-s_time)/(1.0\*its));


void \*dowork(void \*params)
  FILE\* s=fopen("/tmp/dldldldldld","w");
  for (int i=0; i<100000000; i++)

void main()
  FILE\* s=fopen("/tmp/dldldldldld","w");
  for (int i=0; i<100000000; i++)
  pthread_t threads[1];

Here's the results of running the code on Solaris 10:

$ cc -mt putc.c
Time per iteration 30.55 ns
Time per iteration 165.76 ns

The situation on Solaris 10 is better than Solaris 9, since on Solaris 9 the cost of the mutex was incurred by the -mt compiler flag rather than whether there were actually multiple threads active.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge


« April 2014
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming