Tuesday May 26, 2015

Misaligned loads profiled (again)

A long time ago I described how misaligned loads appeared in profiles of 32-bit applications. Since then we've changed the level of detail presented in the Performance Analyzer. When I wrote the article the time spent on-cpu that wasn't User time was grouped as System time. We've now started showing more detail - and more detail is good.

Here's a similar bit of code:

#include <stdio.h>

static int i,j;
volatile double *d;

void main ()
  char a[10];
  for (i=0;i < j; i++) 

This code stores into a misaligned double, and that's all we need in order to generate misaligned traps and see how they are shown in the performance analyzer. Here's the hot loop:

Load Object: a.out

Inclusive       Inclusive       
User CPU Time   Trap CPU Time   Name
(sec.)          (sec.)          
1.131           0.510               [?]    10928:  inc         4, %i1
0.              0.                  [?]    1092c:  ldd         [%i2], %f2
0.811           0.380               [?]    10930:  faddd       %f2, %f0, %f4
0.              0.                  [?]    10934:  std         %f4, [%i2]
0.911           0.480               [?]    10938:  ldd         [%i2], %f2
1.121           0.370               [?]    1093c:  faddd       %f2, %f0, %f4
0.              0.                  [?]    10940:  std         %f4, [%i2]
0.761           0.410               [?]    10944:  ldd         [%i2], %f2
0.911           0.410               [?]    10948:  faddd       %f2, %f0, %f4
0.010           0.                  [?]    1094c:  std         %f4, [%i2]
0.941           0.450               [?]    10950:  ldd         [%i2], %f2
1.111           0.380               [?]    10954:  faddd       %f2, %f0, %f4
0.              0.                  [?]    10958:  cmp         %i1, %i5
0.              0.                  [?]    1095c:  ble,pt      %icc,0x10928
0.              0.                  [?]    10960:  std         %f4, [%i2]

So the first thing to notice is that we're now reporting Trap time rather than aggregating it into System time. This is useful because trap time is intrinsically different from system time, so it's worth displaying it differently. Fortunately the new overview screen highlights the trap time, so it's easy to recognise when to look for it.

Now, you should be familiar with the "previous instruction is to blame rule" for interpreting the output from the performance analyzer. Dealing with traps is no different, the time spent on the next instruction is due to the trap of the previous instruction. So the final load in the loop takes about 1.1s of user time and 0.38s of trap time.

Slight side track about the "blame the last instruction" rule. For misaligned accesses the problem instruction traps and its action is emulated. So the next instruction executed is the instruction following the misaligned access. That's why we see time attributed to the following instruction. However, there are situations where an instruction is retried after a trap, in those cases the next instruction is the instruction that caused the trap. Examples of this are TLB misses or save/restore instructions.

If we recompile the code as 64-bit and set -xmemalign=8i, then we get a different profile:

User CPU Time   Name
3.002           <Total>
2.882           __misalign_trap_handler
0.070           main
0.040           __do_misaligned_ldst_instr
0.010           getreg
0.              _start

For 64-bit code the misaligned operations are fixed in user-land. One (unexpected) advantage of this is that you can take a look at the routines that call the trap handler and identify exactly where the misaligned memory operations are:

0.480	main + 0x00000078
0.450	main + 0x00000054
0.410	main + 0x0000006C
0.380	main + 0x00000060
0.370	main + 0x00000088
0.310	main + 0x0000005C
0.270	main + 0x00000068
0.260	main + 0x00000074

This is really useful if there are a number of sites and your objective is to fix them in order of performance impact.

Tuesday May 05, 2015

C++ rules enforced in Studio 12.4

Studio 12.4 has improved adherence to the C++ standard, so some codes which were accepted by 12.3 might get reported as errors with the new compiler. The compiler documentation has a list of the improvements and examples of how to modify problem code to make it standard compliant.

Saturday Mar 21, 2015

New Studio C++ blogger

Please welcome another of my colleagues to blogs.oracle.com. Fedor's just posted some details about getting Boost to compile with Studio.

Tuesday Feb 17, 2015

Profiling the kernel

One of the incredibly useful features in Studio is the ability to profile the kernel. The tool to do this is er_kernel. It's based around dtrace, so you either need to run it with escalated privileges, or you need to edit /etc/user_attr to add something like:


The correct way to modify user_attr is with the command usermod:

usermod -K defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernel <username>

There's two ways to run er_kernel. The default mode is to just profile the kernel:

$ er_kernel sleep 10
Creating experiment database ktest.1.er (Process ID: 7399) ...
$ er_print -limit 10 -func ktest.1.er
Functions sorted by metric: Exclusive Kernel CPU Time

Excl.     Incl.      Name
Kernel    Kernel
CPU sec.  CPU sec.
19.242    19.242     <Total>
14.869    14.869     <l_PID_7398>
 0.687     0.949     default_mutex_lock_delay
 0.263     0.263     mutex_enter
 0.202     0.202     <java_PID_248>
 0.162     0.162     gettick
 0.141     0.141     hv_ldc_tx_set_qtail

The we passed the command sleep 10 to er_kernel, this causes it to profile for 10 seconds. It might be better form to use the equivalent command line option -t 10.

In the profile we can see a couple of user processes together with some kernel activity. The other way to run er_kernel is to profile the kernel and user processes. We enable this mode with the command line option -F on:

$ er_kernel -F on sleep 10
Creating experiment database ktest.2.er (Process ID: 7630) ...
$ er_print -limit 5 -func ktest.2.er
Functions sorted by metric: Exclusive Total CPU Time

Excl.     Incl.     Excl.     Incl.      Name
Total     Total     Kernel    Kernel
CPU sec.  CPU sec.  CPU sec.  CPU sec.
15.384    15.384    16.333    16.333     <Total>
15.061    15.061     0.        0.        main
 0.061     0.061     0.        0.        ioctl
 0.051     0.141     0.        0.        dt_consume_cpu
 0.040     0.040     0.        0.        __nanosleep

In this case we can see all the userland activity as well as kernel activity. The -F option is very flexible, instead of just profiling everything, we can use -F =<regexp>syntax to specify either a PID or process name to profile:

$ er_kernel -F =7398

Monday Feb 09, 2015

New Studio blogger

One of my colleagues, Raj Prakash, has started a blog. If you saw the Studio videos from last year, you'll recall that Raj and I discussed memory error dectection, and appropriately enough his first post is on how you can use Studio's memory error detection capabilities on server-type applications.

Tuesday Feb 03, 2015

Complete set of bit manipulation posts combined

My recent blog posts on bit manipulation are now available as an article up on the OTN community pages. If you want to read the individual posts they are:

Bit manipulation: Gathering bits

In the last post on bit manipulation we looked at how we could identify bytes that were greater than a particular target value, and stop when we discovered one. The resulting vector of bytes contained a zero byte for those which did not meet the criteria, and a byte containing 0x80 for those that did. Obviously we could express the result much more efficiently if we assigned a single bit for each result. The following is "lightly" optimised code for producing a bit vector indicating the position of zero bytes:

void zeros( unsigned char * array, int length, unsigned char * result )
  for (int i=0;i < length; i+=8)
    result[i>>3] = 
    ( (array[i+0]==0) << 7) +
    ( (array[i+1]==0) << 6) +
    ( (array[i+2]==0) << 5) +
    ( (array[i+3]==0) << 4) +
    ( (array[i+4]==0) << 3) +
    ( (array[i+5]==0) << 2) +
    ( (array[i+6]==0) << 1) +
    ( (array[i+7]==0) << 0);

The code is "lightly" optimised because it works on eight values at a time. This helps performance because the code can store results a byte at a time. An even less optimised version would split the index into a byte and bit offset and use that to update the result vector.

When we previously looked at finding zero bytes we used Mycroft's algorithm that determines whether a zero byte is present or not. It does not indicate where the zero byte is to be found. For this new problem we want to identify exactly which bytes contain zero. So we can come up with two rules that both need be true:

  • The inverted byte must have a set upper bit.
  • If we invert the byte and select the lower bits, adding one to these must set the upper bit.

Putting these into a logical operation we get (~byte & ( (~byte & 0x7f) + 1) & 0x80). For non-zero input bytes we get a result of zero, for zero input bytes we get a result of 0x80. Next we need to convert these into a bit vector.

If you recall the population count example from earlier, we used a set of operations to combine adjacent bits. In this case we want to do something similar, but instead of adding bits we want to shift them so that they end up in the right places. The code to perform the comparison and shift the results is:

void zeros2( unsigned long long* array, int length, unsigned char* result )
  for (int i=0; i<length; i+=8)
    unsigned long long v, u;
    v = array[ i>>3 ];

    u = ~v;
    u = u & 0x7f7f7f7f7f7f7f7f;
    u = u + 0x0101010101010101;
    v = u & (~v);
    v = v & 0x8080808080808080;

    v = v | (v << 7);
    v = v | (v << 14);
    v = (v >> 56) | (v >> 28);
    result[ i>>3 ] = v;

The resulting code runs about four times faster than the original.

Concluding remarks

So that ends this brief series on bit manipulation, I hope you've found it interesting, if you want to investigate this further there are plenty of resources on the web, but it would be hard to skip mentioning the book "The Hacker's Delight", which is a great read on this domain.

There's a couple of concluding thoughts. First of all performance comes from doing operations on multiple items of data in the same instruction. This should sound familiar as "SIMD", so a processor might often have vector instructions that already get the benefits of single instruction, multiple data, and single SIMD instruction might replace several integer operations in the above codes. The other place the performance comes from is eliminating branch instructions - particularly the unpredictable ones, again vector instructions might offer a similar benefit.

Monday Feb 02, 2015

Bit manipulation: finding a range of values

We previously looked at finding zero values in an array. A similar problem is to find a value larger than some target. The vanilla code for this is pretty simple:

#include "timing.h"

int range(char * array, unsigned int length, unsigned char target)
  for (unsigned int i=0; i<length; i++)
    if (array[i]>target) { return i; }
  return -1;

It's possible to recode this to use bit operations, but there is a small complication. We need two versions of the routine depending on whether the target value is >127 or not. Let's start with the target greater than 127. There are two rules to finding bytes greater than this target:

  • The upper bit is set in the target value, this means that the upper bit must also be set in the bytes we examine. So we can AND the input value with 0x80, and this must be 0x80.
  • We want a bit more precision than testing the upper bit. We need to know that the value is greater than the target value. So if we clear the upper bit we get a number between 0 and 127. This is equivalent to subtracting 128 off all the bytes that have a value greater than 127. So instead of doing a comparison of is 132 greater than 192 we can do an equivalent check of is (132-128) greater than (192-128), or is 4 greater than 64? However, we want bytes where this is true to end up with their upper bit set. So we can do an ADD operation where we add sufficient to each byte to cause the result to be greater than 128 for the bytes with a value greater than the target. The operation for this is ( byte & 0x7f ) + (255-target).

The second condition is hard to understand, so consider an example where we are searching for values greater than 192. We have an input of 132. So the first of the two conditions produces 132 & 0x80 = 0x80. For the second condition we want to do (132 & 0x7f) + (255-192) = 4+63 = 68 so the second condition does not produce a value with the upper bit set. Trying again with an input of 193 we get 65 + 63 = 128 so the upper bit is set, and we get a result of 0x80 indicating that the byte is selected.

The full operation is (byte & ( (byte & 0x7f) + (255 - target) ) & 0x80).

If the target value is less than 128 we perform a similar set of operations. In this case if the upper bit is set then the byte is automatically greater than the target value. If the upper bit is not set we have to add sufficient on to cause the upper bit to be set by any value that meets the criteria.

The operation looks like (byte | ( (byte & 0x7f) + (127 - target) ) & 0x80).

Putting all this together we get the following code:

int range2( unsigned char* array, unsigned int length, unsigned char target )
  unsigned int i = 0;
  // Handle misalignment
  while ( (length > 0) && ( (unsigned long long) & array[i] & 7) )
    if ( array[i] > target ) { return i; }
  // Optimised code
  unsigned long long * p = (unsigned long long*) &array[i];
  if (target < 128)
    unsigned long long v8 = 127 - target;
    v8 = v8 | (v8 << 8);
    v8 = v8 | (v8 << 16);
    v8 = v8 | (v8 << 32);

    while (length > 8) 
      unsigned long long v = *p;
      unsigned long long u;
      u = v & 0x8080808080808080; // upper bit
      v = v & 0x7f7f7f7f7f7f7f7f; // lower bits
      v = v + v8;
      unsigned long long r = (v | u) & 0x8080808080808080;
      if (r) { break; }
    unsigned long long v8 = 255 - target;
    v8 = v8 | (v8 << 8);
    v8 = v8 | (v8 << 16);
    v8 = v8 | (v8 << 32);
    while (length > 8)
      unsigned long long v = *p;
      unsigned long long u;
      u = v & 0x8080808080808080; // upper bit
      v = v & 0x7f7f7f7f7f7f7f7f; // lower bits
      v = v + v8;
      unsigned long long r = v & u;
      if (r) { break; }

  // Handle trailing values
  while (length > 0)
    if (array[i] > target) { return i; }
  return -1;

The resulting code runs about 4x faster than the original version.

Wednesday Jan 28, 2015

Inline functions in C

Functions declared as inline are slightly more complex than might be expected. Douglas Walls has provided a chapter-and-verse write up. But the issue bears further explanation.

When a function is declared as inline it's a hint to the compiler that the function could be inlined. It is not a command to the compiler that the function must be inlined. If the compiler chooses not to inline the function, then the function will be left with a function call that needs to be resolved, and at link time it will be necessary for a function definition to be provided. Consider this example:

#include <stdio.h>

inline void foo() 
  printf(" In foo\n"); 

void main()

The code provides an inline definition of foo(), so if the compiler chooses to inline the function it will use this definition. However, if it chooses not to inline the function, you will get a link time error when the linker is unable to find a suitable definition for the symbol foo:

$ cc -o in in.c
Undefined                       first referenced
 symbol                             in file
foo                                 in.o
ld: fatal: symbol referencing errors. No output written to in

This can be worked around by adding either "static" or "extern" to the definition of the inline function.

If the function is declared to be a static inline then, as before the compiler may choose to inline the function. In addition the compiler will emit a locally scoped version of the function in the object file. There can be one static version per object file, so you may end up with multiple definitions of the same function, so this can be very space inefficient. Since all the functions are locally scoped, there is are no multiple definitions.

Another approach is to declare the function as extern inline. In this case the compiler may generate inline code, and will also generate a global instance of the function. Although multiple global instances of the function might be generated in all the object files, only one will be remain in the executable after linking. So declaring functions as extern inline is more space efficient.

This behaviour is defined by the standard. However, gcc takes a different approach, which is to treat inline functions by generating a global function and potentially inlining the code. Unfortunately this can cause multiply-defined symbol errors at link time, where the same extern inline function is declared in multiple files. For example, in the following code both in.c and in2.c include in.h which contains the definition of extern inline foo()....

$ gcc -o in in.c in2.c
ld: fatal: symbol 'foo' is multiply-defined:

The gcc behaviour for functions declared as extern inline is also different. It does not emit an external definition for these functions, leading to unresolved symbol errors at link time.

For gcc, it is best to either declare the functions as extern inline and, in additional module, provide a global definition of the function, or to declare the functions as static inline and live with the multiple local symbol definitions that this produces.

So for convenience it is tempting to use static inline for all compilers. This is a good work around (ignoring the issue of duplicate local copies of functions), except for an issue around unused code.

The keyword static tells the compiler to emit a locally-scoped version of the function. Solaris Studio emits that function even if the function does not get called within the module. If that function calls a non-existent function, then you may get a link time error. Suppose you have the following code:

void error_message();

static inline unused() 

void main()

Compiling this we get the following error message:

$ cc -O i.c
"i.c", line 3: warning: no explicit type given
Undefined                       first referenced
 symbol                             in file
error_message                       i.o

Even though the function call exists in code that is not used, there is a link error for the undefined function error_message(). The same error would occur if extern inline was used as this would cause a global version of the function to be emitted. The problem would not occur if the function were just declared as inline because in this case the compiler would not emit either a global or local version of the function. The same code compiles with gcc because the unused function is not generated.

So to summarise the options:

  • Declare everything static inline, and ensure that there are no undefined functions, and that there are no functions that call undefined functions.
  • Declare everything inline for Studio and extern inline for gcc. Then provide a global version of the function in a separate file.

Improving performance through bit manipulation: clear last set bit

Bit manipulation is one of those fun areas where you can get a performance gain from recoding a routine to use logical or arithmetic instructions rather than a more straight-forward code.

Of course, in doing this you need to avoid the pit fall of premature optimisation - where you needlessly make the code more obscure with no benefit, or a benefit that disappears as soon as you run your code on a different machine. So with that caveat in mind, let's take a look at a simple example.

Clear last set bit

This is a great starting point because it nicely demonstrates how we can sometimes replace a fair chunk of code with a much simpler set of instructions. Of course, the algorithm that uses fewer instructions is harder to understand, but in some situations the performance gain is worth it.

We'll start off with some classic code to solve the problem. The reason for this is two-fold. First of all we want to clearly understand the problem we're solving. Secondly, we want a reference version that we can test against to ensure that our fiendishly clever code is actually correct. So here's our starting point:

unsigned long long clearlastbit( unsigned long long value )
  int bit=1;
  if ( value== 0 ) { return 0; }
  while ( !(value & bit) )
    bit = bit << 1;
  value = value ^ bit;
  return value;

But before we start trying to improve it we need a timing harness to find out how fast it runs. The following harness uses the Solaris call gethrtime() to return a timestamp in nanoseconds.

#include <stdio.h>
#include <sys/time.h>

static double s_time;

void starttime()
  s_time = 1.0 * gethrtime();

void endtime(unsigned long long its)
  double e_time = 1.0 * gethrtime();
  printf( "Time per iteration %5.2f ns\n", (e_time-s_time) / (1.0*its) );
  s_time = 1.0 * gethrtime();

The next thing we need is a workload to test the current implementation. The workload iterates over a range of numbers and repeatedly calls clearlastbit() until all the bits in the current number have been cleared.

#define COUNT 1000000
void main()
  for (unsigned long long i = 0; i < COUNT; i++ )
    unsigned long long value = i;
    while (value) { value = clearlastbit(value); }

Big O notation

So let's take a break at this point to discuss big O notation. If we look at the code for clearlastbit() we can see that it contains a loop. We'll iterate around the loop once for each bit in the input value, so for a N bit number we might iterate up to N times. We say that this computation is "order N", meaning that the cost the calculation is somehow proportional to the number of bits in the input number. This is written as O(N).

The order N description is useful because it gives us some idea of the cost of calling the routine. From it we know that the routine will typically take twice as long if coded for 8 byte inputs than if we used 4 byte inputs. Order N is not too bad as costs go, the ones to look out for are order N squared, or cubed etc. For these higher orders the run time to complete a computation can become huge for even comparatively small values of N.

If we look at the test harness, we are iterating over the function COUNT times, so effectively the entire program is O(COUNT*N), and we're exploiting the fact that this is effectively an O(N^2) cost to provide a workload that has a reasonable duration.

So let's return to the problem of clearing the last set bit. One obvious optimisation would be to record the last bit that was cleared, and then start the next iteration of the loop from that point. This is potentially a nice gain, but does not fundamentally change the algorithm. A better approach is to take advantage of bit manipulation so that we can avoid the loop altogether.

unsigned long long clearlastbit2( unsigned long long value )
  return (value & (value-1) );

Ok, if you look at this code it is not immediately apparent what it does - most people would at first sight say "How can that possibly do anything useful?". The easiest way to understand it is to take an example. Suppose we pass the value ten into this function. In binary ten is encoded as 1010b. The first operation is the subtract operation which produces the result of nine, which is encoded as 1001b. We then take the AND of these two to get the result of 1000b or eight. We've cleared the last set bit because the subtract either removed the one bit (if it was set) or broke down the next largest set bit. The AND operation just keeps the bits to the left of the last set bit.

What is interesting about this snippet of code is that it is just three instructions. There's no loop and no branches - so most processors can execute this code very quickly. To demonstrate how much faster this code is, we need a test harness. The test harness should have two parts to it. The first part needs to validate that the new code produces the same result as the existing code. The second part needs to time the old and new code.

#define COUNT 1000000
void main()
  // Correctness test
  for (unsigned long long i = 0; i<COUNT; i++ )
    unsigned long long value = i;
    while (value) 
       unsigned long long v2 = value;
       value = clearlastbit(value);
       if (value != clearlastbit2(v2)) 
         printf(" Mismatch input %llx: %llx!= %llx\n", v2, value, clearlastbit2(v2)); 
  // Performance test
  for (unsigned long long i = 0; i<COUNT; i++ )
    unsigned long long value = i;
    while (value) { value = clearlastbit(value); }

  for (unsigned long long i = 0; i<COUNT; i++ )
    unsigned long long value = i;
    while (value) { value = clearlastbit2(value); }

The final result is that the bit manipulation version of this code is about 3x faster than the original code - on this workload. Of course one of the interesting things is that the performance does depend on the input values. For example, if there are no set bits, then both codes will run in about the same amount of time.

Tuesday Jan 06, 2015

The Performance Analyzer Overview screen

A while back I promised a more complete article about the Performance Analyzer Overview screen introduced in Studio 12.4. Well, here it is!

Just to recap, the Overview screen displays a summary of the contents of the experiment. This enables you to pick the appropriate metrics to display, so quickly allows you to find where the time is being spent, and then to use the rest of the tool to drill down into what is taking the time.

Tuesday Apr 29, 2014

Unsigned integers considered annoying

Let's talk about unsigned integers. These can be tricky because they wrap-around from big to small. Signed integers wrap-around from positive to negative. Let's look at an example. Suppose I want to do something for all iterations of a loop except for the last OFFSET of them. I could write something like:

  if (i < length - OFFSET) {}

If I assume OFFSET is 8 then for length 10, I'll do something for the first 2 iterations. The problem occurs when the length is less than OFFSET. If length is 2, then I'd expect not to do anything for any of the iterations. For a signed integer 2 minus 8 is -6 which is less than i, so I don't do anything. For an unsigned integer 2 minus 8 is 0xFFFFFFFA which is still greater than i. Hence we'll continue to do whatever it is we shouldn't be doing in this instance.

So the obvious fix for this is that for unsigned integers we do:

  if (i + OFFSET < length) {}

This works over the range that we might expect it to work. Of course we have a problem with signed integers if length happens to be close to INT_MAX, at this point adding OFFSET to a large value of i may cause it to overflow and become a large negative number - which will continue to be less than length.

With unsigned ints we encounter this same problem at UINT_MAX where adding OFFSET to i could generate a small value, which is less than the boundary.

So in these cases we might want to write:

  if (i < length - OFFSET) {}


So basically to cover all the situations we might want to write something like:

  if ( (length > OFFSET) && (i < length - OFFSET) ) {}

If this looks rather complex, then it's important to realise that we're handling a range check - and a range has upper and lower bounds. For signed integers zero - OFFSET is representable, so we can write:

  if (i < length - OFFSET) {}

without worrying about wrap-around. However for unsigned integers we need to define both the left and right ends of the range. Hence the more complex expression.

Thursday Apr 03, 2014

Discovering the Code Analyzer

We're doing something different with the Studio 12.4 Beta programme. We're also putting together some material about the compiler and features: videos, whitepapers, etc.

One of the first videos is now officially available. You might have seen the preproduction "leak" if you happen to follow Studio on either facebook or twitter.

This first video is an interview with Raj Prakash, the project lead for the Code Analyzer.

The Code Analyzer is our suite for checking the correctness of code. Something that you would run before you deliver an application to customers.

Monday Mar 31, 2014

Socialising Solaris Studio

I just figured that I'd talk about studio's social media presence.

First off, we have our own forums. One for the compilers and one for the tools. This is a good place to post comments and questions; posting here will get our attention.

We also have a presence on Facebook and Twitter.

Moving to the broader Oracle community, these pages list social media presence for a number of products.

Looking at Oracle blogs, the first stop probably has to be the entertaining The OTN Garage. It's also probably useful to browse the blogs by keywords, for example here's posts tagged with Solaris.

Tuesday Aug 27, 2013

My Oracle Open World and JavaOne schedule

I've got my schedule for Oracle Open World and JavaOne:

Note that on Thursday I have about 30 minutes between my two talks, so expect me to rush out of the database talk in order to get to the Java talk.

Thursday Mar 14, 2013

The pains of preprocessing

Ok, so I've encountered this twice in 24 hours. So it's probably worth talking about it.

The preprocessor does a simple text substitution as it works its way through your source files. Sometimes this has "unanticipated" side-effects. When this happens, you'll normally get a "hey, this makes no sense at all" error from the compiler. Here's an example:

$ more c.c
#include <ucontext.h>
#include <stdio.h>

int main()
  int FS;

$ CC c.c
$ CC c.c
"c.c", line 6: Error: Badly formed expression.
"c.c", line 7: Error: The left operand must be an lvalue.
2 Error(s) detected.

A similar thing happens with g++:

$  /pkg/gnu/bin/g++ c.c
c.c: In function 'int main()':
c.c:6:7: error: expected unqualified-id before numeric constant
c.c:7:6: error: lvalue required as left operand of assignment

The Studio C compiler gives a bit more of a clue what is going on. But it's not something you can rely on:

$ cc c.c
"c.c", line 6: syntax error before or at: 1
"c.c", line 7: left operand must be modifiable lvalue: op "="

As you can guess the issue is that FS gets substituted. We can find out what happens by examining the preprocessed source:

$ CC -P c.c
$ tail c.i
int main ( )
int 1 ;
1 = 0 ;
printf ( "FS=%i" , 1 ) ;

You can confirm this using -xdumpmacros to dump out the macros as they are defined. You can combine this with -H to see which header files are included:

$ CC -xdumpmacros c.c 2>&1 |grep FS
#define _SYS_ISA_DEFS_H
#define _FILE_OFFSET_BITS 32
#define REG_FSBASE 26
#define REG_FS 22
#define FS 1

If you're using gcc you should use the -E option to get preprocessed source, and the -dD option to get definitions of macros and the include files.

Thursday Mar 22, 2012

Tech day at the Santa Clara campus

The next OTN Sys Admin day is at the Santa Clara campus on 10th April. It has a half hour talk on Studio. More information at the OTN Garage.

Friday Feb 03, 2012

Using prtpicl to get cache sizes

If you are on a SPARC system you can get cache size information using the command fpversion, which is provided with Studio:

$ fpversion
 A SPARC-based CPU is available.
 Kernel says main memory's clock rate is 1012.0 MHz.

 Sun-4 floating-point controller version 0 found.
 An UltraSPARC chip is available.

 Use "-xtarget=sparc64vii -xcache=64/64/2:5120/256/10" code-generation option.

The cache parameters are output exactly as you would want to pass them into the compiler - for each cache it describes the size in KB, the line size in bytes, and the associativity.

fpversion doesn't exist on x86 systems. The next best thing is to use prtpicl to output system configuration information, and inspect that output for cache size. Here's the cache output for the same SPARC system using prtpicl.

$ prtpicl -v |grep cache
              :l1-icache-size    0x10000
              :l1-icache-line-size       0x40
              :l1-icache-associativity   0x2
              :l1-dcache-size    0x10000
              :l1-dcache-line-size       0x40
              :l1-dcache-associativity   0x2
              :l2-cache-size     0x500000
              :l2-cache-line-size        0x100
              :l2-cache-associativity    0xa

Friday Jan 13, 2012

C++ and inline templates

A while back I wrote an article on using inline templates. It's a bit of a niche article as I would generally advise people to write in C/C++, and tune the compiler flags and source code until the compiler generates the code that they want to see.

However, one thing that I didn't mention in the article, it's implied but not stated, is that inline templates are defined as C functions. When used from C++ they need to be declared as extern "C", otherwise you get linker errors. Here's an example template:

.inline nothing

And here's some code that calls it:

void nothing();

int main()

The code works when compiled as C, but not as C++:

$ cc i.c i.il
$ ./a.out
$ CC i.c i.il
Undefined                       first referenced
 symbol                             in file
void nothing()                   i.o
ld: fatal: Symbol referencing errors. No output written to a.out

To fix this, and make the code compilable with both C and C++ we use the __cplusplus feature test macro and conditionally include extern "C". Here's the modified source:

#ifdef __cplusplus
  extern "C"
    void nothing();
#ifdef __cplusplus

int main()

Wednesday Dec 14, 2011

Oracle Solaris Studio 12.3

Oracle Solaris Studio 12.3 was released today. You can download it here.

There's a bundle of exciting stuff that goes into every new release. The headlines are probably the introduction of the Code Analyzer tool which does dynamic and static error reporting on an application, and the ablity of the IDE to be run on a remote system while the builds are done on the host.

I have a couple of other favourite areas of change. First of all we've got spot running on a bunch of recent processors - in particular the SPARC T4 (I'll write more about this later). Secondly, the filtering in the Performance Analyzer has been pushed to the foreground. Let's discuss filtering now.

Filtering is one of those technologies that is very powerful, but has been quite hard to use in previous releases. The change in this release has been that the filters have been placed on the right-click menu. Here's an example:

Adding and removing filters is now just a matter of right clicking. This allows you to rapidly drill down on the profile data. For example filtering out activity by processor, call stack, and so on.

Wednesday Nov 02, 2011

Welcome to the (System) Developer's Edge

The Developer's Edge went out of print a while back. This was obviously frustrating, not just for me, but for the folks who contacted me asking what happened. Well, I'm thrilled to be able to announce that it's available as a pdf download.

This is essentially the same book as was previously available. I've not updated the links back to the original articles. It would have been problematic, in some instances the original articles no longer exist. There are only two significant changes, the first is the branding has been changed (there's no cover art, which keeps the download small). The second is the title of the book has been modified to include the word "system" to indicate that its focused towards the hardware end of the stack.

I hope you enjoy the System Developer's Edge.

Friday Oct 21, 2011


SPARC and x86 processors have different endianness. SPARC is big-endian and x86 is little-endian. Big-endian means that numbers are stored with the most significant data earlier in memory. Conversely little-endian means that numbers are stored with the least significant data earlier in memory.

Think of big endian as writing numbers as we would normally do. For example one thousand, one hundred and twenty would be written as 1120 using a big-endian format. However, writing as little endian it would be 0211 - the least significant digits would be recorded first.

For machines, this relates to which bytes are stored first. To make data portable between machines, a format needs to be agreed. For example in networking, data is defined as being big-endian. So to handle network packets, little-endian machines need to convert the data before using it.

Converting the bytes is a trivial matter, but it has some performance pitfalls. Let's start with a simple way of doing the conversion.

template <class T>
T swapslow(T in)
  T out;
  char * pcin = (char*)∈
  char * pcout = (char*)&out;

  for (int i=0; i<sizeof(T); i++)
    pcout[i] = pcin[sizeof(T)-i];
  return out;

The code uses templates to generalise it to different sizes of integers. But the following observations hold even if you use a C version for a particular size of input.

First thing to look at is instruction count. Assume I'm dealing with ints. I store the input to memory, then I access the input one byte at a time, storing each byte to a new location in memory, before finally loading the result. So for an int, I've got 10 memory operations.

Memory operations can be costly. Processors may be limited to only issuing one per cycle. In comparison most processors can issue two or more logical or integer arithmetic instructions per cycle. Loads are also costly as they have to access the cache, which takes a few cycles.

The other issue is more subtle, and I've discussed it in the past. There are RAW issues in this code. I'm storing an int, but loading it as four bytes. Then I'm storing four bytes, and loading them as an int.

A RAW hazard is a read-after-write hazard. The processor sees data being stored, but cannot convert that stored data into the format that the subsequent load requires. Hence the load has to wait until the result of the store reaches the cache before the load can complete. This can be multiple cycles of wait.

With endianness conversion, the data is already in the registers, so we can use logical operations to perform the conversion. This approach is shown in the next code snippet.

template <class T>
T swap(T in)
  T out=0;
  for (int i=0; i<sizeof(T); i++)
  return out;

In this case, we avoid the stores and loads, but instead we perform four logical operations per byte. This is higher cost than the load and store per byte. However, we can usually do more logical operations per cycle and the operations normally take a single cycle to complete. Overall, this is probably slightly faster than loads and stores.

However, you will usually see a greater performance gain from avoiding the RAW hazards. Obviously RAW hazards are hardware dependent - some processors may be engineered to avoid them. In which case you will only see a problem on some particular hardware. Which means that your application will run well on one machine, but poorly on another.

Tuesday Aug 09, 2011

Standards and headers

Every so often I encounter, or hear about, a problem with function definitions when the standard header files are included. Most often its mmap, but sometimes it's something else. Every time I think that I should write something up. Well, it's finally happened, a short paper on how to write portable code using the standard headers.

Monday Aug 01, 2011

Oracle Solaris Studio 12.3 Beta Programme

Last week, we started the beta programme for Oracle Solaris Studio 12.3. You can participate by downloading the software and reporting any issues.

As with any release, there's a lot of incremental improvements wherever we find opportunities, and there's a couple of new features. The two most interesting new features are:

    The Code Analyzer which reports possible errors in your application, both dynamic (ie memory access errors), and static. The static error detection is the newest feature, this goes beyond the compile time warnings or lint messages, and does much more detailed compile-time analysis of your code.
  • Remote development on Windows. I'm yet to try out this feature, but the IDE has the ability to run remotely on a Windows box seamlessly compiling and running on a remote server. In fact the improvements in the IDE are well worth a look.

Some of the Studio team are giving a webcast on Thursday 4th August at 9am PDT.

Wednesday May 18, 2011

Profiling running applications

Sometimes you want to profile an application, but you either want to profile it after it has started running, or you want to profile it for part of a run. There are a couple of approaches that enable you to do this.

If you want to profile a running application, then there is the option (-P <pid>) for collect to attach to a PID:

$ collect -P <pid>

Behind the scenes this generates a script and passes the script to dbx, which attaches to the process, starts profiling, and then stops profiling after about 5 minutes. If your application is sensitive to being stopped for dbx to attach, then this is not the best way to go. The alternative approach is to start the application under collect, then collect the profile over the period of interest.

The flag -y <signal> will run the application under collect, but collect will not gather any data until profiling is enabled by sending the selected signal to the application. Here's an example of doing this:

First of all we need an application that runs for a bit of time. Since the compiler doesn't optimise out floating point operations unless the flag -fsimple is used, we can quickly write an app that spends a long time doing nothing:

$ more slow.c
int main()
  double d=0.0;
  for (int i=0;i<10000000000; i++) {d+=d;}

$ cc -g slow.c

The next step is to run the application under collect with the option -y SIGUSR1 to indicate that collect should not start collecting data until it receives the signal USR1.

$ collect -y SIGUSR1 ./a.out &
[1] 1187
Creating experiment database test.1.er ...

If we look at the generated experiment we can see that it exists, but it contains no data.

$ er_print -func test.1.er
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
0.        0.         

To start gathering data we send SIGUSR1 to the application, sending the signal again stops data collection. Sending the signal twice we can collect two seconds of data:

$ kill -SIGUSR1 1187;sleep 2;kill -SIGUSR1 1187
$ er_print -func test.1.er
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
2.001     2.001      
2.001     2.001      main
0.        2.001      _start

Thursday Apr 28, 2011

Exploring Performance Analyzer experiments

I was recently profiling a script to see where the time went, and I ended up wanting to extract the profiles for just a single component. The structure of an analyzer experiment is that there's a single root directory (test.1.er) and inside that there's a single level of subdirectories representing all the child processes. Each subdirectory is a profile of a single process - these can all be loaded as individual experiments. Inside every experiment directory there is a log.xml file. This file contains a summary of what the experiment contains.

The name of the executable that was run is held on an xml "process" line. So the following script can extract a list of all the profiles of a particular application.

$ grep myapp `find test.1.er -name 'log.xml'`|grep process | sed 's/\\:.\*//' > myapp_profiles

Once we have a list of every time my application was run, I can now extract the times from that list, and sort them using the following line:

$ grep exit `cat <myapp_files`|sed 's/.\*tstamp=\\"//'|sed 's/\\".\*//'|sort -n 

Once I have the list of times I can use then locate an experiment with a particular runtime - it's probably going to be the longest runtime:

$ grep exit `cat <myapp_files`|grep 75.9 

Monday Apr 25, 2011

Using pragma opt

The Studio compiler has the ability to control the optimisation level that is applied to particular functions in an application. This can be useful if the functions are designed to work at a specific optimisation level, or if the application fails at a particular optimisation level, and you need to figure out where the problem lies.

The optimisation levels are controlled through pragma opt. The following steps need to be followed to use the pragma:

  • The directive needs to be inserted into the source file. The format of the directive is #pragma opt /level/ (/function/). This needs to be inserted into the code before the start of the function definition, but after the function header.
  • The code needs to be compiled with the flag -xmaxopt=level. This sets the maximum optimisation level for all functions in the file - including those tagged with #pragma opt.

We can see this in action using the following code snippet. This contains two identical functions, both return the square of a global variable. However, we are using #pragma opt to control the optimisation level of the function f().

int f();
int g();

#pragma opt 2 (f)

int d;

int f()
  return d\*d;

int g()
  return d\*d;

The code is compiled with the flag -xmaxopt=5, this specifies the maximum optimisation level that can be applied to any functions in the file.

$ cc -O -xmaxopt=5 -S opt.c

If we compare the disassembly for the functions f() and g(), we can see that g() is more optimal as it does not reload the global data.

/\* 000000          0 \*/         sethi   %hi(d),%o5

!   10                !  return d\*d;

/\* 0x0004         10 \*/         ldsw    [%o5+%lo(d)],%o4 ! volatile    // First load of d
/\* 0x0008            \*/         ldsw    [%o5+%lo(d)],%o3 ! volatile    // Second load of d
/\* 0x000c            \*/         retl    ! Result =  %o0
/\* 0x0010            \*/         mulx    %o4,%o3,%o0

/\* 000000         14 \*/         sethi   %hi(d),%o5
/\* 0x0004            \*/         ld      [%o5+%lo(d)],%o4               // Single load of d

!   15                !  return d\*d;

/\* 0x0008         15 \*/         sra     %o4,0,%o3
/\* 0x000c            \*/         retl    ! Result =  %o0
/\* 0x0010            \*/         mulx    %o3,%o3,%o0

Friday Apr 01, 2011

Profiling scripts

One feature that crept into the Oracle Solaris Studio 12.2 release was the ability for the performance analyzer to follow scripts. It is necessary to set the environment variable SP_COLLECTOR_SKIP_CHECKEXEC to use this feature - as shown below.

bash-3.00$ file `which which`
/bin/which:     executable /usr/bin/csh script
bash-3.00$ collect which
Target `which' is not a valid ELF executable
bash-3.00$ collect which
Creating experiment database test.1.er ...

Thursday Jan 27, 2011

Don't initialise local strings

Consider the following code:

void s(int i)
  char string[2048]="";
  sprinf(string,"Value = %i",i);
  printf("String = %s\\n",string);

The C standards require that if any elements of the character array string are initialised, then all of them should be. We can demonstrate this by compiling with gcc:

$ gcc -O -S f.c
$ more f.s
        .file   "f.c"
        .type   s, #function
        .proc   020
        save    %sp, -2160, %sp
        stx     %g0, [%fp-2064]
        add     %fp, -2056, %o0
        mov     0, %o1
        call    memset, 0
        mov    2040, %o2

You can see that explicitly initialising string caused all elements of string to be initialised with a call to memset(). Removing the explicit initialisation of string (the ="") avoids the call to memset().

Wednesday Dec 01, 2010


So the spot user's guide has been added to the Solaris Studio 12.2 documentation. There's also another collection of older articles.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« July 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming