Thursday May 28, 2015

Where does misaligned data come from?

A good question about data (mis)alignment is "Where did it come from?". So here's a reasonably detailed answer to that...

If the compiler has generated the code for you and you've not done anything "weird" then the data should be correctly aligned. So most apps don't have misaligned data, and most of the time you (as a developer) don't have to worry about it. For example, if you allocate a local variable, or a global variable, then the compiler will correctly align it. If you cast a call to malloc() into a pointer to a structure, then that structure will be correctly aligned. And so on.... so if the compiler is doing all this correctly, when could it every be possible to have misaligned data. There's a bunch of situations.

But first let's quickly review the -xmemalign flag. What it actually tells the compiler to do is to assume a particular alignment (and trap behaviour) for variables where it is unsure what the alignment is. If a variable is aligned, then the compiler will generate code exploiting that fact. So the -xmemalign only really applies to dynamically allocated data accessed through pointers. So what does this apply - the following is not an exhaustive list:

  • Packed data structures. If a data structure has been declared as being packed, then the compiler will squash the members together to occupy the minimum space - so the alignments may be wrong. If a structure is not packed the compiler adds padding to ensure that members are correctly aligned.
  • Buffers. Suppose your application gets packets across the network. If the packet contains an integer there's no guarantee that the integer will be placed on a four byte boundary.
  • Pointers into bytes. Suppose you have a string of characters and you want load 4 of them into an int - perhaps you're doing some bit-twiddling. Then you have to take care to handle strings that don't start at 4 byte boundaries.

The take away from this should be that alignment is not something that most developers need worry about. Most code gets the correct alignment out of the box - that's why the example program is so contrived: misalignment is the result of a developer choice, decision, or requirement. It does sometimes come up in porting, and that's why it's important to be able to diagnose when and where it happens, but most folks can get by assuming that they'll never see it! :)

Tuesday May 26, 2015

Misaligned loads profiled (again)

A long time ago I described how misaligned loads appeared in profiles of 32-bit applications. Since then we've changed the level of detail presented in the Performance Analyzer. When I wrote the article the time spent on-cpu that wasn't User time was grouped as System time. We've now started showing more detail - and more detail is good.

Here's a similar bit of code:

#include <stdio.h>

static int i,j;
volatile double *d;

void main ()
  char a[10];
  for (i=0;i < j; i++) 

This code stores into a misaligned double, and that's all we need in order to generate misaligned traps and see how they are shown in the performance analyzer. Here's the hot loop:

Load Object: a.out

Inclusive       Inclusive       
User CPU Time   Trap CPU Time   Name
(sec.)          (sec.)          
1.131           0.510               [?]    10928:  inc         4, %i1
0.              0.                  [?]    1092c:  ldd         [%i2], %f2
0.811           0.380               [?]    10930:  faddd       %f2, %f0, %f4
0.              0.                  [?]    10934:  std         %f4, [%i2]
0.911           0.480               [?]    10938:  ldd         [%i2], %f2
1.121           0.370               [?]    1093c:  faddd       %f2, %f0, %f4
0.              0.                  [?]    10940:  std         %f4, [%i2]
0.761           0.410               [?]    10944:  ldd         [%i2], %f2
0.911           0.410               [?]    10948:  faddd       %f2, %f0, %f4
0.010           0.                  [?]    1094c:  std         %f4, [%i2]
0.941           0.450               [?]    10950:  ldd         [%i2], %f2
1.111           0.380               [?]    10954:  faddd       %f2, %f0, %f4
0.              0.                  [?]    10958:  cmp         %i1, %i5
0.              0.                  [?]    1095c:  ble,pt      %icc,0x10928
0.              0.                  [?]    10960:  std         %f4, [%i2]

So the first thing to notice is that we're now reporting Trap time rather than aggregating it into System time. This is useful because trap time is intrinsically different from system time, so it's worth displaying it differently. Fortunately the new overview screen highlights the trap time, so it's easy to recognise when to look for it.

Now, you should be familiar with the "previous instruction is to blame rule" for interpreting the output from the performance analyzer. Dealing with traps is no different, the time spent on the next instruction is due to the trap of the previous instruction. So the final load in the loop takes about 1.1s of user time and 0.38s of trap time.

Slight side track about the "blame the last instruction" rule. For misaligned accesses the problem instruction traps and its action is emulated. So the next instruction executed is the instruction following the misaligned access. That's why we see time attributed to the following instruction. However, there are situations where an instruction is retried after a trap, in those cases the next instruction is the instruction that caused the trap. Examples of this are TLB misses or save/restore instructions.

If we recompile the code as 64-bit and set -xmemalign=8i, then we get a different profile:

User CPU Time   Name
3.002           <Total>
2.882           __misalign_trap_handler
0.070           main
0.040           __do_misaligned_ldst_instr
0.010           getreg
0.              _start

For 64-bit code the misaligned operations are fixed in user-land. One (unexpected) advantage of this is that you can take a look at the routines that call the trap handler and identify exactly where the misaligned memory operations are:

0.480	main + 0x00000078
0.450	main + 0x00000054
0.410	main + 0x0000006C
0.380	main + 0x00000060
0.370	main + 0x00000088
0.310	main + 0x0000005C
0.270	main + 0x00000068
0.260	main + 0x00000074

This is really useful if there are a number of sites and your objective is to fix them in order of performance impact.

Wednesday May 13, 2015

Misaligned loads in 64-bit apps

A while back I wrote up how to use dtrace to identify misaligned loads in 32-bit apps. Here's a script to do the same for 64-bit apps:

#!/usr/sbin/dtrace -s


Run it as './script '

Tuesday Feb 17, 2015

Profiling the kernel

One of the incredibly useful features in Studio is the ability to profile the kernel. The tool to do this is er_kernel. It's based around dtrace, so you either need to run it with escalated privileges, or you need to edit /etc/user_attr to add something like:


The correct way to modify user_attr is with the command usermod:

usermod -K defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernel <username>

There's two ways to run er_kernel. The default mode is to just profile the kernel:

$ er_kernel sleep 10
Creating experiment database (Process ID: 7399) ...
$ er_print -limit 10 -func
Functions sorted by metric: Exclusive Kernel CPU Time

Excl.     Incl.      Name
Kernel    Kernel
CPU sec.  CPU sec.
19.242    19.242     <Total>
14.869    14.869     <l_PID_7398>
 0.687     0.949     default_mutex_lock_delay
 0.263     0.263     mutex_enter
 0.202     0.202     <java_PID_248>
 0.162     0.162     gettick
 0.141     0.141     hv_ldc_tx_set_qtail

The we passed the command sleep 10 to er_kernel, this causes it to profile for 10 seconds. It might be better form to use the equivalent command line option -t 10.

In the profile we can see a couple of user processes together with some kernel activity. The other way to run er_kernel is to profile the kernel and user processes. We enable this mode with the command line option -F on:

$ er_kernel -F on sleep 10
Creating experiment database (Process ID: 7630) ...
$ er_print -limit 5 -func
Functions sorted by metric: Exclusive Total CPU Time

Excl.     Incl.     Excl.     Incl.      Name
Total     Total     Kernel    Kernel
CPU sec.  CPU sec.  CPU sec.  CPU sec.
15.384    15.384    16.333    16.333     <Total>
15.061    15.061     0.        0.        main
 0.061     0.061     0.        0.        ioctl
 0.051     0.141     0.        0.        dt_consume_cpu
 0.040     0.040     0.        0.        __nanosleep

In this case we can see all the userland activity as well as kernel activity. The -F option is very flexible, instead of just profiling everything, we can use -F =<regexp>syntax to specify either a PID or process name to profile:

$ er_kernel -F =7398

Tuesday Feb 03, 2015

Complete set of bit manipulation posts combined

My recent blog posts on bit manipulation are now available as an article up on the OTN community pages. If you want to read the individual posts they are:

Bit manipulation: Gathering bits

In the last post on bit manipulation we looked at how we could identify bytes that were greater than a particular target value, and stop when we discovered one. The resulting vector of bytes contained a zero byte for those which did not meet the criteria, and a byte containing 0x80 for those that did. Obviously we could express the result much more efficiently if we assigned a single bit for each result. The following is "lightly" optimised code for producing a bit vector indicating the position of zero bytes:

void zeros( unsigned char * array, int length, unsigned char * result )
  for (int i=0;i < length; i+=8)
    result[i>>3] = 
    ( (array[i+0]==0) << 7) +
    ( (array[i+1]==0) << 6) +
    ( (array[i+2]==0) << 5) +
    ( (array[i+3]==0) << 4) +
    ( (array[i+4]==0) << 3) +
    ( (array[i+5]==0) << 2) +
    ( (array[i+6]==0) << 1) +
    ( (array[i+7]==0) << 0);

The code is "lightly" optimised because it works on eight values at a time. This helps performance because the code can store results a byte at a time. An even less optimised version would split the index into a byte and bit offset and use that to update the result vector.

When we previously looked at finding zero bytes we used Mycroft's algorithm that determines whether a zero byte is present or not. It does not indicate where the zero byte is to be found. For this new problem we want to identify exactly which bytes contain zero. So we can come up with two rules that both need be true:

  • The inverted byte must have a set upper bit.
  • If we invert the byte and select the lower bits, adding one to these must set the upper bit.

Putting these into a logical operation we get (~byte & ( (~byte & 0x7f) + 1) & 0x80). For non-zero input bytes we get a result of zero, for zero input bytes we get a result of 0x80. Next we need to convert these into a bit vector.

If you recall the population count example from earlier, we used a set of operations to combine adjacent bits. In this case we want to do something similar, but instead of adding bits we want to shift them so that they end up in the right places. The code to perform the comparison and shift the results is:

void zeros2( unsigned long long* array, int length, unsigned char* result )
  for (int i=0; i<length; i+=8)
    unsigned long long v, u;
    v = array[ i>>3 ];

    u = ~v;
    u = u & 0x7f7f7f7f7f7f7f7f;
    u = u + 0x0101010101010101;
    v = u & (~v);
    v = v & 0x8080808080808080;

    v = v | (v << 7);
    v = v | (v << 14);
    v = (v >> 56) | (v >> 28);
    result[ i>>3 ] = v;

The resulting code runs about four times faster than the original.

Concluding remarks

So that ends this brief series on bit manipulation, I hope you've found it interesting, if you want to investigate this further there are plenty of resources on the web, but it would be hard to skip mentioning the book "The Hacker's Delight", which is a great read on this domain.

There's a couple of concluding thoughts. First of all performance comes from doing operations on multiple items of data in the same instruction. This should sound familiar as "SIMD", so a processor might often have vector instructions that already get the benefits of single instruction, multiple data, and single SIMD instruction might replace several integer operations in the above codes. The other place the performance comes from is eliminating branch instructions - particularly the unpredictable ones, again vector instructions might offer a similar benefit.

Monday Feb 02, 2015

Bit manipulation: finding a range of values

We previously looked at finding zero values in an array. A similar problem is to find a value larger than some target. The vanilla code for this is pretty simple:

#include "timing.h"

int range(char * array, unsigned int length, unsigned char target)
  for (unsigned int i=0; i<length; i++)
    if (array[i]>target) { return i; }
  return -1;

It's possible to recode this to use bit operations, but there is a small complication. We need two versions of the routine depending on whether the target value is >127 or not. Let's start with the target greater than 127. There are two rules to finding bytes greater than this target:

  • The upper bit is set in the target value, this means that the upper bit must also be set in the bytes we examine. So we can AND the input value with 0x80, and this must be 0x80.
  • We want a bit more precision than testing the upper bit. We need to know that the value is greater than the target value. So if we clear the upper bit we get a number between 0 and 127. This is equivalent to subtracting 128 off all the bytes that have a value greater than 127. So instead of doing a comparison of is 132 greater than 192 we can do an equivalent check of is (132-128) greater than (192-128), or is 4 greater than 64? However, we want bytes where this is true to end up with their upper bit set. So we can do an ADD operation where we add sufficient to each byte to cause the result to be greater than 128 for the bytes with a value greater than the target. The operation for this is ( byte & 0x7f ) + (255-target).

The second condition is hard to understand, so consider an example where we are searching for values greater than 192. We have an input of 132. So the first of the two conditions produces 132 & 0x80 = 0x80. For the second condition we want to do (132 & 0x7f) + (255-192) = 4+63 = 68 so the second condition does not produce a value with the upper bit set. Trying again with an input of 193 we get 65 + 63 = 128 so the upper bit is set, and we get a result of 0x80 indicating that the byte is selected.

The full operation is (byte & ( (byte & 0x7f) + (255 - target) ) & 0x80).

If the target value is less than 128 we perform a similar set of operations. In this case if the upper bit is set then the byte is automatically greater than the target value. If the upper bit is not set we have to add sufficient on to cause the upper bit to be set by any value that meets the criteria.

The operation looks like (byte | ( (byte & 0x7f) + (127 - target) ) & 0x80).

Putting all this together we get the following code:

int range2( unsigned char* array, unsigned int length, unsigned char target )
  unsigned int i = 0;
  // Handle misalignment
  while ( (length > 0) && ( (unsigned long long) & array[i] & 7) )
    if ( array[i] > target ) { return i; }
  // Optimised code
  unsigned long long * p = (unsigned long long*) &array[i];
  if (target < 128)
    unsigned long long v8 = 127 - target;
    v8 = v8 | (v8 << 8);
    v8 = v8 | (v8 << 16);
    v8 = v8 | (v8 << 32);

    while (length > 8) 
      unsigned long long v = *p;
      unsigned long long u;
      u = v & 0x8080808080808080; // upper bit
      v = v & 0x7f7f7f7f7f7f7f7f; // lower bits
      v = v + v8;
      unsigned long long r = (v | u) & 0x8080808080808080;
      if (r) { break; }
    unsigned long long v8 = 255 - target;
    v8 = v8 | (v8 << 8);
    v8 = v8 | (v8 << 16);
    v8 = v8 | (v8 << 32);
    while (length > 8)
      unsigned long long v = *p;
      unsigned long long u;
      u = v & 0x8080808080808080; // upper bit
      v = v & 0x7f7f7f7f7f7f7f7f; // lower bits
      v = v + v8;
      unsigned long long r = v & u;
      if (r) { break; }

  // Handle trailing values
  while (length > 0)
    if (array[i] > target) { return i; }
  return -1;

The resulting code runs about 4x faster than the original version.

Friday Jan 30, 2015

Finding zero values in an array

A common thing to want to do is to find zero values in an array. This is obviously necessary for string length. So we'll start out with a test harness and a simple implementation:

#include "timing.h"

unsigned int len(char* array)
  unsigned int length = 0;
  while( array[length] ) { length++; }
  return length;

#define COUNT 100000
void main()
  char array[ COUNT ];
  for (int i=1; i<COUNT; i++)
    array[i-1] = 'a';
    array[i] = 0;
    if ( i != len(array) ) { printf( "Error at %i\n", i ); }
  for (int i=1; i<COUNT; i++)
    array[i-1] = 'a';
    array[i] = 0;

A chap called Alan Mycroft came up with a very neat algorithm to simultaneously examine multiple bytes and determine whether there is a zero in them. His algorithm starts off with the idea that there are two conditions that need to be true if a byte contains the value zero. First of all the upper bit of the byte must be zero, this is true for zero and all values less than 128, so on its own it is not sufficient. The second characteristic is that if one is subtracted from the value, then the upper bit must be one. This is true for zero and all values greater than 128. Although both conditions are individually satisfied by multiple values, the only value that satisfies both conditions is zero.

The following code uses the Mycroft test for a string length implementation. The code contains a pre-loop to get to an eight byte aligned address.

unsigned int len2(char* array)
  unsigned int length = 0;
  // Handle misaligned data
  while ( ( (unsigned long long) & array[length] ) &7 )
    if ( array[length] == 0 ) { return length; }

  unsigned long long * p = (unsigned long long *) & array[length];
  unsigned long long v8, v7;
    v8 = *p;
    v7 = v8 - 0x0101010101010101;
    v7 = (v7 & ~v8) & 0x8080808080808080;
  while ( !v7 );
  length = (char*)p - array-8;
  while ( array[length] ) { length++; }
  return length;

The algorithm has one weak point. It does not always report exactly which byte is zero, just that there is a zero byte somewhere. Hence the final loop where we work out exactly which byte is zero.

It is a trivial extension to use this to search for a byte of any value. If we XOR the input vector with a vector of bytes containing the target value, then we get a zero byte where the target value occurs, and a non-zero byte everywhere else.

It is also easy to extend the code to search for other zero bit patterns. For example, if we want to find zero nibbles (ie 4 bit values), then we can change the constants to be 0x1111111111111111 and 0x8888888888888888.

Thursday Jan 29, 2015

Bit manipulation: Population Count

Population count is one of the more esoteric instructions. It's the operation to count the number of set bits in a register. It comes up with sufficient frequency that most processors have a hardware instruction to do it. However, for this example, we're going to look at coding it in software. First of all we'll write a baseline version of the code:

int popc(unsigned long long value)
  unsigned long long bit = 1;
  int popc = 0;
  while ( bit )
    if ( value & bit ) { popc++; }
    bit = bit << 1;
  return popc;

The above code examines every bit in the input and counts the number of set bits. The number of iterations is proportional to the number of bits in the register.

Most people will immediately recognise that we could make this a bit faster using the code we discussed previously that clears the last set bit, whist there are set bits keep clearing them, otherwise you're done. The advantage of this approach is that you only iterate once for every set bit in the value. So if there are no set bits, then you do not do any iterations.

int popc2( unsigned long long value )
  int popc = 0;
  while ( value )
    value = value & (value-1);
  return popc;

The next thing to do is to put together a test harness that confirms that the new code produces the same results as the old code, and also measures the performance of the two implementations.

#define COUNT 1000000
void main()
  // Correctness test
  for (unsigned long long i = 0; i<COUNT; i++ )
    if (popc( i + (i<<32) ) != popc2( i + (i<<32) ) )
      printf(" Mismatch popc2 input %llx: %u!= %u\n", 
               i+(i<<32), popc(i+(i<<32)), popc2(i+(i<<32))); 
  // Performance test
  for (unsigned long long i = 0; i<COUNT; i++ )
  for (unsigned long long i = 0; i<COUNT; i++ )

The new code is about twice as fast as the old code. However, the new code still contains a loop, and this can be a bit of a problem.

Branch mispredictions

The trouble with loops, and with branches in general, is that processors don't know the next instruction that will be executed after the branch until the branch has been reached, but the processor needs to have already fetched instruction after the branch well before this. The problem is nicely summarised by Holly in Red Dwarf:

"Look, I'm trying to navigate at faster than the speed of light, which means that before you see something, you've already passed through it."

So processors use branch prediction to guess whether a branch is taken or not. If the prediction is correct there is no break in the instruction stream, but if the prediction is wrong, then the processor needs to throw away all the incorrectly predicted instructions, and fetch the instructions from correct address. This is a significant cost, so ideally you don't want mispredicted branches, and the best way of ensuring that is to not have branches at all!

The following code is a branchless sequence for computing population count

unsigned int popc3(unsigned long long value)
  unsigned long long v2;
  v2     = value &t;< 1;
  v2    &= 0x5555555555555555;
  value &= 0x5555555555555555;
  value += v2;
  v2     = value << 2;
  v2    &= 0x3333333333333333;
  value &= 0x3333333333333333;
  value += v2;
  v2     = value << 4;
  v2    &= 0x0f0f0f0f0f0f0f0f;
  value &= 0x0f0f0f0f0f0f0f0f;
  value += v2;
  v2     = value << 8;
  v2    &= 0x00ff00ff00ff00ff;
  value &= 0x00ff00ff00ff00ff;
  value += v2;
  v2     = value << 16;
  v2    &= 0x0000ffff0000ffff;
  value &= 0x0000ffff0000ffff;
  value += v2;
  v2     = value << 32;
  value += v2;
  return (unsigned int) value;

This instruction sequence computes the population count by initially adding adjacent bits to get a two bit result of 0, 1, or 2. It then adds the adjacent pairs of bits to get a 4 bit result of between 0 and 4. Next it adds adjacent nibbles to get a byte result, then adds pairs of bytes to get shorts, then adds shorts to get a pair of ints, which it adds to get the final value. The code contains a fair amount of AND operations to mask out the bits that are not part of the result.

This bit manipulation version is about two times faster than the clear-last-bit-set version, making it about four times faster than the original code. However, it is worth noting that this is a fixed cost. The routine takes the same amount of time regardless of the input value. In contrast the clear -last-bit-set version will exit early if there are no set bits. Consequently the performance gain for the code will depend on both the input value and the cost of mispredicted branches.

Wednesday Jan 28, 2015

Tracking application resource use

One question you might ask is "how much memory is my application consuming?". Obviously you can use prstat (prstat -cp <pid> or prstat -cmLp <pid>) to examine the behaviour of a process. But how about programmatically finding that information.

OTN has just published an article where I demonstrate how to find out about the resource use of a process, and incidentally how to put that functionality into a library that reports resource use over time.

Inline functions in C

Functions declared as inline are slightly more complex than might be expected. Douglas Walls has provided a chapter-and-verse write up. But the issue bears further explanation.

When a function is declared as inline it's a hint to the compiler that the function could be inlined. It is not a command to the compiler that the function must be inlined. If the compiler chooses not to inline the function, then the function will be left with a function call that needs to be resolved, and at link time it will be necessary for a function definition to be provided. Consider this example:

#include <stdio.h>

inline void foo() 
  printf(" In foo\n"); 

void main()

The code provides an inline definition of foo(), so if the compiler chooses to inline the function it will use this definition. However, if it chooses not to inline the function, you will get a link time error when the linker is unable to find a suitable definition for the symbol foo:

$ cc -o in in.c
Undefined                       first referenced
 symbol                             in file
foo                                 in.o
ld: fatal: symbol referencing errors. No output written to in

This can be worked around by adding either "static" or "extern" to the definition of the inline function.

If the function is declared to be a static inline then, as before the compiler may choose to inline the function. In addition the compiler will emit a locally scoped version of the function in the object file. There can be one static version per object file, so you may end up with multiple definitions of the same function, so this can be very space inefficient. Since all the functions are locally scoped, there is are no multiple definitions.

Another approach is to declare the function as extern inline. In this case the compiler may generate inline code, and will also generate a global instance of the function. Although multiple global instances of the function might be generated in all the object files, only one will be remain in the executable after linking. So declaring functions as extern inline is more space efficient.

This behaviour is defined by the standard. However, gcc takes a different approach, which is to treat inline functions by generating a global function and potentially inlining the code. Unfortunately this can cause multiply-defined symbol errors at link time, where the same extern inline function is declared in multiple files. For example, in the following code both in.c and in2.c include in.h which contains the definition of extern inline foo()....

$ gcc -o in in.c in2.c
ld: fatal: symbol 'foo' is multiply-defined:

The gcc behaviour for functions declared as extern inline is also different. It does not emit an external definition for these functions, leading to unresolved symbol errors at link time.

For gcc, it is best to either declare the functions as extern inline and, in additional module, provide a global definition of the function, or to declare the functions as static inline and live with the multiple local symbol definitions that this produces.

So for convenience it is tempting to use static inline for all compilers. This is a good work around (ignoring the issue of duplicate local copies of functions), except for an issue around unused code.

The keyword static tells the compiler to emit a locally-scoped version of the function. Solaris Studio emits that function even if the function does not get called within the module. If that function calls a non-existent function, then you may get a link time error. Suppose you have the following code:

void error_message();

static inline unused() 

void main()

Compiling this we get the following error message:

$ cc -O i.c
"i.c", line 3: warning: no explicit type given
Undefined                       first referenced
 symbol                             in file
error_message                       i.o

Even though the function call exists in code that is not used, there is a link error for the undefined function error_message(). The same error would occur if extern inline was used as this would cause a global version of the function to be emitted. The problem would not occur if the function were just declared as inline because in this case the compiler would not emit either a global or local version of the function. The same code compiles with gcc because the unused function is not generated.

So to summarise the options:

  • Declare everything static inline, and ensure that there are no undefined functions, and that there are no functions that call undefined functions.
  • Declare everything inline for Studio and extern inline for gcc. Then provide a global version of the function in a separate file.

Improving performance through bit manipulation: clear last set bit

Bit manipulation is one of those fun areas where you can get a performance gain from recoding a routine to use logical or arithmetic instructions rather than a more straight-forward code.

Of course, in doing this you need to avoid the pit fall of premature optimisation - where you needlessly make the code more obscure with no benefit, or a benefit that disappears as soon as you run your code on a different machine. So with that caveat in mind, let's take a look at a simple example.

Clear last set bit

This is a great starting point because it nicely demonstrates how we can sometimes replace a fair chunk of code with a much simpler set of instructions. Of course, the algorithm that uses fewer instructions is harder to understand, but in some situations the performance gain is worth it.

We'll start off with some classic code to solve the problem. The reason for this is two-fold. First of all we want to clearly understand the problem we're solving. Secondly, we want a reference version that we can test against to ensure that our fiendishly clever code is actually correct. So here's our starting point:

unsigned long long clearlastbit( unsigned long long value )
  int bit=1;
  if ( value== 0 ) { return 0; }
  while ( !(value & bit) )
    bit = bit << 1;
  value = value ^ bit;
  return value;

But before we start trying to improve it we need a timing harness to find out how fast it runs. The following harness uses the Solaris call gethrtime() to return a timestamp in nanoseconds.

#include <stdio.h>
#include <sys/time.h>

static double s_time;

void starttime()
  s_time = 1.0 * gethrtime();

void endtime(unsigned long long its)
  double e_time = 1.0 * gethrtime();
  printf( "Time per iteration %5.2f ns\n", (e_time-s_time) / (1.0*its) );
  s_time = 1.0 * gethrtime();

The next thing we need is a workload to test the current implementation. The workload iterates over a range of numbers and repeatedly calls clearlastbit() until all the bits in the current number have been cleared.

#define COUNT 1000000
void main()
  for (unsigned long long i = 0; i < COUNT; i++ )
    unsigned long long value = i;
    while (value) { value = clearlastbit(value); }

Big O notation

So let's take a break at this point to discuss big O notation. If we look at the code for clearlastbit() we can see that it contains a loop. We'll iterate around the loop once for each bit in the input value, so for a N bit number we might iterate up to N times. We say that this computation is "order N", meaning that the cost the calculation is somehow proportional to the number of bits in the input number. This is written as O(N).

The order N description is useful because it gives us some idea of the cost of calling the routine. From it we know that the routine will typically take twice as long if coded for 8 byte inputs than if we used 4 byte inputs. Order N is not too bad as costs go, the ones to look out for are order N squared, or cubed etc. For these higher orders the run time to complete a computation can become huge for even comparatively small values of N.

If we look at the test harness, we are iterating over the function COUNT times, so effectively the entire program is O(COUNT*N), and we're exploiting the fact that this is effectively an O(N^2) cost to provide a workload that has a reasonable duration.

So let's return to the problem of clearing the last set bit. One obvious optimisation would be to record the last bit that was cleared, and then start the next iteration of the loop from that point. This is potentially a nice gain, but does not fundamentally change the algorithm. A better approach is to take advantage of bit manipulation so that we can avoid the loop altogether.

unsigned long long clearlastbit2( unsigned long long value )
  return (value & (value-1) );

Ok, if you look at this code it is not immediately apparent what it does - most people would at first sight say "How can that possibly do anything useful?". The easiest way to understand it is to take an example. Suppose we pass the value ten into this function. In binary ten is encoded as 1010b. The first operation is the subtract operation which produces the result of nine, which is encoded as 1001b. We then take the AND of these two to get the result of 1000b or eight. We've cleared the last set bit because the subtract either removed the one bit (if it was set) or broke down the next largest set bit. The AND operation just keeps the bits to the left of the last set bit.

What is interesting about this snippet of code is that it is just three instructions. There's no loop and no branches - so most processors can execute this code very quickly. To demonstrate how much faster this code is, we need a test harness. The test harness should have two parts to it. The first part needs to validate that the new code produces the same result as the existing code. The second part needs to time the old and new code.

#define COUNT 1000000
void main()
  // Correctness test
  for (unsigned long long i = 0; i<COUNT; i++ )
    unsigned long long value = i;
    while (value) 
       unsigned long long v2 = value;
       value = clearlastbit(value);
       if (value != clearlastbit2(v2)) 
         printf(" Mismatch input %llx: %llx!= %llx\n", v2, value, clearlastbit2(v2)); 
  // Performance test
  for (unsigned long long i = 0; i<COUNT; i++ )
    unsigned long long value = i;
    while (value) { value = clearlastbit(value); }

  for (unsigned long long i = 0; i<COUNT; i++ )
    unsigned long long value = i;
    while (value) { value = clearlastbit2(value); }

The final result is that the bit manipulation version of this code is about 3x faster than the original code - on this workload. Of course one of the interesting things is that the performance does depend on the input values. For example, if there are no set bits, then both codes will run in about the same amount of time.

Monday Jan 26, 2015

Hands on with Solaris Studio

Over at the OTN Garage, Rick has published details of the Solaris Studio hands-on lab.

This is a great chance to learn about the Code Analyzer and the Performance Analyzer. Just to quickly recap, the Code Analyzer is really a suite of tools that do checking on application errors. This includes static analysis - ie extensive compile time checking - and dynamic analysis - run time error detection. If you regularly read this blog, you'll need no introduction to the Performance Analyzer.

The dates for these are:

  • Americas - Feb 11th 9am to 12:30pm PT
  • EMEA - Feb 24th - 9am to 12:30pm GMT
  • APAC - March 4th - 9:30am IST to 1pm IST

Tuesday Jan 06, 2015

New articles about Solaris Studio

We've started posting new articles directly into the communities section of the Oracle website. If you're not familiar with this location, it's also where you can post questions on languages or tools.

With the change it should be easier to find articles relevant to developers, and it should be easy to comment on them. So hopefully this works out well. There's currently three articles listed on the content page. I've already posted about the article on the Performance Analyzer Overview screen, so I'll quickly highlight the other two:

  • In Studio 12.4 we've introducedfiner control over debug information. This allows you to reduce object file size by excluding debug info that you don't need. There's substantial control, but probably the easiest new option is -g1 which includes a minimal set debug info.
  • A change in Studio 12.4 is in the default way that C++ handles templates. The summary is that the compiler's default mode is inline with the way that other compilers work - you need to include template definitions in the source file being compiled. Previously the compiler would try to find the definitions in external files. This old behaviour could be confusing, so it's good that the default has changed. But it's possible that you may encounter code that was written with the expectation that the Studio compiler behaved in the old way, in this case you'll need to modify the source, or tell the compiler to revert to the older behaviour. Hopefully, most people won't even notice the change, but it's worth knowing the signs in case you encounter a problem.

The Performance Analyzer Overview screen

A while back I promised a more complete article about the Performance Analyzer Overview screen introduced in Studio 12.4. Well, here it is!

Just to recap, the Overview screen displays a summary of the contents of the experiment. This enables you to pick the appropriate metrics to display, so quickly allows you to find where the time is being spent, and then to use the rest of the tool to drill down into what is taking the time.

Wednesday Nov 19, 2014

Writing inline templates

Writing some inline templates today... I've written about doing this kind of stuff in the past here and, in more detail, here.

I happen to need to pass a bundle of parameters on to the routine. The best way of checking how the parameters will be passed is to get the compiler to provide some initial template. Here's an example routine:

int parameters (int p0, int * p1, int * p2, int* p3, int * p4, int * p5, int * p6, int p7)
  return p0 + *p1 + *p2 + *p3 + *p4 + ((*p5)<<2) + ((*p6)<<3) + p7*p7;

In the routine I've tried to handle some of the parameters differently. I know that the first parameters get passed in registers, and then the later ones get passed on the stack. By handling them differently I can work out which loads from the stack correspond to which variables. The disassembly looks like:

-bash-4.1$ cc -g -O parameters.c -c
-bash-4.1$ dis -F parameters parameters.o
disassembly for parameters.o

    parameters:             ca 02 60 00  ld        [%o1], %g5
    parameters+0x4:         c4 02 e0 00  ld        [%o3], %g2
    parameters+0x8:         c2 02 a0 00  ld        [%o2], %g1
    parameters+0xc:         c6 03 a0 60  ld        [%sp + 0x60], %g3  // load of p7
    parameters+0x10:        88 02 00 05  add       %o0, %g5, %g4
    parameters+0x14:        d0 03 60 00  ld        [%o5], %o0
    parameters+0x18:        ca 03 20 00  ld        [%o4], %g5
    parameters+0x1c:        92 00 80 01  add       %g2, %g1, %o1
    parameters+0x20:        87 38 e0 00  sra       %g3, 0x0, %g3
    parameters+0x24:        82 01 00 09  add       %g4, %o1, %g1
    parameters+0x28:        d2 03 a0 5c  ld        [%sp + 0x5c], %o1 // load of p6
    parameters+0x2c:        88 48 c0 03  mulx      %g3, %g3, %g4     // %g4 = %g3*%g3
    parameters+0x30:        97 2a 20 02  sll       %o0, 0x2, %o3
    parameters+0x34:        94 00 40 05  add       %g1, %g5, %o2
    parameters+0x38:        da 02 60 00  ld        [%o1], %o5       
    parameters+0x3c:        84 02 c0 0a  add       %o3, %o2, %g2
    parameters+0x40:        99 2b 60 03  sll       %o5, 0x3, %o4     // %o4 = %o5<<3
    parameters+0x44:        90 00 80 0c  add       %g2, %o4, %o0
    parameters+0x48:        81 c3 e0 08  retl
    parameters+0x4c:        90 02 00 04  add       %o0, %g4, %o0

Thursday Nov 13, 2014

Software in Silicon Cloud

I missed this press release about Software in Silicon Cloud. It's the announcement for a service where you can try out a SPARC M7 processor. There's an accompanying website which has the sign up plus some more information about the service.

What's particularly exciting is that it talks a bit more about Application Data Integrity (ADI). Larry Ellison called this "the most important piece of engineering we’ve done in a long, long time.".

Incorrect handling of pointers is a large contributor to bugs in software. ADI tackles this by making the hardware check that the pointer being used is valid for the region of memory it is pointing to. If it's not valid the hardware flags it as an error. Since it's done by hardware, there's minimal performance impact - it's at hardware speed, so developers can check their application in realtime.

There's a nice demo of how ADI protects against exploits like HeartBleed.

Wednesday Nov 12, 2014

Oracle Solaris Studio playlist

There's an extensive list of Solaris Studio videos on youtube. In particular there's a bunch of tutorials covering the features of the IDE. The IDE doesn't often get the attention it deserves. It's based off NetBeans, and is full of useful code refactoring tools, navigation tools, etc. To find out more, take a look at some of the videos.

Tuesday Nov 11, 2014

New Performance Analyzer Overview screen

I love using the Performance Analyzer, but the question I often get when I show it to people, is "Where do I start?". So one of the improvements in Solaris Studio 12.4 is an Overview screen to help people get started with the tool. Here's what it looks like:

The reason this is important, is that many applications spend time in various place - like waiting on disk, or in user locks - and it's not always obvious where is going to be the most effective place to look for performance gains.

The Overview screen is meant to be the "one-stop" place where people can find out what their application is doing. When we put it back into the product I expected it to be the screen that I glanced at then never went back to. I was most surprised when this turned out not to be the case.

During performance analysis, I'm often exploring different ideas as to where it might be possible to get performance improvements. The Overview screen allows me to select the metrics that I'm interested in, then take a look at the resulting profiles. So I might start with system time, and just enable the system time metrics. Once I'm done with that, I might move on to user time, and select those metrics. So what was surprising about the Overview screen was how often I returned to it to change the metrics I was using.

So what does the screen contain? The overview shows all the available metrics. The bars indicate which metrics contribute the most time. So it's easy to pick (and explore) the metrics that contribute the most time.

If the profile contains performance counter metrics, then those also appear. If the counters include instructions and cycles, then the synthetic CPI/IPC metrics are also available. The Overview screen is really useful for hardware counter metrics.

I use performance counters in a couple of ways: to confirm a hypothesis about performance or to estimate time spent on a type of event. For example, if I think a load is taking a lot of time due to TLB misses, then profiling on the TLB miss performance counter will tell me whether that load has a lot of misses or not. Alternatively, if I've got TLB miss counter data, then I can scale this by the cost per TLB miss, and get an estimate of the total runtime lost to TLB misses.

Where the Overview screen comes into this is that I will often want to minimise the number of columns of data that are shown (to fit everything onto my monitor), but sometimes I want to quickly enable a counter to see whether that event happens at the bit of code where I'm looking. Hence I end up flipping to the Overview screen and then returning to the code.

So what I thought would be a nice feature, actually became pretty central to my work-flow.

I should have a more detailed paper about the Overview screen up on OTN soon.

Performance made easy

The big news of the day is that Oracle Solaris Studio 12.4 is available for download. I'd like to thank all those people who tried out the beta releases and gave us feedback.

There's a number of things that are new in this release. The most obvious one is C++11 support, I've written a bit about the lambda expression support, tuples, and unordered containers.

My favourite tool, the Performance Analyzer, has also had a bit of a facelift. I'll talk about the Overview screen in a separate post (and in an article), but there's some other fantastic features. The syntax highlighting, and hyperlinking, has made navigating profiles much easier. There's been a large number of improvements in filtering - a feature that's been in the product a long time, but these changes elevate it to being much more accessible (an article on filtering is long overdue!). There's also the default hardware counters - which makes it a no-brainer to get hardware counter data, which is really helpful in understanding exactly what an application is doing.

Over the development cycle I've made much use of the other tools. The Thread Analyzer for identifying data races has had some improvements. The Code Analyzer tools have made some great gains in rapidly identifying potential coding errors. And so on....

Anyway, please download the new version, try it out, try out the tools, and let us know what you think of it.

Tuesday Nov 04, 2014

SPARC Software in Silicon

Short video by Juan Loaiza about the Software in Silicon work in the upcoming SPARC processor.

Friday Oct 10, 2014

OpenWorld and JavaOne slides available for download

Thanks everyone who attended my talks last week. My slides for OpenWorld and JavaOne are available for download:

Sunday Sep 28, 2014

SPARC Processor Documentation

I'm pretty excited, we've now got documentation up for the SPARC processors. Take a look at the SPARC T4 supplement, the SPARC T4 performance instrumentation supplement, the SPARC M5 supplement, or the familiar SPARC 2011 Architecture.

Tuesday Sep 23, 2014

Comparing constant duration profiles

I was putting together my slides for Open World, and in one of them I'm showing profile data from a server-style workload. ie one that keeps running until stopped. In this case the profile can be an arbitrary duration, and it's the work done in that time which is the important metric, not the total amount of time taken.

Profiling for a constant duration is a slightly unusual situation. We normally profile a workload that takes N seconds, do some tuning, and it now takes (N-S) seconds, and we can say that we improved performance by S/N percent. This is represented by the left pair of boxes in the following diagram:

In the diagram you can see that the routine B got optimised and therefore the entire runtime, for completing the same amount of work, reduced by an amount corresponding to the performance improvement for B.

Let's run through the same scenario, but instead of profiling for a constant amount of work, we profile for a constant duration. In the diagram this is represented by the outermost pair of boxes.

Both profiles run for the same total amount of time, but the right hand profile has less time spent in routine B() than the left profile, because the time in B() has reduced more time is spent in A(). This is natural, I've made some part of the code more efficient, I'm observing for the same amount of time, so I must spend more time in the part of the code that I've not optimised.

So what's the performance gain? In this case we're more likely to look at the gain in throughput. It's a safe assumption that the amount of time in A() corresponds to the amount of work done - ie that if we did T units of work, then the average cost per unit work A()/T is the same across the pair of experiments. So if we did T units of work in the first experiment, then in the second experiment we'd do T * A'()/A(). ie the throughput increases by S = A'()/A() where S is the scaling factor. What is interesting about this is that A() represents any measure of time spent in code which was not optimised. So A() could be a single routine or it could be all the routines that are untouched by the optimisation.

Friday Sep 05, 2014

Fun with signal handlers

I recently had a couple of projects where I needed to write some signal handling code. I figured it would be helpful to write up a short article on my experiences.

The article contains two examples. The first is using a timer to write a simple profiler for an application - so you can find out what code is currently being executed. The second is potentially more esoteric - handling illegal instructions. This is probably worth explaining a bit.

When a SPARC processor hits an instruction that it does not understand, it traps. You typically see this if an application has gone off into the weeds and started executing the data segment or something. However, you can use this feature for doing something whenever the processor encounters an illegal instruction. If it's a valid instruction that isn't available on the processor, you could write emulation code. Or you could use it as a kind of break point that you insert into the code. Or you could use it to make up your own instruction set. That bit's left as an exercise for you. The article provides the template of how to do it.

Monday Aug 25, 2014

My schedule for JavaOne and Oracle Open World

I'm very excited to have got my schedule for Open World and JavaOne:

CON8108: Engineering Insights: Best Practices for Optimizing Oracle Software for Oracle Hardware
Venue / Room: Intercontinental - Grand Ballroom C
Date and Time: 10/1/14, 16:45 - 17:30

CON2654: Java Performance: Hardware, Structures, and Algorithms
Venue / Room: Hilton - Imperial Ballroom A
Date and Time: 9/29/14, 17:30 - 18:30

The first talk will be about some of the techniques I use when performance tuning software. We get very involved in looking at how Oracle software works on Oracle hardware. The things we do work for any software, but we have the advantage of good working relationships with the critical teams.

The second talk is with Charlie Hunt, it's a follow on from the talk we gave at JavaOne last year. We got Rock Star awards for that, so the pressure's on a bit for this sequel. Fortunately there's still plenty to talk about when you look at how Java programs interact with the hardware, and how careful choices of data structures and algorithms can have a significant impact on delivered performance.

Anyway, I hope to see a bunch of people there, if you're reading this, please come and introduce yourself. If you don't make it I'm looking forward to putting links to the presentations up.

Friday Jul 11, 2014

Studio 12.4 Beta Refresh, performance counters, and CPI

We've just released the refresh beta for Solaris Studio 12.4 - free download. This release features quite a lot of changes to a number of components. It's worth calling out improvements in the C++11 support and other tools. We've had few comments and posts on the Studio forums, and a bunch of these have resulted in improvements in this refresh.

One of the features that is deserving of greater attention is default hardware counters in the Performance Analyzer.

Default hardware counters

There's a lot of potential hardware counters that you can profile your application on. Some of them are easy to understand, some require a bit more thought, and some are delightfully cryptic (for example, I'm sure that op_stv_wait_sxmiss_ex means something to someone). Consequently most people don't pay them much attention.

On the other hand, some of us get very excited about hardware performance counters, and the information that they can provide. It's good to be able to reveal that we've made some steps along the path of making that information more generally available.

The new feature in the Performance Analyzer is default hardware counters. For most platforms we've selected a set of meaningful performance counters. You get these if you add -h on to the flags passed to collect. For example:

$ collect -h on ./a.out

Using the counters

Typically the counters will gather cycles, instructions, and cache misses - these are relatively easy to understand and often provide very useful information. In particular, given a count of instructions and a count of cycles, it's easy to compute Cycles per Instruction (CPI) or Instructions per Cycle(IPC).

I'm not a great fan of CPI or IPC as absolute measurements - working in the compiler team there are plenty of ways to change these metrics by controlling the I (instructions) when I really care most about the C (cycles). But, the two measurements have a very useful purpose when examining a profile.

A high CPI means lots cycles were spent somewhere, and very few instructions were issued in that time. This means lots of stall, which means that there's some potential for performance gains. So a good rule of thumb for where to focus first is routines that take a lot of time, and have a high CPI.

IPC is useful for a different reason. A processor can issue a maximum number of instructions per cycle. For example, a T4 processor can issue two instructions per cycle. If I see an IPC of 2 for one routine, I know that the code is not stalled, and is limited by instruction count. So when I look at a code with a high IPC I can focus on optimisations that reduce the instruction count.

So both IPC and CPI are meaningful metrics. Reflecting this, the Performance Analyzer will compute the metrics if the hardware counter data is available. Here's an example:

This code was deliberately contrived so that all the routines had ludicrously high CPI. But isn't that cool - I can immediately see what kinds of opportunities might be lurking in the code.

This is not restricted to just the functions view, CPI and/or IPC are presented in every view - so you can look at CPI for each thread, line of source, line of disassembly. Of course, as the counter data gets spread over more "lines" you have less data per line, and consequently more noise. So CPI data at the disassembly level is not likely to be that useful for very short running experiments. But when aggregated, the CPI can often be meaningful even for short experiments.

Friday May 23, 2014

Generic hardware counter events

A while back, Solaris introduced support for PAPI - which is probably as close as we can get to a de-facto standard for performance counter naming. For performance counter geeks like me, this is not quite enough information, I actually want to know the names of the raw counters used. Fortunately this is provided in the generic_events man page:

$ man generic_events
Reformatting page.  Please Wait... done

CPU Performance Counters Library Functions   generic_events(3CPC)

     generic_events - generic performance counter events

     The Solaris  cpc(3CPC)  subsystem  implements  a  number  of
     predefined, generic performance counter events. Each generic
   Intel Pentium Pro/II/III Processor
       Generic Event          Platform Event          Event Mask
     PAPI_ca_shr          l2_ifetch                 0xf
     PAPI_ca_cln          bus_tran_rfo              0x0
     PAPI_ca_itv          bus_tran_inval            0x0
     PAPI_tlb_im          itlb_miss                 0x0
     PAPI_btac_m          btb_misses                0x0
     PAPI_hw_int          hw_int_rx                 0x0

Tuesday May 06, 2014

Introduction to OpenMP

Recently, I had the opportunity to talk with Nawal Copty, our OpenMP representative, about writing parallel applications using OpenMP, and about the new features in the OpenMP 4.0 specification. The video is available on youtube.

The set of recent videos can be accessed either through the Oracle Learning Library, or as a playlist on youtube.

Thursday Apr 17, 2014

What's new in C++11

I always enjoy chatting with Steve Clamage about C++, and I was really pleased to get to interview him about what we should expect from the new 2011 standard.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« June 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming