Tuesday Mar 31, 2015

Using the Solaris Studio IDE for remote development

Solaris Studio has an IDE based on NetBeans.org, one of the features of the IDE is its ability to do remote development - ie do your development work on a Windows laptop while doing the builds on a remote Solaris or Linux box. Vladimir has written up a nice how-to guide covering the three models that Studio supports:

As shown in the image, the three models are:

  • Fully remote. The source and builds remain on the remote system.
  • Mixed/shared. The source is on a shared network location, but the builds are performed remotely.
  • Simple. Where the source is cached on the local machine for editing, but sent to the remote machine for builds.

Wednesday Mar 25, 2015

Community redesign...

Rick has a nice post about the changes to the Oracle community pages. Useful quick read.

Saturday Mar 21, 2015

New Studio C++ blogger

Please welcome another of my colleagues to blogs.oracle.com. Fedor's just posted some details about getting Boost to compile with Studio.

Monday Mar 02, 2015

Building xerces 2.8.0

Some notes on the building xerces 2.8.0 on Solaris. You can find the build instructions on the xerces site. But there's some changes that are needed to make it work with recent Studio compilers.

  • Remove -ptr from Makefile.incl. This is a deprecated option and should be removed from the Makefile.
  • Add -library=stlport4 into Makefile.incl. Need this flag on the compile and link lines in order that stlport4 is used instead of the default libCstd. Most recent C++ codes require a more up to date version of the Standard Library than the default provides.
  • -xarch=sparcv9 with -m64 in runConfigure. This stops the compiler from emitting a warning message about deprecated options.

Build with:

$ ./runConfigure -p solaris -c cc -x CC -r pthread -b 64
$ ./configure
$ ./gmake

Hopefully this gives you sufficient information to build the 2.8.0 version of the library.

Update: On 19th March 2015 they removed the 2.8 branch of xerces. So the instructions are rather redundant now :)

Building old code

Just been looking at a strange link time error:

ld.so.1: lddstub: fatal: tr/dev/nul: open failed: No such file or directory

I got this compiling C++ code that was expecting one of the old Studio compilers (probably Workshop vintage). The clue to figuring out what was wrong was this warning in the build:

CC: Warning: Option -ptr/dev/nul passed to ld, if ld is invoked, ignored otherwise

What was happening was that the library was building using the long-since-deprecated -ptr option. This specified the location of the template repository in really old versions of the Sun compilers. We've long since moved from using a repository for templates. The deprecation process is that the option gets a warning message for a while, then eventually the compiler stops giving the warning and starts ignoring the option. In this case, however, the option gets passed to the linker, and unfortunately it happens to be a meaningful option for the linker:

        [-p auditlib]   identify audit library to accompany this object

So the linker acts on this, and you end up with the fatal link time error.

Friday Feb 27, 2015

Improper member use error

I hit this Studio compiler error message, and it took me a few minutes to work out what was going wrong. So I'm writing it up in case anyone else hits it. Consider this code:

typedef struct t
{
   int t1;
   int t2;
} t_t;

struct q
{
   int q1;
   int q2;
};

void main()
{
   struct t v; // Instantiate one structure
   v.q1 = 0;   // Use member of a different structure
}

The C compiler produces this error:

$ cc odd.c
"odd.c", line 16: improper member use: q1
cc: acomp failed for odd.c 

The compiler recognises the structure member, and works out that it's not valid to use it in the context of the other structure - hence the error message. But the error message is not exactly clear about it.

Tuesday Feb 17, 2015

Profiling the kernel

One of the incredibly useful features in Studio is the ability to profile the kernel. The tool to do this is er_kernel. It's based around dtrace, so you either need to run it with escalated privileges, or you need to edit /etc/user_attr to add something like:

<username>::::defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernel

The correct way to modify user_attr is with the command usermod:

usermod -K defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernel <username>

There's two ways to run er_kernel. The default mode is to just profile the kernel:

$ er_kernel sleep 10
....
Creating experiment database ktest.1.er (Process ID: 7399) ...
....
$ er_print -limit 10 -func ktest.1.er
Functions sorted by metric: Exclusive Kernel CPU Time

Excl.     Incl.      Name
Kernel    Kernel
CPU sec.  CPU sec.
19.242    19.242     <Total>
14.869    14.869     <l_PID_7398>
 0.687     0.949     default_mutex_lock_delay
 0.263     0.263     mutex_enter
 0.202     0.202     <java_PID_248>
 0.162     0.162     gettick
 0.141     0.141     hv_ldc_tx_set_qtail
...

The we passed the command sleep 10 to er_kernel, this causes it to profile for 10 seconds. It might be better form to use the equivalent command line option -t 10.

In the profile we can see a couple of user processes together with some kernel activity. The other way to run er_kernel is to profile the kernel and user processes. We enable this mode with the command line option -F on:

$ er_kernel -F on sleep 10
...
Creating experiment database ktest.2.er (Process ID: 7630) ...
...
$ er_print -limit 5 -func ktest.2.er
Functions sorted by metric: Exclusive Total CPU Time

Excl.     Incl.     Excl.     Incl.      Name
Total     Total     Kernel    Kernel
CPU sec.  CPU sec.  CPU sec.  CPU sec.
15.384    15.384    16.333    16.333     <Total>
15.061    15.061     0.        0.        main
 0.061     0.061     0.        0.        ioctl
 0.051     0.141     0.        0.        dt_consume_cpu
 0.040     0.040     0.        0.        __nanosleep
...

In this case we can see all the userland activity as well as kernel activity. The -F option is very flexible, instead of just profiling everything, we can use -F =<regexp>syntax to specify either a PID or process name to profile:

$ er_kernel -F =7398

Tuesday Feb 10, 2015

Printing out arguments

Rather unexciting application for printing out the arguments passed into an application:

#include <stdio.h>


void main(int argc, char** argv)
{
  for (int i=0; i<argc; i++)
  {
    printf(" %i = \"%s\"\n",i, argv[i]);
  }
}

Monday Feb 09, 2015

New Studio blogger

One of my colleagues, Raj Prakash, has started a blog. If you saw the Studio videos from last year, you'll recall that Raj and I discussed memory error dectection, and appropriately enough his first post is on how you can use Studio's memory error detection capabilities on server-type applications.

Thursday Feb 05, 2015

Digging into microstate accounting

Solaris has support for microstate accounting. This gives huge insight into where an application and its threads are spending their time. It breaks down time into the (obvious) user and system, but also allows you to see the time spent waiting on page faults and other useful-to-know states.

This level of detail is available through the usage file in /proc/pid, there's a corresponding file for each lwp in /proc/pid/lwp/lwpid/lwpusage. You can find more details about the /proc file system in documentation, or reading my recent article about tracking memory use.

Here's an example of using it to report idle time, ie time when the process wasn't busy:

#include <stdio.h>
#include <sys/resource.h>
#include <unistd.h>
#include <fcntl.h>
#include <procfs.h>

void busy()
{
  for (int i=0; i<100000; i++)
  {
   double d = i;
   while (d>0) { d=d *0.5; }
  }
}

void lazy()
{
  sleep(10);
}

double convert(timestruc_t ts)
{
  return ts.tv_sec + ts.tv_nsec/1000000000.0;
}

void report_idle()
{
  prusage_t prusage;
  int fd;
  fd = open( "/proc/self/usage", O_RDONLY);
  if (fd == -1) { return; }
  read( fd, &prusage, sizeof(prusage) );
  close( fd );
  printf("Idle percent = %3.2f\n",
  100.0*(1.0 - (convert(prusage.pr_utime) + convert(prusage.pr_stime))
               /convert(prusage.pr_rtime) ) );
}

void main()
{
  report_idle();
  busy();
  report_idle();
  lazy();
  report_idle();
}

The code has two functions that take time. The first does some redundant FP computation (that cannot be optimised out unless you tell the compiler to do FP optimisations), this part of the code is CPU bound. When run the program reports low idle time for this section of the code. The second routine calls sleep(), so the program is idle at this point waiting for the sleep time to expire, hence this section is reported as being high idle time.

Wednesday Feb 04, 2015

Namespaces in C++

A porting problem I hit with regularity is using functions in the standard namespace. Fortunately, it's a relatively easy problem to diagnose and fix. But it is very common, and it's worth discussing how it happens.

C++ namespaces are a very useful feature that allows an application to use identical names for symbols in different contexts. Here's an example where we define two namespaces and place identically named functions in the two of them.

#include <iostream>

namespace ns1
{
  void hello() 
  { std::cout << "Hello ns1\n"; }
}

namespace ns2
{
  void hello() 
  { std::cout << "Hello ns2\n"; }
}

int main()
{
  ns1::hello();
  ns2::hello();
}

The construct namespace optional_name is used to introduce a namespace. In this example we have introduced two namespaces ns1 and ns2. Both namespaces have a routine called hello, but both routines can happily co-exist because they exist in different namespaces.

Behind the scenes, the namespace becomes part of the symbol name:

$ nm a.out|grep hello
[63]    |     68640|        44|FUNC |GLOB |0    |11     |__1cDns1Fhello6F_v_
[56]    |     68704|        44|FUNC |GLOB |0    |11     |__1cDns2Fhello6F_v_

When it comes to using functions declared in namespaces we can prepend the namespace to the name of the symbol, this uniquely identifies the symbol. You can see this in the example where the calls to hello() from the different namespaces are prefixed with the namespace.

However, prefixing every function call with its namespace can rapidly become very tedious. So there is a way to make this easier. First of all, let's quickly discuss the global namespace. The global namespace is the namespace that is searched if you do not specify a namespace - kind of the default namespace. If you declare a function foo() in your code, then it naturally resides in the global namespace.

We can add symbols from other namespaces into the global namespace using the using keyword. There are two ways we can do this. One way is to add the entire namespace into the global namespace. The other way is to symbols individually into the name space. To do this write using namespace <namespace>; to import the entire namespace into the global namespace, or using <namespace>::<function>; to just import a single function into the global namespace. Here's the earlier example modified to show both approaches:

#include <iostream>

namespace ns1
{
  void hello() 
  { std::cout << "Hello ns1\n"; }
}

namespace ns2
{
  void hello() 
  { std::cout << "Hello ns2\n"; }
}

int main()
{
  { 
    using namespace ns1; 
    hello(); 
  }
  { 
    using ns2::hello;
    hello(); 
  }
}

The other thing you will notice in the example is the use of std::cout. Notice that this is prefixed with the std:: namespace. This is an example of a situation where you might encounter porting problems.

The C++03 standard (17.4.1.1) says this about the C++ Standard Library "All library entities except macros, operator new and operator delete are defined within the namespace std or namespaces nested within the namespace std.". This means that, according to the standard, if you include iostream then cout will be defined in the std namespace. That's the only place you can rely on it being available.

Now, sometimes you might find a function that is in the std namespace is already available in the general namespace. For example, gcc puts all the functions that are in the std namespace into the general namespace.

Other times, you might include a header file which has already imported an entire namespace, or particular symbols from a namespace. This can happen if you change the Standard Library that you are using and the new header files contain a different set of includes and using statements.

There's one other area where you can encounter this, and that is using C library routines. All the C header files have a C++ counterpart. For example stdio.h has the counterpart cstdio. One difference between the two headers is the namespace where the routines are placed. If the C headers are used, then the symbols get placed into the global namespace, if the C++ headers are used the symbols get placed into the C++ namespace. This behaviour is defined by section D.5 of the C++03 standard. Here's an example where we use both the C and C++ header files, and need to specify the namespace for the functions from the C++ header file:

#include <cstdio>
#include <strings.h>

int main()
{
  char string[100];
  strcpy( string, "Hello" );
  std::printf( "%s\n", string );
}

Tuesday Feb 03, 2015

Complete set of bit manipulation posts combined

My recent blog posts on bit manipulation are now available as an article up on the OTN community pages. If you want to read the individual posts they are:

Bit manipulation: Gathering bits

In the last post on bit manipulation we looked at how we could identify bytes that were greater than a particular target value, and stop when we discovered one. The resulting vector of bytes contained a zero byte for those which did not meet the criteria, and a byte containing 0x80 for those that did. Obviously we could express the result much more efficiently if we assigned a single bit for each result. The following is "lightly" optimised code for producing a bit vector indicating the position of zero bytes:

void zeros( unsigned char * array, int length, unsigned char * result )
{
  for (int i=0;i < length; i+=8)
  {
    result[i>>3] = 
    ( (array[i+0]==0) << 7) +
    ( (array[i+1]==0) << 6) +
    ( (array[i+2]==0) << 5) +
    ( (array[i+3]==0) << 4) +
    ( (array[i+4]==0) << 3) +
    ( (array[i+5]==0) << 2) +
    ( (array[i+6]==0) << 1) +
    ( (array[i+7]==0) << 0);
  }
}

The code is "lightly" optimised because it works on eight values at a time. This helps performance because the code can store results a byte at a time. An even less optimised version would split the index into a byte and bit offset and use that to update the result vector.

When we previously looked at finding zero bytes we used Mycroft's algorithm that determines whether a zero byte is present or not. It does not indicate where the zero byte is to be found. For this new problem we want to identify exactly which bytes contain zero. So we can come up with two rules that both need be true:

  • The inverted byte must have a set upper bit.
  • If we invert the byte and select the lower bits, adding one to these must set the upper bit.

Putting these into a logical operation we get (~byte & ( (~byte & 0x7f) + 1) & 0x80). For non-zero input bytes we get a result of zero, for zero input bytes we get a result of 0x80. Next we need to convert these into a bit vector.

If you recall the population count example from earlier, we used a set of operations to combine adjacent bits. In this case we want to do something similar, but instead of adding bits we want to shift them so that they end up in the right places. The code to perform the comparison and shift the results is:

void zeros2( unsigned long long* array, int length, unsigned char* result )
{
  for (int i=0; i<length; i+=8)
  {
    unsigned long long v, u;
    v = array[ i>>3 ];

    u = ~v;
    u = u & 0x7f7f7f7f7f7f7f7f;
    u = u + 0x0101010101010101;
    v = u & (~v);
    v = v & 0x8080808080808080;

    v = v | (v << 7);
    v = v | (v << 14);
    v = (v >> 56) | (v >> 28);
    result[ i>>3 ] = v;
  }
}

The resulting code runs about four times faster than the original.

Concluding remarks

So that ends this brief series on bit manipulation, I hope you've found it interesting, if you want to investigate this further there are plenty of resources on the web, but it would be hard to skip mentioning the book "The Hacker's Delight", which is a great read on this domain.

There's a couple of concluding thoughts. First of all performance comes from doing operations on multiple items of data in the same instruction. This should sound familiar as "SIMD", so a processor might often have vector instructions that already get the benefits of single instruction, multiple data, and single SIMD instruction might replace several integer operations in the above codes. The other place the performance comes from is eliminating branch instructions - particularly the unpredictable ones, again vector instructions might offer a similar benefit.

Monday Feb 02, 2015

Bit manipulation: finding a range of values

We previously looked at finding zero values in an array. A similar problem is to find a value larger than some target. The vanilla code for this is pretty simple:

#include "timing.h"

int range(char * array, unsigned int length, unsigned char target)
{
  for (unsigned int i=0; i<length; i++)
  {
    if (array[i]>target) { return i; }
  }
  return -1;
}

It's possible to recode this to use bit operations, but there is a small complication. We need two versions of the routine depending on whether the target value is >127 or not. Let's start with the target greater than 127. There are two rules to finding bytes greater than this target:

  • The upper bit is set in the target value, this means that the upper bit must also be set in the bytes we examine. So we can AND the input value with 0x80, and this must be 0x80.
  • We want a bit more precision than testing the upper bit. We need to know that the value is greater than the target value. So if we clear the upper bit we get a number between 0 and 127. This is equivalent to subtracting 128 off all the bytes that have a value greater than 127. So instead of doing a comparison of is 132 greater than 192 we can do an equivalent check of is (132-128) greater than (192-128), or is 4 greater than 64? However, we want bytes where this is true to end up with their upper bit set. So we can do an ADD operation where we add sufficient to each byte to cause the result to be greater than 128 for the bytes with a value greater than the target. The operation for this is ( byte & 0x7f ) + (255-target).

The second condition is hard to understand, so consider an example where we are searching for values greater than 192. We have an input of 132. So the first of the two conditions produces 132 & 0x80 = 0x80. For the second condition we want to do (132 & 0x7f) + (255-192) = 4+63 = 68 so the second condition does not produce a value with the upper bit set. Trying again with an input of 193 we get 65 + 63 = 128 so the upper bit is set, and we get a result of 0x80 indicating that the byte is selected.

The full operation is (byte & ( (byte & 0x7f) + (255 - target) ) & 0x80).

If the target value is less than 128 we perform a similar set of operations. In this case if the upper bit is set then the byte is automatically greater than the target value. If the upper bit is not set we have to add sufficient on to cause the upper bit to be set by any value that meets the criteria.

The operation looks like (byte | ( (byte & 0x7f) + (127 - target) ) & 0x80).

Putting all this together we get the following code:

int range2( unsigned char* array, unsigned int length, unsigned char target )
{
  unsigned int i = 0;
  // Handle misalignment
  while ( (length > 0) && ( (unsigned long long) & array[i] & 7) )
  { 
    if ( array[i] > target ) { return i; }
    i++;
    length--;
  }
  // Optimised code
  unsigned long long * p = (unsigned long long*) &array[i];
  if (target < 128)
  {
    unsigned long long v8 = 127 - target;
    v8 = v8 | (v8 << 8);
    v8 = v8 | (v8 << 16);
    v8 = v8 | (v8 << 32);

    while (length > 8) 
    {
      unsigned long long v = *p;
      unsigned long long u;
      u = v & 0x8080808080808080; // upper bit
      v = v & 0x7f7f7f7f7f7f7f7f; // lower bits
      v = v + v8;
      unsigned long long r = (v | u) & 0x8080808080808080;
      if (r) { break; }
      length-=8;
      p++;
      i+=8;
    }
  }
  else
  {
    unsigned long long v8 = 255 - target;
    v8 = v8 | (v8 << 8);
    v8 = v8 | (v8 << 16);
    v8 = v8 | (v8 << 32);
    while (length > 8)
    {
      unsigned long long v = *p;
      unsigned long long u;
      u = v & 0x8080808080808080; // upper bit
      v = v & 0x7f7f7f7f7f7f7f7f; // lower bits
      v = v + v8;
      unsigned long long r = v & u;
      if (r) { break; }
      length-=8;
      p++;
      i+=8;
    }
  }

  // Handle trailing values
  while (length > 0)
  { 
    if (array[i] > target) { return i; }
    i++;
    length--;
  }
  return -1;
}

The resulting code runs about 4x faster than the original version.

Friday Jan 30, 2015

Finding zero values in an array

A common thing to want to do is to find zero values in an array. This is obviously necessary for string length. So we'll start out with a test harness and a simple implementation:

#include "timing.h"

unsigned int len(char* array)
{
  unsigned int length = 0;
  while( array[length] ) { length++; }
  return length;
}

#define COUNT 100000
void main()
{
  char array[ COUNT ];
  for (int i=1; i<COUNT; i++)
  {
    array[i-1] = 'a';
    array[i] = 0;
    if ( i != len(array) ) { printf( "Error at %i\n", i ); }
  }
  starttime();
  for (int i=1; i<COUNT; i++)
  {
    array[i-1] = 'a';
    array[i] = 0;
    len(array);
  }
  endtime(COUNT);
}

A chap called Alan Mycroft came up with a very neat algorithm to simultaneously examine multiple bytes and determine whether there is a zero in them. His algorithm starts off with the idea that there are two conditions that need to be true if a byte contains the value zero. First of all the upper bit of the byte must be zero, this is true for zero and all values less than 128, so on its own it is not sufficient. The second characteristic is that if one is subtracted from the value, then the upper bit must be one. This is true for zero and all values greater than 128. Although both conditions are individually satisfied by multiple values, the only value that satisfies both conditions is zero.

The following code uses the Mycroft test for a string length implementation. The code contains a pre-loop to get to an eight byte aligned address.

unsigned int len2(char* array)
{
  unsigned int length = 0;
  // Handle misaligned data
  while ( ( (unsigned long long) & array[length] ) &7 )
  {
    if ( array[length] == 0 ) { return length; }
    length++;
  }

  unsigned long long * p = (unsigned long long *) & array[length];
  unsigned long long v8, v7;
  do
  {
    v8 = *p;
    v7 = v8 - 0x0101010101010101;
    v7 = (v7 & ~v8) & 0x8080808080808080;
    p++;
  }
  while ( !v7 );
  length = (char*)p - array-8;
  while ( array[length] ) { length++; }
  return length;
}

The algorithm has one weak point. It does not always report exactly which byte is zero, just that there is a zero byte somewhere. Hence the final loop where we work out exactly which byte is zero.

It is a trivial extension to use this to search for a byte of any value. If we XOR the input vector with a vector of bytes containing the target value, then we get a zero byte where the target value occurs, and a non-zero byte everywhere else.

It is also easy to extend the code to search for other zero bit patterns. For example, if we want to find zero nibbles (ie 4 bit values), then we can change the constants to be 0x1111111111111111 and 0x8888888888888888.

Thursday Jan 29, 2015

gedit troubles

Hit a couple of issues with gedit, just documenting them in case others hit the same problems.

X11 connection rejected because of wrong authentication.

This turned out to be because there was already a copy of gedit running on the system.

GConf Error: Failed to contact configuration server; some possible causes are that you need to 
enable TCP/IP networking for ORBit, or you have stale NFS locks due to a system crash. See 
http://projects.gnome.org/gconf/ for information. (Details -  1: Failed to get connection 
to session: /usr/bin/dbus-launch terminated abnormally without any error message)

This turned out to be dbus-launch not being installed on the system.

Bit manipulation: Population Count

Population count is one of the more esoteric instructions. It's the operation to count the number of set bits in a register. It comes up with sufficient frequency that most processors have a hardware instruction to do it. However, for this example, we're going to look at coding it in software. First of all we'll write a baseline version of the code:

int popc(unsigned long long value)
{
  unsigned long long bit = 1;
  int popc = 0;
  while ( bit )
  {
    if ( value & bit ) { popc++; }
    bit = bit << 1;
  }
  return popc;
}

The above code examines every bit in the input and counts the number of set bits. The number of iterations is proportional to the number of bits in the register.

Most people will immediately recognise that we could make this a bit faster using the code we discussed previously that clears the last set bit, whist there are set bits keep clearing them, otherwise you're done. The advantage of this approach is that you only iterate once for every set bit in the value. So if there are no set bits, then you do not do any iterations.

int popc2( unsigned long long value )
{
  int popc = 0;
  while ( value )
  {
    popc++;
    value = value & (value-1);
  }
  return popc;
}

The next thing to do is to put together a test harness that confirms that the new code produces the same results as the old code, and also measures the performance of the two implementations.

#define COUNT 1000000
void main()
{
  // Correctness test
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    if (popc( i + (i<<32) ) != popc2( i + (i<<32) ) )
    { 
      printf(" Mismatch popc2 input %llx: %u!= %u\n", 
               i+(i<<32), popc(i+(i<<32)), popc2(i+(i<<32))); 
    }
  }
  
  // Performance test
  starttime();
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    popc(i+(i<<32));
  }
  endtime(COUNT);
  starttime();
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    popc2(i+(i<<32));
  }
}

The new code is about twice as fast as the old code. However, the new code still contains a loop, and this can be a bit of a problem.

Branch mispredictions

The trouble with loops, and with branches in general, is that processors don't know the next instruction that will be executed after the branch until the branch has been reached, but the processor needs to have already fetched instruction after the branch well before this. The problem is nicely summarised by Holly in Red Dwarf:

"Look, I'm trying to navigate at faster than the speed of light, which means that before you see something, you've already passed through it."

So processors use branch prediction to guess whether a branch is taken or not. If the prediction is correct there is no break in the instruction stream, but if the prediction is wrong, then the processor needs to throw away all the incorrectly predicted instructions, and fetch the instructions from correct address. This is a significant cost, so ideally you don't want mispredicted branches, and the best way of ensuring that is to not have branches at all!

The following code is a branchless sequence for computing population count

unsigned int popc3(unsigned long long value)
{
  unsigned long long v2;
  v2     = value &t;< 1;
  v2    &= 0x5555555555555555;
  value &= 0x5555555555555555;
  value += v2;
  v2     = value << 2;
  v2    &= 0x3333333333333333;
  value &= 0x3333333333333333;
  value += v2;
  v2     = value << 4;
  v2    &= 0x0f0f0f0f0f0f0f0f;
  value &= 0x0f0f0f0f0f0f0f0f;
  value += v2;
  v2     = value << 8;
  v2    &= 0x00ff00ff00ff00ff;
  value &= 0x00ff00ff00ff00ff;
  value += v2;
  v2     = value << 16;
  v2    &= 0x0000ffff0000ffff;
  value &= 0x0000ffff0000ffff;
  value += v2;
  v2     = value << 32;
  value += v2;
  return (unsigned int) value;
}

This instruction sequence computes the population count by initially adding adjacent bits to get a two bit result of 0, 1, or 2. It then adds the adjacent pairs of bits to get a 4 bit result of between 0 and 4. Next it adds adjacent nibbles to get a byte result, then adds pairs of bytes to get shorts, then adds shorts to get a pair of ints, which it adds to get the final value. The code contains a fair amount of AND operations to mask out the bits that are not part of the result.

This bit manipulation version is about two times faster than the clear-last-bit-set version, making it about four times faster than the original code. However, it is worth noting that this is a fixed cost. The routine takes the same amount of time regardless of the input value. In contrast the clear -last-bit-set version will exit early if there are no set bits. Consequently the performance gain for the code will depend on both the input value and the cost of mispredicted branches.

Wednesday Jan 28, 2015

Tracking application resource use

One question you might ask is "how much memory is my application consuming?". Obviously you can use prstat (prstat -cp <pid> or prstat -cmLp <pid>) to examine the behaviour of a process. But how about programmatically finding that information.

OTN has just published an article where I demonstrate how to find out about the resource use of a process, and incidentally how to put that functionality into a library that reports resource use over time.

Inline functions in C

Functions declared as inline are slightly more complex than might be expected. Douglas Walls has provided a chapter-and-verse write up. But the issue bears further explanation.

When a function is declared as inline it's a hint to the compiler that the function could be inlined. It is not a command to the compiler that the function must be inlined. If the compiler chooses not to inline the function, then the function will be left with a function call that needs to be resolved, and at link time it will be necessary for a function definition to be provided. Consider this example:

#include <stdio.h>

inline void foo() 
{
  printf(" In foo\n"); 
}

void main()
{
  foo();
}

The code provides an inline definition of foo(), so if the compiler chooses to inline the function it will use this definition. However, if it chooses not to inline the function, you will get a link time error when the linker is unable to find a suitable definition for the symbol foo:

$ cc -o in in.c
Undefined                       first referenced
 symbol                             in file
foo                                 in.o
ld: fatal: symbol referencing errors. No output written to in

This can be worked around by adding either "static" or "extern" to the definition of the inline function.

If the function is declared to be a static inline then, as before the compiler may choose to inline the function. In addition the compiler will emit a locally scoped version of the function in the object file. There can be one static version per object file, so you may end up with multiple definitions of the same function, so this can be very space inefficient. Since all the functions are locally scoped, there is are no multiple definitions.

Another approach is to declare the function as extern inline. In this case the compiler may generate inline code, and will also generate a global instance of the function. Although multiple global instances of the function might be generated in all the object files, only one will be remain in the executable after linking. So declaring functions as extern inline is more space efficient.

This behaviour is defined by the standard. However, gcc takes a different approach, which is to treat inline functions by generating a global function and potentially inlining the code. Unfortunately this can cause multiply-defined symbol errors at link time, where the same extern inline function is declared in multiple files. For example, in the following code both in.c and in2.c include in.h which contains the definition of extern inline foo()....

$ gcc -o in in.c in2.c
ld: fatal: symbol 'foo' is multiply-defined:

The gcc behaviour for functions declared as extern inline is also different. It does not emit an external definition for these functions, leading to unresolved symbol errors at link time.

For gcc, it is best to either declare the functions as extern inline and, in additional module, provide a global definition of the function, or to declare the functions as static inline and live with the multiple local symbol definitions that this produces.

So for convenience it is tempting to use static inline for all compilers. This is a good work around (ignoring the issue of duplicate local copies of functions), except for an issue around unused code.

The keyword static tells the compiler to emit a locally-scoped version of the function. Solaris Studio emits that function even if the function does not get called within the module. If that function calls a non-existent function, then you may get a link time error. Suppose you have the following code:

void error_message();

static inline unused() 
{
  error_message(); 
}

void main()
{
}

Compiling this we get the following error message:

$ cc -O i.c
"i.c", line 3: warning: no explicit type given
Undefined                       first referenced
 symbol                             in file
error_message                       i.o

Even though the function call exists in code that is not used, there is a link error for the undefined function error_message(). The same error would occur if extern inline was used as this would cause a global version of the function to be emitted. The problem would not occur if the function were just declared as inline because in this case the compiler would not emit either a global or local version of the function. The same code compiles with gcc because the unused function is not generated.

So to summarise the options:

  • Declare everything static inline, and ensure that there are no undefined functions, and that there are no functions that call undefined functions.
  • Declare everything inline for Studio and extern inline for gcc. Then provide a global version of the function in a separate file.

Improving performance through bit manipulation: clear last set bit

Bit manipulation is one of those fun areas where you can get a performance gain from recoding a routine to use logical or arithmetic instructions rather than a more straight-forward code.

Of course, in doing this you need to avoid the pit fall of premature optimisation - where you needlessly make the code more obscure with no benefit, or a benefit that disappears as soon as you run your code on a different machine. So with that caveat in mind, let's take a look at a simple example.

Clear last set bit

This is a great starting point because it nicely demonstrates how we can sometimes replace a fair chunk of code with a much simpler set of instructions. Of course, the algorithm that uses fewer instructions is harder to understand, but in some situations the performance gain is worth it.

We'll start off with some classic code to solve the problem. The reason for this is two-fold. First of all we want to clearly understand the problem we're solving. Secondly, we want a reference version that we can test against to ensure that our fiendishly clever code is actually correct. So here's our starting point:

unsigned long long clearlastbit( unsigned long long value )
{
  int bit=1;
  if ( value== 0 ) { return 0; }
  while ( !(value & bit) )
  {
    bit = bit << 1;
  }
  value = value ^ bit;
  return value;
}

But before we start trying to improve it we need a timing harness to find out how fast it runs. The following harness uses the Solaris call gethrtime() to return a timestamp in nanoseconds.

#include <stdio.h>
#include <sys/time.h>

static double s_time;

void starttime()
{
  s_time = 1.0 * gethrtime();
}

void endtime(unsigned long long its)
{
  double e_time = 1.0 * gethrtime();
  printf( "Time per iteration %5.2f ns\n", (e_time-s_time) / (1.0*its) );
  s_time = 1.0 * gethrtime();
}

The next thing we need is a workload to test the current implementation. The workload iterates over a range of numbers and repeatedly calls clearlastbit() until all the bits in the current number have been cleared.

#define COUNT 1000000
void main()
{
  starttime();
  for (unsigned long long i = 0; i < COUNT; i++ )
  {
    unsigned long long value = i;
    while (value) { value = clearlastbit(value); }
  }
  endtime(COUNT);
}

Big O notation

So let's take a break at this point to discuss big O notation. If we look at the code for clearlastbit() we can see that it contains a loop. We'll iterate around the loop once for each bit in the input value, so for a N bit number we might iterate up to N times. We say that this computation is "order N", meaning that the cost the calculation is somehow proportional to the number of bits in the input number. This is written as O(N).

The order N description is useful because it gives us some idea of the cost of calling the routine. From it we know that the routine will typically take twice as long if coded for 8 byte inputs than if we used 4 byte inputs. Order N is not too bad as costs go, the ones to look out for are order N squared, or cubed etc. For these higher orders the run time to complete a computation can become huge for even comparatively small values of N.

If we look at the test harness, we are iterating over the function COUNT times, so effectively the entire program is O(COUNT*N), and we're exploiting the fact that this is effectively an O(N^2) cost to provide a workload that has a reasonable duration.

So let's return to the problem of clearing the last set bit. One obvious optimisation would be to record the last bit that was cleared, and then start the next iteration of the loop from that point. This is potentially a nice gain, but does not fundamentally change the algorithm. A better approach is to take advantage of bit manipulation so that we can avoid the loop altogether.

unsigned long long clearlastbit2( unsigned long long value )
{
  return (value & (value-1) );
}

Ok, if you look at this code it is not immediately apparent what it does - most people would at first sight say "How can that possibly do anything useful?". The easiest way to understand it is to take an example. Suppose we pass the value ten into this function. In binary ten is encoded as 1010b. The first operation is the subtract operation which produces the result of nine, which is encoded as 1001b. We then take the AND of these two to get the result of 1000b or eight. We've cleared the last set bit because the subtract either removed the one bit (if it was set) or broke down the next largest set bit. The AND operation just keeps the bits to the left of the last set bit.

What is interesting about this snippet of code is that it is just three instructions. There's no loop and no branches - so most processors can execute this code very quickly. To demonstrate how much faster this code is, we need a test harness. The test harness should have two parts to it. The first part needs to validate that the new code produces the same result as the existing code. The second part needs to time the old and new code.

#define COUNT 1000000
void main()
{
  // Correctness test
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    unsigned long long value = i;
    while (value) 
    { 
       unsigned long long v2 = value;
       value = clearlastbit(value);
       if (value != clearlastbit2(v2)) 
       { 
         printf(" Mismatch input %llx: %llx!= %llx\n", v2, value, clearlastbit2(v2)); 
       }
    }
  }
  
  // Performance test
  starttime();
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    unsigned long long value = i;
    while (value) { value = clearlastbit(value); }
  }
  endtime(COUNT);

  starttime();
  for (unsigned long long i = 0; i<COUNT; i++ )
  {
    unsigned long long value = i;
    while (value) { value = clearlastbit2(value); }
  }
  endtime(COUNT);
}

The final result is that the bit manipulation version of this code is about 3x faster than the original code - on this workload. Of course one of the interesting things is that the performance does depend on the input values. For example, if there are no set bits, then both codes will run in about the same amount of time.

Monday Jan 26, 2015

Hands on with Solaris Studio

Over at the OTN Garage, Rick has published details of the Solaris Studio hands-on lab.

This is a great chance to learn about the Code Analyzer and the Performance Analyzer. Just to quickly recap, the Code Analyzer is really a suite of tools that do checking on application errors. This includes static analysis - ie extensive compile time checking - and dynamic analysis - run time error detection. If you regularly read this blog, you'll need no introduction to the Performance Analyzer.

The dates for these are:

  • Americas - Feb 11th 9am to 12:30pm PT
  • EMEA - Feb 24th - 9am to 12:30pm GMT
  • APAC - March 4th - 9:30am IST to 1pm IST

Tuesday Jan 13, 2015

Missing semi-colon

Thought I'd highlight this error message:

class foo
{
  foo();
}

foo::foo() 
{ }
$ CC -c c.cpp
"c.cpp", line 6: Error: A constructor may not have a return type specification.
1 Error(s) detected.

The problem is that the class definition is not terminated with a semi-colon. It should be:

class foo
{
  foo();
};  // Semi-colon

foo::foo() 
{ }

Wednesday Jan 07, 2015

Behaviour of std::list::splice in the 2003 and 2011 C++ standards

There's an interesting corner case in the behaviour of std::list::splice. In the C++98/C++03 standards it is defined such that iterators referring to the spliced element(s) are invalidated. This behaviour changes in the C++11 standard, where iterators remain valid.

The text of the 2003 standard (section 23.2.2.4, p2, p7, p12) describes the splice operation as "destructively" moving elements from one list to another. If one list is spliced into another, then all iterators and references to that list become invalid. If an element is spliced into a list, then any iterators and references to that element become invalid, similarly if a range of elements is spliced then iterators and references to those elements become invalid.

This is changed in the 2011 standard (section 23.3.5.5, p2, p4, p7, p12) where the operation is still described as being destructive, but all the iterators and references to the spliced element(s) remain valid.

The following code demonstrates the problem:

#include <list>
#include <iostream>

int main()
{
  std::list<int> list;
  std::list<int>::iterator i;

  i=list.begin();
  list.insert(i,5);
  list.insert(i,10);
  list.insert(i,3);
  list.insert(i,4); // i points to end
  // list contains 5 10 3 4
  i--; // i points to 4
  i--; // i points to 3
  i--; // i points to 10

  std::cout << " List contains: ";
  for (std::list<int>::iterator l=list.begin(); l!=list.end(); l++)
  {
    std::cout << " >" << *l << "< ";
  }
  std::cout << "\n element at i = " << *i << "\n";

  std::list<int>::iterator element;
  element = list.begin();
  element++; // points to 10
  element++; // points to 3
  std::cout << " element at element = " << *element << "\n";

  list.splice(i,list,element); // Swap 10 and 3

  std::cout << " List contains :";
  for (std::list<int>::iterator l=list.begin(); l!=list.end(); l++)
  {
    std::cout << " >" << *l << "< ";
  }

  std::cout << "\n element at element = " << *element << '\n';
  element++; // C++03, access to invalid iterator
  std::cout << " element at element = " << *element << '\n';
}

When compiled to the 2011 standard the code is expected to work and produce output like:

 List contains:  >5<  >10<  >3<  >4<
 element at i = 10
 element at element = 3
 List contains : >5<  >3<  >10<  >4<
 element at element = 3
 element at element = 10

However, the behaviour when compiled to the 2003 standard is indeterminate. It might work - if the iterator happens to remain valid, but it could also fail:

 List contains:  >5<  >10<  >3<  >4<
 element at i = 10
 element at element = 3
 List contains : >5<  >3<  >10<  >4<
 element at element = 3
 element at element = Segmentation Fault (core dumped)

Tuesday Jan 06, 2015

New articles about Solaris Studio

We've started posting new articles directly into the communities section of the Oracle website. If you're not familiar with this location, it's also where you can post questions on languages or tools.

With the change it should be easier to find articles relevant to developers, and it should be easy to comment on them. So hopefully this works out well. There's currently three articles listed on the content page. I've already posted about the article on the Performance Analyzer Overview screen, so I'll quickly highlight the other two:

  • In Studio 12.4 we've introducedfiner control over debug information. This allows you to reduce object file size by excluding debug info that you don't need. There's substantial control, but probably the easiest new option is -g1 which includes a minimal set debug info.
  • A change in Studio 12.4 is in the default way that C++ handles templates. The summary is that the compiler's default mode is inline with the way that other compilers work - you need to include template definitions in the source file being compiled. Previously the compiler would try to find the definitions in external files. This old behaviour could be confusing, so it's good that the default has changed. But it's possible that you may encounter code that was written with the expectation that the Studio compiler behaved in the old way, in this case you'll need to modify the source, or tell the compiler to revert to the older behaviour. Hopefully, most people won't even notice the change, but it's worth knowing the signs in case you encounter a problem.

The Performance Analyzer Overview screen

A while back I promised a more complete article about the Performance Analyzer Overview screen introduced in Studio 12.4. Well, here it is!

Just to recap, the Overview screen displays a summary of the contents of the experiment. This enables you to pick the appropriate metrics to display, so quickly allows you to find where the time is being spent, and then to use the rest of the tool to drill down into what is taking the time.

Thursday Nov 13, 2014

Software in Silicon Cloud

I missed this press release about Software in Silicon Cloud. It's the announcement for a service where you can try out a SPARC M7 processor. There's an accompanying website which has the sign up plus some more information about the service.

What's particularly exciting is that it talks a bit more about Application Data Integrity (ADI). Larry Ellison called this "the most important piece of engineering we’ve done in a long, long time.".

Incorrect handling of pointers is a large contributor to bugs in software. ADI tackles this by making the hardware check that the pointer being used is valid for the region of memory it is pointing to. If it's not valid the hardware flags it as an error. Since it's done by hardware, there's minimal performance impact - it's at hardware speed, so developers can check their application in realtime.

There's a nice demo of how ADI protects against exploits like HeartBleed.

Tuesday Nov 11, 2014

New Performance Analyzer Overview screen

I love using the Performance Analyzer, but the question I often get when I show it to people, is "Where do I start?". So one of the improvements in Solaris Studio 12.4 is an Overview screen to help people get started with the tool. Here's what it looks like:


The reason this is important, is that many applications spend time in various place - like waiting on disk, or in user locks - and it's not always obvious where is going to be the most effective place to look for performance gains.

The Overview screen is meant to be the "one-stop" place where people can find out what their application is doing. When we put it back into the product I expected it to be the screen that I glanced at then never went back to. I was most surprised when this turned out not to be the case.

During performance analysis, I'm often exploring different ideas as to where it might be possible to get performance improvements. The Overview screen allows me to select the metrics that I'm interested in, then take a look at the resulting profiles. So I might start with system time, and just enable the system time metrics. Once I'm done with that, I might move on to user time, and select those metrics. So what was surprising about the Overview screen was how often I returned to it to change the metrics I was using.

So what does the screen contain? The overview shows all the available metrics. The bars indicate which metrics contribute the most time. So it's easy to pick (and explore) the metrics that contribute the most time.

If the profile contains performance counter metrics, then those also appear. If the counters include instructions and cycles, then the synthetic CPI/IPC metrics are also available. The Overview screen is really useful for hardware counter metrics.

I use performance counters in a couple of ways: to confirm a hypothesis about performance or to estimate time spent on a type of event. For example, if I think a load is taking a lot of time due to TLB misses, then profiling on the TLB miss performance counter will tell me whether that load has a lot of misses or not. Alternatively, if I've got TLB miss counter data, then I can scale this by the cost per TLB miss, and get an estimate of the total runtime lost to TLB misses.

Where the Overview screen comes into this is that I will often want to minimise the number of columns of data that are shown (to fit everything onto my monitor), but sometimes I want to quickly enable a counter to see whether that event happens at the bit of code where I'm looking. Hence I end up flipping to the Overview screen and then returning to the code.

So what I thought would be a nice feature, actually became pretty central to my work-flow.

I should have a more detailed paper about the Overview screen up on OTN soon.

Tuesday Nov 04, 2014

SPARC Software in Silicon

Short video by Juan Loaiza about the Software in Silicon work in the upcoming SPARC processor.

Friday Jul 11, 2014

Studio 12.4 Beta Refresh, performance counters, and CPI

We've just released the refresh beta for Solaris Studio 12.4 - free download. This release features quite a lot of changes to a number of components. It's worth calling out improvements in the C++11 support and other tools. We've had few comments and posts on the Studio forums, and a bunch of these have resulted in improvements in this refresh.

One of the features that is deserving of greater attention is default hardware counters in the Performance Analyzer.

Default hardware counters

There's a lot of potential hardware counters that you can profile your application on. Some of them are easy to understand, some require a bit more thought, and some are delightfully cryptic (for example, I'm sure that op_stv_wait_sxmiss_ex means something to someone). Consequently most people don't pay them much attention.

On the other hand, some of us get very excited about hardware performance counters, and the information that they can provide. It's good to be able to reveal that we've made some steps along the path of making that information more generally available.

The new feature in the Performance Analyzer is default hardware counters. For most platforms we've selected a set of meaningful performance counters. You get these if you add -h on to the flags passed to collect. For example:

$ collect -h on ./a.out

Using the counters

Typically the counters will gather cycles, instructions, and cache misses - these are relatively easy to understand and often provide very useful information. In particular, given a count of instructions and a count of cycles, it's easy to compute Cycles per Instruction (CPI) or Instructions per Cycle(IPC).

I'm not a great fan of CPI or IPC as absolute measurements - working in the compiler team there are plenty of ways to change these metrics by controlling the I (instructions) when I really care most about the C (cycles). But, the two measurements have a very useful purpose when examining a profile.

A high CPI means lots cycles were spent somewhere, and very few instructions were issued in that time. This means lots of stall, which means that there's some potential for performance gains. So a good rule of thumb for where to focus first is routines that take a lot of time, and have a high CPI.

IPC is useful for a different reason. A processor can issue a maximum number of instructions per cycle. For example, a T4 processor can issue two instructions per cycle. If I see an IPC of 2 for one routine, I know that the code is not stalled, and is limited by instruction count. So when I look at a code with a high IPC I can focus on optimisations that reduce the instruction count.

So both IPC and CPI are meaningful metrics. Reflecting this, the Performance Analyzer will compute the metrics if the hardware counter data is available. Here's an example:


This code was deliberately contrived so that all the routines had ludicrously high CPI. But isn't that cool - I can immediately see what kinds of opportunities might be lurking in the code.

This is not restricted to just the functions view, CPI and/or IPC are presented in every view - so you can look at CPI for each thread, line of source, line of disassembly. Of course, as the counter data gets spread over more "lines" you have less data per line, and consequently more noise. So CPI data at the disassembly level is not likely to be that useful for very short running experiments. But when aggregated, the CPI can often be meaningful even for short experiments.

Wednesday Jun 25, 2014

Guest post on the OTN Garage

Contributed a post on how compilers handle constants to the OTN Garage. The whole OTN blog is worth reading because as well as serving up useful info, Rick has a good irreverent style of writing.

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download

Search

Categories
Archives
« March 2015
SunMonTueWedThuFriSat
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
22
23
24
26
27
28
29
30
31    
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs