Wednesday Mar 25, 2015

Community redesign...

Rick has a nice post about the changes to the Oracle community pages. Useful quick read.

Tuesday Feb 03, 2015

Bit manipulation: Gathering bits

In the last post on bit manipulation we looked at how we could identify bytes that were greater than a particular target value, and stop when we discovered one. The resulting vector of bytes contained a zero byte for those which did not meet the criteria, and a byte containing 0x80 for those that did. Obviously we could express the result much more efficiently if we assigned a single bit for each result. The following is "lightly" optimised code for producing a bit vector indicating the position of zero bytes:

void zeros( unsigned char * array, int length, unsigned char * result )
  for (int i=0;i < length; i+=8)
    result[i>>3] = 
    ( (array[i+0]==0) << 7) +
    ( (array[i+1]==0) << 6) +
    ( (array[i+2]==0) << 5) +
    ( (array[i+3]==0) << 4) +
    ( (array[i+4]==0) << 3) +
    ( (array[i+5]==0) << 2) +
    ( (array[i+6]==0) << 1) +
    ( (array[i+7]==0) << 0);

The code is "lightly" optimised because it works on eight values at a time. This helps performance because the code can store results a byte at a time. An even less optimised version would split the index into a byte and bit offset and use that to update the result vector.

When we previously looked at finding zero bytes we used Mycroft's algorithm that determines whether a zero byte is present or not. It does not indicate where the zero byte is to be found. For this new problem we want to identify exactly which bytes contain zero. So we can come up with two rules that both need be true:

  • The inverted byte must have a set upper bit.
  • If we invert the byte and select the lower bits, adding one to these must set the upper bit.

Putting these into a logical operation we get (~byte & ( (~byte & 0x7f) + 1) & 0x80). For non-zero input bytes we get a result of zero, for zero input bytes we get a result of 0x80. Next we need to convert these into a bit vector.

If you recall the population count example from earlier, we used a set of operations to combine adjacent bits. In this case we want to do something similar, but instead of adding bits we want to shift them so that they end up in the right places. The code to perform the comparison and shift the results is:

void zeros2( unsigned long long* array, int length, unsigned char* result )
  for (int i=0; i<length; i+=8)
    unsigned long long v, u;
    v = array[ i>>3 ];

    u = ~v;
    u = u & 0x7f7f7f7f7f7f7f7f;
    u = u + 0x0101010101010101;
    v = u & (~v);
    v = v & 0x8080808080808080;

    v = v | (v << 7);
    v = v | (v << 14);
    v = (v >> 56) | (v >> 28);
    result[ i>>3 ] = v;

The resulting code runs about four times faster than the original.

Concluding remarks

So that ends this brief series on bit manipulation, I hope you've found it interesting, if you want to investigate this further there are plenty of resources on the web, but it would be hard to skip mentioning the book "The Hacker's Delight", which is a great read on this domain.

There's a couple of concluding thoughts. First of all performance comes from doing operations on multiple items of data in the same instruction. This should sound familiar as "SIMD", so a processor might often have vector instructions that already get the benefits of single instruction, multiple data, and single SIMD instruction might replace several integer operations in the above codes. The other place the performance comes from is eliminating branch instructions - particularly the unpredictable ones, again vector instructions might offer a similar benefit.

Friday Jan 30, 2015

Finding zero values in an array

A common thing to want to do is to find zero values in an array. This is obviously necessary for string length. So we'll start out with a test harness and a simple implementation:

#include "timing.h"

unsigned int len(char* array)
  unsigned int length = 0;
  while( array[length] ) { length++; }
  return length;

#define COUNT 100000
void main()
  char array[ COUNT ];
  for (int i=1; i<COUNT; i++)
    array[i-1] = 'a';
    array[i] = 0;
    if ( i != len(array) ) { printf( "Error at %i\n", i ); }
  for (int i=1; i<COUNT; i++)
    array[i-1] = 'a';
    array[i] = 0;

A chap called Alan Mycroft came up with a very neat algorithm to simultaneously examine multiple bytes and determine whether there is a zero in them. His algorithm starts off with the idea that there are two conditions that need to be true if a byte contains the value zero. First of all the upper bit of the byte must be zero, this is true for zero and all values less than 128, so on its own it is not sufficient. The second characteristic is that if one is subtracted from the value, then the upper bit must be one. This is true for zero and all values greater than 128. Although both conditions are individually satisfied by multiple values, the only value that satisfies both conditions is zero.

The following code uses the Mycroft test for a string length implementation. The code contains a pre-loop to get to an eight byte aligned address.

unsigned int len2(char* array)
  unsigned int length = 0;
  // Handle misaligned data
  while ( ( (unsigned long long) & array[length] ) &7 )
    if ( array[length] == 0 ) { return length; }

  unsigned long long * p = (unsigned long long *) & array[length];
  unsigned long long v8, v7;
    v8 = *p;
    v7 = v8 - 0x0101010101010101;
    v7 = (v7 & ~v8) & 0x8080808080808080;
  while ( !v7 );
  length = (char*)p - array-8;
  while ( array[length] ) { length++; }
  return length;

The algorithm has one weak point. It does not always report exactly which byte is zero, just that there is a zero byte somewhere. Hence the final loop where we work out exactly which byte is zero.

It is a trivial extension to use this to search for a byte of any value. If we XOR the input vector with a vector of bytes containing the target value, then we get a zero byte where the target value occurs, and a non-zero byte everywhere else.

It is also easy to extend the code to search for other zero bit patterns. For example, if we want to find zero nibbles (ie 4 bit values), then we can change the constants to be 0x1111111111111111 and 0x8888888888888888.

Wednesday Jan 28, 2015

Tracking application resource use

One question you might ask is "how much memory is my application consuming?". Obviously you can use prstat (prstat -cp <pid> or prstat -cmLp <pid>) to examine the behaviour of a process. But how about programmatically finding that information.

OTN has just published an article where I demonstrate how to find out about the resource use of a process, and incidentally how to put that functionality into a library that reports resource use over time.

Monday Jan 26, 2015

Hands on with Solaris Studio

Over at the OTN Garage, Rick has published details of the Solaris Studio hands-on lab.

This is a great chance to learn about the Code Analyzer and the Performance Analyzer. Just to quickly recap, the Code Analyzer is really a suite of tools that do checking on application errors. This includes static analysis - ie extensive compile time checking - and dynamic analysis - run time error detection. If you regularly read this blog, you'll need no introduction to the Performance Analyzer.

The dates for these are:

  • Americas - Feb 11th 9am to 12:30pm PT
  • EMEA - Feb 24th - 9am to 12:30pm GMT
  • APAC - March 4th - 9:30am IST to 1pm IST

Tuesday Jan 06, 2015

The Performance Analyzer Overview screen

A while back I promised a more complete article about the Performance Analyzer Overview screen introduced in Studio 12.4. Well, here it is!

Just to recap, the Overview screen displays a summary of the contents of the experiment. This enables you to pick the appropriate metrics to display, so quickly allows you to find where the time is being spent, and then to use the rest of the tool to drill down into what is taking the time.

Tuesday Nov 11, 2014

New Performance Analyzer Overview screen

I love using the Performance Analyzer, but the question I often get when I show it to people, is "Where do I start?". So one of the improvements in Solaris Studio 12.4 is an Overview screen to help people get started with the tool. Here's what it looks like:

The reason this is important, is that many applications spend time in various place - like waiting on disk, or in user locks - and it's not always obvious where is going to be the most effective place to look for performance gains.

The Overview screen is meant to be the "one-stop" place where people can find out what their application is doing. When we put it back into the product I expected it to be the screen that I glanced at then never went back to. I was most surprised when this turned out not to be the case.

During performance analysis, I'm often exploring different ideas as to where it might be possible to get performance improvements. The Overview screen allows me to select the metrics that I'm interested in, then take a look at the resulting profiles. So I might start with system time, and just enable the system time metrics. Once I'm done with that, I might move on to user time, and select those metrics. So what was surprising about the Overview screen was how often I returned to it to change the metrics I was using.

So what does the screen contain? The overview shows all the available metrics. The bars indicate which metrics contribute the most time. So it's easy to pick (and explore) the metrics that contribute the most time.

If the profile contains performance counter metrics, then those also appear. If the counters include instructions and cycles, then the synthetic CPI/IPC metrics are also available. The Overview screen is really useful for hardware counter metrics.

I use performance counters in a couple of ways: to confirm a hypothesis about performance or to estimate time spent on a type of event. For example, if I think a load is taking a lot of time due to TLB misses, then profiling on the TLB miss performance counter will tell me whether that load has a lot of misses or not. Alternatively, if I've got TLB miss counter data, then I can scale this by the cost per TLB miss, and get an estimate of the total runtime lost to TLB misses.

Where the Overview screen comes into this is that I will often want to minimise the number of columns of data that are shown (to fit everything onto my monitor), but sometimes I want to quickly enable a counter to see whether that event happens at the bit of code where I'm looking. Hence I end up flipping to the Overview screen and then returning to the code.

So what I thought would be a nice feature, actually became pretty central to my work-flow.

I should have a more detailed paper about the Overview screen up on OTN soon.

Tuesday Sep 23, 2014

Comparing constant duration profiles

I was putting together my slides for Open World, and in one of them I'm showing profile data from a server-style workload. ie one that keeps running until stopped. In this case the profile can be an arbitrary duration, and it's the work done in that time which is the important metric, not the total amount of time taken.

Profiling for a constant duration is a slightly unusual situation. We normally profile a workload that takes N seconds, do some tuning, and it now takes (N-S) seconds, and we can say that we improved performance by S/N percent. This is represented by the left pair of boxes in the following diagram:

In the diagram you can see that the routine B got optimised and therefore the entire runtime, for completing the same amount of work, reduced by an amount corresponding to the performance improvement for B.

Let's run through the same scenario, but instead of profiling for a constant amount of work, we profile for a constant duration. In the diagram this is represented by the outermost pair of boxes.

Both profiles run for the same total amount of time, but the right hand profile has less time spent in routine B() than the left profile, because the time in B() has reduced more time is spent in A(). This is natural, I've made some part of the code more efficient, I'm observing for the same amount of time, so I must spend more time in the part of the code that I've not optimised.

So what's the performance gain? In this case we're more likely to look at the gain in throughput. It's a safe assumption that the amount of time in A() corresponds to the amount of work done - ie that if we did T units of work, then the average cost per unit work A()/T is the same across the pair of experiments. So if we did T units of work in the first experiment, then in the second experiment we'd do T * A'()/A(). ie the throughput increases by S = A'()/A() where S is the scaling factor. What is interesting about this is that A() represents any measure of time spent in code which was not optimised. So A() could be a single routine or it could be all the routines that are untouched by the optimisation.

Thursday Sep 04, 2014

C++11 Array and Tuple Containers

This article came out a week or so back. It's a quick overview, from Steve Clamage and myself, of the C++11 tuple and array containers.

When you take a look at the page, I want you to take a look at the "about the authors" section on the right. I've been chatting to various people and we came up with this as a way to make the page more interesting, and also to make the "see also" suggestions more obvious. Let me know if you have any ideas for further improvements.

Monday Aug 25, 2014

My schedule for JavaOne and Oracle Open World

I'm very excited to have got my schedule for Open World and JavaOne:

CON8108: Engineering Insights: Best Practices for Optimizing Oracle Software for Oracle Hardware
Venue / Room: Intercontinental - Grand Ballroom C
Date and Time: 10/1/14, 16:45 - 17:30

CON2654: Java Performance: Hardware, Structures, and Algorithms
Venue / Room: Hilton - Imperial Ballroom A
Date and Time: 9/29/14, 17:30 - 18:30

The first talk will be about some of the techniques I use when performance tuning software. We get very involved in looking at how Oracle software works on Oracle hardware. The things we do work for any software, but we have the advantage of good working relationships with the critical teams.

The second talk is with Charlie Hunt, it's a follow on from the talk we gave at JavaOne last year. We got Rock Star awards for that, so the pressure's on a bit for this sequel. Fortunately there's still plenty to talk about when you look at how Java programs interact with the hardware, and how careful choices of data structures and algorithms can have a significant impact on delivered performance.

Anyway, I hope to see a bunch of people there, if you're reading this, please come and introduce yourself. If you don't make it I'm looking forward to putting links to the presentations up.

Wednesday Jun 25, 2014

Guest post on the OTN Garage

Contributed a post on how compilers handle constants to the OTN Garage. The whole OTN blog is worth reading because as well as serving up useful info, Rick has a good irreverent style of writing.

Wednesday Apr 16, 2014

Lambda expressions in C++11

Lambda expressions are, IMO, one of the interesting features of C++11. At first glance they do seem a bit hard to parse, but once you get used to them you start to appreciate how useful they are. Steve Clamage and I have put together a short paper introducing lambda expressions.

Friday Apr 11, 2014

New set and map containers in the C++11 Standard Library

We've just published a short article on the std::unordered_map, std::unordered_set, std::multimap, and std::multiset containers in the C++ Standard Library.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« July 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming