## Tuesday Feb 03, 2015

### Bit manipulation: Gathering bits

In the last post on bit manipulation we looked at how we could identify bytes that were greater than a particular target value, and stop when we discovered one. The resulting vector of bytes contained a zero byte for those which did not meet the criteria, and a byte containing 0x80 for those that did. Obviously we could express the result much more efficiently if we assigned a single bit for each result. The following is "lightly" optimised code for producing a bit vector indicating the position of zero bytes:

```void zeros( unsigned char * array, int length, unsigned char * result )
{
for (int i=0;i < length; i+=8)
{
result[i>>3] =
( (array[i+0]==0) << 7) +
( (array[i+1]==0) << 6) +
( (array[i+2]==0) << 5) +
( (array[i+3]==0) << 4) +
( (array[i+4]==0) << 3) +
( (array[i+5]==0) << 2) +
( (array[i+6]==0) << 1) +
( (array[i+7]==0) << 0);
}
}
```

The code is "lightly" optimised because it works on eight values at a time. This helps performance because the code can store results a byte at a time. An even less optimised version would split the index into a byte and bit offset and use that to update the result vector.

When we previously looked at finding zero bytes we used Mycroft's algorithm that determines whether a zero byte is present or not. It does not indicate where the zero byte is to be found. For this new problem we want to identify exactly which bytes contain zero. So we can come up with two rules that both need be true:

• The inverted byte must have a set upper bit.
• If we invert the byte and select the lower bits, adding one to these must set the upper bit.

Putting these into a logical operation we get ` (~byte & ( (~byte & 0x7f) + 1) & 0x80)`. For non-zero input bytes we get a result of zero, for zero input bytes we get a result of 0x80. Next we need to convert these into a bit vector.

If you recall the population count example from earlier, we used a set of operations to combine adjacent bits. In this case we want to do something similar, but instead of adding bits we want to shift them so that they end up in the right places. The code to perform the comparison and shift the results is:

```void zeros2( unsigned long long* array, int length, unsigned char* result )
{
for (int i=0; i<length; i+=8)
{
unsigned long long v, u;
v = array[ i>>3 ];

u = ~v;
u = u & 0x7f7f7f7f7f7f7f7f;
u = u + 0x0101010101010101;
v = u & (~v);
v = v & 0x8080808080808080;

v = v | (v << 7);
v = v | (v << 14);
v = (v >> 56) | (v >> 28);
result[ i>>3 ] = v;
}
}
```

The resulting code runs about four times faster than the original.

Concluding remarks

So that ends this brief series on bit manipulation, I hope you've found it interesting, if you want to investigate this further there are plenty of resources on the web, but it would be hard to skip mentioning the book "The Hacker's Delight", which is a great read on this domain.

There's a couple of concluding thoughts. First of all performance comes from doing operations on multiple items of data in the same instruction. This should sound familiar as "SIMD", so a processor might often have vector instructions that already get the benefits of single instruction, multiple data, and single SIMD instruction might replace several integer operations in the above codes. The other place the performance comes from is eliminating branch instructions - particularly the unpredictable ones, again vector instructions might offer a similar benefit.

## Monday Feb 02, 2015

### Bit manipulation: finding a range of values

We previously looked at finding zero values in an array. A similar problem is to find a value larger than some target. The vanilla code for this is pretty simple:

```#include "timing.h"

int range(char * array, unsigned int length, unsigned char target)
{
for (unsigned int i=0; i<length; i++)
{
if (array[i]>target) { return i; }
}
return -1;
}
```

It's possible to recode this to use bit operations, but there is a small complication. We need two versions of the routine depending on whether the target value is >127 or not. Let's start with the target greater than 127. There are two rules to finding bytes greater than this target:

• The upper bit is set in the target value, this means that the upper bit must also be set in the bytes we examine. So we can AND the input value with 0x80, and this must be 0x80.
• We want a bit more precision than testing the upper bit. We need to know that the value is greater than the target value. So if we clear the upper bit we get a number between 0 and 127. This is equivalent to subtracting 128 off all the bytes that have a value greater than 127. So instead of doing a comparison of is 132 greater than 192 we can do an equivalent check of is (132-128) greater than (192-128), or is 4 greater than 64? However, we want bytes where this is true to end up with their upper bit set. So we can do an ADD operation where we add sufficient to each byte to cause the result to be greater than 128 for the bytes with a value greater than the target. The operation for this is `( byte & 0x7f ) + (255-target)`.

The second condition is hard to understand, so consider an example where we are searching for values greater than 192. We have an input of 132. So the first of the two conditions produces `132 & 0x80 = 0x80`. For the second condition we want to do `(132 & 0x7f) + (255-192) = 4+63 = 68` so the second condition does not produce a value with the upper bit set. Trying again with an input of 193 we get `65 + 63 = 128` so the upper bit is set, and we get a result of 0x80 indicating that the byte is selected.

The full operation is `(byte & ( (byte & 0x7f) + (255 - target) ) & 0x80)`.

If the target value is less than 128 we perform a similar set of operations. In this case if the upper bit is set then the byte is automatically greater than the target value. If the upper bit is not set we have to add sufficient on to cause the upper bit to be set by any value that meets the criteria.

The operation looks like `(byte | ( (byte & 0x7f) + (127 - target) ) & 0x80)`.

Putting all this together we get the following code:

```int range2( unsigned char* array, unsigned int length, unsigned char target )
{
unsigned int i = 0;
// Handle misalignment
while ( (length > 0) && ( (unsigned long long) & array[i] & 7) )
{
if ( array[i] > target ) { return i; }
i++;
length--;
}
// Optimised code
unsigned long long * p = (unsigned long long*) &array[i];
if (target < 128)
{
unsigned long long v8 = 127 - target;
v8 = v8 | (v8 << 8);
v8 = v8 | (v8 << 16);
v8 = v8 | (v8 << 32);

while (length > 8)
{
unsigned long long v = *p;
unsigned long long u;
u = v & 0x8080808080808080; // upper bit
v = v & 0x7f7f7f7f7f7f7f7f; // lower bits
v = v + v8;
unsigned long long r = (v | u) & 0x8080808080808080;
if (r) { break; }
length-=8;
p++;
i+=8;
}
}
else
{
unsigned long long v8 = 255 - target;
v8 = v8 | (v8 << 8);
v8 = v8 | (v8 << 16);
v8 = v8 | (v8 << 32);
while (length > 8)
{
unsigned long long v = *p;
unsigned long long u;
u = v & 0x8080808080808080; // upper bit
v = v & 0x7f7f7f7f7f7f7f7f; // lower bits
v = v + v8;
unsigned long long r = v & u;
if (r) { break; }
length-=8;
p++;
i+=8;
}
}

// Handle trailing values
while (length > 0)
{
if (array[i] > target) { return i; }
i++;
length--;
}
return -1;
}

```

The resulting code runs about 4x faster than the original version.

## Friday Jan 30, 2015

### Finding zero values in an array

A common thing to want to do is to find zero values in an array. This is obviously necessary for string length. So we'll start out with a test harness and a simple implementation:

```#include "timing.h"

unsigned int len(char* array)
{
unsigned int length = 0;
while( array[length] ) { length++; }
return length;
}

#define COUNT 100000
void main()
{
char array[ COUNT ];
for (int i=1; i<COUNT; i++)
{
array[i-1] = 'a';
array[i] = 0;
if ( i != len(array) ) { printf( "Error at %i\n", i ); }
}
starttime();
for (int i=1; i<COUNT; i++)
{
array[i-1] = 'a';
array[i] = 0;
len(array);
}
endtime(COUNT);
}
```

A chap called Alan Mycroft came up with a very neat algorithm to simultaneously examine multiple bytes and determine whether there is a zero in them. His algorithm starts off with the idea that there are two conditions that need to be true if a byte contains the value zero. First of all the upper bit of the byte must be zero, this is true for zero and all values less than 128, so on its own it is not sufficient. The second characteristic is that if one is subtracted from the value, then the upper bit must be one. This is true for zero and all values greater than 128. Although both conditions are individually satisfied by multiple values, the only value that satisfies both conditions is zero.

The following code uses the Mycroft test for a string length implementation. The code contains a pre-loop to get to an eight byte aligned address.

```unsigned int len2(char* array)
{
unsigned int length = 0;
// Handle misaligned data
while ( ( (unsigned long long) & array[length] ) &7 )
{
if ( array[length] == 0 ) { return length; }
length++;
}

unsigned long long * p = (unsigned long long *) & array[length];
unsigned long long v8, v7;
do
{
v8 = *p;
v7 = v8 - 0x0101010101010101;
v7 = (v7 & ~v8) & 0x8080808080808080;
p++;
}
while ( !v7 );
length = (char*)p - array-8;
while ( array[length] ) { length++; }
return length;
}
```

The algorithm has one weak point. It does not always report exactly which byte is zero, just that there is a zero byte somewhere. Hence the final loop where we work out exactly which byte is zero.

It is a trivial extension to use this to search for a byte of any value. If we XOR the input vector with a vector of bytes containing the target value, then we get a zero byte where the target value occurs, and a non-zero byte everywhere else.

It is also easy to extend the code to search for other zero bit patterns. For example, if we want to find zero nibbles (ie 4 bit values), then we can change the constants to be 0x1111111111111111 and 0x8888888888888888.

## Tuesday Mar 04, 2008

### Extracting more detailed instruction frequency data from BIT

BIT has the ability to print out instruction frequency data, but there may be occasions where more detail is required. One way of getting more complete data is to ask bit to dump the disassembly for the entire application (using the option -a dis) and then write a script to parse the resulting output. The example script here extracts the integer/logical operations and reports them by type.

## Wednesday Feb 27, 2008

### Parsing branch data from BIT

I've an old article on using data from BIT to examine the behaviour of branches in code, and using that to see whether a training workload is representative of the real work the application does. The article contains a couple of scripts for parsing the output from BIT to generate the appropriate results and graphs.

I've just been helping out in looking at branch data once again, this time breaking down the branch probabilities by the condition codes of the branch. The code to do this is relatively easy, but its annoying to have to write it yourself, this is my code for breaking down branch frequencies by condition code. The output is somewhat cryptic, for each condition code it will print out columns for whether the branch was annulling (a) or not (n), and predicted taken (pt), predicted not taken (pn), or not predicted(no).

## Wednesday Nov 28, 2007

### CMT Developer Tools webcast

I've just had another webcast posted. This time it's discussing the CMT Developer Tools. The tools are add-ons for both Sun Studio 11 and Sun Studio 12. Of particular excitement was getting two of the tools ported to x64. There's a bit more information on the individual tools on my post back in July when we released them.

The CMT Developer Tools webcast covers installing and using the tools. The presentation is part slideware and part demo.

## Thursday Nov 08, 2007

### Cool Tools platform matrix

A matrix showing the platforms that the various cool tools are available on.

ToolSPARC/Solarisx86/Solarisx86/Linux
GCC for SPARC SystemsYES
Sun Studio 12YESYESYES
ATSYESYES
BITYES
DiscoverYES
SPOTYESYES
FabanYESYESYES
Sun Application Porting AssistantYESYES
Cool TunerYES
Cool StackYESYES
CooltstYESYESYES
Consolidation ToolYESYESYES
corestatYES

## Tuesday Jul 24, 2007

### CMT Developer Tools

We've just released the CMT Developer Tools for Sun Studio 12. The tools are available for both SPARC and x64 systems, although not all tools are on the x64 platform. The list of tools is as follows:

• ATS - Automatic Tuning and Troubleshooting System. Automatically finds the best compiler flags or locates optimisation bugs in an application without access to the source code. (SPARC and x64)
• SPOT - Simple Performance Optimisation Tool. Generates an html report detailing where the application is spending time, and information about why it might be spending time there.(SPARC and x64)
• BIT - Binary Improvement Tool. Reports information about application coverage also produces instruction execution counts, branch taken data etc. (SPARC only)
• Discover - Sun Memory Error Discovery Tool. Detects memory access issues such in an application. Examples are accesses to uninitialised memory, accesses beyond array bounds, memory leaks, etc. (SPARC only)

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

##### Archives
Sun Mon Tue Wed Thu Fri Sat « July 2016 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Today