X

Recent Posts

Personal

Where does misaligned data come from?

A good question about data (mis)alignment is "Where did it come from?". So here's a reasonably detailed answer to that...If the compiler has generated the code for you and you've not done anything "weird" then the data should be correctly aligned. So most apps don't have misaligned data, and most of the time you (as a developer) don't have to worry about it. For example, if you allocate a local variable, or a global variable, then the compiler will correctly align it. If you cast a call to malloc() into a pointer to a structure, then that structure will be correctly aligned. And so on.... so if the compiler is doing all this correctly, when could it every be possible to have misaligned data. There's a bunch of situations.But first let's quickly review the -xmemalign flag. What it actually tells the compiler to do is to assume a particular alignment (and trap behaviour) for variables where it is unsure what the alignment is. If a variable is aligned, then the compiler will generate code exploiting that fact. So the -xmemalign only really applies to dynamically allocated data accessed through pointers. So what does this apply - the following is not an exhaustive list:Packed data structures. If a data structure has been declared as being packed, then the compiler will squash the members together to occupy the minimum space - so the alignments may be wrong. If a structure is not packed the compiler adds padding to ensure that members are correctly aligned.Buffers. Suppose your application gets packets across the network. If the packet contains an integer there's no guarantee that the integer will be placed on a four byte boundary.Pointers into bytes. Suppose you have a string of characters and you want load 4 of them into an int - perhaps you're doing some bit-twiddling. Then you have to take care to handle strings that don't start at 4 byte boundaries.The take away from this should be that alignment is not something that most developers need worry about. Most code gets the correct alignment out of the box - that's why the example program is so contrived: misalignment is the result of a developer choice, decision, or requirement. It does sometimes come up in porting, and that's why it's important to be able to diagnose when and where it happens, but most folks can get by assuming that they'll never see it! :)

A good question about data (mis)alignment is "Where did it come from?". So here's a reasonably detailed answer to that... If the compiler has generated the code for you and you've not done anything...

Work

Misaligned loads profiled (again)

A long time ago I described how misaligned loads appeared in profiles of 32-bit applications. Since then we've changed the level of detail presented in the Performance Analyzer. When I wrote the article the time spent on-cpu that wasn't User time was grouped as System time. We've now started showing more detail - and more detail is good.Here's a similar bit of code:#include <stdio.h>static int i,j;volatile double *d;void main (){ char a[10]; d=(double*)&a[1]; j=100000000; for (i=0;i < j; i++) { *d+=5.0; } printf("%f",d);}This code stores into a misaligned double, and that's all we need in order to generate misaligned traps and see how they are shown in the performance analyzer. Here's the hot loop:Load Object: a.outInclusive Inclusive User CPU Time Trap CPU Time Name(sec.) (sec.) 1.131 0.510 [?] 10928: inc 4, %i10. 0. [?] 1092c: ldd [%i2], %f20.811 0.380 [?] 10930: faddd %f2, %f0, %f40. 0. [?] 10934: std %f4, [%i2]0.911 0.480 [?] 10938: ldd [%i2], %f21.121 0.370 [?] 1093c: faddd %f2, %f0, %f40. 0. [?] 10940: std %f4, [%i2]0.761 0.410 [?] 10944: ldd [%i2], %f20.911 0.410 [?] 10948: faddd %f2, %f0, %f40.010 0. [?] 1094c: std %f4, [%i2]0.941 0.450 [?] 10950: ldd [%i2], %f21.111 0.380 [?] 10954: faddd %f2, %f0, %f40. 0. [?] 10958: cmp %i1, %i50. 0. [?] 1095c: ble,pt %icc,0x109280. 0. [?] 10960: std %f4, [%i2]So the first thing to notice is that we're now reporting Trap time rather than aggregating it into System time. This is useful because trap time is intrinsically different from system time, so it's worth displaying it differently. Fortunately the new overview screen highlights the trap time, so it's easy to recognise when to look for it.Now, you should be familiar with the "previous instruction is to blame rule" for interpreting the output from the performance analyzer. Dealing with traps is no different, the time spent on the next instruction is due to the trap of the previous instruction. So the final load in the loop takes about 1.1s of user time and 0.38s of trap time.Slight side track about the "blame the last instruction" rule. For misaligned accesses the problem instruction traps and its action is emulated. So the next instruction executed is the instruction following the misaligned access. That's why we see time attributed to the following instruction. However, there are situations where an instruction is retried after a trap, in those cases the next instruction is the instruction that caused the trap. Examples of this are TLB misses or save/restore instructions.If we recompile the code as 64-bit and set -xmemalign=8i, then we get a different profile:Exclusive User CPU Time Name(sec.) 3.002 <Total>2.882 __misalign_trap_handler0.070 main0.040 __do_misaligned_ldst_instr0.010 getreg0. _startFor 64-bit code the misaligned operations are fixed in user-land. One (unexpected) advantage of this is that you can take a look at the routines that call the trap handler and identify exactly where the misaligned memory operations are:0.480main + 0x000000780.450main + 0x000000540.410main + 0x0000006C0.380main + 0x000000600.370main + 0x000000880.310main + 0x0000005C0.270main + 0x000000680.260main + 0x00000074This is really useful if there are a number of sites and your objective is to fix them in order of performance impact.

A long time ago I described how misaligned loads appeared in profiles of 32-bit applications. Since then we've changed the level of detail presented in the Performance Analyzer. When I wrote the...

Personal

Profiling the kernel

One of the incredibly useful features in Studio is the ability to profile the kernel. The tool to do this is er_kernel. It's based around dtrace, so you either need to run it with escalated privileges, or you need to edit /etc/user_attr to add something like:<username>::::defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernelThe correct way to modify user_attr is with the command usermod:usermod -K defaultpriv=basic,dtrace_user,dtrace_proc,dtrace_kernel <username>There's two ways to run er_kernel. The default mode is to just profile the kernel:$ er_kernel sleep 10....Creating experiment database ktest.1.er (Process ID: 7399) .......$ er_print -limit 10 -func ktest.1.erFunctions sorted by metric: Exclusive Kernel CPU TimeExcl. Incl. NameKernel KernelCPU sec. CPU sec.19.242 19.242 <Total>14.869 14.869 <l_PID_7398> 0.687 0.949 default_mutex_lock_delay 0.263 0.263 mutex_enter 0.202 0.202 <java_PID_248> 0.162 0.162 gettick 0.141 0.141 hv_ldc_tx_set_qtail...The we passed the command sleep 10 to er_kernel, this causes it to profile for 10 seconds. It might be better form to use the equivalent command line option -t 10.In the profile we can see a couple of user processes together with some kernel activity. The other way to run er_kernel is to profile the kernel and user processes. We enable this mode with the command line option -F on:$ er_kernel -F on sleep 10...Creating experiment database ktest.2.er (Process ID: 7630) ......$ er_print -limit 5 -func ktest.2.erFunctions sorted by metric: Exclusive Total CPU TimeExcl. Incl. Excl. Incl. NameTotal Total Kernel KernelCPU sec. CPU sec. CPU sec. CPU sec.15.384 15.384 16.333 16.333 <Total>15.061 15.061 0. 0. main 0.061 0.061 0. 0. ioctl 0.051 0.141 0. 0. dt_consume_cpu 0.040 0.040 0. 0. __nanosleep...In this case we can see all the userland activity as well as kernel activity. The -F option is very flexible, instead of just profiling everything, we can use -F =<regexp>syntax to specify either a PID or process name to profile:$ er_kernel -F =7398

One of the incredibly useful features in Studio is the ability to profile the kernel. The tool to do this is er_kernel. It's based around dtrace, so you either need to run it with escalated...

Personal

Digging into microstate accounting

Solaris has support for microstate accounting. This gives huge insight into where an application and its threads are spending their time. It breaks down time into the (obvious) user and system, but also allows you to see the time spent waiting on page faults and other useful-to-know states.This level of detail is available through the usage file in /proc/pid, there's a corresponding file for each lwp in /proc/pid/lwp/lwpid/lwpusage. You can find more details about the /proc file system in documentation, or reading my recent article about tracking memory use.Here's an example of using it to report idle time, ie time when the process wasn't busy:#include <stdio.h>#include <sys/resource.h>#include <unistd.h>#include <fcntl.h>#include <procfs.h>void busy(){ for (int i=0; i<100000; i++) { double d = i; while (d>0) { d=d *0.5; } }}void lazy(){ sleep(10);}double convert(timestruc_t ts){ return ts.tv_sec + ts.tv_nsec/1000000000.0;}void report_idle(){ prusage_t prusage; int fd; fd = open( "/proc/self/usage", O_RDONLY); if (fd == -1) { return; } read( fd, &prusage, sizeof(prusage) ); close( fd ); printf("Idle percent = %3.2f\n", 100.0*(1.0 - (convert(prusage.pr_utime) + convert(prusage.pr_stime)) /convert(prusage.pr_rtime) ) );}void main(){ report_idle(); busy(); report_idle(); lazy(); report_idle();}The code has two functions that take time. The first does some redundant FP computation (that cannot be optimised out unless you tell the compiler to do FP optimisations), this part of the code is CPU bound. When run the program reports low idle time for this section of the code. The second routine calls sleep(), so the program is idle at this point waiting for the sleep time to expire, hence this section is reported as being high idle time.

Solaris has support for microstate accounting. This gives huge insight into where an application and its threads are spending their time. It breaks down time into the (obvious) user and system, but...

Personal

Namespaces in C++

A porting problem I hit with regularity is using functions in the standard namespace. Fortunately, it's a relatively easy problem to diagnose and fix. But it is very common, and it's worth discussing how it happens.C++ namespaces are a very useful feature that allows an application to use identical names for symbols in different contexts. Here's an example where we define two namespaces and place identically named functions in the two of them.#include <iostream>namespace ns1{ void hello() { std::cout << "Hello ns1\n"; }}namespace ns2{ void hello() { std::cout << "Hello ns2\n"; }}int main(){ ns1::hello(); ns2::hello();}The construct namespace optional_name is used to introduce a namespace. In this example we have introduced two namespaces ns1 and ns2. Both namespaces have a routine called hello, but both routines can happily co-exist because they exist in different namespaces.Behind the scenes, the namespace becomes part of the symbol name:$ nm a.out|grep hello[63] | 68640| 44|FUNC |GLOB |0 |11 |__1cDns1Fhello6F_v_[56] | 68704| 44|FUNC |GLOB |0 |11 |__1cDns2Fhello6F_v_When it comes to using functions declared in namespaces we can prepend the namespace to the name of the symbol, this uniquely identifies the symbol. You can see this in the example where the calls to hello() from the different namespaces are prefixed with the namespace.However, prefixing every function call with its namespace can rapidly become very tedious. So there is a way to make this easier. First of all, let's quickly discuss the global namespace. The global namespace is the namespace that is searched if you do not specify a namespace - kind of the default namespace. If you declare a function foo() in your code, then it naturally resides in the global namespace.We can add symbols from other namespaces into the global namespace using the using keyword. There are two ways we can do this. One way is to add the entire namespace into the global namespace. The other way is to symbols individually into the name space. To do this write using namespace <namespace>; to import the entire namespace into the global namespace, or using <namespace>::<function>; to just import a single function into the global namespace. Here's the earlier example modified to show both approaches:#include <iostream>namespace ns1{ void hello() { std::cout << "Hello ns1\n"; }}namespace ns2{ void hello() { std::cout << "Hello ns2\n"; }}int main(){ { using namespace ns1; hello(); } { using ns2::hello; hello(); }}The other thing you will notice in the example is the use of std::cout. Notice that this is prefixed with the std:: namespace. This is an example of a situation where you might encounter porting problems.The C++03 standard (17.4.1.1) says this about the C++ Standard Library "All library entities except macros, operator new and operator delete are defined within the namespace std or namespaces nested within the namespace std.". This means that, according to the standard, if you include iostream then cout will be defined in the std namespace. That's the only place you can rely on it being available.Now, sometimes you might find a function that is in the std namespace is already available in the general namespace. For example, gcc puts all the functions that are in the std namespace into the general namespace.Other times, you might include a header file which has already imported an entire namespace, or particular symbols from a namespace. This can happen if you change the Standard Library that you are using and the new header files contain a different set of includes and using statements.There's one other area where you can encounter this, and that is using C library routines. All the C header files have a C++ counterpart. For example stdio.h has the counterpart cstdio. One difference between the two headers is the namespace where the routines are placed. If the C headers are used, then the symbols get placed into the global namespace, if the C++ headers are used the symbols get placed into the C++ namespace. This behaviour is defined by section D.5 of the C++03 standard. Here's an example where we use both the C and C++ header files, and need to specify the namespace for the functions from the C++ header file:#include <cstdio>#include <strings.h>int main(){ char string[100]; strcpy( string, "Hello" ); std::printf( "%s\n", string );}

A porting problem I hit with regularity is using functions in the standard namespace. Fortunately, it's a relatively easy problem to diagnose and fix. But it is very common, and it's worth discussing...

Personal

Bit manipulation: Gathering bits

In the last post on bit manipulation we looked at how we could identify bytes that were greater than a particular target value, and stop when we discovered one. The resulting vector of bytes contained a zero byte for those which did not meet the criteria, and a byte containing 0x80 for those that did. Obviously we could express the result much more efficiently if we assigned a single bit for each result. The following is "lightly" optimised code for producing a bit vector indicating the position of zero bytes:void zeros( unsigned char * array, int length, unsigned char * result ){ for (int i=0;i < length; i+=8) { result[i>>3] = ( (array[i+0]==0) << 7) + ( (array[i+1]==0) << 6) + ( (array[i+2]==0) << 5) + ( (array[i+3]==0) << 4) + ( (array[i+4]==0) << 3) + ( (array[i+5]==0) << 2) + ( (array[i+6]==0) << 1) + ( (array[i+7]==0) << 0); }}The code is "lightly" optimised because it works on eight values at a time. This helps performance because the code can store results a byte at a time. An even less optimised version would split the index into a byte and bit offset and use that to update the result vector.When we previously looked at finding zero bytes we used Mycroft's algorithm that determines whether a zero byte is present or not. It does not indicate where the zero byte is to be found. For this new problem we want to identify exactly which bytes contain zero. So we can come up with two rules that both need be true:The inverted byte must have a set upper bit.If we invert the byte and select the lower bits, adding one to these must set the upper bit. Putting these into a logical operation we get (~byte & ( (~byte & 0x7f) + 1) & 0x80). For non-zero input bytes we get a result of zero, for zero input bytes we get a result of 0x80. Next we need to convert these into a bit vector.If you recall the population count example from earlier, we used a set of operations to combine adjacent bits. In this case we want to do something similar, but instead of adding bits we want to shift them so that they end up in the right places. The code to perform the comparison and shift the results is:void zeros2( unsigned long long* array, int length, unsigned char* result ){ for (int i=0; i<length; i+=8) { unsigned long long v, u; v = array[ i>>3 ]; u = ~v; u = u & 0x7f7f7f7f7f7f7f7f; u = u + 0x0101010101010101; v = u & (~v); v = v & 0x8080808080808080; v = v | (v << 7); v = v | (v << 14); v = (v >> 56) | (v >> 28); result[ i>>3 ] = v; }}The resulting code runs about four times faster than the original.Concluding remarksSo that ends this brief series on bit manipulation, I hope you've found it interesting, if you want to investigate this further there are plenty of resources on the web, but it would be hard to skip mentioning the book "The Hacker's Delight", which is a great read on this domain.There's a couple of concluding thoughts. First of all performance comes from doing operations on multiple items of data in the same instruction. This should sound familiar as "SIMD", so a processor might often have vector instructions that already get the benefits of single instruction, multiple data, and single SIMD instruction might replace several integer operations in the above codes. The other place the performance comes from is eliminating branch instructions - particularly the unpredictable ones, again vector instructions might offer a similar benefit.

In the last post on bit manipulation we looked at how we could identify bytes that were greater than a particular target value, and stop when we discovered one. The resulting vector of bytes contained...

Personal

Bit manipulation: finding a range of values

We previously looked at finding zero values in an array. A similar problem is to find a value larger than some target. The vanilla code for this is pretty simple:#include "timing.h"int range(char * array, unsigned int length, unsigned char target){ for (unsigned int i=0; i<length; i++) { if (array[i]>target) { return i; } } return -1;}It's possible to recode this to use bit operations, but there is a small complication. We need two versions of the routine depending on whether the target value is >127 or not. Let's start with the target greater than 127. There are two rules to finding bytes greater than this target:The upper bit is set in the target value, this means that the upper bit must also be set in the bytes we examine. So we can AND the input value with 0x80, and this must be 0x80.We want a bit more precision than testing the upper bit. We need to know that the value is greater than the target value. So if we clear the upper bit we get a number between 0 and 127. This is equivalent to subtracting 128 off all the bytes that have a value greater than 127. So instead of doing a comparison of is 132 greater than 192 we can do an equivalent check of is (132-128) greater than (192-128), or is 4 greater than 64? However, we want bytes where this is true to end up with their upper bit set. So we can do an ADD operation where we add sufficient to each byte to cause the result to be greater than 128 for the bytes with a value greater than the target. The operation for this is ( byte & 0x7f ) + (255-target).The second condition is hard to understand, so consider an example where we are searching for values greater than 192. We have an input of 132. So the first of the two conditions produces 132 & 0x80 = 0x80. For the second condition we want to do (132 & 0x7f) + (255-192) = 4+63 = 68 so the second condition does not produce a value with the upper bit set. Trying again with an input of 193 we get 65 + 63 = 128 so the upper bit is set, and we get a result of 0x80 indicating that the byte is selected.The full operation is (byte & ( (byte & 0x7f) + (255 - target) ) & 0x80).If the target value is less than 128 we perform a similar set of operations. In this case if the upper bit is set then the byte is automatically greater than the target value. If the upper bit is not set we have to add sufficient on to cause the upper bit to be set by any value that meets the criteria.The operation looks like (byte | ( (byte & 0x7f) + (127 - target) ) & 0x80).Putting all this together we get the following code:int range2( unsigned char* array, unsigned int length, unsigned char target ){ unsigned int i = 0; // Handle misalignment while ( (length > 0) && ( (unsigned long long) & array[i] & 7) ) { if ( array[i] > target ) { return i; } i++; length--; } // Optimised code unsigned long long * p = (unsigned long long*) &array[i]; if (target < 128) { unsigned long long v8 = 127 - target; v8 = v8 | (v8 << 8); v8 = v8 | (v8 << 16); v8 = v8 | (v8 << 32); while (length > 8) { unsigned long long v = *p; unsigned long long u; u = v & 0x8080808080808080; // upper bit v = v & 0x7f7f7f7f7f7f7f7f; // lower bits v = v + v8; unsigned long long r = (v | u) & 0x8080808080808080; if (r) { break; } length-=8; p++; i+=8; } } else { unsigned long long v8 = 255 - target; v8 = v8 | (v8 << 8); v8 = v8 | (v8 << 16); v8 = v8 | (v8 << 32); while (length > 8) { unsigned long long v = *p; unsigned long long u; u = v & 0x8080808080808080; // upper bit v = v & 0x7f7f7f7f7f7f7f7f; // lower bits v = v + v8; unsigned long long r = v & u; if (r) { break; } length-=8; p++; i+=8; } } // Handle trailing values while (length > 0) { if (array[i] > target) { return i; } i++; length--; } return -1;}The resulting code runs about 4x faster than the original version.

We previously looked at finding zero values in an array. A similar problem is to find a value larger than some target. The vanilla code for this is pretty simple: #include "timing.h"int range(char *...

Personal

Finding zero values in an array

A common thing to want to do is to find zero values in an array. This is obviously necessary for string length. So we'll start out with a test harness and a simple implementation:#include "timing.h"unsigned int len(char* array){ unsigned int length = 0; while( array[length] ) { length++; } return length;}#define COUNT 100000void main(){ char array[ COUNT ]; for (int i=1; i<COUNT; i++) { array[i-1] = 'a'; array[i] = 0; if ( i != len(array) ) { printf( "Error at %i\n", i ); } } starttime(); for (int i=1; i<COUNT; i++) { array[i-1] = 'a'; array[i] = 0; len(array); } endtime(COUNT);}A chap called Alan Mycroft came up with a very neat algorithm to simultaneously examine multiple bytes and determine whether there is a zero in them. His algorithm starts off with the idea that there are two conditions that need to be true if a byte contains the value zero. First of all the upper bit of the byte must be zero, this is true for zero and all values less than 128, so on its own it is not sufficient. The second characteristic is that if one is subtracted from the value, then the upper bit must be one. This is true for zero and all values greater than 128. Although both conditions are individually satisfied by multiple values, the only value that satisfies both conditions is zero.The following code uses the Mycroft test for a string length implementation. The code contains a pre-loop to get to an eight byte aligned address.unsigned int len2(char* array){ unsigned int length = 0; // Handle misaligned data while ( ( (unsigned long long) & array[length] ) &7 ) { if ( array[length] == 0 ) { return length; } length++; } unsigned long long * p = (unsigned long long *) & array[length]; unsigned long long v8, v7; do { v8 = *p; v7 = v8 - 0x0101010101010101; v7 = (v7 & ~v8) & 0x8080808080808080; p++; } while ( !v7 ); length = (char*)p - array-8; while ( array[length] ) { length++; } return length;}The algorithm has one weak point. It does not always report exactly which byte is zero, just that there is a zero byte somewhere. Hence the final loop where we work out exactly which byte is zero.It is a trivial extension to use this to search for a byte of any value. If we XOR the input vector with a vector of bytes containing the target value, then we get a zero byte where the target value occurs, and a non-zero byte everywhere else.It is also easy to extend the code to search for other zero bit patterns. For example, if we want to find zero nibbles (ie 4 bit values), then we can change the constants to be 0x1111111111111111 and 0x8888888888888888.

A common thing to want to do is to find zero values in an array. This is obviously necessary for string length. So we'll start out with a test harness and a simple implementation: #include "timing.h"un...

Personal

Bit manipulation: Population Count

Population count is one of the more esoteric instructions. It's the operation to count the number of set bits in a register. It comes up with sufficient frequency that most processors have a hardware instruction to do it. However, for this example, we're going to look at coding it in software. First of all we'll write a baseline version of the code:int popc(unsigned long long value){ unsigned long long bit = 1; int popc = 0; while ( bit ) { if ( value & bit ) { popc++; } bit = bit << 1; } return popc;}The above code examines every bit in the input and counts the number of set bits. The number of iterations is proportional to the number of bits in the register.Most people will immediately recognise that we could make this a bit faster using the code we discussed previously that clears the last set bit, whist there are set bits keep clearing them, otherwise you're done. The advantage of this approach is that you only iterate once for every set bit in the value. So if there are no set bits, then you do not do any iterations.int popc2( unsigned long long value ){ int popc = 0; while ( value ) { popc++; value = value & (value-1); } return popc;}The next thing to do is to put together a test harness that confirms that the new code produces the same results as the old code, and also measures the performance of the two implementations.#define COUNT 1000000void main(){ // Correctness test for (unsigned long long i = 0; i<COUNT; i++ ) { if (popc( i + (i<<32) ) != popc2( i + (i<<32) ) ) { printf(" Mismatch popc2 input %llx: %u!= %u\n", i+(i<<32), popc(i+(i<<32)), popc2(i+(i<<32))); } } // Performance test starttime(); for (unsigned long long i = 0; i<COUNT; i++ ) { popc(i+(i<<32)); } endtime(COUNT); starttime(); for (unsigned long long i = 0; i<COUNT; i++ ) { popc2(i+(i<<32)); }}The new code is about twice as fast as the old code. However, the new code still contains a loop, and this can be a bit of a problem.Branch mispredictionsThe trouble with loops, and with branches in general, is that processors don't know the next instruction that will be executed after the branch until the branch has been reached, but the processor needs to have already fetched instruction after the branch well before this. The problem is nicely summarised by Holly in Red Dwarf:"Look, I'm trying to navigate at faster than the speed of light, which means that before you see something, you've already passed through it."So processors use branch prediction to guess whether a branch is taken or not. If the prediction is correct there is no break in the instruction stream, but if the prediction is wrong, then the processor needs to throw away all the incorrectly predicted instructions, and fetch the instructions from correct address. This is a significant cost, so ideally you don't want mispredicted branches, and the best way of ensuring that is to not have branches at all!The following code is a branchless sequence for computing population countunsigned int popc3(unsigned long long value){ unsigned long long v2; v2 = value &t;< 1; v2 &= 0x5555555555555555; value &= 0x5555555555555555; value += v2; v2 = value << 2; v2 &= 0x3333333333333333; value &= 0x3333333333333333; value += v2; v2 = value << 4; v2 &= 0x0f0f0f0f0f0f0f0f; value &= 0x0f0f0f0f0f0f0f0f; value += v2; v2 = value << 8; v2 &= 0x00ff00ff00ff00ff; value &= 0x00ff00ff00ff00ff; value += v2; v2 = value << 16; v2 &= 0x0000ffff0000ffff; value &= 0x0000ffff0000ffff; value += v2; v2 = value << 32; value += v2; return (unsigned int) value;}This instruction sequence computes the population count by initially adding adjacent bits to get a two bit result of 0, 1, or 2. It then adds the adjacent pairs of bits to get a 4 bit result of between 0 and 4. Next it adds adjacent nibbles to get a byte result, then adds pairs of bytes to get shorts, then adds shorts to get a pair of ints, which it adds to get the final value. The code contains a fair amount of AND operations to mask out the bits that are not part of the result.This bit manipulation version is about two times faster than the clear-last-bit-set version, making it about four times faster than the original code. However, it is worth noting that this is a fixed cost. The routine takes the same amount of time regardless of the input value. In contrast the clear -last-bit-set version will exit early if there are no set bits. Consequently the performance gain for the code will depend on both the input value and the cost of mispredicted branches.

Population count is one of the more esoteric instructions. It's the operation to count the number of set bits in a register. It comes up with sufficient frequency that most processors have a hardware...

Personal

Inline functions in C

Functions declared as inline are slightly more complex than might be expected. Douglas Walls has provided a chapter-and-verse write up. But the issue bears further explanation.When a function is declared as inline it's a hint to the compiler that the function could be inlined. It is not a command to the compiler that the function must be inlined. If the compiler chooses not to inline the function, then the function will be left with a function call that needs to be resolved, and at link time it will be necessary for a function definition to be provided. Consider this example:#include <stdio.h>inline void foo() { printf(" In foo\n"); }void main(){ foo();}The code provides an inline definition of foo(), so if the compiler chooses to inline the function it will use this definition. However, if it chooses not to inline the function, you will get a link time error when the linker is unable to find a suitable definition for the symbol foo:$ cc -o in in.cUndefined first referenced symbol in filefoo in.old: fatal: symbol referencing errors. No output written to inThis can be worked around by adding either "static" or "extern" to the definition of the inline function.If the function is declared to be a static inline then, as before the compiler may choose to inline the function. In addition the compiler will emit a locally scoped version of the function in the object file. There can be one static version per object file, so you may end up with multiple definitions of the same function, so this can be very space inefficient. Since all the functions are locally scoped, there is are no multiple definitions.Another approach is to declare the function as extern inline. In this case the compiler may generate inline code, and will also generate a global instance of the function. Although multiple global instances of the function might be generated in all the object files, only one will be remain in the executable after linking. So declaring functions as extern inline is more space efficient.This behaviour is defined by the standard. However, gcc takes a different approach, which is to treat inline functions by generating a global function and potentially inlining the code. Unfortunately this can cause multiply-defined symbol errors at link time, where the same extern inline function is declared in multiple files. For example, in the following code both in.c and in2.c include in.h which contains the definition of extern inline foo()....$ gcc -o in in.c in2.cld: fatal: symbol 'foo' is multiply-defined:The gcc behaviour for functions declared as extern inline is also different. It does not emit an external definition for these functions, leading to unresolved symbol errors at link time.For gcc, it is best to either declare the functions as extern inline and, in additional module, provide a global definition of the function, or to declare the functions as static inline and live with the multiple local symbol definitions that this produces.So for convenience it is tempting to use static inline for all compilers. This is a good work around (ignoring the issue of duplicate local copies of functions), except for an issue around unused code.The keyword static tells the compiler to emit a locally-scoped version of the function. Solaris Studio emits that function even if the function does not get called within the module. If that function calls a non-existent function, then you may get a link time error. Suppose you have the following code:void error_message();static inline unused() { error_message(); }void main(){}Compiling this we get the following error message:$ cc -O i.c"i.c", line 3: warning: no explicit type givenUndefined first referenced symbol in fileerror_message i.oEven though the function call exists in code that is not used, there is a link error for the undefined function error_message(). The same error would occur if extern inline was used as this would cause a global version of the function to be emitted. The problem would not occur if the function were just declared as inline because in this case the compiler would not emit either a global or local version of the function. The same code compiles with gcc because the unused function is not generated.So to summarise the options:Declare everything static inline, and ensure that there are no undefined functions, and that there are no functions that call undefined functions.Declare everything inline for Studio and extern inline for gcc. Then provide a global version of the function in a separate file.

Functions declared as inline are slightly more complex than might be expected. Douglas Walls has provided a chapter-and-verse write up. But the issue bears further explanation. When a function is...

Personal

Improving performance through bit manipulation: clear last set bit

Bit manipulation is one of those fun areas where you can get a performance gain from recoding a routine to use logical or arithmetic instructions rather than a more straight-forward code.Of course, in doing this you need to avoid the pit fall of premature optimisation - where you needlessly make the code more obscure with no benefit, or a benefit that disappears as soon as you run your code on a different machine. So with that caveat in mind, let's take a look at a simple example.Clear last set bitThis is a great starting point because it nicely demonstrates how we can sometimes replace a fair chunk of code with a much simpler set of instructions. Of course, the algorithm that uses fewer instructions is harder to understand, but in some situations the performance gain is worth it.We'll start off with some classic code to solve the problem. The reason for this is two-fold. First of all we want to clearly understand the problem we're solving. Secondly, we want a reference version that we can test against to ensure that our fiendishly clever code is actually correct. So here's our starting point:unsigned long long clearlastbit( unsigned long long value ){ int bit=1; if ( value== 0 ) { return 0; } while ( !(value & bit) ) { bit = bit << 1; } value = value ^ bit; return value;}But before we start trying to improve it we need a timing harness to find out how fast it runs. The following harness uses the Solaris call gethrtime() to return a timestamp in nanoseconds.#include <stdio.h>#include <sys/time.h>static double s_time;void starttime(){ s_time = 1.0 * gethrtime();}void endtime(unsigned long long its){ double e_time = 1.0 * gethrtime(); printf( "Time per iteration %5.2f ns\n", (e_time-s_time) / (1.0*its) ); s_time = 1.0 * gethrtime();}The next thing we need is a workload to test the current implementation. The workload iterates over a range of numbers and repeatedly calls clearlastbit() until all the bits in the current number have been cleared.#define COUNT 1000000void main(){ starttime(); for (unsigned long long i = 0; i < COUNT; i++ ) { unsigned long long value = i; while (value) { value = clearlastbit(value); } } endtime(COUNT);}Big O notationSo let's take a break at this point to discuss big O notation. If we look at the code for clearlastbit() we can see that it contains a loop. We'll iterate around the loop once for each bit in the input value, so for a N bit number we might iterate up to N times. We say that this computation is "order N", meaning that the cost the calculation is somehow proportional to the number of bits in the input number. This is written as O(N).The order N description is useful because it gives us some idea of the cost of calling the routine. From it we know that the routine will typically take twice as long if coded for 8 byte inputs than if we used 4 byte inputs. Order N is not too bad as costs go, the ones to look out for are order N squared, or cubed etc. For these higher orders the run time to complete a computation can become huge for even comparatively small values of N.If we look at the test harness, we are iterating over the function COUNT times, so effectively the entire program is O(COUNT*N), and we're exploiting the fact that this is effectively an O(N^2) cost to provide a workload that has a reasonable duration.So let's return to the problem of clearing the last set bit. One obvious optimisation would be to record the last bit that was cleared, and then start the next iteration of the loop from that point. This is potentially a nice gain, but does not fundamentally change the algorithm. A better approach is to take advantage of bit manipulation so that we can avoid the loop altogether.unsigned long long clearlastbit2( unsigned long long value ){ return (value & (value-1) );}Ok, if you look at this code it is not immediately apparent what it does - most people would at first sight say "How can that possibly do anything useful?". The easiest way to understand it is to take an example. Suppose we pass the value ten into this function. In binary ten is encoded as 1010b. The first operation is the subtract operation which produces the result of nine, which is encoded as 1001b. We then take the AND of these two to get the result of 1000b or eight. We've cleared the last set bit because the subtract either removed the one bit (if it was set) or broke down the next largest set bit. The AND operation just keeps the bits to the left of the last set bit.What is interesting about this snippet of code is that it is just three instructions. There's no loop and no branches - so most processors can execute this code very quickly. To demonstrate how much faster this code is, we need a test harness. The test harness should have two parts to it. The first part needs to validate that the new code produces the same result as the existing code. The second part needs to time the old and new code.#define COUNT 1000000void main(){ // Correctness test for (unsigned long long i = 0; i<COUNT; i++ ) { unsigned long long value = i; while (value) { unsigned long long v2 = value; value = clearlastbit(value); if (value != clearlastbit2(v2)) { printf(" Mismatch input %llx: %llx!= %llx\n", v2, value, clearlastbit2(v2)); } } } // Performance test starttime(); for (unsigned long long i = 0; i<COUNT; i++ ) { unsigned long long value = i; while (value) { value = clearlastbit(value); } } endtime(COUNT); starttime(); for (unsigned long long i = 0; i<COUNT; i++ ) { unsigned long long value = i; while (value) { value = clearlastbit2(value); } } endtime(COUNT);}The final result is that the bit manipulation version of this code is about 3x faster than the original code - on this workload. Of course one of the interesting things is that the performance does depend on the input values. For example, if there are no set bits, then both codes will run in about the same amount of time.

Bit manipulation is one of those fun areas where you can get a performance gain from recoding a routine to use logical or arithmetic instructions rather than a more straight-forward code. Of course, in...

Personal

Behaviour of std::list::splice in the 2003 and 2011 C++ standards

There's an interesting corner case in the behaviour of std::list::splice. In the C++98/C++03 standards it is defined such that iterators referring to the spliced element(s) are invalidated. This behaviour changes in the C++11 standard, where iterators remain valid.The text of the 2003 standard (section 23.2.2.4, p2, p7, p12) describes the splice operation as "destructively" moving elements from one list to another. If one list is spliced into another, then all iterators and references to that list become invalid. If an element is spliced into a list, then any iterators and references to that element become invalid, similarly if a range of elements is spliced then iterators and references to those elements become invalid.This is changed in the 2011 standard (section 23.3.5.5, p2, p4, p7, p12) where the operation is still described as being destructive, but all the iterators and references to the spliced element(s) remain valid.The following code demonstrates the problem:#include <list>#include <iostream>int main(){ std::list<int> list; std::list<int>::iterator i; i=list.begin(); list.insert(i,5); list.insert(i,10); list.insert(i,3); list.insert(i,4); // i points to end // list contains 5 10 3 4 i--; // i points to 4 i--; // i points to 3 i--; // i points to 10 std::cout << " List contains: "; for (std::list<int>::iterator l=list.begin(); l!=list.end(); l++) { std::cout << " >" << *l << "< "; } std::cout << "\n element at i = " << *i << "\n"; std::list<int>::iterator element; element = list.begin(); element++; // points to 10 element++; // points to 3 std::cout << " element at element = " << *element << "\n"; list.splice(i,list,element); // Swap 10 and 3 std::cout << " List contains :"; for (std::list<int>::iterator l=list.begin(); l!=list.end(); l++) { std::cout << " >" << *l << "< "; } std::cout << "\n element at element = " << *element << '\n'; element++; // C++03, access to invalid iterator std::cout << " element at element = " << *element << '\n';}When compiled to the 2011 standard the code is expected to work and produce output like: List contains: >5< >10< >3< >4< element at i = 10 element at element = 3 List contains : >5< >3< >10< >4< element at element = 3 element at element = 10However, the behaviour when compiled to the 2003 standard is indeterminate. It might work - if the iterator happens to remain valid, but it could also fail: List contains: >5< >10< >3< >4< element at i = 10 element at element = 3 List contains : >5< >3< >10< >4< element at element = 3 element at element = Segmentation Fault (core dumped)

There's an interesting corner case in the behaviour of std::list::splice. In the C++98/C++03 standards it is defined such that iterators referring to the spliced element(s) are invalidated. This...

Personal

New articles about Solaris Studio

We've started posting new articles directly into the communities section of the Oracle website. If you're not familiar with this location, it's also where you can post questions on languages or tools.With the change it should be easier to find articles relevant to developers, and it should be easy to comment on them. So hopefully this works out well. There's currently three articles listed on the content page. I've already posted about the article on the Performance Analyzer Overview screen, so I'll quickly highlight the other two:In Studio 12.4 we've introducedfiner control over debug information. This allows you to reduce object file size by excluding debug info that you don't need. There's substantial control, but probably the easiest new option is -g1 which includes a minimal set debug info.A change in Studio 12.4 is in the default way that C++ handles templates. The summary is that the compiler's default mode is inline with the way that other compilers work - you need to include template definitions in the source file being compiled. Previously the compiler would try to find the definitions in external files. This old behaviour could be confusing, so it's good that the default has changed. But it's possible that you may encounter code that was written with the expectation that the Studio compiler behaved in the old way, in this case you'll need to modify the source, or tell the compiler to revert to the older behaviour. Hopefully, most people won't even notice the change, but it's worth knowing the signs in case you encounter a problem.

We've started posting new articles directly into the communities section of the Oracle website. If you're not familiar with this location, it's also where you can post questions on languages or tools...

Work

Writing inline templates

Writing some inline templates today... I've written about doing this kind of stuff in the past here and, in more detail, here.I happen to need to pass a bundle of parameters on to the routine. The best way of checking how the parameters will be passed is to get the compiler to provide some initial template. Here's an example routine:int parameters (int p0, int * p1, int * p2, int* p3, int * p4, int * p5, int * p6, int p7){ return p0 + *p1 + *p2 + *p3 + *p4 + ((*p5)<<2) + ((*p6)<<3) + p7*p7;}In the routine I've tried to handle some of the parameters differently. I know that the first parameters get passed in registers, and then the later ones get passed on the stack. By handling them differently I can work out which loads from the stack correspond to which variables. The disassembly looks like:-bash-4.1$ cc -g -O parameters.c -c-bash-4.1$ dis -F parameters parameters.odisassembly for parameters.oparameters() parameters: ca 02 60 00 ld [%o1], %g5 parameters+0x4: c4 02 e0 00 ld [%o3], %g2 parameters+0x8: c2 02 a0 00 ld [%o2], %g1 parameters+0xc: c6 03 a0 60 ld [%sp + 0x60], %g3 // load of p7 parameters+0x10: 88 02 00 05 add %o0, %g5, %g4 parameters+0x14: d0 03 60 00 ld [%o5], %o0 parameters+0x18: ca 03 20 00 ld [%o4], %g5 parameters+0x1c: 92 00 80 01 add %g2, %g1, %o1 parameters+0x20: 87 38 e0 00 sra %g3, 0x0, %g3 parameters+0x24: 82 01 00 09 add %g4, %o1, %g1 parameters+0x28: d2 03 a0 5c ld [%sp + 0x5c], %o1 // load of p6 parameters+0x2c: 88 48 c0 03 mulx %g3, %g3, %g4 // %g4 = %g3*%g3 parameters+0x30: 97 2a 20 02 sll %o0, 0x2, %o3 parameters+0x34: 94 00 40 05 add %g1, %g5, %o2 parameters+0x38: da 02 60 00 ld [%o1], %o5 parameters+0x3c: 84 02 c0 0a add %o3, %o2, %g2 parameters+0x40: 99 2b 60 03 sll %o5, 0x3, %o4 // %o4 = %o5

Writing some inline templates today... I've written about doing this kind of stuff in the past here and, in more detail, here. I happen to need to pass a bundle of parameters on to the routine. The...

Personal

New Performance Analyzer Overview screen

I love using the Performance Analyzer, but the question I often get when I show it to people, is "Where do I start?". So one of the improvements in Solaris Studio 12.4 is an Overview screen to help people get started with the tool. Here's what it looks like:The reason this is important, is that many applications spend time in various place - like waiting on disk, or in user locks - and it's not always obvious where is going to be the most effective place to look for performance gains.The Overview screen is meant to be the "one-stop" place where people can find out what their application is doing. When we put it back into the product I expected it to be the screen that I glanced at then never went back to. I was most surprised when this turned out not to be the case.During performance analysis, I'm often exploring different ideas as to where it might be possible to get performance improvements. The Overview screen allows me to select the metrics that I'm interested in, then take a look at the resulting profiles. So I might start with system time, and just enable the system time metrics. Once I'm done with that, I might move on to user time, and select those metrics. So what was surprising about the Overview screen was how often I returned to it to change the metrics I was using.So what does the screen contain? The overview shows all the available metrics. The bars indicate which metrics contribute the most time. So it's easy to pick (and explore) the metrics that contribute the most time.If the profile contains performance counter metrics, then those also appear. If the counters include instructions and cycles, then the synthetic CPI/IPC metrics are also available. The Overview screen is really useful for hardware counter metrics.I use performance counters in a couple of ways: to confirm a hypothesis about performance or to estimate time spent on a type of event. For example, if I think a load is taking a lot of time due to TLB misses, then profiling on the TLB miss performance counter will tell me whether that load has a lot of misses or not. Alternatively, if I've got TLB miss counter data, then I can scale this by the cost per TLB miss, and get an estimate of the total runtime lost to TLB misses.Where the Overview screen comes into this is that I will often want to minimise the number of columns of data that are shown (to fit everything onto my monitor), but sometimes I want to quickly enable a counter to see whether that event happens at the bit of code where I'm looking. Hence I end up flipping to the Overview screen and then returning to the code.So what I thought would be a nice feature, actually became pretty central to my work-flow.I should have a more detailed paper about the Overview screen up on OTN soon.

I love using the Performance Analyzer, but the question I often get when I show it to people, is "Where do I start?". So one of the improvements in Solaris Studio 12.4 is an Overview screen to help...

Work

Performance made easy

The big news of the day is that Oracle Solaris Studio 12.4 is available for download. I'd like to thank all those people who tried out the beta releases and gave us feedback.There's a number of things that are new in this release. The most obvious one is C++11 support, I've written a bit about the lambda expression support, tuples, and unordered containers.My favourite tool, the Performance Analyzer, has also had a bit of a facelift. I'll talk about the Overview screen in a separate post (and in an article), but there's some other fantastic features. The syntax highlighting, and hyperlinking, has made navigating profiles much easier. There's been a large number of improvements in filtering - a feature that's been in the product a long time, but these changes elevate it to being much more accessible (an article on filtering is long overdue!). There's also the default hardware counters - which makes it a no-brainer to get hardware counter data, which is really helpful in understanding exactly what an application is doing.Over the development cycle I've made much use of the other tools. The Thread Analyzer for identifying data races has had some improvements. The Code Analyzer tools have made some great gains in rapidly identifying potential coding errors. And so on....Anyway, please download the new version, try it out, try out the tools, and let us know what you think of it.

The big news of the day is that Oracle Solaris Studio 12.4 is available for download. I'd like to thank all those people who tried out the beta releases and gave us feedback. There's a number of...

Work

Comparing constant duration profiles

I was putting together my slides for Open World, and in one of them I'm showing profile data from a server-style workload. ie one that keeps running until stopped. In this case the profile can be an arbitrary duration, and it's the work done in that time which is the important metric, not the total amount of time taken.Profiling for a constant duration is a slightly unusual situation. We normally profile a workload that takes N seconds, do some tuning, and it now takes (N-S) seconds, and we can say that we improved performance by S/N percent. This is represented by the left pair of boxes in the following diagram:In the diagram you can see that the routine B got optimised and therefore the entire runtime, for completing the same amount of work, reduced by an amount corresponding to the performance improvement for B.Let's run through the same scenario, but instead of profiling for a constant amount of work, we profile for a constant duration. In the diagram this is represented by the outermost pair of boxes.Both profiles run for the same total amount of time, but the right hand profile has less time spent in routine B() than the left profile, because the time in B() has reduced more time is spent in A(). This is natural, I've made some part of the code more efficient, I'm observing for the same amount of time, so I must spend more time in the part of the code that I've not optimised.So what's the performance gain? In this case we're more likely to look at the gain in throughput. It's a safe assumption that the amount of time in A() corresponds to the amount of work done - ie that if we did T units of work, then the average cost per unit work A()/T is the same across the pair of experiments. So if we did T units of work in the first experiment, then in the second experiment we'd do T * A'()/A(). ie the throughput increases by S = A'()/A() where S is the scaling factor. What is interesting about this is that A() represents any measure of time spent in code which was not optimised. So A() could be a single routine or it could be all the routines that are untouched by the optimisation.

I was putting together my slides for Open World, and in one of them I'm showing profile data from a server-style workload. ie one that keeps running until stopped. In this case the profile can be an...

Work

My schedule for JavaOne and Oracle Open World

I'm very excited to have got my schedule for Open World and JavaOne:CON8108: Engineering Insights: Best Practices for Optimizing Oracle Software for Oracle HardwareVenue / Room: Intercontinental - Grand Ballroom CDate and Time: 10/1/14, 16:45 - 17:30CON2654: Java Performance: Hardware, Structures, and AlgorithmsVenue / Room: Hilton - Imperial Ballroom ADate and Time: 9/29/14, 17:30 - 18:30The first talk will be about some of the techniques I use when performance tuning software. We get very involved in looking at how Oracle software works on Oracle hardware. The things we do work for any software, but we have the advantage of good working relationships with the critical teams.The second talk is with Charlie Hunt, it's a follow on from the talk we gave at JavaOne last year. We got Rock Star awards for that, so the pressure's on a bit for this sequel. Fortunately there's still plenty to talk about when you look at how Java programs interact with the hardware, and how careful choices of data structures and algorithms can have a significant impact on delivered performance.Anyway, I hope to see a bunch of people there, if you're reading this, please come and introduce yourself. If you don't make it I'm looking forward to putting links to the presentations up.

I'm very excited to have got my schedule for Open World and JavaOne: CON8108: Engineering Insights: Best Practices for Optimizing Oracle Software for Oracle HardwareVenue / Room: Intercontinental -...

Personal

Studio 12.4 Beta Refresh, performance counters, and CPI

We've just released the refresh beta for Solaris Studio 12.4 - free download. This release features quite a lot of changes to a number of components. It's worth calling out improvements in the C++11 support and other tools. We've had few comments and posts on the Studio forums, and a bunch of these have resulted in improvements in this refresh.One of the features that is deserving of greater attention is default hardware counters in the Performance Analyzer.Default hardware countersThere's a lot of potential hardware counters that you can profile your application on. Some of them are easy to understand, some require a bit more thought, and some are delightfully cryptic (for example, I'm sure that op_stv_wait_sxmiss_ex means something to someone). Consequently most people don't pay them much attention.On the other hand, some of us get very excited about hardware performance counters, and the information that they can provide. It's good to be able to reveal that we've made some steps along the path of making that information more generally available.The new feature in the Performance Analyzer is default hardware counters. For most platforms we've selected a set of meaningful performance counters. You get these if you add -h on to the flags passed to collect. For example:$ collect -h on ./a.outUsing the countersTypically the counters will gather cycles, instructions, and cache misses - these are relatively easy to understand and often provide very useful information. In particular, given a count of instructions and a count of cycles, it's easy to compute Cycles per Instruction (CPI) or Instructions per Cycle(IPC).I'm not a great fan of CPI or IPC as absolute measurements - working in the compiler team there are plenty of ways to change these metrics by controlling the I (instructions) when I really care most about the C (cycles). But, the two measurements have a very useful purpose when examining a profile.A high CPI means lots cycles were spent somewhere, and very few instructions were issued in that time. This means lots of stall, which means that there's some potential for performance gains. So a good rule of thumb for where to focus first is routines that take a lot of time, and have a high CPI.IPC is useful for a different reason. A processor can issue a maximum number of instructions per cycle. For example, a T4 processor can issue two instructions per cycle. If I see an IPC of 2 for one routine, I know that the code is not stalled, and is limited by instruction count. So when I look at a code with a high IPC I can focus on optimisations that reduce the instruction count.So both IPC and CPI are meaningful metrics. Reflecting this, the Performance Analyzer will compute the metrics if the hardware counter data is available. Here's an example:This code was deliberately contrived so that all the routines had ludicrously high CPI. But isn't that cool - I can immediately see what kinds of opportunities might be lurking in the code.This is not restricted to just the functions view, CPI and/or IPC are presented in every view - so you can look at CPI for each thread, line of source, line of disassembly. Of course, as the counter data gets spread over more "lines" you have less data per line, and consequently more noise. So CPI data at the disassembly level is not likely to be that useful for very short running experiments. But when aggregated, the CPI can often be meaningful even for short experiments.

We've just released the refresh beta for Solaris Studio 12.4 - free download. This release features quite a lot of changes to a number of components. It's worth calling out improvements in the C++11...

Personal

Presenting at JavaOne and Oracle Open World

Once again I'll be presenting at Oracle Open World, and JavaOne. You can search the full catalogue on the web. The details of my two talks are:Engineering Insights: Best Practices for Optimizing Oracle Software for Oracle Hardware [CON8108]Oracle Solaris Studio is an indispensable toolset for optimizing key Oracle software running on Oracle hardware. This presentation steps through a series of case studies from real Oracle applications, illustrating how the various Oracle Solaris Studio development tools have proven instrumental in ensuring that Oracle software is fully tuned and optimized for Oracle hardware. Learn the secrets of how Oracle uses these powerful compilers and performance, memory, and thread analysis tools to write optimal, well-tested enterprise code for Oracle hardware, and hear about best practices you can use to optimize your existing applications for the latest Oracle systems.Java Performance: Hardware, Structures, and Algorithms [CON2654] Many developers consider the deployment platform to be a black box that the JVM abstracts away. In reality, this is not the case. The characteristics of the hardware do have a measurable impact on the performance of any Java application. In this session, two Java Rock Star presenters explore how hardware features influence the performance of your application. You will not only learn how to measure this impact but also find out how to improve the performance of your applications by writing hardware-friendly code.

Once again I'll be presenting at Oracle Open World, and JavaOne. You can search the full catalogue on the web. The details of my two talks are: Engineering Insights: Best Practices for Optimizing...

Personal

Enabling large file support

For 32-bit apps the "default" maximum file size is 2GB. This is because the interfaces use the long datatype which is a signed int for 32-bit apps, and a signed long long for 64-bit apps. For many apps this is insufficient. Solaris already has huge numbers of large file aware commands, these are listed under man largefile.For a developer wanting to support larger files, the obvious solution is to port to 64-bit, however there is also a way to remain with 32-bit apps. This is to compile with large file support.Large file support provides a new set of interfaces that take 64-bit integers, enabling support of files greater than 2GB in size. In a number of cases these interfaces replace the existing ones, so you don't need to change the source. However, there are some interfaces where the long type is part of the ABI; in these cases there is a new interface to use.The way to find out what flags to use is through the command getconf LFS_CFLAGS. The getconf command returns environment settings, and in this case we're asking it to provide the C flags needed to compile with large file support. It's useful to take a look at the other information that getconf can provide.The documentation for compiling with large file support talks about both the flags that are needed, and what functions need to be changed. There are two functions that do not map directly onto large file equivalents because they have a long data type in their prototypes. These two functions are fseek and ftell; calls to these two functions need to be replaced by calls to fseeko and ftello

For 32-bit apps the "default" maximum file size is 2GB. This is because the interfaces use the long datatype which is a signed int for 32-bit apps, and a signed long long for 64-bit apps. For...

Personal

Unsigned integers considered annoying

Let's talk about unsigned integers. These can be tricky because they wrap-around from big to small. Signed integers wrap-around from positive to negative. Let's look at an example. Suppose I want to do something for all iterations of a loop except for the last OFFSET of them. I could write something like: if (i < length - OFFSET) {}If I assume OFFSET is 8 then for length 10, I'll do something for the first 2 iterations. The problem occurs when the length is less than OFFSET. If length is 2, then I'd expect not to do anything for any of the iterations. For a signed integer 2 minus 8 is -6 which is less than i, so I don't do anything. For an unsigned integer 2 minus 8 is 0xFFFFFFFA which is still greater than i. Hence we'll continue to do whatever it is we shouldn't be doing in this instance.So the obvious fix for this is that for unsigned integers we do: if (i + OFFSET < length) {}This works over the range that we might expect it to work. Of course we have a problem with signed integers if length happens to be close to INT_MAX, at this point adding OFFSET to a large value of i may cause it to overflow and become a large negative number - which will continue to be less than length.With unsigned ints we encounter this same problem at UINT_MAX where adding OFFSET to i could generate a small value, which is less than the boundary.So in these cases we might want to write: if (i < length - OFFSET) {}Oh....So basically to cover all the situations we might want to write something like: if ( (length > OFFSET) && (i < length - OFFSET) ) {}If this looks rather complex, then it's important to realise that we're handling a range check - and a range has upper and lower bounds. For signed integers zero - OFFSET is representable, so we can write: if (i < length - OFFSET) {}without worrying about wrap-around. However for unsigned integers we need to define both the left and right ends of the range. Hence the more complex expression.

Let's talk about unsigned integers. These can be tricky because they wrap-around from big to small. Signed integers wrap-around from positive to negative. Let's look at an example. Suppose I want to...

Personal

Solaris Studio 12.4 Beta now available

The beta programme for Solaris Studio 12.4 has opened. So download the bits and take them for a spin!There's a whole bunch of features - you can read about them in the what's new document, but I just want to pick a couple of my favourites:C++ 2011 support. If you've not read about it, C++ 2011 is a big change. There's a lot of stuff that has gone into it - improvements to the Standard library, lambda expressions etc. So there is plenty to try out. However, there are some features not supported in the beta, so take a look at the what's new pages for C++Improvements to the Performance Analyzer. If you regularly read my blog, you'll know that this is the tool that I spend most of my time with. The new version has some very cool features; these might not sound much, but they fundamentally change the way you interact with the tool: an overview screen that helps you rapidly figure out what is of interest, improved filtering, mini-views which show more slices of the data on a single screen, etc. All of these massively improve the experience and the ease of using the tool.There's a couple more things. If you try out features in the beta and you want to suggest improvements, or tell us about problems, please use the forums. There is a forum for the compiler and one for the tools.Oh, and watch this space. I will be writing up some material on the new features....

The beta programme for Solaris Studio 12.4 has opened. So download the bits and take them for a spin! There's a whole bunch of features - you can read about them in the what's new document, but I just...

Personal

Privileges

Before I start, this is not about security, it's probably the antithesis of security. So I'd recommend starting by reading about how using privileges can break the security of your system.There are three tools that I regularly use that require escalated privileges: dtrace, cpustat, and busstat. You can read up on the way that Solaris manages privileges. But if you know what you want to do, the process to figure out how to get the necessary privileges is reasonable straightforward.To find out what privileges you have you can use the ppriv -v $$ command. This will report all the privileges for the current shell.To find out what privileges are stopping you from running a command, you should run it under ppriv -eD command. For example: ppriv -eD cpustat -c instruction_counts 1 1cpustat[13222]: missing privilege "sys_resource" (euid = 84945, syscall = 128) needed at rctl_rlimit_set+0x98cpustat[13222]: missing privilege "cpc_cpu" (euid = 84945, syscall = 5) needed at kcpc_open+0x4...It is also possible to list all the privileges on the system using ppriv -l. This is helpful if the privilege is has a name that maps onto what you want to do. The privileges for dtrace are good examples of this:$ ppriv -l|grep dtracedtrace_kerneldtrace_procdtrace_userYou can then use usermod -K ... to assign the necessary privileges to a user. For example:$ usermod -K defaultpriv=basic,sys_resource,cpc_cpu usernameInformation about privileges for users is recorded in /etc/user_attr, so it is possible to directly edit that file to add or remove privileges.Using this approach you can determine that busstat needs sys_config, cpustat needs sys_resource and cpc_cpu, and dtrace needs dtrace_kernel, dtrace_proc, and dtrace_user.

Before I start, this is not about security, it's probably the antithesis of security. So I'd recommend starting by reading about how using privileges can break the security of your system. There are...

Work

One executable, many platforms

Different processors have different optimal sequences of code. Fortunately, most of the time the differences are minor, and we can easily accommodate them by generating generic code.If you needed more than this, then the "old" model was to use dynamic string tokens to pick the best library for the platform. This works well, and was the mechanism that libc.so used. However, the downside is that you now need to ship a bundle of libraries with the application; this can get (and look) a bit messy.There's a "new" approach that uses a family of capability functions. The idea here is that multiple versions of the routine are linked into the executable, and the runtime linker picks the best for the platform that the application is running on. The routines are denoted with a suffix, after a percentage sign, indicating the platform. For example here's the family of memcpy() implementations in libc:$ elfdump -H /usr/lib/libc.so.1 2>&1 |grep memcpy [10] 0x0010627c 0x00001280 FUNC LOCL D 0 .text memcpy%sun4u [11] 0x001094d0 0x00000b8c FUNC LOCL D 0 .text memcpy%sun4u-opl [12] 0x0010a448 0x000005f0 FUNC LOCL D 0 .text memcpy%sun4v-hwcap1...It takes a bit of effort to produce a family of implementations. Imagine we want to print something different when an application is run on a sun4v machine. First of all we'll have a bit of code that prints out the compile-time defined string that indicates the platform we're running on:#include <stdio.h>static char name[]=PLATFORM;double platform(){ printf("Running on %s\n",name);}To compile this code we need to provide the definition for PLATFORM - suitably escaped. We will need to provide two versions, a generic version that can always run, and a platform specific version that runs on sun4v platforms:$ cc -c -o generic.o p.c -DPLATFORM=\"Generic\"$ cc -c -o sun4v.o p.c -DPLATFORM=\"sun4v\"Now we have a specialised version of the routine platform() but it has the same name as the generic version, so we cannot link the two into the same executable. So what we need to do is to tag it as being the version we want to run on sun4v platforms. This is a two step process. The first step is that we tag the object file as being a sun4v object file. This step is only necessary if the compiler has not already tagged the object file. The compiler will tag the object file appropriately if it uses instructions from a particular architecture - for example if you compiled explicitly targeting T4 using -xtarget=t4. However, if you need to tag the object file, then you can use a mapfile to add the appropriate hardware capabilities:$mapfile_version 2CAPABILITY sun4v { MACHINE=sun4v;};We can then ask the linker to apply these hardware capabilities from the mapfile to the object file:$ ld -r -o sun4v_1.o -Mmapfile.sun4v sun4v.oYou can see that the capabilities have been applied using elfdump:$ elfdump -H sun4v_1.oCapabilities Section: .SUNW_cap Object Capabilities: index tag value [0] CA_SUNW_ID sun4v [1] CA_SUNW_MACH sun4vThe second step is to take these capabilities and apply them to the functions. We do this using the linker option -zsymbolcap:$ ld -r -o sun4v_2.o -z symbolcap sun4v_1.oYou can now see that the platform function has been tagged as being for sun4v hardware:$ elfdump -H sun4v_2.oCapabilities Section: .SUNW_cap Symbol Capabilities: index tag value [1] CA_SUNW_ID sun4v [2] CA_SUNW_MACH sun4v Symbols: index value size type bind oth ver shndx name [24] 0x00000010 0x00000070 FUNC LOCL D 0 .text platform%sun4vAnd finally you can combine the object files into a single executable. The main() routine of the executable calls platform() which will print out a different message depending on the platform. Here's the source to main():extern void platform();int main(){ platform();}Here's what happens when the program is compiled and run on a non-sun4v platform:$ cc -o main -O main.c sun4v_2.o generic.o$ ./mainRunning on GenericHere's the same executable running on a sun4v platform:$ ./mainRunning on sun4v

Different processors have different optimal sequences of code. Fortunately, most of the time the differences are minor, and we can easily accommodate them by generating generic code. If you needed...

Work

The pains of preprocessing

Ok, so I've encountered this twice in 24 hours. So it's probably worth talking about it.The preprocessor does a simple text substitution as it works its way through your source files. Sometimes this has "unanticipated" side-effects. When this happens, you'll normally get a "hey, this makes no sense at all" error from the compiler. Here's an example:$ more c.c#include <ucontext.h>#include <stdio.h>int main(){ int FS; FS=0; printf("FS=%i",FS);}$ CC c.c$ CC c.c"c.c", line 6: Error: Badly formed expression."c.c", line 7: Error: The left operand must be an lvalue.2 Error(s) detected.A similar thing happens with g++:$ /pkg/gnu/bin/g++ c.cc.c: In function 'int main()':c.c:6:7: error: expected unqualified-id before numeric constantc.c:7:6: error: lvalue required as left operand of assignmentThe Studio C compiler gives a bit more of a clue what is going on. But it's not something you can rely on:$ cc c.c"c.c", line 6: syntax error before or at: 1"c.c", line 7: left operand must be modifiable lvalue: op "="As you can guess the issue is that FS gets substituted. We can find out what happens by examining the preprocessed source:$ CC -P c.c$ tail c.iint main ( ){int 1 ;1 = 0 ;printf ( "FS=%i" , 1 ) ;}You can confirm this using -xdumpmacros to dump out the macros as they are defined. You can combine this with -H to see which header files are included:$ CC -xdumpmacros c.c 2>&1 |grep FS#define _SYS_ISA_DEFS_H#define _FILE_OFFSET_BITS 32#define REG_FSBASE 26#define REG_FS 22#define FS 1....If you're using gcc you should use the -E option to get preprocessed source, and the -dD option to get definitions of macros and the include files.

Ok, so I've encountered this twice in 24 hours. So it's probably worth talking about it. The preprocessor does a simple text substitution as it works its way through your source files. Sometimes this...

Work

Compiling for T4

I've recently had quite a few queries about compiling for T4 based systems. So it's probably a good time to review what I consider to be the best practices.Always use the latest compiler. Being in the compiler team, this is bound to be something I'd recommend :) But the serious points are that (a) Every release the tools get better and better, so you are going to be much more effective using the latest release (b) Every release we improve the generated code, so you will see things get better (c) Old releases cannot know about new hardware.Always use optimisation. You should use at least -O to get some amount of optimisation. -xO4 is typically even better as this will add within-file inlining.Always generate debug information, using -g. This allows the tools to attribute information to lines of source. This is particularly important when profiling an application.The default target of -xtarget=generic is often sufficient. This setting is designed to produce a binary that runs well across all supported platforms. If the binary is going to be deployed on only a subset of architectures, then it is possible to produce a binary that only uses the instructions supported on these architectures, which may lead to some performance gains. I've previously discussed which chips support which architectures, and I'd recommend that you take a look at the chart that goes with the discussion.Crossfile optimisation (-xipo) can be very useful - particularly when the hot source code is distributed across multiple source files. If you're allowed to have something as geeky as favourite compiler optimisations, then this is mine!Profile feedback (-xprofile=[collect: | use:]) will help the compiler make the best code layout decisions, and is particularly effective with crossfile optimisations. But what makes this optimisation really useful is that codes that are dominated by branch instructions don't typically improve much with "traditional" compiler optimisation, but often do respond well to being built with profile feedback.The macro flag -fast aims to provide a one-stop "give me a fast application" flag. This usually gives a best performing binary, but with a few caveats. It assumes the build platform is also the deployment platform, it enables floating point optimisations, and it makes some relatively weak assumptions about pointer aliasing. It's worth investigating. SPARC64 processor, T3, and T4 implement floating point multiply accumulate instructions. These can substantially improve floating point performance. To generate them the compiler needs the flag -fma=fused and also needs an architecture that supports the instruction (at least -xarch=sparcfmaf).The most critical advise is that anyone doing performance work should profile their application. I cannot overstate how important it is to look at where the time is going in order to determine what can be done to improve it.I also presented at Oracle OpenWorld on this topic, so it might be helpful to review those slides.

I've recently had quite a few queries about compiling for T4 based systems. So it's probably a good time to review what I consider to be the best practices. Always use the latest compiler. Being in the...

Work

Library order is important

I've written quite extensively about link ordering issues, but I've not discussed the interaction between archive libraries and shared libraries. So let's take a simple program that calls a maths library function:#include <math.h>int main(){ for (int i=0; i<10000000; i++) { sin(i); }}We compile and run it to get the following performance:bash-3.2$ cc -g -O fp.c -lmbash-3.2$ timex ./a.outreal 6.06user 6.04sys 0.01Now most people will have heard of the optimised maths library which is added by the flag -xlibmopt. This contains optimised versions of key mathematical functions, in this instance, using the library doubles performance:bash-3.2$ cc -g -O -xlibmopt fp.c -lmbash-3.2$ timex ./a.outreal 2.70user 2.69sys 0.00The optimised maths library is provided as an archive library (libmopt.a), and the driver adds it to the link line just before the maths library - this causes the linker to pick the definitions provided by the static library in preference to those provided by libm. We can see the processing by asking the compiler to print out the link line:bash-3.2$ cc -### -g -O -xlibmopt fp.c -lm/usr/ccs/bin/ld ... fp.o -lmopt -lm -o a.out...The flag to the linker is -lmopt, and this is placed before the -lm flag. So what happens when the -lm flag is in the wrong place on the command line:bash-3.2$ cc -g -O -xlibmopt -lm fp.cbash-3.2$ timex ./a.outreal 6.02user 6.01sys 0.01If the -lm flag is before the source file (or object file for that matter), we get the slower performance from the system maths library. Why's that? If we look at the link line we can see the following ordering:/usr/ccs/bin/ld ... -lmopt -lm fp.o -o a.out So the optimised maths library is still placed before the system maths library, but the object file is placed afterwards. This would be ok if the optimised maths library were a shared library, but it is not - instead it's an archive library, and archive library processing is different - as described in the linker and library guide:"The link-editor searches an archive only to resolve undefined or tentative external references that have previously been encountered."An archive library can only be used resolve symbols that are outstanding at that point in the link processing. When fp.o is placed before the libmopt.a archive library, then the linker has an unresolved symbol defined in fp.o, and it will search the archive library to resolve that symbol. If the archive library is placed before fp.o then there are no unresolved symbols at that point, and so the linker doesn't need to use the archive library. This is why libmopt needs to be placed after the object files on the link line.On the other hand if the linker has observed any shared libraries, then at any point these are checked for any unresolved symbols. The consequence of this is that once the linker "sees" libm it will resolve any symbols it can to that library, and it will not check the archive library to resolve them. This is why libmopt needs to be placed before libm on the link line.This leads to the following order for placing files on the link line:Object filesArchive librariesShared librariesIf you use this order, then things will consistently get resolved to the archive libraries rather than to the shared libaries.

I've written quite extensively about link ordering issues, but I've not discussed the interaction between archive libraries and shared libraries. So let's take a simple program that calls a maths...

Work

It could be worse....

As "guest" pointed out, in my file I/O test I didn't open the file with O_SYNC, so in fact the time was spent in OS code rather than in disk I/O. It's a straightforward change to add O_SYNC to the open() call, but it's also useful to reduce the iteration count - since the cost per write is much higher:...#define SIZE 1024void test_write(){ starttime(); int file = open("./test.dat",O_WRONLY|O_CREAT|O_SYNC,S_IWGRP|S_IWOTH|S_IWUSR);...Running this gave the following results:Time per iteration 0.000065606310 MB/sTime per iteration 2.709711563906 MB/sTime per iteration 0.178590114758 MB/sYup, disk I/O is way slower than the original I/O calls. However, it's not a very fair comparison since disks get written in large blocks of data and we're deliberately sending a single byte. A fairer result would be to look at the I/O operations per second; which is about 65 - pretty much what I'd expect for this system.It's also interesting to examine at the profiles for the two cases. When the write() was trapping into the OS the profile indicated that all the time was being spent in system. When the data was being written to disk, the time got attributed to sleep. This gives us an indication how to interpret profiles from apps doing I/O. It's the sleep time that indicates disk activity.

As "guest" pointed out, in my file I/O test I didn't open the file with O_SYNC, so in fact the time was spent in OS code rather than in disk I/O. It's a straightforward change to add O_SYNC to...

Work

Write and fprintf for file I/O

fprintf() does buffered I/O, where as write() does unbuffered I/O. So once the write() completes, the data is in the file, whereas, for fprintf() it may take a while for the file to get updated to reflect the output. This results in a significant performance difference - the write works at disk speed. The following is a program to test this:#include <fcntl.h>#include <unistd.h>#include <stdio.h>#include <stdlib.h>#include <errno.h>#include <stdio.h>#include <sys/time.h>#include <sys/types.h>#include <sys/stat.h>static double s_time;void starttime(){ s_time=1.0*gethrtime();}void endtime(long its){ double e_time=1.0*gethrtime(); printf("Time per iteration %5.2f MB/s\n", (1.0*its)/(e_time-s_time*1.0)*1000); s_time=1.0*gethrtime();}#define SIZE 10*1024*1024void test_write(){ starttime(); int file = open("./test.dat",O_WRONLY|O_CREAT,S_IWGRP|S_IWOTH|S_IWUSR); for (int i=0; i<SIZE; i++) { write(file,"a",1); } close(file); endtime(SIZE);}void test_fprintf(){ starttime(); FILE* file = fopen("./test.dat","w"); for (int i=0; i<SIZE; i++) { fprintf(file,"a"); } fclose(file); endtime(SIZE);}void test_flush(){ starttime(); FILE* file = fopen("./test.dat","w"); for (int i=0; i<SIZE; i++) { fprintf(file,"a"); fflush(file); } fclose(file); endtime(SIZE);}int main(){ test_write(); test_fprintf(); test_flush();}Compiling and running I get 0.2MB/s for write() and 6MB/s for fprintf(). A large difference. There's three tests in this example, the third test uses fprintf() and fflush(). This is equivalent to write() both in performance and in functionality. Which leads to the suggestion that fprintf() (and other buffering I/O functions) are the fastest way of writing to files, and that fflush() should be used to enforce synchronisation of the file contents.

fprintf() does buffered I/O, where as write()does unbuffered I/O. So once the write() completes, the data is in the file, whereas, for fprintf() it may take a while for the file to get updated to...

Work

Current SPARC Architectures

Different generations of SPARC processors implement different architectures. The architecture that the compiler targets is controlled implicitly by the -xtarget flag and explicitly by the -arch flag.If an application targets a recent architecture, then the compiler gets to play with all the instructions that the new architecture provides. The downside is that the application won't work on older processors that don't have the new instructions. So for developer's there is a trade-off between performance and portability.The way we have solved this in the compiler is to assume a "generic" architecture, and we've made this the default behaviour of the compiler. The only flag that doesn't make this assumption is -fast which tells the compiler to assume that the build machine is also the deployment machine - so the compiler can use all the instructions that the build machine provides.The -xtarget=generic flag tells the compiler explicitly to use this generic model. We work hard on making generic code work well across all processors. So in most cases this is a very good choice.It is also of interest to know what processors support the various architectures. The following Venn diagram attempts to show this:A textual description is as follows:The T1 and T2 processors, in addition to most other SPARC processors that were shipped in the last 10+ years supported V9b, or sparcvis2. The SPARC64 processors from Fujitsu, used in the M-series machines, added support for the floating point multiply accumulate instruction in the sparcfmaf architecture. Support for this instruction also appeared in the T3 - this is called sparcvis3Later SPARC64 processors added the integer multiply accumulate instruction, this architecture is sparcima.Finally the T4 includes support for both the integer and floating point multiply accumulate instructions in the sparc4 architecture.So the conclusion should be:Floating point multiply accumulate is supported in both the T-series and M-series machines, so it should be a relatively safe bet to start using it.The T4 is a very good machine to deploy to because it supports all the current instruction sets.

Different generations of SPARC processors implement different architectures. The architecture that the compiler targets is controlled implicitly by the -xtarget flag and explicitly by the -arch flag. If...

Personal

Square roots

If you are spending significant time calling sqrt() then to improve this you should compile with -xlibmil. Here's some example code that calls both fabs() and sqrt():#include <math.h>#include <stdio.h>int main(){ double d=23.3; printf("%f\n",fabs(d)); printf("%f\n",sqrt(d));}If we compile this with Studio 12.2 we will see calls to both fabs() and fsqrt():$ cc -S -O m.c bash-3.2$ grep call m.s$ grep call m.s|grep -v printf/* 0x0018 */ call fabs ! params = %o0 %o1 ! Result = %f0 %f1/* 0x0044 */ call sqrt ! params = %o0 %o1 ! Result = %f0 %f1If we add -xlibmil then these calls get replaced by equivalent instructions:$ cc -S -O -xlibmil m.c$ grep abs m.s|grep -v print; grep sqrt m.s|grep -v print/* 0x0018 7 */ fabsd %f4,%f0/* 0x0038 */ fsqrtd %f6,%f2The default for Studio 12.3 is to inline fabs(), but you still need to add -xlibmil for the compiler to inline fsqrt(), so it is a good idea to include the flag.You can see the functions that are replaced by inline versions by grepping the inline template file (libm.il) for the word "inline":$ grep inline libm.il .inline sqrtf,1 .inline sqrt,2 .inline ceil,2 .inline ceilf,1 .inline floor,2 .inline floorf,1 .inline rint,2 .inline rintf,1 .inline min_subnormal,0 .inline min_subnormalf,0 .inline max_subnormal,0...The caveat with -xlibmil is documented: However, these substitutions can cause the setting of errno to become unreliable. If your program depends on the value of errno, avoid this option. See the NOTES section at the end of this man page for more informa- tion.An optimisation in the inline versions of these functions is that they do not set errno. Which can be a problem for some codes, but most codes don't read errno.

If you are spending significant time calling sqrt() then to improve this you should compile with -xlibmil. Here's some example code that calls both fabs() and sqrt(): #include <math.h>#include...

Work

What is -xcode=abs44?

I've talked about building 64-bit libraries with position independent code. When building 64-bit applications there are two options for the code that the compiler generates: -xcode=abs64 or -xcode=abs44, the default is -xcode=abs44. These are documented in the user guides. The abs44 and abs64 options produce 64-bit applications that constrain the code + data + BSS to either 44 bit or 64 bits of address.These options constrain the addresses statically encoded in the application to either 44 or 64 bits. It does not restrict the address range for pointers (dynamically allocated memory) - they remain 64-bits. The restriction is in locating the address of a routine or a variable within the executable.This is easier to understand from the perspective of an example. Suppose we have a variable "data" that we want to return the address of. Here's the code to do such a thing:extern int data;int * address(){ return &data;}If we compile this as a 32-bit app we get the following disassembly:/* 000000 4 */ sethi %hi(data),%o5/* 0x0004 */ retl ! Result = %o0/* 0x0008 */ add %o5,%lo(data),%o0So it takes two instructions to generate the address of the variable "data". At link time the linker will go through the code, locate references to the variable "data" and replace them with the actual address of the variable, so these two instructions will get modified. If we compile this as a 64-bit code with full 64-bit address generation (-xcode=abs64) we get the following:/* 000000 4 */ sethi %hh(data),%o5/* 0x0004 */ sethi %lm(data),%o2/* 0x0008 */ or %o5,%hm(data),%o4/* 0x000c */ sllx %o4,32,%o3/* 0x0010 */ or %o3,%o2,%o1/* 0x0014 */ retl ! Result = %o0/* 0x0018 */ add %o1,%lo(data),%o0So to do the same thing for a 64-bit application with full 64-bit address generation takes 6 instructions. Now, most hardware cannot address the full 64-bits, hardware typically can address somewhere around 40+ bits of address (example). So being able to generate a full 64-bit address is currently unnecessary. This is where abs44 comes in. A 44 bit address can be generated in four instructions, so slightly cuts the instruction count without practically compromising the range of memory that an application can address:/* 000000 4 */ sethi %h44(data),%o5/* 0x0004 */ or %o5,%m44(data),%o4/* 0x0008 */ sllx %o4,12,%o3/* 0x000c */ retl ! Result = %o0/* 0x0010 */ add %o3,%l44(data),%o0

I've talked about building 64-bit libraries with position independent code. When building 64-bit applications there are two options for the code that the compiler generates: -xcode=abs64 or...

Oracle

Integrated Cloud Applications & Platform Services