Friday Oct 21, 2011


SPARC and x86 processors have different endianness. SPARC is big-endian and x86 is little-endian. Big-endian means that numbers are stored with the most significant data earlier in memory. Conversely little-endian means that numbers are stored with the least significant data earlier in memory.

Think of big endian as writing numbers as we would normally do. For example one thousand, one hundred and twenty would be written as 1120 using a big-endian format. However, writing as little endian it would be 0211 - the least significant digits would be recorded first.

For machines, this relates to which bytes are stored first. To make data portable between machines, a format needs to be agreed. For example in networking, data is defined as being big-endian. So to handle network packets, little-endian machines need to convert the data before using it.

Converting the bytes is a trivial matter, but it has some performance pitfalls. Let's start with a simple way of doing the conversion.

template <class T>
T swapslow(T in)
  T out;
  char * pcin = (char*)∈
  char * pcout = (char*)&out;

  for (int i=0; i<sizeof(T); i++)
    pcout[i] = pcin[sizeof(T)-i];
  return out;

The code uses templates to generalise it to different sizes of integers. But the following observations hold even if you use a C version for a particular size of input.

First thing to look at is instruction count. Assume I'm dealing with ints. I store the input to memory, then I access the input one byte at a time, storing each byte to a new location in memory, before finally loading the result. So for an int, I've got 10 memory operations.

Memory operations can be costly. Processors may be limited to only issuing one per cycle. In comparison most processors can issue two or more logical or integer arithmetic instructions per cycle. Loads are also costly as they have to access the cache, which takes a few cycles.

The other issue is more subtle, and I've discussed it in the past. There are RAW issues in this code. I'm storing an int, but loading it as four bytes. Then I'm storing four bytes, and loading them as an int.

A RAW hazard is a read-after-write hazard. The processor sees data being stored, but cannot convert that stored data into the format that the subsequent load requires. Hence the load has to wait until the result of the store reaches the cache before the load can complete. This can be multiple cycles of wait.

With endianness conversion, the data is already in the registers, so we can use logical operations to perform the conversion. This approach is shown in the next code snippet.

template <class T>
T swap(T in)
  T out=0;
  for (int i=0; i<sizeof(T); i++)
  return out;

In this case, we avoid the stores and loads, but instead we perform four logical operations per byte. This is higher cost than the load and store per byte. However, we can usually do more logical operations per cycle and the operations normally take a single cycle to complete. Overall, this is probably slightly faster than loads and stores.

However, you will usually see a greater performance gain from avoiding the RAW hazards. Obviously RAW hazards are hardware dependent - some processors may be engineered to avoid them. In which case you will only see a problem on some particular hardware. Which means that your application will run well on one machine, but poorly on another.

Tuesday Jan 11, 2011

RAW pipeline hazards

When a processor stores an item of data back to memory it actually goes through quite a complex set of operations. A sketch of the activities is as follows. The first thing that needs to be done is that the cache line containing the target address of the store needs to be fetched from memory. While this is happening, the data to be stored there is placed on a store queue. When the store is the oldest item in the queue, and the cache line has been successfully fetched from memory, the data can be placed into the cache line and removed from the queue.

This works very well if data is stored and either never reused, or reused after a relatively long delay. Unfortunately it is common for data to be needed almost immediately. There are plenty of reasons why this is the case. If parameters are passed through the stack, then they will be stored to the stack, and then immediately reloaded. If a register is spilled to the stack, then the data will be reloaded from the stack shortly afterwards.

It could take some considerable number of cycles if the loads had to wait for the stores to exit the queue before they could fetch the data. So many processors implement some kind of bypassing. If a load finds the data it needs in the store queue, then it can fetch it from there. There are often some caveats associated with this bypass. For example, the store and load often have to be of the same size to the same address. i.e. you cannot bypass a byte from a store of a word. If the bypass fails, then the situation is referred to as a "RAW" hazard, meaning "Read-After-Write". If the bypass fails, then the load has to wait until the store has completed before it can retrieve the new value - this can take many cycles.

As a general rule it is best to avoid potential RAWs. It is hardware, and runtime situation dependent whether there will be a RAW hazard or not, so avoiding the possibility is the best defense. Consider the following code which uses loads and stores of bytes to construct an integer.

#include <stdio.h>
#include <sys/time.h>

void tick()
  hrtime_t now = gethrtime();
  static hrtime_t then = 0;
  if (then>0) printf("Elapsed = %f\\n", 1.0\*(now-then)/100000000.0);
  then = now;

int func(char \* value)
  int temp;
  ((char\*)&temp)[0] = value[3];
  ((char\*)&temp)[1] = value[2];
  ((char\*)&temp)[2] = value[1];
  ((char\*)&temp)[3] = value[0];
  return temp;

int main()
  int value = 0x01020304;
  for (int i=0; i<100000000; i++) func((char\*)&value);

In the above code we're reversing the byte order by loading the bytes one-by-one, and storing them into an integer in the correct position, then loading the integer. Running this code on a test machine it reports 12ns per iteration.

However, it is possible to perform the same reordering using logical operations (shifts and ORs) as follows:

int func2(char\* value)
  return (value[0]<<24) | (value[1]<<16) | (value[2]<<8) | value[0];

This modified routine takes about 8ns per iteration. Which is significantly faster than the original code.

The actual speed up observed will depend on many factors, the most obvious being how often the code is encountered. The more observation is that the speed up depends on the platform. Some platforms will be more sensitive to the impact of RAWs than others. So the best advice is, whereever possible, to avoid passing data through the stack.

Monday Oct 19, 2009

Fishing with cputrack

I'm a great fan of the hardware performance counters that you find on most processors. Often you can look at the profile and instantly identify what the issue is. Sometimes though, it is not obvious, and that's where the performance counters can really help out.

I was looking at one such issue last week, the performance of the application was showing some variation, and it wasn't immediately obvious what the issue was. The usual suspects in these cases are:

  • Excessive system time
  • Process migration
  • Memory placement
  • Page size
  • etc.

Unfortunately, none of these seemed to explain the issue. So I hacked together the following script cputrackall which ran the test code under cputrack for all the possible performance counters. Dumped the output into a spreadsheet, and compared the fast and slow runs of the app. This is something of a "fishing trip" script, just gathering as much data as possible in the hope that something leaps out, but sometimes that's exactly what's needed. I regularly get to sit in front of a new chip before the tools like ripc have been ported, and in those situations the easiest thing to do is to look for hardware counter events that might explain the runtime performance. In this particular instance, it helped me to confirm my suspicion that there was a difference in branch misprediction rates that was causing the issue.

Monday Sep 28, 2009

Updated compiler flags article

Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.

Monday Jun 15, 2009

Glasgow Haskell Compiler successfully ported to OpenSPARC

Ben Lippmeier has been working on the native port of the Glasgow Haskell Compiler (GHC) to SPARC. The port was completed a few months back, and since then he's been using an UltraSPARC T2 system to look at thread activity and scaling as the number of threads is increased. The full details of the project are on his GHC on SPARC blog. The latest SPARC/Solaris binary can be downloaded here, although the full set of changes probably won't be available for a couple of months.

Thursday Apr 02, 2009

Strong prefetch

Looked at an interesting problem yesterday. One of the hot routines in an application was showing a lot of system time. The overall impact on the application was minor, but any unexpected system time is worth investigating.

Profiling under the performance analyzer indicated a pair of instructions, a prefetch followed by a load. The load had all the time attributed to it, but it's usual for the performance analyzer to attribute time to the instruction following the one that is slow. Something like (this is synthetic data, measured data later):

User  System
 0.5     0.5     prefetch [%o0],23
10.0    20.0     ld [%i1],%i2

So there are three things that sprang to mind:

  • Misaligned memory accesses. This would probably be the most common cause of system time on memory operations. However, it didn't look like a perfect file since the system time would usually be reported on the instruction afterwards.
  • TSB misses. The TLB is the on-chip structure that holds virtual to physical mappings. If the mapping is not in the TLB, then the mapping gets fetched from a software managed structure called the TSB. This is a TLB miss. However, the TSB has a capacity limit too, and it is possible for there to be a TSB miss where the mapping is fetched from memory and placed into the TSB. TLB misses are fast traps and don't get recorded under system time. However, TSB misses do cause the switch. This is quite a strong candidate, however trapstat -t didn't show much activity.
  • The third option was the prefetch instruction.

A prefetch instruction requests data from memory in advance of the data being used. For many codes this results in a large gain in performance. There's a number of variants of prefetch, and their exact operation depends on the processor. The SPARC architecture 2007 on pages 300-303 gives an outline of the variants. There's prefetch to read, prefetch to write, prefetch for single use, prefetch for multiple use. The processor has the option to do something different depending on which particular prefetch was used to get the data.

The system I was running on was an UltraSPARC IV+. This introduced the strong prefetch variant.

The idea of the strong prefetch variant is that it will cause a TLB miss to be taken if the mapping of the requested address is not present in the TLB. This is useful for situations where TLB misses are frequently encountered and the prefetch needs to be able to deal with them. To explain why this is important, consider this example. I have a loop which strides through a large range of memory, the loop optimally takes only a few cycles to execute - so long as the data is resident in cache. It takes a couple of hundred cycles to fetch data from memory, so I end up fetching for eight iterations ahead (eight is the maximum number of outstanding prefetches on that processor). If the data resides on a new page, and prefetches do not cause TLB misses, then eight iterations will complete before a load or store touches the new page of memory and brings the mapping into the TLB. The eight prefetches that were issued for those iterations will have been dropped and all the iterations will see the full memory latency (a total of 200\*8=1600 cycles of cost). However, if the prefetches are strong prefetches, then the first strong prefetch will cause the mapping to be brought into the TLB, and all the prefetches will hit in the TLB.

However, there's a corner case. A strong prefetch of a page that is not mapped will still cause the TLB miss, and the corresponding system time while the kernel figures drops the access to a non-existant page.

This was my suspicion. The code had a loop, and each iteration of the loop would prefetch the next item in the linked list. If the code had reached the end of the list, then it would be issuing a prefetch for address 0, which (of course) is not mapped.

It's relatively easy to provide a test case for this, just loop around issuing prefetches for address zero and see how long the code takes to run. Prefetches are defined in the header file <sun_prefetch.h>. So the first code looks like:

#include <sun_prefetch.h>

void main()
  for (int i=0; i<1000000; i++)

Compiling and running this gives:

$ cc pref.c
$ timex a.out

real           2.59
user           1.24
sys            1.34

So the code has substantial system time - which fits with the profile of the application. Next thing to look at is the location of that system time:

$ collect a.out
$ er_print -metrics e.user:e.system -dis main
   Excl.     Excl.
   User CPU  Sys. CPU
    sec.      sec.
   0.        0.                 [?]    10b8c:  bset        576, %o1 ! 0xf4240
   0.        0.                 [?]    10b90:  prefetch    [0], #n_writes_strong
## 1.571     1.181              [?]    10b94:  inc         %i5
   0.        0.                 [?]    10b98:  cmp         %i5, %o1

So the system time is recorded on the instruction following the prefetch.

The next step is to investigate ways to solve it. The <sun_prefetch.h> header file does not define any variants of weak prefetch. So it's necessary to write an inline template to provide the support:

.inline weak,4
  prefetch [%o0],2
.inline strong,4
  prefetch [%o0],22

The identifier for many writes strong is #22, for many writes weak it is #2. The initial code uses strong since I want to validate that I get the same behaviour with my new code:

void weak(int\*);
void strong(int\*);

void main()
  for (int i=0; i<1000000; i++)

Running this gives:

$ cc pref.c
$ collect a.out
$ er_print -metrics e.user:e.system -dis main
   Excl.     Excl.
   User CPU  Sys. CPU
    sec.      sec.
   0.        0.                 [?]    10b90:  clr         %o0
   0.        0.                 [?]    10b94:  nop
   0.        0.                 [?]    10b98:  prefetch    [%o0], #n_writes_strong
## 1.761     1.051              [?]    10b9c:  inc         %i5

So that looks good. Replacing the strong prefetch with a weak prefetch:

void weak(int\*);
void strong(int\*);

void main()
  for (int i=0; i<1000000; i++)

And then repeat the profiling experiment:

   Excl.     Excl.
   User CPU  Sys. CPU
    sec.      sec.
   0.        0.                 [?]    10b90:  clr         %o0
   0.        0.                 [?]    10b94:  nop
   0.        0.                 [?]    10b98:  prefetch    [%o0], #n_writes
   0.        0.                 [?]    10b9c:  inc         %i5

Running the code under timex:

$ timex a.out

real           0.01
user           0.00
sys            0.00

So that got rid of the system time. Of course with this change the prefetches issued will now be dropped if there is a TLB miss, and it is possible that could cause a slowdown for the loads in the loop - which is a trade-off. However, examining the application using pmap -xs <pid> indicated that the data resided on 4MB pages, so the number of TLB misses should be pretty low (this was confirmed by the trapstat data I gathered earlier).

One final comment is that this behaviour is platform dependent. Running the same strong prefetch code on an UltraSPARC T2, for example, shows no system time.

Thursday Mar 26, 2009

OpenSPARC workshop - Brussels

April 24th and 25th I'm going to be in Brussels rerunning the OpenSPARC workshop. The workshop leverages the material in the OpenSPARC Internals book, together with the OpenSPARC presentations. There's a poster with all the details (for those with acute eyesight, the poster on the left is from the December workshop!).

Tuesday Dec 16, 2008

OpenSPARC Internals available on Amazon

OpenSPARC Internals is now available from Amazon. As well as print-on-demand from lulu, and as a free (after registration) download.

Thursday Nov 13, 2008

How to learn SPARC assembly language

Got a question this morning about how to learn SPARC assembly language. It's a topic that I cover briefly in my book, however, the coverage in the book was never meant to be complete. The text in my book is meant as a quick guide to reading SPARC (and x86) assembly, so that the later examples make some kind of sense. The basics are the instruction format:

[instruction]  [source register 1], [source register 2], [destination register]

For example:

faddd    %f0, %f2, %f4


%f4 = %f0 + %f2

The other thing to learn that's different about SPARC is the branch delay slot. Where the instruction placed after the branch is actually executed as part of the branch. This is different from x86 where a branch instruction is the delimiter of the block of code.

With those basics out the way, the next thing to do would be to take a look at the SPARC Architecture manual. Which is a very detailed reference to all the software visible implementation details.

Finally, I'd suggest just writing some simple codes, and profiling them using the Sun Studio Performance Analyzer. Use the disassembly view tab and the architecture manual to see how the instructions are used in practice.

Tuesday Nov 04, 2008

Job available in this performance analysis team

We're advertising a job opening in this group. We're looking for someone who's keen on doing performance analysis on x86 and SPARC platforms. The req number is 561456, and you can read the details on If you have any questions, please do feel free to contact me.

Monday Jul 28, 2008

Atomic operations

Solaris 10 provides atomic operations in libc. Atomic operations are sequences of instructions that behave as if they are a single atomic instruction that no other thread can interrupt. The way that the operations are implemented uses the compare-and-swap instruction (cas). A rough outline is as follows:

  Load existing value
  New value = existing value + increment
  return value = compare and swap(existing value and new value)
while (return value != existing value)

The compare-and-swap instruction atomically swaps the value in a register with the value held in memory, but only if the value held in memory equals the value held in another register. The return value is the value held in memory. The pseudo-code uses this so that the value stored back to memory will only be the incremented value if the current value in memory is equal to the value before the increment. If the value held in memory is not the one that was expected, the code retries the operation until it does succeed.

Breaking the compare-and-swap instruction into pseudo code looks like:

  CAS (Value held in memory, Old value, New value)
    Existing value = \*Value held in memory;
    if (Existing value == Old value)
       \*Value held in memory = New value;
       return Old value;
       return \*Value held in memory;

One of the questions from last week was how to use atomic operations in Solaris 9 if they are only available on Solaris 10. The answer is to write your own. The following code snippet demonstrates how to write an atomic increment operation:

.inline atomic_add,8
  ld [%o0],%o2         /\* Load existing value            \*/
  add %o2, %o1, %o3    /\* Generate new value             \*/
  cas [%o0],%o2,%o3    /\* Swap memory contents           \*/
  cmp %o2, %o3         /\* Compare old value with return  \*/
  bne 1b               /\* Fail if the old value does not \*/
                       /\* equal the value returned from  \*/
                       /\* memory                         \*/
  mov %o3,%o2          /\* Retry using latest value from  \*/
                       /\* memory                         \*/

It's probably a good idea to read this article on atomic operations on SPARC, and it may also be useful to read up on inline templates.

Friday Jul 11, 2008

Reading the %tick counter

It's often necessary to time the duration of a piece of code. To do this I often use gethrtime(), which is typically a quick call. Obviously, the faster the call more accurate I can expect my timing to be. Sometimes the call to gethrtime is too long (or perhaps too costly, because it's done so frequently). The next thing to try is to read the %tick register.

The %tick register is a 64-bit register that gets incremented on every cycle. Since it's stored on the processor reading the value of the register is a low cost operation. The only complexity is that being a 64-bit value it needs special handling under 32-bit codes where the a 64-bit return value from a function is passed in the %o0 (upper bits) and %o1 (lower bits) registers.

The inline template to read the %tick register in a 64-bit code is very simple

.inline tk,0
   rd %tick,%o0

The 32-bit version requires two more instructions to get the return value into the appropriate registers:

.inline tk,0
   rd %tick,%o2
   srlx %o2,32,%o0
   sra %o2,0,%o1

Here's an example code which uses the %tick register to get the current value plus an estimate of the cost of reading the %tick register:


long long tk();

void main()
  printf("value=%llu duration=%llu\\n",tk(),-tk()+tk());

The compile line is:

$ cc -O tk.c

The output should be something like:

$ a.out
value=4974674895272956 duration=32

Indicating that 32 cycles (in this instance) elapsed between the two reads of the %tick register. Looking at the disassembly, there are certainly a number of cycles of overhead:

        10bbc:  95 41 00 00  rd         %tick, %o2
        10bc0:  91 32 b0 20  srlx       %o2, 32, %o0
        10bc4:  93 3a a0 00  sra        %o2, 0, %o1
        10bc8:  97 32 60 00  srl        %o1, 0, %o3 
        10bcc:  99 2a 30 20  sllx       %o0, 32, %o4
        10bd0:  88 12 c0 0c  or         %o3, %o4, %g4
        10bd4:  c8 73 a0 60  stx        %g4, [%sp + 96]
        10bd8:  95 41 00 00  rd         %tick, %o2

This overhead can be reduced by treating the %tick register as a 32-bit read, and effectively ignoring the upper bits. For very short duration codes this is probably acceptable, but is unsuitable for longer running code blocks. With this (inelegant hack) the following code is generated:

        10644:  91 41 00 00  rd         %tick, %o0
        10648:  b0 07 62 7c  add        %i5, 636, %i0
        1064c:  b8 10 00 08  mov        %o0, %i4
        10650:  91 41 00 00  rd         %tick, %o0

Which returns usually a value of 8 cycles on the same platform.

Thursday May 22, 2008

Alloca on SPARC

On SPARC there's a slight complication. The load and store instructions have an offset range of -4096 to +4096. To use a larger offset than that it is necessary to put the offset into a register and use that to calculate the address.

If the size of the local variables are less than 4KB, then a load or store instruction can use the frame pointer together with an offset in order to access the memory on the stack. If the stack is greater than 4KB, then it's possible to use the frame pointer to access memory in the upper 4KB range, and the stack pointer to access memory in the lower 4KB. Rather like this diagram shows:

frame pointer -> top of stack
               | Upper 4KB can be accessed 
               v using offset+ frame pointer
               | Lower 4KB can be accessed 
               v using offset+ frame pointer
stack pointer -> bottom of stack

The complication is when temporary memory is allocated on the stack using alloca, and the size of the local variables exceed 4KB. In this case it's not possible to just shift the stack pointer downwards - since that may cause variables that were previously accessed through the stack pointer to become out of the 4KB offset range, or it would change the offset from the stack pointer where variables are stored (by an amount which may only be known at runtime). Either of these situations would not be good.

Instead of just shifting the stack pointer, a slightly more complex operation has to be carried out. The memory gets allocated in the middle of the range, and the lower memory gets shifted (or copied) downwards. The end result is something like this:

frame pointer -> top of stack
               | Upper 4KB can be accessed 
               v using offset+ frame pointer
               [Alloca'd memory]
               | Lower 4KB can be accessed 
               v using offset+ frame pointer
stack pointer -> bottom of stack

The routine that does this manipulation of memory is called __builtin_alloca. You can see in the code that it moves the stack pointer, and then has a copy loop to move the contents of the stack.

Unfortunately, the need to copy the data means that it takes longer to allocate memory. So if the function __builtin_alloca appears in a profile, the first thing to do is to see whether it's possible to reduce the amount of local variables/stack space needed for the routine.

As a footnote, take a look at the equivalent code for the x86 version of __builtin_alloca. The x86, being CISC, does not have the limit on the size of the offset that can be used. Hence the x86 code does not need the copy routine to move variables in the stack around.

Wednesday May 07, 2008

When to use membars

membar instructions are SPARC assembly language instructions that enforce memory ordering. They tell the processor to ensure that memory operations are completed before it continues execution. However, the basic rule is that the instructions are usually only necessary in "unusual" circumstances - which fortunately will mean that most people don't encounter them.

The UltraSPARC Architecture manual documents the situation very well in section 9.5. It gives these rules which cover the default behaviour:

  • Each load instruction behaves as if it were followed by a MEMBAR #LoadLoad and #LoadStore.
  • Each store instruction behaves as if it were followed by a MEMBAR #StoreStore.
  • Each atomic load-store behaves as if it were followed by a MEMBAR #LoadLoad, #LoadStore, and #StoreStore.

There's a table in section 9.5.3 which covers when membars are necessary. Basically, membars are necessary for ordering of block loads and stores, and for ordering non-cacheable loads and stores. There is an interesting note where it indicates that a membar is necessary to order a store followed by a load to a different addresses; if the address is the same the load will get the correct data. This at first glance seems odd - why worry about whether the store is complete if the load is of independent data. However, I can imagine this being useful in situations where the same physical memory is mapped using different virtual address ranges - not something that happens often, but could happen in the kernel.

As a footnote, the equivalent x86 instruction is the mfence. There's a good discussion of memory ordering in section 7.2 of the Intel Systems Programming Guide.

There's some more discussion of this topic on Dave Dice's weblog.

Tuesday Mar 25, 2008

Conference schedule

The next two months are likely to be a bit hectic for me. I'm presenting at three different conferences, as well as a chat session in Second Life. So I figured I'd put the information up in case anyone reading this is also going to one or other of the events. So in date order:

I'll be talking about parallelisation at the various conferences, the talks will be different. The multi-core expo talks focuses on microparallelisation. The ESC talk will probably be higher level, and the CommunityOne talk will probably be wider ranging, and I hope more interactive.

In the Second Life event I'll be talking about the book, although the whole idea of appearing is to do Q&A, so I hope that will be more of a discussion.

Thursday Mar 20, 2008

The much maligned -fast

The compiler flag -fast gets an unfair rap. Even the compiler reports:

cc: Warning: -xarch=native has been explicitly specified, or 
implicitly specified by a macro option, -xarch=native on this 
architecture implies -xarch=sparcvis2 which generates code that 
does not run on pre UltraSPARC III processors

which is hardly fair given the the UltraSPARC III line came out about 8 years ago! So I want to quickly discuss what's good about the option, and what reasons there are to be cautious.

The first thing to talk about is the warning message. -xtarget=native is a good option to use when the target platform is also the deployment platform. For me, this is the common case, but for people producing applications that are more generally deployed, it's not the common case. The best thing to do to avoid the warning and produce binaries that work with the widest range of hardware is to add the flag -xtarget=generic after -fast (compiler flags are parsed from left to right, so the rightmost flag is the one that gets obeyed). The generic target represents a mix of all the important processors, the mix produces code that should work well on all of them.

The next option which is in -fast for C that might cause some concern is -xalias_level=basic. This tells the compiler to assume that pointers of different basic types (e.g. integers, floats etc.) don't alias. Most people code to this, and the C standard actually has higher demands on the level of aliasing the compiler can assume. So code that conforms to the C standard will work correctly with this option. Of course, it's still worth being aware that the compiler is making the assumption.

The final area is floating point simplification. That's the flags -fsimple=2 which allows the compiler to reorder floating point expressions, -fns which allows the processor to flush subnormal numbers to zero, and some other flags that use faster floating point libraries or inline templates. I've previously written about my rather odd views on floating point maths. Basically it comes down to If these options make a difference to the performance of your code, then you should investigate why they make a difference..

Since -fast contains a number of flags which impact performance, it's probably a good plan to identify exactly those flags that do make a difference, and use only those. A tool like ats can really help here.

Performance tuning recipe

Dan Berger posted a comment about the compiler flags we'd used for Ruby. Basically, we've not done compiler flag tuning yet, so I'll write a quick outline of the first steps in tuning an application.

  • First of all profile the application with whatever flags it usually builds with. This is partly to get some idea of where the hot code is, but it's also useful to have some idea of what you're starting with. The other benefit is that it's tricky to debug build issues if you've already changed the defaults.
  • It should be pretty easy at this point to identify the build flags. Probably they will flash past on the screen, or in the worse case, they can be extracted (from non-stripped executables) using dumpstabs or dwarfdump. It can be necessary to check that the flags you want to use are actually the ones being passed into the compiler.
  • Of course, I'd certainly use spot to get all the profiles. One of the features spot has that is really useful is to archive the results, so that after multiple runs of the application with different builds it's still possible to look at the old code, and identify the compiler flags used.
  • I'd probably try -fast, which is a macro flag, meaning it enables a number of optimisations that typically improve application performance. I'll probably post later about this flag, since there's quite a bit to say about it. Performance under the flag should give an idea of what to expect with aggressive optimisations. If the flag does improve the applications performance, then you probably want to identify the exact options that provide the benefit and use those explicitly.
  • In the profile, I'd be looking for routines which seem to be taking too long, and I'd be trying to figure out what was causing the time to be spent. For SPARC, the execution count data from bit that's shown in the spot report is vital in being able to distinguish from code that runs slow, or code that runs fast but is executed many times. I'd probably try various other options based on what I saw. Memory stalls might make me try prefetch options, TLB misses would make me tweak the -xpagesize options.
  • I'd also want to look at using crossfile optimisation, and profile feedback, both of which tend to benefit all codes, but particularly codes which have a lot of branches. The compiler can make pretty good guesses on loops ;)

These directions are more a list of possible experiments than necessary an item-by-item checklist, but they form a good basis. And they are not an exhaustive list...

Cross-linking support in Nevada

Ali Bahrami has just written an interesting post about cross-linking support going into Nevada. This is the facility to enable the linking of SPARC object files to produce SPARC executables on an x86 box (or the other way around).

Friday Mar 14, 2008

32-bits good, 64-bits better?

One of the questions people ask is when to develop 64-bit apps vs 32-bit apps. The answer is not totally clear cut, it depends on the application and the platform. So here's my take on it.

First, let's discuss SPARC. The SPARC V8 architecture defines the 32-bit SPARC ISA (Instruction Set Architecture). This was found on a selection of SPARC processors that appeared before the UltraSPARC I, ie quite a long time ago. Later the SPARC V9 architecture appeared. (As an aside, these are open standards, so anyone can download the specs and make one, of course not everyone has the time and resources to do that, but it's nice in theory ;)

The SPARC V9 architecture added a few instructions, but mainly added the capability to use 64-bit addressing. The ABI (Application Binary Interface) was also improved (floating point values passed in FP registers rather than the odd V8 choice of using the integer registers). The UltraSPARC I and onwards have implemented the V9 architecture, which means that they can execute both V8 and V9 binaries.

One of the things that it's easy to get taken in by is the animal farm idea that if 32-bits is good 64-bits must be better. The trouble with 64-bit address space is that it takes more instructions to set up an address, pointers go from 4 bytes to 8 bytes, and the memory footprint of the application increases.

A hybrid mode was also defined, which took the 32-bit address space, together with a selection of the new instructions. This was called v8plus, or more recently sparcvis. This has been the default architecture for SPARC binaries for quite a while now, and it combines the smaller memory footprint of SPARC V8 with the more recent instructions from SPARC V9. For applications that don't require 64-bit address space, v8plus or sparcvis is the way to go.

Moving to the x86 side things are slightly more complex. The high level view is similar. You have the x86 ISA, or IA32 as its been called. Then you have the 64-bit ISA, called AMD64 or EMT64. EMT64 gives you both 64-bit addressing, a new ABI, a number of new instructions, and perhaps most importantly a bundle of new registers. The x86 has impressively few registers, EMT64 fixes that quite nicely.

In the same way as SPARC, moving to a 64-bit address space does cost some performance due to the increased memory footprint. However, the x64 gains a number of additional registers, which usually more than make up for this loss in performance. So the general rule is that 64-bits is better, unless the application makes extensive use of pointers.

Unlike SPARC, EMT64 does not currently provide a mode which gives the instruction set extensions and registers with a 32-bit address space.

Thursday Mar 13, 2008

SPARC64 VI docs

Programmers guide for the Fujitsu SPARC64 vi.

Friday Mar 07, 2008

Ruby performance gains on SPARC

The programming language Ruby is run on a VM. So the VM is responsible for context switches as well as garbage collection. Consequently, the code contains calls to flush register windows. A colleague of mine, Miriam Blatt, has been examining the code and we think we've found some places where the calls to flush register windows are unnecessary. The code appears in versions 1.8/1.9 of Ruby, but I'll focus on 1.8.\* in this discussion.

As outlined in my blog entry on register windows, the act of flushing them is both high cost and rarely needed. The key points at which it is necessary to flush the register windows to memory are on context switches and before garbage collection.

Ruby defines a macro called FLUSH_REGISTER_WINDOWS in defines.h. The macro only does something on IA64 and SPARC, so the changes I'll discuss here are defined so that they leave the behaviour on IA64 unchanged. My suspicion is that the changes are equally valid for IA64, but I lack an IA64 system to check them on.

The FLUSH_REGISTER_WINDOWS macro gets used in eval.c in the EXEC_TAG macro, THREAD_SAVE_CONTEXT macro, rb_thread_save_context routine, and rb_thread_restore_context routine. (There's also a call in gc.c for the garbage collection.)

The first thing to notice is that the THREAD_SAVE_CONTEXT macro calls rb_thread_save_context, so the FLUSH_REGISTER_WINDOWS call in the THREAD_SAVE_CONTEXT macro is unnecessary (the register windows have already been flushed). However, we've not seen this particular flush cause any performance issues in our tests (although it's possible that the tests didn't stress multithreading).

The more important call is the one in EXEC_TAG. This is executed very frequently in Ruby codes, but this flush does not appear to be at all necessary. It is neither a context switch or the start of garbage collection. Removing this call to flush register windows leads to significant performance gains (upwards of 10% when measured in an older v880 box. Some of the benchmarks nearly doubled in performance).

The source code modifications for 1.8.6 are as follows:

$ diff defines.h.orig defines.h.mod
> #  define EXEC_FLUSH_REGISTER_WINDOWS ((void)0)

The change to defines.h adds new variants of the FLUSH_REGISTER_WINDOWS macro to be used for the EXEC_TAG and THREAD_SAVE_CONTEXT macros. To preserve the current behaviour on IA64, they are left as defined as ((void)0) on all architectures but IA64 where they are defined as FLUSH_REGISTER_WINDOWS.

$ diff eval.c.orig eval.c.mod
< #define EXEC_TAG()    (FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
> #define EXEC_TAG()    (EXEC_FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
<     (rb_thread_switch((FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))
>     (rb_thread_switch((SWITCH_FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))

The changes to eval.c just use the new macros instead of the old FLUSH_REGISTER_WINDOWS call.

These code changes have worked on all the tests we've used (including `gmake test-all`). However, I can't be certain that there is not a workload which requires these flushes. This appears to be putback that added the flush call to EXEC_TAG, and the comment suggests that the change may not be necessary. I'd love to hear comments either agreeing with the analysis, or pointing out why the flushes are necessary.

Update: to add diff -u output
$ diff -u defines.h.orig defines.h.mod
--- defines.h.orig      Tue Mar  4 16:32:05 2008
+++ defines.h.mod       Wed Mar  5 14:22:06 2008
@@ -226,12 +226,18 @@
 #  define FLUSH_REGISTER_WINDOWS flush_register_windows()
+#  define EXEC_FLUSH_REGISTER_WINDOWS ((void)0)
 #elif defined(__ia64)
 void \*rb_ia64_bsp(void);
 void rb_ia64_flushrs(void);
 #  define FLUSH_REGISTER_WINDOWS rb_ia64_flushrs()
 #  define FLUSH_REGISTER_WINDOWS ((void)0)

 #if defined(DOSISH)
$ diff -u eval.c.orig eval.c.mod
--- eval.c.orig Tue Mar  4 16:32:00 2008
+++ eval.c.mod  Wed Mar  5 14:22:13 2008
@@ -1022,7 +1022,7 @@
 #define PROT_LAMBDA INT2FIX(2) /\* 5 \*/
 #define PROT_YIELD  INT2FIX(3) /\* 7 \*/

-#define EXEC_TAG()    (FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
+#define EXEC_TAG()    (EXEC_FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))

 #define JUMP_TAG(st) do {              \\
     ruby_frame = prot_tag->frame;      \\
@@ -10287,7 +10287,7 @@

 #define THREAD_SAVE_CONTEXT(th) \\
-    (rb_thread_switch((FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))
+    (rb_thread_switch((SWITCH_FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))

 NORETURN(static void rb_thread_restore_context _((rb_thread_t,int)));
 NORETURN(NOINLINE(static void rb_thread_restore_context_0(rb_thread_t,int,void\*)));

Flush register windows

The SPARC architecture has an interesting feature called Register Windows. The idea is that the processor should contain multiple sets of registers on chip. When a new routine is called, the processor can give a fresh set of registers to the new routine, preserving the value of the old registers. When the new routine completes and control returns to the calling routine, the register values for the old routine are also restored. The idea is for the chip not to have to save and load the values held in registers whenever a routine is called; this reduces memory traffic and should improve performance.

The trouble with register windows, is that each chip can only hold a finite number of them. Once all the register windows are full, the processor has to spill a complete set of registers to memory. This is in contrast with the situation where the program is responsible for spilling and filling registers - the program only need spill a single register if that is all that the routine requires.

Most SPARC processors have about seven sets of register windows, so if the program remains in a call stack depth of about seven, there is no register spill/fill cost associated with calls of other routines. Beyond this stack depth, there is a cost for the spills and fills of the register windows.

The SPARC architecture book contains a more detailed description of register windows in section 5.2.2.

Most of the time software is completely unaware of this architectural decision, in fact user code should never have to be aware of it. There are two situations where software does need to know about register windows, these really only impact virtual machines or operating systems:

  • Context switches. In a context switch the processor changes to executing another software thread, so all the state from that thread needs to be saved for the thread to later resume execution. Note that setjmp and longjmp which are sometimes used as part of code to implement context switching already have the appropriate flushes in them.
  • Garbage collection. Garbage collection involves inspecting the state of the objects held in memory and determining whether each object is live or dead. Live objects are identified by having other live objects point to them. So all the registers need to be stored in memory so that they can be inspected to check whether they point to any objects that should be considered live.

The SPARC V9 instruction flushw will cause the processor to store all the register windows in a thread to memory. For SPARC V8, the same effect is attained through trap x03. Either way, the cost can be quite high since the processor needs to store up to about 7 sets of register windows to memory Each set is 16 8-byte registers, which results in potentially a lot of memory traffic and cycles.

Tuesday Feb 12, 2008

CMT related books has a side bar featuring books that are relevant to CMT. Mine is featured as one of the links. The other two books are Computer Architecture - a quantitative approach which features the UltraSPARC T1, and Chip-Multiprocessor Architecture which is by members of the team responsible for the UltraSPARC T1 processor.

Friday Feb 01, 2008

The meaning of -xmemalign

I made some comments on a thread on the forums about memory alignment on SPARC and the -xmemalign flag. I've talked about memory alignment before, but this time the discussion was more about how the flag works. In brief:

  • The flag has two parts -xmemalign=[1|2|4|8][i|s]
  • The number specifies the alignment that the compiler should assume when compiling an object file. So if the compiler is not certain that the current variable is correctly aligned (say it's accessed through a pointer) then the compiler will assume the alignment given by the flag. Take a single precision floating point value that takes four bytes. Under -xmemalign=1[i|s] the compiler will assume that it is unaligned, so will issue four single byte loads to load the value. If the alignenment is specified as -xmemalign=2[i|s] the compiler will assume two byte alignment, so will issue two loads to get the four byte value.
  • The suffix [i|s] tells the compiler how to behave if there is a misaligned access. For 32-bit codes the default is i which fixes the misaligned access and continues. For 64-bit codes the default is s which causes the app to die with a SIGBUS error. This is the part of the flag that has to be specified at link time because it causes different code to be linked into the binary depending on the desired behaviour. The C documentation captures this correctly, but the C++ and Fortran docs will be updated.

Tuesday Oct 16, 2007

Building shared libraries for SPARCV9

By default, the SPARC compiler assumes that SPARCV9 objects are built with -xcode=abs44, which means that 44 bits are used to hold the absolute address of any object. Shared libraries should be built using position independent code, either -xcode=pic13 or -xcode=pic32 (replacing the deprecated -Kpic and -KPIC options.

If one of the object files in a library is built with abs44, then the linker will report the following error:

ld: fatal: relocation error: R_SPARC_H44: file file.o: symbol <unknown>:
relocations based on the ABS44 coding model can not be used in building
a shared object
Further details on this can be found in the compiler documentation.

Tuesday Sep 25, 2007

Solaris Application Programming book

I'm thrilled to see that my book is being listed for pre-order on Amazon in the US. It seems to take about a month for it to travel the Atlantic to Amazon UK.

Friday Aug 31, 2007

Register windows and context switches

Interesting paper on register windows and context switching

Total Store Ordering (TSO)

Total Store Ordering is the name for the ordering implemented on multi-processor SPARC systems. The SPARC architecture also defines a number of weaker orderings. TSO basically guarantees that loads are visible between processors in the order in which they occurred, the same for stores, and that loads will see the values of earlier stores. There's a very helpful description of this in the SPARC architecture documentation on pages 418-427 (section 9.5). Particularly useful is table 9-2 on page 422, which gives a useful summary of where membar operations are required.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« July 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming