Thursday Dec 11, 2014

Checking whether hardware supports crypto instructions

A quick example of how to tell if the machine that you're running on supports crypto instructions.

The 2011 SPARC Architecture manual tells you to read the cfr register before using the instruction. The cfr register contains a bit for every implemented crypto instruction. However, the cfr register is not implemented on all processors. So you would need to check whether this register is implemented before reading it....

So there has to be a better way. Fortunately, Solaris implements a getisax() call which provides this information without the user needing to muck around with the low level details. The following code shows how this call can be used to check whether the AES instruction is implemented or not:

#include <sys/auxv.h>
#include <stdio.h>

void main()
  unsigned int array[10];
  unsigned int count = getisax(array,10);
  if (count>0)
    printf(" AES: ");
    if (array[0] & AV_SPARC_AES) { printf("Yes\n"); } else { printf("No\n"); }
    printf("Error: getisax() call returned no results\n");

Wednesday Nov 19, 2014

Writing inline templates

Writing some inline templates today... I've written about doing this kind of stuff in the past here and, in more detail, here.

I happen to need to pass a bundle of parameters on to the routine. The best way of checking how the parameters will be passed is to get the compiler to provide some initial template. Here's an example routine:

int parameters (int p0, int * p1, int * p2, int* p3, int * p4, int * p5, int * p6, int p7)
  return p0 + *p1 + *p2 + *p3 + *p4 + ((*p5)<<2) + ((*p6)<<3) + p7*p7;

In the routine I've tried to handle some of the parameters differently. I know that the first parameters get passed in registers, and then the later ones get passed on the stack. By handling them differently I can work out which loads from the stack correspond to which variables. The disassembly looks like:

-bash-4.1$ cc -g -O parameters.c -c
-bash-4.1$ dis -F parameters parameters.o
disassembly for parameters.o

    parameters:             ca 02 60 00  ld        [%o1], %g5
    parameters+0x4:         c4 02 e0 00  ld        [%o3], %g2
    parameters+0x8:         c2 02 a0 00  ld        [%o2], %g1
    parameters+0xc:         c6 03 a0 60  ld        [%sp + 0x60], %g3  // load of p7
    parameters+0x10:        88 02 00 05  add       %o0, %g5, %g4
    parameters+0x14:        d0 03 60 00  ld        [%o5], %o0
    parameters+0x18:        ca 03 20 00  ld        [%o4], %g5
    parameters+0x1c:        92 00 80 01  add       %g2, %g1, %o1
    parameters+0x20:        87 38 e0 00  sra       %g3, 0x0, %g3
    parameters+0x24:        82 01 00 09  add       %g4, %o1, %g1
    parameters+0x28:        d2 03 a0 5c  ld        [%sp + 0x5c], %o1 // load of p6
    parameters+0x2c:        88 48 c0 03  mulx      %g3, %g3, %g4     // %g4 = %g3*%g3
    parameters+0x30:        97 2a 20 02  sll       %o0, 0x2, %o3
    parameters+0x34:        94 00 40 05  add       %g1, %g5, %o2
    parameters+0x38:        da 02 60 00  ld        [%o1], %o5       
    parameters+0x3c:        84 02 c0 0a  add       %o3, %o2, %g2
    parameters+0x40:        99 2b 60 03  sll       %o5, 0x3, %o4     // %o4 = %o5<<3
    parameters+0x44:        90 00 80 0c  add       %g2, %o4, %o0
    parameters+0x48:        81 c3 e0 08  retl
    parameters+0x4c:        90 02 00 04  add       %o0, %g4, %o0

Thursday Nov 13, 2014

Software in Silicon Cloud

I missed this press release about Software in Silicon Cloud. It's the announcement for a service where you can try out a SPARC M7 processor. There's an accompanying website which has the sign up plus some more information about the service.

What's particularly exciting is that it talks a bit more about Application Data Integrity (ADI). Larry Ellison called this "the most important piece of engineering we’ve done in a long, long time.".

Incorrect handling of pointers is a large contributor to bugs in software. ADI tackles this by making the hardware check that the pointer being used is valid for the region of memory it is pointing to. If it's not valid the hardware flags it as an error. Since it's done by hardware, there's minimal performance impact - it's at hardware speed, so developers can check their application in realtime.

There's a nice demo of how ADI protects against exploits like HeartBleed.

Wednesday Nov 12, 2014

Oracle Solaris Studio playlist

There's an extensive list of Solaris Studio videos on youtube. In particular there's a bunch of tutorials covering the features of the IDE. The IDE doesn't often get the attention it deserves. It's based off NetBeans, and is full of useful code refactoring tools, navigation tools, etc. To find out more, take a look at some of the videos.

Tuesday Nov 11, 2014

New Performance Analyzer Overview screen

I love using the Performance Analyzer, but the question I often get when I show it to people, is "Where do I start?". So one of the improvements in Solaris Studio 12.4 is an Overview screen to help people get started with the tool. Here's what it looks like:

The reason this is important, is that many applications spend time in various place - like waiting on disk, or in user locks - and it's not always obvious where is going to be the most effective place to look for performance gains.

The Overview screen is meant to be the "one-stop" place where people can find out what their application is doing. When we put it back into the product I expected it to be the screen that I glanced at then never went back to. I was most surprised when this turned out not to be the case.

During performance analysis, I'm often exploring different ideas as to where it might be possible to get performance improvements. The Overview screen allows me to select the metrics that I'm interested in, then take a look at the resulting profiles. So I might start with system time, and just enable the system time metrics. Once I'm done with that, I might move on to user time, and select those metrics. So what was surprising about the Overview screen was how often I returned to it to change the metrics I was using.

So what does the screen contain? The overview shows all the available metrics. The bars indicate which metrics contribute the most time. So it's easy to pick (and explore) the metrics that contribute the most time.

If the profile contains performance counter metrics, then those also appear. If the counters include instructions and cycles, then the synthetic CPI/IPC metrics are also available. The Overview screen is really useful for hardware counter metrics.

I use performance counters in a couple of ways: to confirm a hypothesis about performance or to estimate time spent on a type of event. For example, if I think a load is taking a lot of time due to TLB misses, then profiling on the TLB miss performance counter will tell me whether that load has a lot of misses or not. Alternatively, if I've got TLB miss counter data, then I can scale this by the cost per TLB miss, and get an estimate of the total runtime lost to TLB misses.

Where the Overview screen comes into this is that I will often want to minimise the number of columns of data that are shown (to fit everything onto my monitor), but sometimes I want to quickly enable a counter to see whether that event happens at the bit of code where I'm looking. Hence I end up flipping to the Overview screen and then returning to the code.

So what I thought would be a nice feature, actually became pretty central to my work-flow.

I should have a more detailed paper about the Overview screen up on OTN soon.

Performance made easy

The big news of the day is that Oracle Solaris Studio 12.4 is available for download. I'd like to thank all those people who tried out the beta releases and gave us feedback.

There's a number of things that are new in this release. The most obvious one is C++11 support, I've written a bit about the lambda expression support, tuples, and unordered containers.

My favourite tool, the Performance Analyzer, has also had a bit of a facelift. I'll talk about the Overview screen in a separate post (and in an article), but there's some other fantastic features. The syntax highlighting, and hyperlinking, has made navigating profiles much easier. There's been a large number of improvements in filtering - a feature that's been in the product a long time, but these changes elevate it to being much more accessible (an article on filtering is long overdue!). There's also the default hardware counters - which makes it a no-brainer to get hardware counter data, which is really helpful in understanding exactly what an application is doing.

Over the development cycle I've made much use of the other tools. The Thread Analyzer for identifying data races has had some improvements. The Code Analyzer tools have made some great gains in rapidly identifying potential coding errors. And so on....

Anyway, please download the new version, try it out, try out the tools, and let us know what you think of it.

Tuesday Nov 04, 2014

SPARC Software in Silicon

Short video by Juan Loaiza about the Software in Silicon work in the upcoming SPARC processor.

Friday Oct 10, 2014

OpenWorld and JavaOne slides available for download

Thanks everyone who attended my talks last week. My slides for OpenWorld and JavaOne are available for download:

Sunday Sep 28, 2014

SPARC Processor Documentation

I'm pretty excited, we've now got documentation up for the SPARC processors. Take a look at the SPARC T4 supplement, the SPARC T4 performance instrumentation supplement, the SPARC M5 supplement, or the familiar SPARC 2011 Architecture.

Friday Aug 15, 2014

Providing feedback on the Solaris Studio 12.4 Beta

Obviously, the point of the Solaris Studio 12.4 Beta programme was for everyone to try out the new version of the compiler and tools, and for us to gather feedback on what was working, what was broken, and what was missing. We've had lots of useful feedback - you can see some of it on the forums. But we're after more.

Hence we have a Solaris Studio 12.4 Beta survey where you can tell us more about your experiences. Your comments are really helpful to us. Thanks.

Friday Jul 11, 2014

Studio 12.4 Beta Refresh, performance counters, and CPI

We've just released the refresh beta for Solaris Studio 12.4 - free download. This release features quite a lot of changes to a number of components. It's worth calling out improvements in the C++11 support and other tools. We've had few comments and posts on the Studio forums, and a bunch of these have resulted in improvements in this refresh.

One of the features that is deserving of greater attention is default hardware counters in the Performance Analyzer.

Default hardware counters

There's a lot of potential hardware counters that you can profile your application on. Some of them are easy to understand, some require a bit more thought, and some are delightfully cryptic (for example, I'm sure that op_stv_wait_sxmiss_ex means something to someone). Consequently most people don't pay them much attention.

On the other hand, some of us get very excited about hardware performance counters, and the information that they can provide. It's good to be able to reveal that we've made some steps along the path of making that information more generally available.

The new feature in the Performance Analyzer is default hardware counters. For most platforms we've selected a set of meaningful performance counters. You get these if you add -h on to the flags passed to collect. For example:

$ collect -h on ./a.out

Using the counters

Typically the counters will gather cycles, instructions, and cache misses - these are relatively easy to understand and often provide very useful information. In particular, given a count of instructions and a count of cycles, it's easy to compute Cycles per Instruction (CPI) or Instructions per Cycle(IPC).

I'm not a great fan of CPI or IPC as absolute measurements - working in the compiler team there are plenty of ways to change these metrics by controlling the I (instructions) when I really care most about the C (cycles). But, the two measurements have a very useful purpose when examining a profile.

A high CPI means lots cycles were spent somewhere, and very few instructions were issued in that time. This means lots of stall, which means that there's some potential for performance gains. So a good rule of thumb for where to focus first is routines that take a lot of time, and have a high CPI.

IPC is useful for a different reason. A processor can issue a maximum number of instructions per cycle. For example, a T4 processor can issue two instructions per cycle. If I see an IPC of 2 for one routine, I know that the code is not stalled, and is limited by instruction count. So when I look at a code with a high IPC I can focus on optimisations that reduce the instruction count.

So both IPC and CPI are meaningful metrics. Reflecting this, the Performance Analyzer will compute the metrics if the hardware counter data is available. Here's an example:

This code was deliberately contrived so that all the routines had ludicrously high CPI. But isn't that cool - I can immediately see what kinds of opportunities might be lurking in the code.

This is not restricted to just the functions view, CPI and/or IPC are presented in every view - so you can look at CPI for each thread, line of source, line of disassembly. Of course, as the counter data gets spread over more "lines" you have less data per line, and consequently more noise. So CPI data at the disassembly level is not likely to be that useful for very short running experiments. But when aggregated, the CPI can often be meaningful even for short experiments.

Friday May 23, 2014

Generic hardware counter events

A while back, Solaris introduced support for PAPI - which is probably as close as we can get to a de-facto standard for performance counter naming. For performance counter geeks like me, this is not quite enough information, I actually want to know the names of the raw counters used. Fortunately this is provided in the generic_events man page:

$ man generic_events
Reformatting page.  Please Wait... done

CPU Performance Counters Library Functions   generic_events(3CPC)

     generic_events - generic performance counter events

     The Solaris  cpc(3CPC)  subsystem  implements  a  number  of
     predefined, generic performance counter events. Each generic
   Intel Pentium Pro/II/III Processor
       Generic Event          Platform Event          Event Mask
     PAPI_ca_shr          l2_ifetch                 0xf
     PAPI_ca_cln          bus_tran_rfo              0x0
     PAPI_ca_itv          bus_tran_inval            0x0
     PAPI_tlb_im          itlb_miss                 0x0
     PAPI_btac_m          btb_misses                0x0
     PAPI_hw_int          hw_int_rx                 0x0

Monday Apr 07, 2014

RAW hazards revisited (again)

I've talked about RAW hazards in the past, and even written articles about them. They are an interesting topic because they are situation where a small tweak to the code can avoid the problem.

In the article on RAW hazards there is some code that demonstrates various types of RAW hazard. One common situation is writing code to copy misaligned data into a buffer. The example code contains a test for this kind of copying, the results from this test, compiled with Solaris Studio 12.3, on my system look like:

Misaligned load v1 (bad) memcpy()
Elapsed = 16.486042 ns
Misaligned load v2 (bad) byte copy
Elapsed = 9.176913 ns
Misaligned load good
Elapsed = 5.243858 ns

However, we've done some work in the compiler on better identification of potential RAW hazards. If I recompile using the 12.4 Beta compiler I get the following results:

Misaligned load v1 (bad) memcpy()
Elapsed = 4.756911 ns
Misaligned load v2 (bad) byte copy
Elapsed = 5.005309 ns
Misaligned load good
Elapsed = 5.597687 ns

All three variants of the code produce the same performance - the RAW hazards have been eliminated!

Thursday Apr 03, 2014

SPARC roadmap

A new SPARC roadmap has been published. We have some very cool stuff coming :)

Wednesday Feb 26, 2014

Multicore Application Programming available in Chinese!

This was a complete surprise to me. A box arrived on my doorstep, and inside were copies of Multicore Application Programming in Chinese. They look good, and have a glossy cover rather than the matte cover of the English version.

Article on RAW hazards

Feels like it's been a long while since I wrote up an article for OTN, so I'm pleased that I've finally got around to fixing that.

I've written about RAW hazards in the past. But I recently went through a patch of discovering them in a number of places, so I've written up a full article on them.

What is "nice" about RAW hazards is that once you recognise them for what they are (and that's the tricky bit), they are typically easy to avoid. So if you see 10 seconds of time attributable to RAW hazards in the profile, then you can often get the entire 10 seconds back by tweaking the code.

Wednesday Sep 18, 2013

SPARC processor documentation

The SPARC processor documentation can be found here. What is really exciting though is that you can finally download the Oracle SPARC Architecture 2011 spec, which describes the current SPARC instruction set.

Friday Aug 09, 2013

How to use a lot of threads ....

The SPARC M5-32 has 1,536 virtual CPUs. There is a lot that you can do with that much resource and a bunch of us sat down to write a white paper discussing the options.

There are a couple of key observations in there. First of all it is quite likely that such a large system will not end up running a single instance of a single application. Therefore it is useful to understand the various options for virtualising the system. The second observation is that there are a number of details to take care of when writing an application that is expected to scale to large numbers of threads.

Anyhow, I hope you find it a useful read.

Tuesday Jun 11, 2013

SPARC family

This is a nice table showing the various SPARC processors being shipped by Oracle.

Wednesday Dec 12, 2012

Compiling for T4

I've recently had quite a few queries about compiling for T4 based systems. So it's probably a good time to review what I consider to be the best practices.

  • Always use the latest compiler. Being in the compiler team, this is bound to be something I'd recommend :) But the serious points are that (a) Every release the tools get better and better, so you are going to be much more effective using the latest release (b) Every release we improve the generated code, so you will see things get better (c) Old releases cannot know about new hardware.
  • Always use optimisation. You should use at least -O to get some amount of optimisation. -xO4 is typically even better as this will add within-file inlining.
  • Always generate debug information, using -g. This allows the tools to attribute information to lines of source. This is particularly important when profiling an application.
  • The default target of -xtarget=generic is often sufficient. This setting is designed to produce a binary that runs well across all supported platforms. If the binary is going to be deployed on only a subset of architectures, then it is possible to produce a binary that only uses the instructions supported on these architectures, which may lead to some performance gains. I've previously discussed which chips support which architectures, and I'd recommend that you take a look at the chart that goes with the discussion.
  • Crossfile optimisation (-xipo) can be very useful - particularly when the hot source code is distributed across multiple source files. If you're allowed to have something as geeky as favourite compiler optimisations, then this is mine!
  • Profile feedback (-xprofile=[collect: | use:]) will help the compiler make the best code layout decisions, and is particularly effective with crossfile optimisations. But what makes this optimisation really useful is that codes that are dominated by branch instructions don't typically improve much with "traditional" compiler optimisation, but often do respond well to being built with profile feedback.
  • The macro flag -fast aims to provide a one-stop "give me a fast application" flag. This usually gives a best performing binary, but with a few caveats. It assumes the build platform is also the deployment platform, it enables floating point optimisations, and it makes some relatively weak assumptions about pointer aliasing. It's worth investigating.
  • SPARC64 processor, T3, and T4 implement floating point multiply accumulate instructions. These can substantially improve floating point performance. To generate them the compiler needs the flag -fma=fused and also needs an architecture that supports the instruction (at least -xarch=sparcfmaf).
  • The most critical advise is that anyone doing performance work should profile their application. I cannot overstate how important it is to look at where the time is going in order to determine what can be done to improve it.

I also presented at Oracle OpenWorld on this topic, so it might be helpful to review those slides.

Thursday Oct 18, 2012

Maximising T4 performance

My presentation from Oracle Open World is available for download.

Tuesday Oct 09, 2012

25 years of SPARC

Looks like an interesting event at the Computer History Museum on 1 November. A panel discussing SPARC at 25: Past, Present and Future. Free sign up.

Friday Sep 14, 2012

Current SPARC Architectures

Different generations of SPARC processors implement different architectures. The architecture that the compiler targets is controlled implicitly by the -xtarget flag and explicitly by the -arch flag.

If an application targets a recent architecture, then the compiler gets to play with all the instructions that the new architecture provides. The downside is that the application won't work on older processors that don't have the new instructions. So for developer's there is a trade-off between performance and portability.

The way we have solved this in the compiler is to assume a "generic" architecture, and we've made this the default behaviour of the compiler. The only flag that doesn't make this assumption is -fast which tells the compiler to assume that the build machine is also the deployment machine - so the compiler can use all the instructions that the build machine provides.

The -xtarget=generic flag tells the compiler explicitly to use this generic model. We work hard on making generic code work well across all processors. So in most cases this is a very good choice.

It is also of interest to know what processors support the various architectures. The following Venn diagram attempts to show this:

A textual description is as follows:

  • The T1 and T2 processors, in addition to most other SPARC processors that were shipped in the last 10+ years supported V9b, or sparcvis2.
  • The SPARC64 processors from Fujitsu, used in the M-series machines, added support for the floating point multiply accumulate instruction in the sparcfmaf architecture.
  • Support for this instruction also appeared in the T3 - this is called sparcvis3
  • Later SPARC64 processors added the integer multiply accumulate instruction, this architecture is sparcima.
  • Finally the T4 includes support for both the integer and floating point multiply accumulate instructions in the sparc4 architecture.

So the conclusion should be:

  • Floating point multiply accumulate is supported in both the T-series and M-series machines, so it should be a relatively safe bet to start using it.
  • The T4 is a very good machine to deploy to because it supports all the current instruction sets.

Thursday Aug 30, 2012

SPARC Architecture 2011

With what appears to be minimal fanfare, an update of the SPARC Architecture has been released. If you ever look at SPARC disassembly code, then this is the document that you need to bookmark. If you are not familiar with it, then it basically describes how a SPARC processor should behave - it doesn't describe a particular implementation, just the "generic" processor. As with all revisions, it supercedes the SPARC v9 book published back in the 90s, having both corrections, and definitions of new instructions. Anyway, should be an interesting read :)

Friday Apr 20, 2012

What is -xcode=abs44?

I've talked about building 64-bit libraries with position independent code. When building 64-bit applications there are two options for the code that the compiler generates: -xcode=abs64 or -xcode=abs44, the default is -xcode=abs44. These are documented in the user guides. The abs44 and abs64 options produce 64-bit applications that constrain the code + data + BSS to either 44 bit or 64 bits of address.

These options constrain the addresses statically encoded in the application to either 44 or 64 bits. It does not restrict the address range for pointers (dynamically allocated memory) - they remain 64-bits. The restriction is in locating the address of a routine or a variable within the executable.

This is easier to understand from the perspective of an example. Suppose we have a variable "data" that we want to return the address of. Here's the code to do such a thing:

extern int data;

int * address()
  return &data;

If we compile this as a 32-bit app we get the following disassembly:

/* 000000          4 */         sethi   %hi(data),%o5
/* 0x0004            */         retl    ! Result =  %o0
/* 0x0008            */         add     %o5,%lo(data),%o0

So it takes two instructions to generate the address of the variable "data". At link time the linker will go through the code, locate references to the variable "data" and replace them with the actual address of the variable, so these two instructions will get modified. If we compile this as a 64-bit code with full 64-bit address generation (-xcode=abs64) we get the following:

/* 000000          4 */         sethi   %hh(data),%o5
/* 0x0004            */         sethi   %lm(data),%o2
/* 0x0008            */         or      %o5,%hm(data),%o4
/* 0x000c            */         sllx    %o4,32,%o3
/* 0x0010            */         or      %o3,%o2,%o1
/* 0x0014            */         retl    ! Result =  %o0
/* 0x0018            */         add     %o1,%lo(data),%o0

So to do the same thing for a 64-bit application with full 64-bit address generation takes 6 instructions. Now, most hardware cannot address the full 64-bits, hardware typically can address somewhere around 40+ bits of address (example). So being able to generate a full 64-bit address is currently unnecessary. This is where abs44 comes in. A 44 bit address can be generated in four instructions, so slightly cuts the instruction count without practically compromising the range of memory that an application can address:

/* 000000          4 */         sethi   %h44(data),%o5
/* 0x0004            */         or      %o5,%m44(data),%o4
/* 0x0008            */         sllx    %o4,12,%o3
/* 0x000c            */         retl    ! Result =  %o0
/* 0x0010            */         add     %o3,%l44(data),%o0

Friday Oct 21, 2011


SPARC and x86 processors have different endianness. SPARC is big-endian and x86 is little-endian. Big-endian means that numbers are stored with the most significant data earlier in memory. Conversely little-endian means that numbers are stored with the least significant data earlier in memory.

Think of big endian as writing numbers as we would normally do. For example one thousand, one hundred and twenty would be written as 1120 using a big-endian format. However, writing as little endian it would be 0211 - the least significant digits would be recorded first.

For machines, this relates to which bytes are stored first. To make data portable between machines, a format needs to be agreed. For example in networking, data is defined as being big-endian. So to handle network packets, little-endian machines need to convert the data before using it.

Converting the bytes is a trivial matter, but it has some performance pitfalls. Let's start with a simple way of doing the conversion.

template <class T>
T swapslow(T in)
  T out;
  char * pcin = (char*)∈
  char * pcout = (char*)&out;

  for (int i=0; i<sizeof(T); i++)
    pcout[i] = pcin[sizeof(T)-i];
  return out;

The code uses templates to generalise it to different sizes of integers. But the following observations hold even if you use a C version for a particular size of input.

First thing to look at is instruction count. Assume I'm dealing with ints. I store the input to memory, then I access the input one byte at a time, storing each byte to a new location in memory, before finally loading the result. So for an int, I've got 10 memory operations.

Memory operations can be costly. Processors may be limited to only issuing one per cycle. In comparison most processors can issue two or more logical or integer arithmetic instructions per cycle. Loads are also costly as they have to access the cache, which takes a few cycles.

The other issue is more subtle, and I've discussed it in the past. There are RAW issues in this code. I'm storing an int, but loading it as four bytes. Then I'm storing four bytes, and loading them as an int.

A RAW hazard is a read-after-write hazard. The processor sees data being stored, but cannot convert that stored data into the format that the subsequent load requires. Hence the load has to wait until the result of the store reaches the cache before the load can complete. This can be multiple cycles of wait.

With endianness conversion, the data is already in the registers, so we can use logical operations to perform the conversion. This approach is shown in the next code snippet.

template <class T>
T swap(T in)
  T out=0;
  for (int i=0; i<sizeof(T); i++)
  return out;

In this case, we avoid the stores and loads, but instead we perform four logical operations per byte. This is higher cost than the load and store per byte. However, we can usually do more logical operations per cycle and the operations normally take a single cycle to complete. Overall, this is probably slightly faster than loads and stores.

However, you will usually see a greater performance gain from avoiding the RAW hazards. Obviously RAW hazards are hardware dependent - some processors may be engineered to avoid them. In which case you will only see a problem on some particular hardware. Which means that your application will run well on one machine, but poorly on another.

Tuesday Jan 11, 2011

RAW pipeline hazards

When a processor stores an item of data back to memory it actually goes through quite a complex set of operations. A sketch of the activities is as follows. The first thing that needs to be done is that the cache line containing the target address of the store needs to be fetched from memory. While this is happening, the data to be stored there is placed on a store queue. When the store is the oldest item in the queue, and the cache line has been successfully fetched from memory, the data can be placed into the cache line and removed from the queue.

This works very well if data is stored and either never reused, or reused after a relatively long delay. Unfortunately it is common for data to be needed almost immediately. There are plenty of reasons why this is the case. If parameters are passed through the stack, then they will be stored to the stack, and then immediately reloaded. If a register is spilled to the stack, then the data will be reloaded from the stack shortly afterwards.

It could take some considerable number of cycles if the loads had to wait for the stores to exit the queue before they could fetch the data. So many processors implement some kind of bypassing. If a load finds the data it needs in the store queue, then it can fetch it from there. There are often some caveats associated with this bypass. For example, the store and load often have to be of the same size to the same address. i.e. you cannot bypass a byte from a store of a word. If the bypass fails, then the situation is referred to as a "RAW" hazard, meaning "Read-After-Write". If the bypass fails, then the load has to wait until the store has completed before it can retrieve the new value - this can take many cycles.

As a general rule it is best to avoid potential RAWs. It is hardware, and runtime situation dependent whether there will be a RAW hazard or not, so avoiding the possibility is the best defense. Consider the following code which uses loads and stores of bytes to construct an integer.

#include <stdio.h>
#include <sys/time.h>

void tick()
  hrtime_t now = gethrtime();
  static hrtime_t then = 0;
  if (then>0) printf("Elapsed = %f\\n", 1.0\*(now-then)/100000000.0);
  then = now;

int func(char \* value)
  int temp;
  ((char\*)&temp)[0] = value[3];
  ((char\*)&temp)[1] = value[2];
  ((char\*)&temp)[2] = value[1];
  ((char\*)&temp)[3] = value[0];
  return temp;

int main()
  int value = 0x01020304;
  for (int i=0; i<100000000; i++) func((char\*)&value);

In the above code we're reversing the byte order by loading the bytes one-by-one, and storing them into an integer in the correct position, then loading the integer. Running this code on a test machine it reports 12ns per iteration.

However, it is possible to perform the same reordering using logical operations (shifts and ORs) as follows:

int func2(char\* value)
  return (value[0]<<24) | (value[1]<<16) | (value[2]<<8) | value[0];

This modified routine takes about 8ns per iteration. Which is significantly faster than the original code.

The actual speed up observed will depend on many factors, the most obvious being how often the code is encountered. The more observation is that the speed up depends on the platform. Some platforms will be more sensitive to the impact of RAWs than others. So the best advice is, whereever possible, to avoid passing data through the stack.

Monday Oct 19, 2009

Fishing with cputrack

I'm a great fan of the hardware performance counters that you find on most processors. Often you can look at the profile and instantly identify what the issue is. Sometimes though, it is not obvious, and that's where the performance counters can really help out.

I was looking at one such issue last week, the performance of the application was showing some variation, and it wasn't immediately obvious what the issue was. The usual suspects in these cases are:

  • Excessive system time
  • Process migration
  • Memory placement
  • Page size
  • etc.

Unfortunately, none of these seemed to explain the issue. So I hacked together the following script cputrackall which ran the test code under cputrack for all the possible performance counters. Dumped the output into a spreadsheet, and compared the fast and slow runs of the app. This is something of a "fishing trip" script, just gathering as much data as possible in the hope that something leaps out, but sometimes that's exactly what's needed. I regularly get to sit in front of a new chip before the tools like ripc have been ported, and in those situations the easiest thing to do is to look for hardware counter events that might explain the runtime performance. In this particular instance, it helped me to confirm my suspicion that there was a difference in branch misprediction rates that was causing the issue.

Monday Sep 28, 2009

Updated compiler flags article

Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« February 2015
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming