Monday Apr 07, 2014

RAW hazards revisited (again)

I've talked about RAW hazards in the past, and even written articles about them. They are an interesting topic because they are situation where a small tweak to the code can avoid the problem.

In the article on RAW hazards there is some code that demonstrates various types of RAW hazard. One common situation is writing code to copy misaligned data into a buffer. The example code contains a test for this kind of copying, the results from this test, compiled with Solaris Studio 12.3, on my system look like:

Misaligned load v1 (bad) memcpy()
Elapsed = 16.486042 ns
Misaligned load v2 (bad) byte copy
Elapsed = 9.176913 ns
Misaligned load good
Elapsed = 5.243858 ns

However, we've done some work in the compiler on better identification of potential RAW hazards. If I recompile using the 12.4 Beta compiler I get the following results:

Misaligned load v1 (bad) memcpy()
Elapsed = 4.756911 ns
Misaligned load v2 (bad) byte copy
Elapsed = 5.005309 ns
Misaligned load good
Elapsed = 5.597687 ns

All three variants of the code produce the same performance - the RAW hazards have been eliminated!

Thursday Apr 03, 2014

SPARC roadmap

A new SPARC roadmap has been published. We have some very cool stuff coming :)

Wednesday Feb 26, 2014

Multicore Application Programming available in Chinese!

This was a complete surprise to me. A box arrived on my doorstep, and inside were copies of Multicore Application Programming in Chinese. They look good, and have a glossy cover rather than the matte cover of the English version.

Article on RAW hazards

Feels like it's been a long while since I wrote up an article for OTN, so I'm pleased that I've finally got around to fixing that.

I've written about RAW hazards in the past. But I recently went through a patch of discovering them in a number of places, so I've written up a full article on them.

What is "nice" about RAW hazards is that once you recognise them for what they are (and that's the tricky bit), they are typically easy to avoid. So if you see 10 seconds of time attributable to RAW hazards in the profile, then you can often get the entire 10 seconds back by tweaking the code.

Wednesday Sep 18, 2013

SPARC processor documentation

The SPARC processor documentation can be found here. What is really exciting though is that you can finally download the Oracle SPARC Architecture 2011 spec, which describes the current SPARC instruction set.

Friday Aug 09, 2013

How to use a lot of threads ....

The SPARC M5-32 has 1,536 virtual CPUs. There is a lot that you can do with that much resource and a bunch of us sat down to write a white paper discussing the options.

There are a couple of key observations in there. First of all it is quite likely that such a large system will not end up running a single instance of a single application. Therefore it is useful to understand the various options for virtualising the system. The second observation is that there are a number of details to take care of when writing an application that is expected to scale to large numbers of threads.

Anyhow, I hope you find it a useful read.

Tuesday Jun 11, 2013

SPARC family

This is a nice table showing the various SPARC processors being shipped by Oracle.

Wednesday Dec 12, 2012

Compiling for T4

I've recently had quite a few queries about compiling for T4 based systems. So it's probably a good time to review what I consider to be the best practices.

  • Always use the latest compiler. Being in the compiler team, this is bound to be something I'd recommend :) But the serious points are that (a) Every release the tools get better and better, so you are going to be much more effective using the latest release (b) Every release we improve the generated code, so you will see things get better (c) Old releases cannot know about new hardware.
  • Always use optimisation. You should use at least -O to get some amount of optimisation. -xO4 is typically even better as this will add within-file inlining.
  • Always generate debug information, using -g. This allows the tools to attribute information to lines of source. This is particularly important when profiling an application.
  • The default target of -xtarget=generic is often sufficient. This setting is designed to produce a binary that runs well across all supported platforms. If the binary is going to be deployed on only a subset of architectures, then it is possible to produce a binary that only uses the instructions supported on these architectures, which may lead to some performance gains. I've previously discussed which chips support which architectures, and I'd recommend that you take a look at the chart that goes with the discussion.
  • Crossfile optimisation (-xipo) can be very useful - particularly when the hot source code is distributed across multiple source files. If you're allowed to have something as geeky as favourite compiler optimisations, then this is mine!
  • Profile feedback (-xprofile=[collect: | use:]) will help the compiler make the best code layout decisions, and is particularly effective with crossfile optimisations. But what makes this optimisation really useful is that codes that are dominated by branch instructions don't typically improve much with "traditional" compiler optimisation, but often do respond well to being built with profile feedback.
  • The macro flag -fast aims to provide a one-stop "give me a fast application" flag. This usually gives a best performing binary, but with a few caveats. It assumes the build platform is also the deployment platform, it enables floating point optimisations, and it makes some relatively weak assumptions about pointer aliasing. It's worth investigating.
  • SPARC64 processor, T3, and T4 implement floating point multiply accumulate instructions. These can substantially improve floating point performance. To generate them the compiler needs the flag -fma=fused and also needs an architecture that supports the instruction (at least -xarch=sparcfmaf).
  • The most critical advise is that anyone doing performance work should profile their application. I cannot overstate how important it is to look at where the time is going in order to determine what can be done to improve it.

I also presented at Oracle OpenWorld on this topic, so it might be helpful to review those slides.

Thursday Oct 18, 2012

Maximising T4 performance

My presentation from Oracle Open World is available for download.

Tuesday Oct 09, 2012

25 years of SPARC

Looks like an interesting event at the Computer History Museum on 1 November. A panel discussing SPARC at 25: Past, Present and Future. Free sign up.

Friday Sep 14, 2012

Current SPARC Architectures

Different generations of SPARC processors implement different architectures. The architecture that the compiler targets is controlled implicitly by the -xtarget flag and explicitly by the -arch flag.

If an application targets a recent architecture, then the compiler gets to play with all the instructions that the new architecture provides. The downside is that the application won't work on older processors that don't have the new instructions. So for developer's there is a trade-off between performance and portability.

The way we have solved this in the compiler is to assume a "generic" architecture, and we've made this the default behaviour of the compiler. The only flag that doesn't make this assumption is -fast which tells the compiler to assume that the build machine is also the deployment machine - so the compiler can use all the instructions that the build machine provides.

The -xtarget=generic flag tells the compiler explicitly to use this generic model. We work hard on making generic code work well across all processors. So in most cases this is a very good choice.

It is also of interest to know what processors support the various architectures. The following Venn diagram attempts to show this:


A textual description is as follows:

  • The T1 and T2 processors, in addition to most other SPARC processors that were shipped in the last 10+ years supported V9b, or sparcvis2.
  • The SPARC64 processors from Fujitsu, used in the M-series machines, added support for the floating point multiply accumulate instruction in the sparcfmaf architecture.
  • Support for this instruction also appeared in the T3 - this is called sparcvis3
  • Later SPARC64 processors added the integer multiply accumulate instruction, this architecture is sparcima.
  • Finally the T4 includes support for both the integer and floating point multiply accumulate instructions in the sparc4 architecture.

So the conclusion should be:

  • Floating point multiply accumulate is supported in both the T-series and M-series machines, so it should be a relatively safe bet to start using it.
  • The T4 is a very good machine to deploy to because it supports all the current instruction sets.

Thursday Aug 30, 2012

SPARC Architecture 2011

With what appears to be minimal fanfare, an update of the SPARC Architecture has been released. If you ever look at SPARC disassembly code, then this is the document that you need to bookmark. If you are not familiar with it, then it basically describes how a SPARC processor should behave - it doesn't describe a particular implementation, just the "generic" processor. As with all revisions, it supercedes the SPARC v9 book published back in the 90s, having both corrections, and definitions of new instructions. Anyway, should be an interesting read :)

Friday Apr 20, 2012

What is -xcode=abs44?

I've talked about building 64-bit libraries with position independent code. When building 64-bit applications there are two options for the code that the compiler generates: -xcode=abs64 or -xcode=abs44, the default is -xcode=abs44. These are documented in the user guides. The abs44 and abs64 options produce 64-bit applications that constrain the code + data + BSS to either 44 bit or 64 bits of address.

These options constrain the addresses statically encoded in the application to either 44 or 64 bits. It does not restrict the address range for pointers (dynamically allocated memory) - they remain 64-bits. The restriction is in locating the address of a routine or a variable within the executable.

This is easier to understand from the perspective of an example. Suppose we have a variable "data" that we want to return the address of. Here's the code to do such a thing:

extern int data;

int * address()
{
  return &data;
}

If we compile this as a 32-bit app we get the following disassembly:

/* 000000          4 */         sethi   %hi(data),%o5
/* 0x0004            */         retl    ! Result =  %o0
/* 0x0008            */         add     %o5,%lo(data),%o0

So it takes two instructions to generate the address of the variable "data". At link time the linker will go through the code, locate references to the variable "data" and replace them with the actual address of the variable, so these two instructions will get modified. If we compile this as a 64-bit code with full 64-bit address generation (-xcode=abs64) we get the following:

/* 000000          4 */         sethi   %hh(data),%o5
/* 0x0004            */         sethi   %lm(data),%o2
/* 0x0008            */         or      %o5,%hm(data),%o4
/* 0x000c            */         sllx    %o4,32,%o3
/* 0x0010            */         or      %o3,%o2,%o1
/* 0x0014            */         retl    ! Result =  %o0
/* 0x0018            */         add     %o1,%lo(data),%o0

So to do the same thing for a 64-bit application with full 64-bit address generation takes 6 instructions. Now, most hardware cannot address the full 64-bits, hardware typically can address somewhere around 40+ bits of address (example). So being able to generate a full 64-bit address is currently unnecessary. This is where abs44 comes in. A 44 bit address can be generated in four instructions, so slightly cuts the instruction count without practically compromising the range of memory that an application can address:

/* 000000          4 */         sethi   %h44(data),%o5
/* 0x0004            */         or      %o5,%m44(data),%o4
/* 0x0008            */         sllx    %o4,12,%o3
/* 0x000c            */         retl    ! Result =  %o0
/* 0x0010            */         add     %o3,%l44(data),%o0

Friday Oct 21, 2011

Endianness

SPARC and x86 processors have different endianness. SPARC is big-endian and x86 is little-endian. Big-endian means that numbers are stored with the most significant data earlier in memory. Conversely little-endian means that numbers are stored with the least significant data earlier in memory.

Think of big endian as writing numbers as we would normally do. For example one thousand, one hundred and twenty would be written as 1120 using a big-endian format. However, writing as little endian it would be 0211 - the least significant digits would be recorded first.

For machines, this relates to which bytes are stored first. To make data portable between machines, a format needs to be agreed. For example in networking, data is defined as being big-endian. So to handle network packets, little-endian machines need to convert the data before using it.

Converting the bytes is a trivial matter, but it has some performance pitfalls. Let's start with a simple way of doing the conversion.

template <class T>
T swapslow(T in)
{
  T out;
  char * pcin = (char*)∈
  char * pcout = (char*)&out;

  for (int i=0; i<sizeof(T); i++)
  {
    pcout[i] = pcin[sizeof(T)-i];
  }
  return out;
}

The code uses templates to generalise it to different sizes of integers. But the following observations hold even if you use a C version for a particular size of input.

First thing to look at is instruction count. Assume I'm dealing with ints. I store the input to memory, then I access the input one byte at a time, storing each byte to a new location in memory, before finally loading the result. So for an int, I've got 10 memory operations.

Memory operations can be costly. Processors may be limited to only issuing one per cycle. In comparison most processors can issue two or more logical or integer arithmetic instructions per cycle. Loads are also costly as they have to access the cache, which takes a few cycles.

The other issue is more subtle, and I've discussed it in the past. There are RAW issues in this code. I'm storing an int, but loading it as four bytes. Then I'm storing four bytes, and loading them as an int.

A RAW hazard is a read-after-write hazard. The processor sees data being stored, but cannot convert that stored data into the format that the subsequent load requires. Hence the load has to wait until the result of the store reaches the cache before the load can complete. This can be multiple cycles of wait.

With endianness conversion, the data is already in the registers, so we can use logical operations to perform the conversion. This approach is shown in the next code snippet.

template <class T>
T swap(T in)
{
  T out=0;
  for (int i=0; i<sizeof(T); i++)
  {
    out<<=8;
    out|=(in&255);
    in>>=8;
  }
  return out;
} 

In this case, we avoid the stores and loads, but instead we perform four logical operations per byte. This is higher cost than the load and store per byte. However, we can usually do more logical operations per cycle and the operations normally take a single cycle to complete. Overall, this is probably slightly faster than loads and stores.

However, you will usually see a greater performance gain from avoiding the RAW hazards. Obviously RAW hazards are hardware dependent - some processors may be engineered to avoid them. In which case you will only see a problem on some particular hardware. Which means that your application will run well on one machine, but poorly on another.

Tuesday Jan 11, 2011

RAW pipeline hazards

When a processor stores an item of data back to memory it actually goes through quite a complex set of operations. A sketch of the activities is as follows. The first thing that needs to be done is that the cache line containing the target address of the store needs to be fetched from memory. While this is happening, the data to be stored there is placed on a store queue. When the store is the oldest item in the queue, and the cache line has been successfully fetched from memory, the data can be placed into the cache line and removed from the queue.

This works very well if data is stored and either never reused, or reused after a relatively long delay. Unfortunately it is common for data to be needed almost immediately. There are plenty of reasons why this is the case. If parameters are passed through the stack, then they will be stored to the stack, and then immediately reloaded. If a register is spilled to the stack, then the data will be reloaded from the stack shortly afterwards.

It could take some considerable number of cycles if the loads had to wait for the stores to exit the queue before they could fetch the data. So many processors implement some kind of bypassing. If a load finds the data it needs in the store queue, then it can fetch it from there. There are often some caveats associated with this bypass. For example, the store and load often have to be of the same size to the same address. i.e. you cannot bypass a byte from a store of a word. If the bypass fails, then the situation is referred to as a "RAW" hazard, meaning "Read-After-Write". If the bypass fails, then the load has to wait until the store has completed before it can retrieve the new value - this can take many cycles.

As a general rule it is best to avoid potential RAWs. It is hardware, and runtime situation dependent whether there will be a RAW hazard or not, so avoiding the possibility is the best defense. Consider the following code which uses loads and stores of bytes to construct an integer.

#include <stdio.h>
#include <sys/time.h>

void tick()
{
  hrtime_t now = gethrtime();
  static hrtime_t then = 0;
  if (then>0) printf("Elapsed = %f\\n", 1.0\*(now-then)/100000000.0);
  then = now;
}

int func(char \* value)
{
  int temp;
  ((char\*)&temp)[0] = value[3];
  ((char\*)&temp)[1] = value[2];
  ((char\*)&temp)[2] = value[1];
  ((char\*)&temp)[3] = value[0];
  return temp;
}

int main()
{
  int value = 0x01020304;
  tick();
  for (int i=0; i<100000000; i++) func((char\*)&value);
}

In the above code we're reversing the byte order by loading the bytes one-by-one, and storing them into an integer in the correct position, then loading the integer. Running this code on a test machine it reports 12ns per iteration.

However, it is possible to perform the same reordering using logical operations (shifts and ORs) as follows:

int func2(char\* value)
{
  return (value[0]<<24) | (value[1]<<16) | (value[2]<<8) | value[0];
}

This modified routine takes about 8ns per iteration. Which is significantly faster than the original code.

The actual speed up observed will depend on many factors, the most obvious being how often the code is encountered. The more observation is that the speed up depends on the platform. Some platforms will be more sensitive to the impact of RAWs than others. So the best advice is, whereever possible, to avoid passing data through the stack.

Monday Oct 19, 2009

Fishing with cputrack

I'm a great fan of the hardware performance counters that you find on most processors. Often you can look at the profile and instantly identify what the issue is. Sometimes though, it is not obvious, and that's where the performance counters can really help out.

I was looking at one such issue last week, the performance of the application was showing some variation, and it wasn't immediately obvious what the issue was. The usual suspects in these cases are:

  • Excessive system time
  • Process migration
  • Memory placement
  • Page size
  • etc.

Unfortunately, none of these seemed to explain the issue. So I hacked together the following script cputrackall which ran the test code under cputrack for all the possible performance counters. Dumped the output into a spreadsheet, and compared the fast and slow runs of the app. This is something of a "fishing trip" script, just gathering as much data as possible in the hope that something leaps out, but sometimes that's exactly what's needed. I regularly get to sit in front of a new chip before the tools like ripc have been ported, and in those situations the easiest thing to do is to look for hardware counter events that might explain the runtime performance. In this particular instance, it helped me to confirm my suspicion that there was a difference in branch misprediction rates that was causing the issue.

Monday Sep 28, 2009

Updated compiler flags article

Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.

Monday Jun 15, 2009

Glasgow Haskell Compiler successfully ported to OpenSPARC

Ben Lippmeier has been working on the native port of the Glasgow Haskell Compiler (GHC) to SPARC. The port was completed a few months back, and since then he's been using an UltraSPARC T2 system to look at thread activity and scaling as the number of threads is increased. The full details of the project are on his GHC on SPARC blog. The latest SPARC/Solaris binary can be downloaded here, although the full set of changes probably won't be available for a couple of months.

Thursday Apr 02, 2009

Strong prefetch

Looked at an interesting problem yesterday. One of the hot routines in an application was showing a lot of system time. The overall impact on the application was minor, but any unexpected system time is worth investigating.

Profiling under the performance analyzer indicated a pair of instructions, a prefetch followed by a load. The load had all the time attributed to it, but it's usual for the performance analyzer to attribute time to the instruction following the one that is slow. Something like (this is synthetic data, measured data later):

User  System
 0.5     0.5     prefetch [%o0],23
10.0    20.0     ld [%i1],%i2

So there are three things that sprang to mind:

  • Misaligned memory accesses. This would probably be the most common cause of system time on memory operations. However, it didn't look like a perfect file since the system time would usually be reported on the instruction afterwards.
  • TSB misses. The TLB is the on-chip structure that holds virtual to physical mappings. If the mapping is not in the TLB, then the mapping gets fetched from a software managed structure called the TSB. This is a TLB miss. However, the TSB has a capacity limit too, and it is possible for there to be a TSB miss where the mapping is fetched from memory and placed into the TSB. TLB misses are fast traps and don't get recorded under system time. However, TSB misses do cause the switch. This is quite a strong candidate, however trapstat -t didn't show much activity.
  • The third option was the prefetch instruction.

A prefetch instruction requests data from memory in advance of the data being used. For many codes this results in a large gain in performance. There's a number of variants of prefetch, and their exact operation depends on the processor. The SPARC architecture 2007 on pages 300-303 gives an outline of the variants. There's prefetch to read, prefetch to write, prefetch for single use, prefetch for multiple use. The processor has the option to do something different depending on which particular prefetch was used to get the data.

The system I was running on was an UltraSPARC IV+. This introduced the strong prefetch variant.

The idea of the strong prefetch variant is that it will cause a TLB miss to be taken if the mapping of the requested address is not present in the TLB. This is useful for situations where TLB misses are frequently encountered and the prefetch needs to be able to deal with them. To explain why this is important, consider this example. I have a loop which strides through a large range of memory, the loop optimally takes only a few cycles to execute - so long as the data is resident in cache. It takes a couple of hundred cycles to fetch data from memory, so I end up fetching for eight iterations ahead (eight is the maximum number of outstanding prefetches on that processor). If the data resides on a new page, and prefetches do not cause TLB misses, then eight iterations will complete before a load or store touches the new page of memory and brings the mapping into the TLB. The eight prefetches that were issued for those iterations will have been dropped and all the iterations will see the full memory latency (a total of 200\*8=1600 cycles of cost). However, if the prefetches are strong prefetches, then the first strong prefetch will cause the mapping to be brought into the TLB, and all the prefetches will hit in the TLB.

However, there's a corner case. A strong prefetch of a page that is not mapped will still cause the TLB miss, and the corresponding system time while the kernel figures drops the access to a non-existant page.

This was my suspicion. The code had a loop, and each iteration of the loop would prefetch the next item in the linked list. If the code had reached the end of the list, then it would be issuing a prefetch for address 0, which (of course) is not mapped.

It's relatively easy to provide a test case for this, just loop around issuing prefetches for address zero and see how long the code takes to run. Prefetches are defined in the header file <sun_prefetch.h>. So the first code looks like:

#include <sun_prefetch.h>

void main()
{
  for (int i=0; i<1000000; i++)
  {
   sparc_prefetch_write_many(0);
  }
}

Compiling and running this gives:

$ cc pref.c
$ timex a.out

real           2.59
user           1.24
sys            1.34

So the code has substantial system time - which fits with the profile of the application. Next thing to look at is the location of that system time:

$ collect a.out
$ er_print -metrics e.user:e.system -dis main test.1.er
...
   Excl.     Excl.
   User CPU  Sys. CPU
    sec.      sec.
                                
...
   0.        0.                 [?]    10b8c:  bset        576, %o1 ! 0xf4240
   0.        0.                 [?]    10b90:  prefetch    [0], #n_writes_strong
## 1.571     1.181              [?]    10b94:  inc         %i5
   0.        0.                 [?]    10b98:  cmp         %i5, %o1
...

So the system time is recorded on the instruction following the prefetch.

The next step is to investigate ways to solve it. The <sun_prefetch.h> header file does not define any variants of weak prefetch. So it's necessary to write an inline template to provide the support:

.inline weak,4
  prefetch [%o0],2
.end
.inline strong,4
  prefetch [%o0],22
.end

The identifier for many writes strong is #22, for many writes weak it is #2. The initial code uses strong since I want to validate that I get the same behaviour with my new code:

void weak(int\*);
void strong(int\*);

void main()
{
  for (int i=0; i<1000000; i++)
  {
   strong(0);
  }
}

Running this gives:

$ cc pref.c pref.il
$ collect a.out
$ er_print -metrics e.user:e.system -dis main test.1.er
...
   Excl.     Excl.
   User CPU  Sys. CPU
    sec.      sec.
                                
 ...
   0.        0.                 [?]    10b90:  clr         %o0
   0.        0.                 [?]    10b94:  nop
   0.        0.                 [?]    10b98:  prefetch    [%o0], #n_writes_strong
## 1.761     1.051              [?]    10b9c:  inc         %i5
...

So that looks good. Replacing the strong prefetch with a weak prefetch:

void weak(int\*);
void strong(int\*);

void main()
{
  for (int i=0; i<1000000; i++)
  {
   weak(0);
  }
}

And then repeat the profiling experiment:

   Excl.     Excl.
   User CPU  Sys. CPU
    sec.      sec.
                                
...
   0.        0.                 [?]    10b90:  clr         %o0
   0.        0.                 [?]    10b94:  nop
   0.        0.                 [?]    10b98:  prefetch    [%o0], #n_writes
   0.        0.                 [?]    10b9c:  inc         %i5
...

Running the code under timex:

$ timex a.out

real           0.01
user           0.00
sys            0.00

So that got rid of the system time. Of course with this change the prefetches issued will now be dropped if there is a TLB miss, and it is possible that could cause a slowdown for the loads in the loop - which is a trade-off. However, examining the application using pmap -xs <pid> indicated that the data resided on 4MB pages, so the number of TLB misses should be pretty low (this was confirmed by the trapstat data I gathered earlier).

One final comment is that this behaviour is platform dependent. Running the same strong prefetch code on an UltraSPARC T2, for example, shows no system time.

Thursday Mar 26, 2009

OpenSPARC workshop - Brussels

April 24th and 25th I'm going to be in Brussels rerunning the OpenSPARC workshop. The workshop leverages the material in the OpenSPARC Internals book, together with the OpenSPARC presentations. There's a poster with all the details (for those with acute eyesight, the poster on the left is from the December workshop!).

Tuesday Dec 16, 2008

OpenSPARC Internals available on Amazon

OpenSPARC Internals is now available from Amazon. As well as print-on-demand from lulu, and as a free (after registration) download.

Thursday Nov 13, 2008

How to learn SPARC assembly language

Got a question this morning about how to learn SPARC assembly language. It's a topic that I cover briefly in my book, however, the coverage in the book was never meant to be complete. The text in my book is meant as a quick guide to reading SPARC (and x86) assembly, so that the later examples make some kind of sense. The basics are the instruction format:

[instruction]  [source register 1], [source register 2], [destination register]

For example:

faddd    %f0, %f2, %f4

Means:

%f4 = %f0 + %f2

The other thing to learn that's different about SPARC is the branch delay slot. Where the instruction placed after the branch is actually executed as part of the branch. This is different from x86 where a branch instruction is the delimiter of the block of code.

With those basics out the way, the next thing to do would be to take a look at the SPARC Architecture manual. Which is a very detailed reference to all the software visible implementation details.

Finally, I'd suggest just writing some simple codes, and profiling them using the Sun Studio Performance Analyzer. Use the disassembly view tab and the architecture manual to see how the instructions are used in practice.

Tuesday Nov 04, 2008

Job available in this performance analysis team

We're advertising a job opening in this group. We're looking for someone who's keen on doing performance analysis on x86 and SPARC platforms. The req number is 561456, and you can read the details on sun.com. If you have any questions, please do feel free to contact me.

Monday Jul 28, 2008

Atomic operations

Solaris 10 provides atomic operations in libc. Atomic operations are sequences of instructions that behave as if they are a single atomic instruction that no other thread can interrupt. The way that the operations are implemented uses the compare-and-swap instruction (cas). A rough outline is as follows:

do
{
  Load existing value
  New value = existing value + increment
  return value = compare and swap(existing value and new value)
}
while (return value != existing value)

The compare-and-swap instruction atomically swaps the value in a register with the value held in memory, but only if the value held in memory equals the value held in another register. The return value is the value held in memory. The pseudo-code uses this so that the value stored back to memory will only be the incremented value if the current value in memory is equal to the value before the increment. If the value held in memory is not the one that was expected, the code retries the operation until it does succeed.

Breaking the compare-and-swap instruction into pseudo code looks like:

  CAS (Value held in memory, Old value, New value)
  {
    Existing value = \*Value held in memory;
    if (Existing value == Old value)
    {
       \*Value held in memory = New value;
       return Old value;
    }
    else
    {
       return \*Value held in memory;
    }
  }

One of the questions from last week was how to use atomic operations in Solaris 9 if they are only available on Solaris 10. The answer is to write your own. The following code snippet demonstrates how to write an atomic increment operation:

.inline atomic_add,8
  ld [%o0],%o2         /\* Load existing value            \*/
1:
  add %o2, %o1, %o3    /\* Generate new value             \*/
  cas [%o0],%o2,%o3    /\* Swap memory contents           \*/
  cmp %o2, %o3         /\* Compare old value with return  \*/
  bne 1b               /\* Fail if the old value does not \*/
                       /\* equal the value returned from  \*/
                       /\* memory                         \*/
  mov %o3,%o2          /\* Retry using latest value from  \*/
                       /\* memory                         \*/
.end

It's probably a good idea to read this article on atomic operations on SPARC, and it may also be useful to read up on inline templates.

Friday Jul 11, 2008

Reading the %tick counter

It's often necessary to time the duration of a piece of code. To do this I often use gethrtime(), which is typically a quick call. Obviously, the faster the call more accurate I can expect my timing to be. Sometimes the call to gethrtime is too long (or perhaps too costly, because it's done so frequently). The next thing to try is to read the %tick register.

The %tick register is a 64-bit register that gets incremented on every cycle. Since it's stored on the processor reading the value of the register is a low cost operation. The only complexity is that being a 64-bit value it needs special handling under 32-bit codes where the a 64-bit return value from a function is passed in the %o0 (upper bits) and %o1 (lower bits) registers.

The inline template to read the %tick register in a 64-bit code is very simple

.inline tk,0
   rd %tick,%o0
.end

The 32-bit version requires two more instructions to get the return value into the appropriate registers:

.inline tk,0
   rd %tick,%o2
   srlx %o2,32,%o0
   sra %o2,0,%o1
.end

Here's an example code which uses the %tick register to get the current value plus an estimate of the cost of reading the %tick register:

#include 

long long tk();

void main()
{
  printf("value=%llu duration=%llu\\n",tk(),-tk()+tk());
}

The compile line is:

$ cc -O tk.c tk.il

The output should be something like:

$ a.out
value=4974674895272956 duration=32

Indicating that 32 cycles (in this instance) elapsed between the two reads of the %tick register. Looking at the disassembly, there are certainly a number of cycles of overhead:

        10bbc:  95 41 00 00  rd         %tick, %o2
        10bc0:  91 32 b0 20  srlx       %o2, 32, %o0
        10bc4:  93 3a a0 00  sra        %o2, 0, %o1
        10bc8:  97 32 60 00  srl        %o1, 0, %o3 
        10bcc:  99 2a 30 20  sllx       %o0, 32, %o4
        10bd0:  88 12 c0 0c  or         %o3, %o4, %g4
        10bd4:  c8 73 a0 60  stx        %g4, [%sp + 96]
        10bd8:  95 41 00 00  rd         %tick, %o2

This overhead can be reduced by treating the %tick register as a 32-bit read, and effectively ignoring the upper bits. For very short duration codes this is probably acceptable, but is unsuitable for longer running code blocks. With this (inelegant hack) the following code is generated:

        10644:  91 41 00 00  rd         %tick, %o0
        10648:  b0 07 62 7c  add        %i5, 636, %i0
        1064c:  b8 10 00 08  mov        %o0, %i4
        10650:  91 41 00 00  rd         %tick, %o0

Which returns usually a value of 8 cycles on the same platform.

Thursday May 22, 2008

Alloca on SPARC

On SPARC there's a slight complication. The load and store instructions have an offset range of -4096 to +4096. To use a larger offset than that it is necessary to put the offset into a register and use that to calculate the address.

If the size of the local variables are less than 4KB, then a load or store instruction can use the frame pointer together with an offset in order to access the memory on the stack. If the stack is greater than 4KB, then it's possible to use the frame pointer to access memory in the upper 4KB range, and the stack pointer to access memory in the lower 4KB. Rather like this diagram shows:

frame pointer -> top of stack
               \^
               | Upper 4KB can be accessed 
               v using offset+ frame pointer
               \^
               | Lower 4KB can be accessed 
               v using offset+ frame pointer
stack pointer -> bottom of stack

The complication is when temporary memory is allocated on the stack using alloca, and the size of the local variables exceed 4KB. In this case it's not possible to just shift the stack pointer downwards - since that may cause variables that were previously accessed through the stack pointer to become out of the 4KB offset range, or it would change the offset from the stack pointer where variables are stored (by an amount which may only be known at runtime). Either of these situations would not be good.

Instead of just shifting the stack pointer, a slightly more complex operation has to be carried out. The memory gets allocated in the middle of the range, and the lower memory gets shifted (or copied) downwards. The end result is something like this:

frame pointer -> top of stack
               \^
               | Upper 4KB can be accessed 
               v using offset+ frame pointer
               [Alloca'd memory]
               \^
               | Lower 4KB can be accessed 
               v using offset+ frame pointer
stack pointer -> bottom of stack

The routine that does this manipulation of memory is called __builtin_alloca. You can see in the code that it moves the stack pointer, and then has a copy loop to move the contents of the stack.

Unfortunately, the need to copy the data means that it takes longer to allocate memory. So if the function __builtin_alloca appears in a profile, the first thing to do is to see whether it's possible to reduce the amount of local variables/stack space needed for the routine.

As a footnote, take a look at the equivalent code for the x86 version of __builtin_alloca. The x86, being CISC, does not have the limit on the size of the offset that can be used. Hence the x86 code does not need the copy routine to move variables in the stack around.

Wednesday May 07, 2008

When to use membars

membar instructions are SPARC assembly language instructions that enforce memory ordering. They tell the processor to ensure that memory operations are completed before it continues execution. However, the basic rule is that the instructions are usually only necessary in "unusual" circumstances - which fortunately will mean that most people don't encounter them.

The UltraSPARC Architecture manual documents the situation very well in section 9.5. It gives these rules which cover the default behaviour:

  • Each load instruction behaves as if it were followed by a MEMBAR #LoadLoad and #LoadStore.
  • Each store instruction behaves as if it were followed by a MEMBAR #StoreStore.
  • Each atomic load-store behaves as if it were followed by a MEMBAR #LoadLoad, #LoadStore, and #StoreStore.

There's a table in section 9.5.3 which covers when membars are necessary. Basically, membars are necessary for ordering of block loads and stores, and for ordering non-cacheable loads and stores. There is an interesting note where it indicates that a membar is necessary to order a store followed by a load to a different addresses; if the address is the same the load will get the correct data. This at first glance seems odd - why worry about whether the store is complete if the load is of independent data. However, I can imagine this being useful in situations where the same physical memory is mapped using different virtual address ranges - not something that happens often, but could happen in the kernel.

As a footnote, the equivalent x86 instruction is the mfence. There's a good discussion of memory ordering in section 7.2 of the Intel Systems Programming Guide.

There's some more discussion of this topic on Dave Dice's weblog.

Tuesday Mar 25, 2008

Conference schedule

The next two months are likely to be a bit hectic for me. I'm presenting at three different conferences, as well as a chat session in Second Life. So I figured I'd put the information up in case anyone reading this is also going to one or other of the events. So in date order:

I'll be talking about parallelisation at the various conferences, the talks will be different. The multi-core expo talks focuses on microparallelisation. The ESC talk will probably be higher level, and the CommunityOne talk will probably be wider ranging, and I hope more interactive.

In the Second Life event I'll be talking about the book, although the whole idea of appearing is to do Q&A, so I hope that will be more of a discussion.

Thursday Mar 20, 2008

The much maligned -fast

The compiler flag -fast gets an unfair rap. Even the compiler reports:

cc: Warning: -xarch=native has been explicitly specified, or 
implicitly specified by a macro option, -xarch=native on this 
architecture implies -xarch=sparcvis2 which generates code that 
does not run on pre UltraSPARC III processors

which is hardly fair given the the UltraSPARC III line came out about 8 years ago! So I want to quickly discuss what's good about the option, and what reasons there are to be cautious.

The first thing to talk about is the warning message. -xtarget=native is a good option to use when the target platform is also the deployment platform. For me, this is the common case, but for people producing applications that are more generally deployed, it's not the common case. The best thing to do to avoid the warning and produce binaries that work with the widest range of hardware is to add the flag -xtarget=generic after -fast (compiler flags are parsed from left to right, so the rightmost flag is the one that gets obeyed). The generic target represents a mix of all the important processors, the mix produces code that should work well on all of them.

The next option which is in -fast for C that might cause some concern is -xalias_level=basic. This tells the compiler to assume that pointers of different basic types (e.g. integers, floats etc.) don't alias. Most people code to this, and the C standard actually has higher demands on the level of aliasing the compiler can assume. So code that conforms to the C standard will work correctly with this option. Of course, it's still worth being aware that the compiler is making the assumption.

The final area is floating point simplification. That's the flags -fsimple=2 which allows the compiler to reorder floating point expressions, -fns which allows the processor to flush subnormal numbers to zero, and some other flags that use faster floating point libraries or inline templates. I've previously written about my rather odd views on floating point maths. Basically it comes down to If these options make a difference to the performance of your code, then you should investigate why they make a difference..

Since -fast contains a number of flags which impact performance, it's probably a good plan to identify exactly those flags that do make a difference, and use only those. A tool like ats can really help here.

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs