Thursday May 28, 2015

Where does misaligned data come from?

A good question about data (mis)alignment is "Where did it come from?". So here's a reasonably detailed answer to that...

If the compiler has generated the code for you and you've not done anything "weird" then the data should be correctly aligned. So most apps don't have misaligned data, and most of the time you (as a developer) don't have to worry about it. For example, if you allocate a local variable, or a global variable, then the compiler will correctly align it. If you cast a call to malloc() into a pointer to a structure, then that structure will be correctly aligned. And so on.... so if the compiler is doing all this correctly, when could it every be possible to have misaligned data. There's a bunch of situations.

But first let's quickly review the -xmemalign flag. What it actually tells the compiler to do is to assume a particular alignment (and trap behaviour) for variables where it is unsure what the alignment is. If a variable is aligned, then the compiler will generate code exploiting that fact. So the -xmemalign only really applies to dynamically allocated data accessed through pointers. So what does this apply - the following is not an exhaustive list:

  • Packed data structures. If a data structure has been declared as being packed, then the compiler will squash the members together to occupy the minimum space - so the alignments may be wrong. If a structure is not packed the compiler adds padding to ensure that members are correctly aligned.
  • Buffers. Suppose your application gets packets across the network. If the packet contains an integer there's no guarantee that the integer will be placed on a four byte boundary.
  • Pointers into bytes. Suppose you have a string of characters and you want load 4 of them into an int - perhaps you're doing some bit-twiddling. Then you have to take care to handle strings that don't start at 4 byte boundaries.

The take away from this should be that alignment is not something that most developers need worry about. Most code gets the correct alignment out of the box - that's why the example program is so contrived: misalignment is the result of a developer choice, decision, or requirement. It does sometimes come up in porting, and that's why it's important to be able to diagnose when and where it happens, but most folks can get by assuming that they'll never see it! :)

Thursday Dec 11, 2014

Checking whether hardware supports crypto instructions

A quick example of how to tell if the machine that you're running on supports crypto instructions.

The 2011 SPARC Architecture manual tells you to read the cfr register before using the instruction. The cfr register contains a bit for every implemented crypto instruction. However, the cfr register is not implemented on all processors. So you would need to check whether this register is implemented before reading it....

So there has to be a better way. Fortunately, Solaris implements a getisax() call which provides this information without the user needing to muck around with the low level details. The following code shows how this call can be used to check whether the AES instruction is implemented or not:

#include <sys/auxv.h>
#include <stdio.h>

void main()
{
  unsigned int array[10];
  unsigned int count = getisax(array,10);
  if (count>0)
  {
    printf(" AES: ");
    if (array[0] & AV_SPARC_AES) { printf("Yes\n"); } else { printf("No\n"); }
  }
  else
  {
    printf("Error: getisax() call returned no results\n");
  }
}

Wednesday Nov 12, 2014

Oracle Solaris Studio playlist

There's an extensive list of Solaris Studio videos on youtube. In particular there's a bunch of tutorials covering the features of the IDE. The IDE doesn't often get the attention it deserves. It's based off NetBeans, and is full of useful code refactoring tools, navigation tools, etc. To find out more, take a look at some of the videos.

Tuesday Nov 11, 2014

New Performance Analyzer Overview screen

I love using the Performance Analyzer, but the question I often get when I show it to people, is "Where do I start?". So one of the improvements in Solaris Studio 12.4 is an Overview screen to help people get started with the tool. Here's what it looks like:


The reason this is important, is that many applications spend time in various place - like waiting on disk, or in user locks - and it's not always obvious where is going to be the most effective place to look for performance gains.

The Overview screen is meant to be the "one-stop" place where people can find out what their application is doing. When we put it back into the product I expected it to be the screen that I glanced at then never went back to. I was most surprised when this turned out not to be the case.

During performance analysis, I'm often exploring different ideas as to where it might be possible to get performance improvements. The Overview screen allows me to select the metrics that I'm interested in, then take a look at the resulting profiles. So I might start with system time, and just enable the system time metrics. Once I'm done with that, I might move on to user time, and select those metrics. So what was surprising about the Overview screen was how often I returned to it to change the metrics I was using.

So what does the screen contain? The overview shows all the available metrics. The bars indicate which metrics contribute the most time. So it's easy to pick (and explore) the metrics that contribute the most time.

If the profile contains performance counter metrics, then those also appear. If the counters include instructions and cycles, then the synthetic CPI/IPC metrics are also available. The Overview screen is really useful for hardware counter metrics.

I use performance counters in a couple of ways: to confirm a hypothesis about performance or to estimate time spent on a type of event. For example, if I think a load is taking a lot of time due to TLB misses, then profiling on the TLB miss performance counter will tell me whether that load has a lot of misses or not. Alternatively, if I've got TLB miss counter data, then I can scale this by the cost per TLB miss, and get an estimate of the total runtime lost to TLB misses.

Where the Overview screen comes into this is that I will often want to minimise the number of columns of data that are shown (to fit everything onto my monitor), but sometimes I want to quickly enable a counter to see whether that event happens at the bit of code where I'm looking. Hence I end up flipping to the Overview screen and then returning to the code.

So what I thought would be a nice feature, actually became pretty central to my work-flow.

I should have a more detailed paper about the Overview screen up on OTN soon.

Performance made easy

The big news of the day is that Oracle Solaris Studio 12.4 is available for download. I'd like to thank all those people who tried out the beta releases and gave us feedback.

There's a number of things that are new in this release. The most obvious one is C++11 support, I've written a bit about the lambda expression support, tuples, and unordered containers.

My favourite tool, the Performance Analyzer, has also had a bit of a facelift. I'll talk about the Overview screen in a separate post (and in an article), but there's some other fantastic features. The syntax highlighting, and hyperlinking, has made navigating profiles much easier. There's been a large number of improvements in filtering - a feature that's been in the product a long time, but these changes elevate it to being much more accessible (an article on filtering is long overdue!). There's also the default hardware counters - which makes it a no-brainer to get hardware counter data, which is really helpful in understanding exactly what an application is doing.

Over the development cycle I've made much use of the other tools. The Thread Analyzer for identifying data races has had some improvements. The Code Analyzer tools have made some great gains in rapidly identifying potential coding errors. And so on....

Anyway, please download the new version, try it out, try out the tools, and let us know what you think of it.

Friday Oct 10, 2014

OpenWorld and JavaOne slides available for download

Thanks everyone who attended my talks last week. My slides for OpenWorld and JavaOne are available for download:

Friday Aug 15, 2014

Providing feedback on the Solaris Studio 12.4 Beta

Obviously, the point of the Solaris Studio 12.4 Beta programme was for everyone to try out the new version of the compiler and tools, and for us to gather feedback on what was working, what was broken, and what was missing. We've had lots of useful feedback - you can see some of it on the forums. But we're after more.

Hence we have a Solaris Studio 12.4 Beta survey where you can tell us more about your experiences. Your comments are really helpful to us. Thanks.

Friday Jul 11, 2014

Studio 12.4 Beta Refresh, performance counters, and CPI

We've just released the refresh beta for Solaris Studio 12.4 - free download. This release features quite a lot of changes to a number of components. It's worth calling out improvements in the C++11 support and other tools. We've had few comments and posts on the Studio forums, and a bunch of these have resulted in improvements in this refresh.

One of the features that is deserving of greater attention is default hardware counters in the Performance Analyzer.

Default hardware counters

There's a lot of potential hardware counters that you can profile your application on. Some of them are easy to understand, some require a bit more thought, and some are delightfully cryptic (for example, I'm sure that op_stv_wait_sxmiss_ex means something to someone). Consequently most people don't pay them much attention.

On the other hand, some of us get very excited about hardware performance counters, and the information that they can provide. It's good to be able to reveal that we've made some steps along the path of making that information more generally available.

The new feature in the Performance Analyzer is default hardware counters. For most platforms we've selected a set of meaningful performance counters. You get these if you add -h on to the flags passed to collect. For example:

$ collect -h on ./a.out

Using the counters

Typically the counters will gather cycles, instructions, and cache misses - these are relatively easy to understand and often provide very useful information. In particular, given a count of instructions and a count of cycles, it's easy to compute Cycles per Instruction (CPI) or Instructions per Cycle(IPC).

I'm not a great fan of CPI or IPC as absolute measurements - working in the compiler team there are plenty of ways to change these metrics by controlling the I (instructions) when I really care most about the C (cycles). But, the two measurements have a very useful purpose when examining a profile.

A high CPI means lots cycles were spent somewhere, and very few instructions were issued in that time. This means lots of stall, which means that there's some potential for performance gains. So a good rule of thumb for where to focus first is routines that take a lot of time, and have a high CPI.

IPC is useful for a different reason. A processor can issue a maximum number of instructions per cycle. For example, a T4 processor can issue two instructions per cycle. If I see an IPC of 2 for one routine, I know that the code is not stalled, and is limited by instruction count. So when I look at a code with a high IPC I can focus on optimisations that reduce the instruction count.

So both IPC and CPI are meaningful metrics. Reflecting this, the Performance Analyzer will compute the metrics if the hardware counter data is available. Here's an example:


This code was deliberately contrived so that all the routines had ludicrously high CPI. But isn't that cool - I can immediately see what kinds of opportunities might be lurking in the code.

This is not restricted to just the functions view, CPI and/or IPC are presented in every view - so you can look at CPI for each thread, line of source, line of disassembly. Of course, as the counter data gets spread over more "lines" you have less data per line, and consequently more noise. So CPI data at the disassembly level is not likely to be that useful for very short running experiments. But when aggregated, the CPI can often be meaningful even for short experiments.

Friday May 23, 2014

Generic hardware counter events

A while back, Solaris introduced support for PAPI - which is probably as close as we can get to a de-facto standard for performance counter naming. For performance counter geeks like me, this is not quite enough information, I actually want to know the names of the raw counters used. Fortunately this is provided in the generic_events man page:

$ man generic_events
Reformatting page.  Please Wait... done

CPU Performance Counters Library Functions   generic_events(3CPC)

NAME
     generic_events - generic performance counter events

DESCRIPTION
     The Solaris  cpc(3CPC)  subsystem  implements  a  number  of
     predefined, generic performance counter events. Each generic
...
   Intel Pentium Pro/II/III Processor
       Generic Event          Platform Event          Event Mask
     _____________________________________________________________
     PAPI_ca_shr          l2_ifetch                 0xf
     PAPI_ca_cln          bus_tran_rfo              0x0
     PAPI_ca_itv          bus_tran_inval            0x0
     PAPI_tlb_im          itlb_miss                 0x0
     PAPI_btac_m          btb_misses                0x0
     PAPI_hw_int          hw_int_rx                 0x0
...

Wednesday Feb 26, 2014

Multicore Application Programming available in Chinese!

This was a complete surprise to me. A box arrived on my doorstep, and inside were copies of Multicore Application Programming in Chinese. They look good, and have a glossy cover rather than the matte cover of the English version.

Article on RAW hazards

Feels like it's been a long while since I wrote up an article for OTN, so I'm pleased that I've finally got around to fixing that.

I've written about RAW hazards in the past. But I recently went through a patch of discovering them in a number of places, so I've written up a full article on them.

What is "nice" about RAW hazards is that once you recognise them for what they are (and that's the tricky bit), they are typically easy to avoid. So if you see 10 seconds of time attributable to RAW hazards in the profile, then you can often get the entire 10 seconds back by tweaking the code.

Friday Oct 21, 2011

Endianness

SPARC and x86 processors have different endianness. SPARC is big-endian and x86 is little-endian. Big-endian means that numbers are stored with the most significant data earlier in memory. Conversely little-endian means that numbers are stored with the least significant data earlier in memory.

Think of big endian as writing numbers as we would normally do. For example one thousand, one hundred and twenty would be written as 1120 using a big-endian format. However, writing as little endian it would be 0211 - the least significant digits would be recorded first.

For machines, this relates to which bytes are stored first. To make data portable between machines, a format needs to be agreed. For example in networking, data is defined as being big-endian. So to handle network packets, little-endian machines need to convert the data before using it.

Converting the bytes is a trivial matter, but it has some performance pitfalls. Let's start with a simple way of doing the conversion.

template <class T>
T swapslow(T in)
{
  T out;
  char * pcin = (char*)∈
  char * pcout = (char*)&out;

  for (int i=0; i<sizeof(T); i++)
  {
    pcout[i] = pcin[sizeof(T)-i];
  }
  return out;
}

The code uses templates to generalise it to different sizes of integers. But the following observations hold even if you use a C version for a particular size of input.

First thing to look at is instruction count. Assume I'm dealing with ints. I store the input to memory, then I access the input one byte at a time, storing each byte to a new location in memory, before finally loading the result. So for an int, I've got 10 memory operations.

Memory operations can be costly. Processors may be limited to only issuing one per cycle. In comparison most processors can issue two or more logical or integer arithmetic instructions per cycle. Loads are also costly as they have to access the cache, which takes a few cycles.

The other issue is more subtle, and I've discussed it in the past. There are RAW issues in this code. I'm storing an int, but loading it as four bytes. Then I'm storing four bytes, and loading them as an int.

A RAW hazard is a read-after-write hazard. The processor sees data being stored, but cannot convert that stored data into the format that the subsequent load requires. Hence the load has to wait until the result of the store reaches the cache before the load can complete. This can be multiple cycles of wait.

With endianness conversion, the data is already in the registers, so we can use logical operations to perform the conversion. This approach is shown in the next code snippet.

template <class T>
T swap(T in)
{
  T out=0;
  for (int i=0; i<sizeof(T); i++)
  {
    out<<=8;
    out|=(in&255);
    in>>=8;
  }
  return out;
} 

In this case, we avoid the stores and loads, but instead we perform four logical operations per byte. This is higher cost than the load and store per byte. However, we can usually do more logical operations per cycle and the operations normally take a single cycle to complete. Overall, this is probably slightly faster than loads and stores.

However, you will usually see a greater performance gain from avoiding the RAW hazards. Obviously RAW hazards are hardware dependent - some processors may be engineered to avoid them. In which case you will only see a problem on some particular hardware. Which means that your application will run well on one machine, but poorly on another.

Tuesday Jan 11, 2011

RAW pipeline hazards

When a processor stores an item of data back to memory it actually goes through quite a complex set of operations. A sketch of the activities is as follows. The first thing that needs to be done is that the cache line containing the target address of the store needs to be fetched from memory. While this is happening, the data to be stored there is placed on a store queue. When the store is the oldest item in the queue, and the cache line has been successfully fetched from memory, the data can be placed into the cache line and removed from the queue.

This works very well if data is stored and either never reused, or reused after a relatively long delay. Unfortunately it is common for data to be needed almost immediately. There are plenty of reasons why this is the case. If parameters are passed through the stack, then they will be stored to the stack, and then immediately reloaded. If a register is spilled to the stack, then the data will be reloaded from the stack shortly afterwards.

It could take some considerable number of cycles if the loads had to wait for the stores to exit the queue before they could fetch the data. So many processors implement some kind of bypassing. If a load finds the data it needs in the store queue, then it can fetch it from there. There are often some caveats associated with this bypass. For example, the store and load often have to be of the same size to the same address. i.e. you cannot bypass a byte from a store of a word. If the bypass fails, then the situation is referred to as a "RAW" hazard, meaning "Read-After-Write". If the bypass fails, then the load has to wait until the store has completed before it can retrieve the new value - this can take many cycles.

As a general rule it is best to avoid potential RAWs. It is hardware, and runtime situation dependent whether there will be a RAW hazard or not, so avoiding the possibility is the best defense. Consider the following code which uses loads and stores of bytes to construct an integer.

#include <stdio.h>
#include <sys/time.h>

void tick()
{
  hrtime_t now = gethrtime();
  static hrtime_t then = 0;
  if (then>0) printf("Elapsed = %f\\n", 1.0\*(now-then)/100000000.0);
  then = now;
}

int func(char \* value)
{
  int temp;
  ((char\*)&temp)[0] = value[3];
  ((char\*)&temp)[1] = value[2];
  ((char\*)&temp)[2] = value[1];
  ((char\*)&temp)[3] = value[0];
  return temp;
}

int main()
{
  int value = 0x01020304;
  tick();
  for (int i=0; i<100000000; i++) func((char\*)&value);
}

In the above code we're reversing the byte order by loading the bytes one-by-one, and storing them into an integer in the correct position, then loading the integer. Running this code on a test machine it reports 12ns per iteration.

However, it is possible to perform the same reordering using logical operations (shifts and ORs) as follows:

int func2(char\* value)
{
  return (value[0]<<24) | (value[1]<<16) | (value[2]<<8) | value[0];
}

This modified routine takes about 8ns per iteration. Which is significantly faster than the original code.

The actual speed up observed will depend on many factors, the most obvious being how often the code is encountered. The more observation is that the speed up depends on the platform. Some platforms will be more sensitive to the impact of RAWs than others. So the best advice is, whereever possible, to avoid passing data through the stack.

Monday Feb 15, 2010

x86 performance tuning documents

Interesting set of x86 performance tuning documents.

Monday Oct 19, 2009

Fishing with cputrack

I'm a great fan of the hardware performance counters that you find on most processors. Often you can look at the profile and instantly identify what the issue is. Sometimes though, it is not obvious, and that's where the performance counters can really help out.

I was looking at one such issue last week, the performance of the application was showing some variation, and it wasn't immediately obvious what the issue was. The usual suspects in these cases are:

  • Excessive system time
  • Process migration
  • Memory placement
  • Page size
  • etc.

Unfortunately, none of these seemed to explain the issue. So I hacked together the following script cputrackall which ran the test code under cputrack for all the possible performance counters. Dumped the output into a spreadsheet, and compared the fast and slow runs of the app. This is something of a "fishing trip" script, just gathering as much data as possible in the hope that something leaps out, but sometimes that's exactly what's needed. I regularly get to sit in front of a new chip before the tools like ripc have been ported, and in those situations the easiest thing to do is to look for hardware counter events that might explain the runtime performance. In this particular instance, it helped me to confirm my suspicion that there was a difference in branch misprediction rates that was causing the issue.

Monday Sep 28, 2009

Updated compiler flags article

Just updated the Selecting The Best Compiler Options article for the developer portal. Minor changes, mainly a bit more clarification on floating point optimisations.

Friday Apr 17, 2009

Compiling for Nehalem and other processors

First thing I'd suggest is to make sure that you're using a recent compiler. Practically that means Sun Studio 12 (which is quite old now), or the Sun Studio Express releases (aka Sun Studio 12 Update 1 Early Access). Obviously the latest features are only found in the latest compilers.

In terms of flags, the starting point should be -fast, you can later trim that if you need to remove floating point simplification, or are not happy with one or other of the flags that it includes.

The next flags to use are:

  • -xipo=2 This flag enables crossfile optimisation and tracking of allocated memory. I find crossfile optimisation useful because it limits the impact of the source code structure on the final application.
  • -xarch=sse4_2 This flag is included in -fast (assuming that the build system supports it). However, if you later plan to fine tune the compiler flags, it's best to start off with it explicitly. This allows the compiler to use the SSE4 instruction set. Probably -xarch=sse2 will be sufficient in most circumstances - it's a call depending on the system that the application will be deployed on.
  • -xvector=simd This flag tells the compiler to generate SIMD (single instruction multiple data) instructions - basically the combination of this and the architecture flag enables the compiler to generate applications that use SSE instructions. These instructions can lead to substantial performance gains in some floating point applications.
  • -m64 On x86 there's performance to be gained from using the 64-bit instruction set extensions. The code gets more registers and a better calling convention. These tend to outweigh the costs of the larger memory footprint of 64-bit applications.
  • -xpagesize=2M This tells the operating system to provide large pages to the application.
  • The other optimisation that I find very useful is profile feedback. This does complicate and lengthen the build process, but is often the most effective way of getting performance gains for codes dominated by branches and conditional code.
  • The other flags to consider are the aliasing flags -xalias_level=std for C, -xalias_level=compatible for C++, and -xrestrict. These flags do lead to performance gains, but require the developer to be comfortable that their code does conform to the requirements of the flags. (IMO, most code does.)

All this talk about flags should not be a replacement for what I consider to be the basic first step in optimising the performance of an application: to take a profile. Compiler flags tell the compiler how to do a good job of producing the code, but the compiler can't do much about the algorithms used. Profiling the application will often give a clue as to a way that the performance of the application can be improved by a change of algorithm - something that even the best compiler flags can't always do.

Tuesday Nov 04, 2008

Job available in this performance analysis team

We're advertising a job opening in this group. We're looking for someone who's keen on doing performance analysis on x86 and SPARC platforms. The req number is 561456, and you can read the details on sun.com. If you have any questions, please do feel free to contact me.

Tuesday Oct 28, 2008

More Sun Studio resources from AMD

Bao from AMD pointed me at these two additional resources. A cheat sheet for Sun Studio. I disagree with the suggestion on it to use -xO2, I would suggest using -O instead. There's also a Solaris Developer Zone.

Monday Oct 27, 2008

x86 compiler flags

This AMD document summarises the optimisation flags available for many x86 compilers (Sun Studio, PGI, Intel etc.). It's about a year old, but it looks ok for Sun Studio. However it talks about -xcrossfile which is ancient history - use -xipo instead!

Thursday Mar 20, 2008

The much maligned -fast

The compiler flag -fast gets an unfair rap. Even the compiler reports:

cc: Warning: -xarch=native has been explicitly specified, or 
implicitly specified by a macro option, -xarch=native on this 
architecture implies -xarch=sparcvis2 which generates code that 
does not run on pre UltraSPARC III processors

which is hardly fair given the the UltraSPARC III line came out about 8 years ago! So I want to quickly discuss what's good about the option, and what reasons there are to be cautious.

The first thing to talk about is the warning message. -xtarget=native is a good option to use when the target platform is also the deployment platform. For me, this is the common case, but for people producing applications that are more generally deployed, it's not the common case. The best thing to do to avoid the warning and produce binaries that work with the widest range of hardware is to add the flag -xtarget=generic after -fast (compiler flags are parsed from left to right, so the rightmost flag is the one that gets obeyed). The generic target represents a mix of all the important processors, the mix produces code that should work well on all of them.

The next option which is in -fast for C that might cause some concern is -xalias_level=basic. This tells the compiler to assume that pointers of different basic types (e.g. integers, floats etc.) don't alias. Most people code to this, and the C standard actually has higher demands on the level of aliasing the compiler can assume. So code that conforms to the C standard will work correctly with this option. Of course, it's still worth being aware that the compiler is making the assumption.

The final area is floating point simplification. That's the flags -fsimple=2 which allows the compiler to reorder floating point expressions, -fns which allows the processor to flush subnormal numbers to zero, and some other flags that use faster floating point libraries or inline templates. I've previously written about my rather odd views on floating point maths. Basically it comes down to If these options make a difference to the performance of your code, then you should investigate why they make a difference..

Since -fast contains a number of flags which impact performance, it's probably a good plan to identify exactly those flags that do make a difference, and use only those. A tool like ats can really help here.

Performance tuning recipe

Dan Berger posted a comment about the compiler flags we'd used for Ruby. Basically, we've not done compiler flag tuning yet, so I'll write a quick outline of the first steps in tuning an application.

  • First of all profile the application with whatever flags it usually builds with. This is partly to get some idea of where the hot code is, but it's also useful to have some idea of what you're starting with. The other benefit is that it's tricky to debug build issues if you've already changed the defaults.
  • It should be pretty easy at this point to identify the build flags. Probably they will flash past on the screen, or in the worse case, they can be extracted (from non-stripped executables) using dumpstabs or dwarfdump. It can be necessary to check that the flags you want to use are actually the ones being passed into the compiler.
  • Of course, I'd certainly use spot to get all the profiles. One of the features spot has that is really useful is to archive the results, so that after multiple runs of the application with different builds it's still possible to look at the old code, and identify the compiler flags used.
  • I'd probably try -fast, which is a macro flag, meaning it enables a number of optimisations that typically improve application performance. I'll probably post later about this flag, since there's quite a bit to say about it. Performance under the flag should give an idea of what to expect with aggressive optimisations. If the flag does improve the applications performance, then you probably want to identify the exact options that provide the benefit and use those explicitly.
  • In the profile, I'd be looking for routines which seem to be taking too long, and I'd be trying to figure out what was causing the time to be spent. For SPARC, the execution count data from bit that's shown in the spot report is vital in being able to distinguish from code that runs slow, or code that runs fast but is executed many times. I'd probably try various other options based on what I saw. Memory stalls might make me try prefetch options, TLB misses would make me tweak the -xpagesize options.
  • I'd also want to look at using crossfile optimisation, and profile feedback, both of which tend to benefit all codes, but particularly codes which have a lot of branches. The compiler can make pretty good guesses on loops ;)

These directions are more a list of possible experiments than necessary an item-by-item checklist, but they form a good basis. And they are not an exhaustive list...

Cross-linking support in Nevada

Ali Bahrami has just written an interesting post about cross-linking support going into Nevada. This is the facility to enable the linking of SPARC object files to produce SPARC executables on an x86 box (or the other way around).

Friday Mar 14, 2008

AMD64 architecture documentation

Documentation on the AMD64 architecture.

32-bits good, 64-bits better?

One of the questions people ask is when to develop 64-bit apps vs 32-bit apps. The answer is not totally clear cut, it depends on the application and the platform. So here's my take on it.

First, let's discuss SPARC. The SPARC V8 architecture defines the 32-bit SPARC ISA (Instruction Set Architecture). This was found on a selection of SPARC processors that appeared before the UltraSPARC I, ie quite a long time ago. Later the SPARC V9 architecture appeared. (As an aside, these are open standards, so anyone can download the specs and make one, of course not everyone has the time and resources to do that, but it's nice in theory ;)

The SPARC V9 architecture added a few instructions, but mainly added the capability to use 64-bit addressing. The ABI (Application Binary Interface) was also improved (floating point values passed in FP registers rather than the odd V8 choice of using the integer registers). The UltraSPARC I and onwards have implemented the V9 architecture, which means that they can execute both V8 and V9 binaries.

One of the things that it's easy to get taken in by is the animal farm idea that if 32-bits is good 64-bits must be better. The trouble with 64-bit address space is that it takes more instructions to set up an address, pointers go from 4 bytes to 8 bytes, and the memory footprint of the application increases.

A hybrid mode was also defined, which took the 32-bit address space, together with a selection of the new instructions. This was called v8plus, or more recently sparcvis. This has been the default architecture for SPARC binaries for quite a while now, and it combines the smaller memory footprint of SPARC V8 with the more recent instructions from SPARC V9. For applications that don't require 64-bit address space, v8plus or sparcvis is the way to go.

Moving to the x86 side things are slightly more complex. The high level view is similar. You have the x86 ISA, or IA32 as its been called. Then you have the 64-bit ISA, called AMD64 or EMT64. EMT64 gives you both 64-bit addressing, a new ABI, a number of new instructions, and perhaps most importantly a bundle of new registers. The x86 has impressively few registers, EMT64 fixes that quite nicely.

In the same way as SPARC, moving to a 64-bit address space does cost some performance due to the increased memory footprint. However, the x64 gains a number of additional registers, which usually more than make up for this loss in performance. So the general rule is that 64-bits is better, unless the application makes extensive use of pointers.

Unlike SPARC, EMT64 does not currently provide a mode which gives the instruction set extensions and registers with a 32-bit address space.

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download

Search

Categories
Archives
« July 2015
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs